Is 80% sentiment-scoring accuracy good? Is it good enough?
Mikko Kotila, founder and CEO of online media analytics company Statsit, says no. He asks, in this blog post, what automated sentiment analysis is good for and answers, "not much." Yet his assumptions and analysis betray common misapprehensions. The questions are valid, yet the facts, reasoning, and conclusion offered, while they may be widely-shared, capture only a small part of the sentiment-analysis picture.
Any look at the sentiment-analysis big picture should start with classification precision. Here, as with every automated solution, human performance creates a baseline for expectations. I suspect that Kotila, in his look at sentiment analysis, overestimates the precision of both human analysis and, in projecting from two vendors' self-proclaimed results to the whole of automated sentiment analysis, the accuracy of the broad set of automated solutions. That's right: in many contexts, many automated sentiment solutions are far less than 80% precise. They're still useful however, because more comes into play than raw, document-level classification precision. There's more to accuracy and usefulness than that one point. Automation advantages typically include speed, reach, consistency, and cost. For a more complete picture, add in accuracy-boosting techniques and look at use cases beyond listening platforms.
So here's my own review of the accuracy question and my take on the usefulness of sentiment analysis.
Sentiment Analysis Accuracy
Do humans read sentiment -- broadly, expressions of attitude, opinion, feeling, and emotion -- with 100% accuracy? Mikko Kotila (who is my Everyman; his impressions reflect those of a significant portion of the market) reports that "leading providers such as Sysomos and Radian6 estimate their automated sentiment analysis and scoring system to be 80% accurate." His own assumption about human accuracy comes out in his statement that the "20% difference statistically is huge and comes with an array of problems." What "20% difference" could he possibly see other than between the automated tools' accuracy and human accuracy? Yet we humans do not agree universally with one another on anything subjective, and sentiment, rendered in text that lacks visual or aural clues, is really tricky, even for people. Sadly, we're nowhere close to perfect.
The yardstick is agreement between a method's results and a "gold standard" or, lacking a definitive standard, "inter-annotator agreement" between two automated methods or between a human and a machine. I know of only one scientific study of human sentiment-annotation accuracy, "Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis," by Wilson, Wiebe & Hoffman, 2005.
The Univ. of Pittsburgh researchers found 82% agreement in the assignment, by two individuals, of phrase-level sentiment polarity to automatically identified subjective statements in a test document set. Polarity (a.k.a. valence) is the sentiment direction: positive, negative, both, or neutral.
The authors further report, "For 18% of the subjective expressions, at least one annotator used an uncertain tag when marking polarity. If we consider these cases to be borderline and exclude them from the study, percent agreement increases to 90%."
The Pittsburgh authors evaluated sentiment at a phrase level; other applications look at feature, sentence, and document-level sentiment. Text features include entities such as named individuals and companies; they may also include concepts or topics that are comprised of entities. The situation isn't necessarily better at other levels. (The use cases are different, however, more on which later.)
On the other hand, Bing Liu, of the University of Illinois at Chicago reported to me, "We have done some informal and general study on the agreement of human annotators. We did not find much disagreement and thus did not write any paper on it. Of course, our focus has been on opinions/sentiments on product features/attributes in consumer reviews and blogs. The focus of Wiebe's group has been on political/news type of articles, which tend to be more difficult to judge."
Mike Marshall of text-analytics vendor Lexalytics did his own experimental testing of document-level sentiment analysis and found "overall accuracy was 81.5% with 81 of the positive documents being correctly identified and 82 of the negative ones. This is right in the magic space for human agreement." According to Mike, "Experience has also shown us that human analysts tend to agree about 80% of the time, which means that you are always going to find documents that you disagree with the machine on."