Opinion Spam Detection: Detecting Fake Reviews and Reviewers
Many names: Spam Review, Fake Review, or Bogus Review
Opinion Spammer, Review Spammer, Fake Reviewer, Shill (Stooge or Plant)
Deception, Deceptive Message
(See this The New York Times front page article, Jan. 26, 2012)
(Bloomberg BusinessWeek, Sept. 29, 2011 and more ... )
It has become a common practice for people to find and to read opinions/reviews on the Web for many purposes. For example, if one wants to buy a product, one typically goes to a merchant or review site (e.g., amazon.com) to read some reviews of existing users of the product. If one sees many positive reviews of the product, one is very likely to buy the product. However, if one sees many negative reviews, he/she will most likely choose another product. Positive opinions can result in significant financial gains and/or fames for organizations and individuals. This, unfortunately, gives good incentives for opinion spam.
Opinion Spam: Opinion spamming refers to "illegal" activities (e.g., writing fake reviews, also called shilling) that try to deliberately mislead readers or automated opinion mining and sentiment analysis systems by giving undeserving positive opinions to some target entities in order to promote the entities and/or by giving false
negative opinions to some other entities in order to damage their reputation.
Opinion spam comes in many forms, e.g., fake reviews (also called bogus reviews), fake comments, fake blogs, fake social network postings, deceptions, and deceptive messages. Manually spotting such postings is very hard, but there are several pages on the Web (see below) which tell people how to spot fake reviews and deceptive messages. To the best of our knowledge, our
group was the first in academia to conduct research on detecting fake
reviews and reviewers (or shills). Our first paper was published in 2007, and subsequent papers were published in 2008 and 2010. My textbook Web Data Mining also has a section in Chapter 11 discussing the issue (Springer, Second Edition, July 2011; First Edition, Dec, 2006). The objective of our current project is to detect fake reviews. We have not worked on detecting other forms of spam opinions.
Fake Review Detection
We have used supervised, pattern discovery, unexpectedness defined with probability, and graph-based methods for the task. Below are some main signals that we use:
Review content:
Lexical features such as words, n-grams, part-of-speech, and other lexical features.
Content and style similarity of reviews from different reviewers.
Semantic inconsistency (we have never used this kind of features). For example, a reviewer wrote "My wife and I bought this car ..." in one review and then in another review he/she wrote "My husband really love ..." (I heard this example from a friend in a company which actively detects fake reviews).
Reviewer abnormal behaviors:
Public data available from Web sites, e.g., time of posting, frequency of posting, first reviewers of products, and many more. For example, do you see anything wrong with the reviews from this user-name,
Big John? What about after you see the reviews of these two user-names, Cletus and Jake? In fact, if you
browse the reviews of their reviewed products, you will find another
suspicious
user-name/person. This is just one example of atypical behaviors that our algorithm is able to discover.
Web site private/internal data (we have not used such data, but they are extremely useful), e.g., IP and MAC addresses, time taking to post a review, physical location of the reviewer, etc (a lot of them).
Product related features: For example, product decriptions and sales ranks
Relationships: Complex relationships among reviewers, reviews, and entities (e.g., products and stores).
We believe that as opinions on the Web are increasingly used in practice by consumers, organizations, and businesses for their decision making, opinion spam will get worse and also more sophisticated.
Detecting spam reviews or opinions will become more and more critical. The situation is already quite bad. When I have time, I will write more about it. You can also have a look at our papers.
Differences from Web Spam and Email Spam
Web spam: Web spamming refers to the use of "illegitimate means" to boost the search rank position of some target Web pages (see this New York Times article). There are two main types of spam, link spam and content spam. Opinion spam is very different from Web spam because both link spam and content spam seldom occur in opinion documents such as product reviews. Link spam is spam on hyperlinks, which almost does not exist in reviews as there are usually no links among reviews. Content spam tries to add irrelevant or remotely relevant words in target Web pages in order to fool search engines, which again hardly occurs in reviews.
Email Spam: Email spamming usually refers to unsolicited commercial advertisements. Although exists, advertisements in reviews are rare. They are also relatively easy to detect.
Check out this interview: Practical Sentiment Analysis and Lies, interviewed by Tom H. C. Anderson for his Next Gen Market Research blog, April 9, 2012.
Types of Opinion Spam
There are generally three types of spam reviews (Jindal and Liu WSDM-2008):
Type 1 (fake reviews): These are reviews that deliberately mislead readers or opinion mining systems by giving undeserving positive opinions to some target entities in order to promote the entities and/or by giving unjust or malicious negative opinions to some other entities in order to damage their reputation.
Type 2 (reviews on brands only): These reviews do not comment on the specific products that they are supposed to review, but only comment on the brands, the manufacturers or the sellers of the products. Although they may be useful, they are considered as spam because they are not targeted at the specific products and are often biased. For example, in a review for a HP printer, the reviewer only wrote "I hate HP. I never buy any of their products".
Type 3 (non-reviews): These are not reviews or opinionated although they appear as reviews. There are two main sub-types: (1) advertisements, and (2) other irrelevant texts containing no opinions (e.g., questions, answers, and random texts).
Type 2 and Type 3 spam are rare, but Type 1 spam reviews are wide-spread and very hard to detect. Some fake reviews are not so harmful, but some are very harmful. See details in (Jindal and Liu WSDM-2008) or Chapter 11 of my book Web Data Mining.
Acknowledgement: This project was partially funded by Microsoft and Google
China's Internet "Water Army" (Shuijun) - Opinion Spammers
You can hire people to write and post fake reviews or comments, and even bribe staff at review, forum and microblog sites to delete posts that you do not like.
If you read Chinese, see this description from Baidu Baike at baidu.com.
Data Sets
Amazon Product Review Data (Huge) used in (Jindal and Liu, WWW-2007; WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al. WWW-2011; Mukherjee, Liu and Glance, WWW-2012) for review spam (fake review) detection. It has information about reviewers, review text, ratings, product info, etc. Due to the large file size, you may need to use Download Accelerator Plus (DAP) to download. If you use this data, please cite (Jindal and Liu, WSDM-2008).
Nitin Jindal, Bing Liu and Ee-Peng Lim. "Finding Unusual Review
Patterns Using Unexpected Rules"Proceedings of the 19th ACM
International Conference on Information and Knowledge Management
(CIKM-2010, short paper), Toronto, Canada, Oct 26 - 30, 2010.
Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu and Hady Lauw.
"Detecting Product Review Spammers using Rating Behaviors."Proceedings of the 19th ACM International Conference on Information and Knowledge
Management (CIKM-2010, full paper), Toronto, Canada, Oct 26 - 30, 2010.
Nitin Jindal and Bing Liu. "Opinion Spam and Analysis."Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.
Nitin Jindal and Bing Liu. "Review Spam Detection." Proceedings of WWW-2007 (poster paper), May 8-12, Banff, Canada.