This is the dataset used in paper:
CER: Complementary Entity Recognition via Knowledge Expansion on Large Unlabeled Product Reviews
IEEE International Conference on Big Data 2016 (IEEE Bigdata 2016)
Hu Xu, Sihong Xie, Lei Shu, Philip S. Yu
[paper], [slides], [Annotated Dataset], [bib]
It has 7 annotated accessory products with about 1200 reviews. More products are coming. The datasets are in json format. The reviews are in sentences with each token annotated in BIO tags (Beginning, Inside, Outside). Each entity also has the polarities of compatibilities (compatible, incompatible, unknown).
If you use this dataset in your research, please cite our Bigdata 2016 paper, here is the bib.
New: Compatible/Incompatible Entity Recognition from Product Community Question and Answering dataset (CIER PCQA)
Note: this is a similar problem but on a totally different dataset. For example, PCQA tends to be more specific on complementary entities. Answers are more close to facts rather than opinions in reviews.
import json with open('CER_7.json', 'r') as f: products=json.load(f) #print the list of words of the 1st sentence in the first product. print products['X']['words'] #print the list of labels of the 1st sentence in the first product. print products['y']['label']
Java: You may need a library to load JSON format.