Web Content Mining
WWW-2005 Tutorial, May 10, 2005, Chiba, Japan
References
- Agrawal, R. and Srikant, R. Fast algorithm for mining association rules. VLDB-94, 1994.
- Agrawal, R. and Srikant, R. On integrating catalogs. WWW-01, 2001.
- Agrawal, R., Rajagopalan, S., Srikant, R., and Xu, Y. Mining newsgroups using networks arising from social behavior. WWW-03, 2003.
- Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003.
- Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34-58, 1989.
- Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327-337.
- Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW-02, 2002.
- Brill, E. Some advances in rule-based part of speech tagging. AAAI-94, 1994.
- Broder, A., Glassman, S., Manasse, M. and Zweig, G. Syntactic Clustering of the Web. WWW-6, 1997.
- Bunescu, R., Mooney, R. Collective Information Extraction with Relational Markov Networks. ACL-04, 2004.
- Mooney, R., and Bunescu, R. Mining Knowledge from Text Using Information Extraction. To appear in a special issue of SigKDD Explorations on Text Mining and Natural Language Processing, 2005.
- Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001.
- Cai, D, Yu, S., Wen, J-R and Ma, W-Y. "Extracting Content Structure for Web Pages based on Visual Representation", Fifth Asia Pacific Web Conference (APWeb-03), 2003.
- Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., Block-based web search. SIGIR-04. 2004
- Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5).
- Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002.
- Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.
- Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135.158, 2001.
- Chriisment, C., Dousset, B, Karouach, S, Mothe, J. Information mining:
extracting, exploring and visualising geo-referenced information. SIGIR-04 Workshop on Geograpghic information retrieval, 2004.
- Cimiano, P., Handschuh, S., and Staab, S. Towards the self-annotating web. WWW-04, 2004.
- Cohen, W., Hurst, M., and Jensen, L. A flexible learning system for wrapping tables and lists in HTML documents. WWW-02, 2002.
- Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. VLDB-01, 2001.
- Cui, Hang, Min-Yen Kan and Tat-Seng Chua, Unsupervised Learning of Soft Patterns for Definitional Question Answering, Proceedings of the Thirteenth World Wide Web conference (WWW 2004), New York, May 17-22, 2004, pp. 90-99.
- Das, S. and Chen, M. Yahoo! for Amazon: Extracting market sentiment from stock message boards. APFA-01, 2001.
- Dave, K., Lawrence, S., and Pennock, D. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. WWW-03, 2003.
- Doan, A., and Halevy, A., Semantic Integration Research in the Database Community: A Brief Survey. AI magazine, 2005.
- Doan, A., Madhavan, J., Domingos, P., Halevy, A. Learning to map between ontologies on the semantic web. WWW-02, 2002.u
- Embley, D., Jiang, Y and Ng, Y. .Record-boundary discovery in Web documents.. SIGMOD-99, 1999.
- Etzioni, O., Cafarella, M., Downey, D., Kok, S. Popescu, A., Shaked, T., Soderland, S., Weld, S. Web-Scale Information Extraction in KnowItAll (Preliminary Results). WWW-2004.
- Fellbaum, C. 1998. WordNet: an Electronic Lexical Database, MIT Press.
- Freitag, D., and McCallum, A. Information extraction with HMM structures learned by stochastic optimization. AAAI-00, 2000.
- Gruhl, D., Guha, R. Liben-Nowell, D. Tomkins, A. Information diffusion through blogspace. WWW-04, 2004,
- Guha, R., Kumar, R., Raghavan, P., Tomkins, A. Propagation of trust and distrust. WWW-04, 2004.
- Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. DOM based Content Extraction of HTML Documents, WWW-03, 2003.
- Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.
- Hatzivassiloglou, V. and McKeown, K. Predicting the Semantic Orientation of Adjectives. ACL-97, 1997.
- Hatzivassiloglou, V., and Wiebe, J. Effects of adjective orientation and gradability on sentence subjectivity. COLING-00, 2000.
- He, B., Chang, K., Statistical Schema Matching across Web Query Interfaces. SIGMOD-03, 2003.
- He, B., Chang, K., Han, J: Discovering complex matchings across web query interfaces: a correlation mining approach. KDD-04, 2004.
- Hearst, M. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics, pages 539.545, 1992.
- Hogeweg, P., Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees: An integrated method. J. Mol. Evol., 20, 175-186 (1984).
- Hsu, C.-N. and Dung, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521-538, 1998.
- Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. KDD-04, 2004.
- Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. WWW-04, 2004.
- Kushmerick, N., Weld, D., and Doorenbos, R. Wrapper induction for information extraction. IJCAI-97, 1997.
- Kushmerick, N. Wrapper Verification. WWW Journal 3, 2000.
- Kushmerick, N. Regression testing for wrapper maintenance. AAAI-99, pp. 74-79, 1999.
- Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15-68, 2000.
- Lafferty, J., McCallum, A., Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling or sequence data. ICML-01, 2001.
- Lerman, K., Minton, S, Knoblock, C: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR) 18: 149-181, 2003.
- Lerman, K., Getoor L., Minton, S. and Knoblock, C. .Using the Structure of Web Sites for Automatic Segmentation of Tables.. SIGMOD-04, 2004.
- Leuski, A. and Allan, J. "Improving interactive retrieval by combining ranked lists and clustering". In Proceedings of RIAO-2000, pages 665-681, Paris, France, 2000
- Li, X, Liu, B, Phang, T, and Hu, M. "Using Micro Information Unit for Internet Search," CIKM-2002, McLean, VA, Nov 5-9, 2002.
- Lin, S and Ho, J. Discovering informative content blocks from Web documents. KDD-02, 2002.
- Liu, B., and Chang, K. "Editorial: Special Issue on Web Content Mining" SIGKDD Explorations special issue on Web Content Mining, Dec, 2004.
- Liu, B., Chin, C, and Ng, H. "Mining Topic-Specific Concepts and Definitions on the Web." WWW-03, 2003.
- Liu, B., Grossman, R., and Zhai, Y. "Mining Data Records in Web Pages." KDD-03, 2003.
- Liu, B and Zhai, Y. "NET - A System for Extracting Web Data from Flat and Nested Data Records." WISE-05, 2005.
- Liu, B., Hsu, W., and Ma, Y. Integrating Classification and Association Rule Mining. KDD-98, 1998.
- Liu, B., Hu, M and Cheng, J. "Opinion Observer: Analyzing and comparing opinions on the Web" WWW-05, May 10-14, 2005, in Chiba, Japan.
- Liu, B., Ma, Y, and Yu, P. "Discovering unexpected information from your competitors' Web sites." KDD-01, San Francisco, CA; Aug 20-23, 2001
- Liu, B., Zhao, K and Yi, L. "Visualizing Web site comparisons." WWW-02. Honolulu, Hawaii, USA, 2002.
- Maedche, A., and Staab, S. Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79, 2001.
- Meng, X., Lu, H., Wang, H., and Gu, M. Schema-guided wrapper generator. ICDE-02, 2002.
- Morinaga, S., Yamanishi, K., Tateishi, K., and Fukushima, T. Mining Product Reputations on the Web. KDD-02, 2002.
- Muslea, I., Minton, S. and Knoblock, C. Active Learning for Hierarchical Wrapper Induction. AAAI-99, 1999: 975.
- Muslea, I., Minton, S. and Knoblock, C. Selective Sampling with Co-Testing: Preliminary Results. AAAI-00, 2000.
- Muslea, I., Minton, S. and Knoblock, C. .A hierarchical approach to wrapper induction.. Agents-99, 1999.
- Nasukawa, T. and Yi, J. Sentiment analysis: Capturing favorability using natural language processing. Proceedings of the 2nd Intl. Conf. on Knowledge Capture (K-CA-03, 2003.
- Nigam, K., and Hurst, M. Towards a Robust Metric of Opinion. AAAI Spring Symposium on Exploring Attitude and Affect in Text. 2004.
- NLProcessor . Text Analysis Toolkit. 2000. http://www.infogistics.com/textanalysis.html
- Noy, N, and Musen, M. PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. AAAI-00, 2000.
- Pang, B., Lee, L., and Vaithyanathan, S., Thumbs up? Sentiment Classification Using Machine Learning Techniques. EMNLP-02, 2002.
- Pinto, D., McCallum, A., Wei, X. and Bruce, W. Table Extraction Using Conditional Random Fields. SIGIR-03, 2003.
- Ramaswamy, L., Ivengar, A., Liu, L., and Douglis, F. Automatic detection of fragments in dynamically generated Web pages. WWW-04, 2004.
- Reis, D. Golgher, P., Silva, A., Laender, A. Automatic Web news extraction using tree edit distance, WWW-04, 2004.
- Riloff, E. and Wiebe, J. Learning extraction patterns for subjective expressions. EMNLP-03, 2003.
- Rosenfeld, B., Feldman, R., Aumann, Y. Structural extraction from visual layout of documents. CIKM-02, 2002.
- Song, R., Liu, H., Wen, J.-R., Ma, W.-Y. Learning block importance models for Web pages. WWW-04, 2004.
- Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422.433, 1979.
- Hogue, A. and Karger, D. Thresher: Automating the unwrapping of semantic content from the World Wide Web.. WWW-05, 2005.
- Tong, R. An Operational System for Detecting and Tracking Opinions in on-line discussion. SIGIR 2001 Workshop on Operational Text Classification. 2001
- Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL-02, 2002.
- Vaithyanathan, S., Dom, B. Model Selection in Unsupervised Learning with Applications To Document Clustering. ICML-99, 1999.
- Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
- Wang, J., Wen, J-R, Lochovsky, F., Ma, W-Y. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. VLDB-04, 2004.
- Wang, J., and Lochovsky, F. Data extraction and label assignment for Web databases. WWW-03, 2003.
- Wang, Y., and Hu, J. A machine learning based approach for table detection on the Web. WWW-02, 2002.
- Wiebe, J., Bruce, R., and O.Hara, T. Development and Use of a Gold Standard Data Set for Subjectivity Classifications. ACL-99, 1999.
- Wilson, T., Wiebe, J., and Hwa, R. Just how mad are you? Finding strong and weak opinion clauses. AAAI-04, 2004.
- Wu, W, Yu, C, Doan, A., and Meng, W., An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. SIGMOD-04, 2004.
- Yang, W. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739.755, 1991.
- Yi, L., and Liu, B. "Web Page Cleaning for Web Mining through Feature Weighting" IJCAI-03, Aug 9-15, 2003, Acapulco, Mexico.
- Yi, L., Liu, B., and Li, X. "Eliminating Noisy Information in Web Pages for Data Mining." KDD-2003, Washington, DC, USA, August 24 - 27, 2003.
- Yin, X. and Lee, W-S. Using link analysis to improve layout on mobile devices. WWW-04, 2004.
- Yu, H., and Hatzivassiloglou, V. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. EMNLP-03, 2003.
- Zamir, O. and Etzioni, O. Grouper: A Dynamic Clustering Interface to Web Search Results. WWW8, 1999.
- Zeng, H-J., H-J..He, Q-C, Chen, Z., Ma, W-Y, Ma, J. Learning to cluster web search results. SIGIR-04, 2004.
- Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. WWW-05, 2005.
- Zhai, Y., and Liu, B. Extracting Web Data Using Instance-Based Learning. WISE-05, 2005.
- Zhang, D., and Lee, W-S. Web taxonomy integration using support vector machines. WWW-04, 2004.
- Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. Fully automatic wrapper generation for search engines.. WWW-05, 2005.
Created on April 25, 2005 by Bing Liu.