Ph.D. Research Projects
Project: Topic Modeling & Lifelong Learning
Type: Research
Keywords: Machine Learning, Text Mining, Lifelong learning, Opinion Mining, Aspect Extraction, Knowledge-based Topic Models
Time: Sep. 2012 – Present
Organization: University of Illinois at Chicago
Programming Language: Java, Python
My Achievements: Long papers accepted in KDD 2014, ICML 2014, ACL 2014, EMNLP 2013, CIKM 2013, and IJCAI 2013, CIKM 2013. See my publications for details.
Project: Intention Identification
Type: Research
Keywords: Text Mining, Transfer Learning, Machine Learning, Natural Language Processing
Time: Nov. 2011 – Sep. 2012
Organization: University of Illinois at Chicago
Programming Language: Java
My Achievements: Long paper accepted in NAACL 2013. See my publications for details.
Internship Projects
Project: Question Quality Classification and Grammar Auto-Correction
Type: Machine Learning Engineering & Production
Keywords: Machine Learning, Natural Language Processing, Feature Engineering, Production
Time: May 2015 – July 2015
Organization: Quora, Mountain View, USA
Complexity and Technology: 5K lines of Python code
Details: Built a machine learning classifier for a large amount of questions based on a variety of features. Also, applied data mining techniques to improve question quality by auto-correcting grammatical issues.
My Achievements: 1) Implemented the entire data pipeline for the project, including extracting data, modeling questions, and evaluating models. 2) Designed fine-grained natural language processing features for model improvements. 3) Integrated the code into production.
Project: Modeling Twitter Influence on TV Tune in
Type: Data Science & Research
Keywords: Machine Learning, Time Series Data, Granger Causality, Vector Autoregression
Time: May 2014 – Aug. 2014
Organization: Twitter, Boston, USA
Complexity and Technology: 1,000 lines of Scalding (Hadoop, 100TG data), 4,000 lines of Python code
Details: Social TV becomes important emerging technology due to the popularity of social media such as Twitter. How does Twitter influence the traditional media platforms (such as TV and News) is an interesting yet challenging problem. My internship project is to collect data and build the model to infer the relationship between Twitter, TV and other media platforms such as News.
My Achievements: I collected and processed a large amount of data (Tweets, Ratings, etc) from multiple data sources. Then, I built the model and designed experiments to show the relationship. Last, a few presentations were made across multiple teams.
Project: Video Triggering in Bing Search Engine
Type: Research & Product
Keywords: Information Retrieval, Bing Search Engine, Natural Language Processing
Time: May 2012 – Aug. 2012
Organization: Microsoft Research Redmond, WA, USA
Complexity and Technology: 4,000 lines of C# code for experiments, 3,000 lines of C# code for product.
Details: The project was to build a trigger for education videos, particularly for Khan Academy. There were two parts in the algorithm: classification and ranking. The classification (to classify user query intent) was based on query log in Bing and word distribution in video library. For ranking, I modeled the context of educational videos based on domain background. In details, the ranking algorithm was based on both video fields (text) and domain entities with corresponding weights.
My Achievements: I designed the algorithm and implemented it. The algorithm improved precision by about 35%, compared with previous algorithm. Also, I wrote the product code to flight the new algorithm into Bing.
Project: Automatic Wrapper for Web Data Extraction
Type: Research & Product
Keywords: Web Mining, Data Mining
Time: Sept. 2010 – Nov. 2010
Organization: Microsoft Research Asia, Beijing, China
Complexity and Technology: 5,000 lines of C# code for experiments, 2,500 lines of C# code for product.
Details: The project was to build an automatic wrapper to extract information from structured data, e.g. forum posts and reviews. My task was to improve the efficiency of the algorithm called MiBAT. Basically, I re-designed the algorithm by converting the sequence of it. I proposed a new algorithm bottom-up finding common parents in DOM (Document Object Model) tree to find the anchor trees (sub-trees containing the real date time).
My Achievements: The new algorithm dramatically speeded up from originally 245 milliseconds to 8.8 milliseconds per page without losing much precision and recall! Since it worked so well, I wrote C++ product code to incorporate it into Bing to improve document understanding.
Project: Scale Blocking for Record Linkage
Type: Research
Keywords: Information Integration, Big Data, Machine Learning, Publication
Time: Dec. 2010 – Jan. 2011
Organization: Microsoft Research Asia, Beijing, China
Complexity and Technology: 3,000 lines of C# code and 300 lines of SQL for research. Dataset: 29 Million entries.
Details: Linking the identical entries between two huge databases is very challenging due to the limitation of memory and efficiency. Based on previous machine learning work, we proposed to use unlabeled data to help the machine learning about blocking schemes. Moreover, we designed an algorithm for effectively and efficiently utilizing unlabeled data by sampling the data. To empirically verify it, I did comprehensive experiments.
My Achievements: Paper published:
Yunbo Cao, Zhiyuan Chen, Jiamin Zhu, Pei Yue, Chin-Yew Lin, and Yong Yu. Leveraging Unlabeled Data to Scale Blocking for Record Linkage. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July 16-22, 2011. [Paper] [BibTeX]
Project: Sempute Link
Type: Research & Demo
Keywords: Natural Language Processing, Machine Learning, Demo
Time: Feb. 2011 – May 2011
Organization: Microsoft Research Asia, Beijing, China
Complexity and Technology: 3,000 lines of C# code for backend, 500 lines of HTML/JavaScriptfor frontend.
Details: The project was to build a system (including backend algorithm and UI) that can link any entity on the Web page to the entry in Wikipedia. The challenge was to solve entity name disambiguation. We designed dozens of features by comparing the context of the entity and content in Wikipedia page. I refined the features in the model and did experiments to improve the accuracy.
My Achievements: I developed the backend services and the desktop user interface. The demo was shown in Microsoft TechFest 2011.
Course Projects
Project: Mini Siri Q&A System
Type: Course project
Keywords: Natural Language Processing, Database
Time: May 2012
Organization: CS 421, Natural Language Processing, University of Illinois at Chicago
Programming Language: Java
Details: The project was to design a mini siri Q&A system to answer user questions about movies, geography and Olympics. The idea was to use parse tree of the question and convert it into SQL query, in order to get the answer from database.
My Achievements: Evaluated on user question data, the algorithm got 3rd among all submissions.
Project: Product Entity Recognition
Type: Course project
Keywords: Natural Language Processing, Text Mining, Machine Learning
Time: Dec. 2011
Organization: CS 583, Data Mining and Text Mining, University of Illinois at Chicago
Programming Language: Java
Details: The project was to extract product entities from product reviews. Motivated from traditional name entity recognition algorithm, each word was classified as begin, middle and end of product entity, which reduced a huge feature space. Last, SVM was applied twice to solve this multi-class classification problem.
My Achievements: Evaluated on the testing data, the algorithm ranked 1st among all submissions.
Website Design Projects
Project: ACM-ICPC Automatic Judge Website
Type: Website Building
Keywords: HTML, JavaScript
Time: Jan. 2010 - May 2010
Organization: DLUT, China
Programming Language: ASP.NET, HTML, Javascript
Details: The website I implemented can automatically compile and run the codes users submit, as well as compare the results of the standard output with user codes output and return the result to the user. I also added additional functions, e.g. forums and messages.
My Achievements: The website was serving as training and competition platform for ACM-ICPC contests and teaching platform for the course.