Welcome to Xiaoxiao (Ricky) Shi's Home Page

List of Projects


Project Descriptions

  • LinkedOut


    It is a system to predict who is going to change jobs in the late future, and how his/her move affects his/her social network. It is a work that I did with Guan Wang and Yuchen Zhao in a global competition (LinkedIn Hackday). I mainly designed the prediction model, and implemented it in script language and a JAVA-based HCI (human computer interaction) framework called Gephi. I also implemented part of the HCI interface. For the HCI interface, I implemented a visual effect that simulates the elastic effect of a spring in physics. The system finally won the 2nd place over 170 competitors.

  • Heterogeneous Learning


    It is a project that I did during my internship in AT&T labs. The objective is to derive prediction models to learn from multiple heterogeneous data sources. For instance, users' profiles can be used to build recommendation systems. In addition, a model can also use users' historical behaviors and social networks to infer users' interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build e ective learning models. We call this framework heterogeneous learning. We propose a gradient boosting model to solve the problem, and the paper is selected as one of the 10 best papers in SDM'12.

  • Compression based user profile generation and advertisement recommendation

  • Important User Behaviors detected by the proposed model

    It is a work that I did when I was a summer intern in Yahoo! Labs in 2010. I mainly derived a map-reduce framework to explore an essential feature subspace of the users' historical behaviors based on the non-parametric clustering technique. The precision of the advertisement recommendation is improved in the reduced feature subspace. There were mainly 5 research scientists working on the project with 20 more research scientists providing help and discussion on the project. I wrote over 20,000 lines of codes in JAVA under HADOOP (a "MAP-REDUCE" framework) as well as thousands of lines of script languages. I contributed the main framework of the proposed model and set up extensive experiments on the large scale datasets. This work is published in SIGIR'10.

  • Shaker Detection

  • Capture the lead-lag effects (positive and negative), and summarize them in the cascading graph
    Algorithm flow
    Example of the cascading graph constructed by the algorithm
  • Transfer Learning

  • Transfer Learning across Chemical Graph Databases
    Transfer Learning across different feature spaces




    Transfer Learning across different output spaces


    Transfer Learning across classification and clustering tasks
    Active Transfer Learning




  • Advanced Trading Signals

    (a) VPIN: The Volume Synchronized Probability of INformed Trading, commonly known as VPIN, is a mathematical model to evaluate the volume-based volatility of the market. Its origin is the PIN model, which is used to infer the probability of informed trading:



    where S0 is the current price, SB is the existence of good news, while SG is the existence of bad news, which are furthered modeled to be Poisson Processes. My responsibility is to implement and improve the VPIN model in a streaming environment, and test it on various markets with different frequencies. An example of the trading signals are illustrated in the following figure:















    (b) VADX (derived from reversed-engineering): The Volume Synchronized Average Directional IndeX, is a trading signal that I proposed, enlightened by the VPIN model and the trading signal of another financial service company. It captures the current market direction in the volume synchronized clock. I was first assigned a task to study the “secret” of an accurate trading signal provided by another financial service company, given just the values of the signal without knowing what it is. I then performed various data mining approaches, and successfully uncover the mysteries of the signal. After combining the idea from VPIN, I develop the VADX signal.

    (c) Market Force Signal: It is a trading model that I developed to simulate the market movements as the movements of a vibrating string. The signal has two important properties: (a) it is calculated from both price movement, and volume movement; (b) it proves the efficient market hypothesis in the long run; in other words, the accumulated market force is neutral in the long run.

    (d) Combined Trading Signal: With the three signals, I develop a cost-sensitive decision tree model (a machine learning model) to explore the optimal trading strategy. With the empirical study on the order book data, the precision of predicting up-tick/down-tick is over 95%.

  • Optimal Market Marking Strategy via Cost-sensitive Decision Tree

    I derive a machine learning model to determine how to submit an order in the market marking process. For example, you can submit a market order, or you can submit a limit order competing with the best prices, or you can submit a limit order one or two ticks deviating from the best prices. In this task, I basically design 5 class labels (2 ticks higher, 2 ticks lower, 1 tick higher, 1 tick lower, trade_at_middle), and convert the market order book data into a summarized feature vector with various indication values (e.g., bid/offer spread, bid/offer ratio, etc.). I then build a cost-sensitive decision tree to learn the best policy for market making.

  • Correlation Trading

    (a) Pairs trading with 1 million of initial capital


















    (b) Basket trading based on 1 million of initial capital:

    Average PnL per Trade:  464.4797

    Std: 662.0391

    Excess Kurtosis: 12.25663 (concentrate on the mean)

    Skeweness:  2.040957

    84.9% winners

    Average 55 trades/day

















  • FX Post-trade Analysis Software

    The classic FX arbitrage is a cross-exchange trading strategy, and it involves different banks, ECNs, and EBS in the trades. Hence, it is not straightforward to track good/bad trades, and perform various post-trade analyses. I solely develop a visual tool to achieve the goal.

















  • Spam Mining

    It was a course project to discover the faked reviews in resellerrating.com. There are over 20,000 stores and over 200,000 reviews. I mainly contributed a concept-based model to nd suspicious reviews with strange concepts related to the stores. There were two people working in this project and I took the lead. The whole system has about 7,000 lines of codes, and I wrote 3,000 lines in JAVA and 1,000 lines in Matlab.

  • MPI Parallel Computing

    It was a course project to design di erent logical schema (e.g., ring structure, hypercube structure, etc.) in parallel computing. It was written in pure C, and I wrote about 3,000 lines of C codes in the project.