Research

Research Statement

The long-term goal of research in my group is to create provable machine learning algorithms that facilitate efficient and accurate data analysis. In the “big data” era, despite the plethora of data in both public and private sectors, most datasets are incompletely observed. The target structure is often missing, and the salient structures are obscured by the raw feature representation. Although machine learning is evolving to tackle these challenges with increasingly delicate models, the training schemes have been relying on heuristics that provide no theoretical guarantees such as global optimality.

Over the past few years, we have focused on addressing this issue by jointly inferring latent representation and learning predictors for massive datasets through a single convex optimization. The results were analyzed theoretically and validated on real applications. In the pursuit of representation learning, it is critical to conjoin it with prediction model learning, so that the latent features discovered have predictive power. Such a joint learning poses significant challenges because it leads to non-convex problems where finding a jointly optimal solution is intractable. This forms a major obstacle to the analysis of deep learning despite its empirical success. Instead of superseding deep learning, our goal is to complement it with analyzable tools that learn predictive representations with improved scalability, modularity, reliability, and/or flexibility. Towards this end, we have made the following progresses:

Convexify multi-layer neural networks by reformulating them in terms of the kernel matrices of the latent output, discovering salient data abstractions efficiently under various network architectures. This was first achieved for a single-hidden layer [27], later extended to multiple-hidden layers using normalized kernels [21], then to structured latent representations such as chains and graph matching [18], and finally to an inductive setting [13] that is scalable to tens of thousands of examples [T1].

Using a similar approach, we convexified multiview learning with a single hidden layer in [36, 29], and it is being extended to multiple layers and multiple modalities. Similar extensions are being made for convex modeling of temporal, spatial, and multi-way correlations by leveraging reproducing kernel Hilbert space embeddings of distributions [40, 41] our work of tensor trace norm relaxation [19].

Incorporate prior knowledge and structured correlations in data while retaining global optimality. A lot of domain specific priors such as invariance and robustness can be modeled effectively in the framework. We first achieved it in [11] and [T2] by warping the kernels with transformation invariance and directional smoothness, tapping into our original observation that many invariances can be compactly represented by a linear functional [25]. In [8], this was further extended to sublinear invariances such as logical relationships between multiple classes and mixup regularization, and the result is a novel semi-inner-product space where the representer of each example can be efficiently embedded into a Euclidean space. This is a transformative breakthrough that enables a variety of generalized invariances to be incorporated in convex representation learning, and we are applying it to unsupervised domain adaptation and adversarial training of graph neural networks.

The technique leverages the polar operator on the set of representations with unit invariant complexity, and it was extended to multiple layers [7], allowing deep regularization in neural networks to be much more effectively accomplished for, e.g., multiview learning and robust sequential modeling.

Design efficient and provable optimization algorithms based on greedy approximation and parallel optimization, both exploiting the problem structure such as double separability. To optimize kernel matrices, we utilized low-rank approximation through the conditional gradient algorithm, establishing the first O(1/epsilon) rate of convergence in [2, 30] and [T1]. It was then accelerated to linear convergence on polytopes whose rates were tightened, for the first time, by using the sparsity of the solution [15]. We also discovered the first linear convergence for variance-reduced stochastic optimization with Bregman divergence on min-max problems [16], O(1/epsilon) rates for finding the minimal enclosing convex shapes [37], along with distributed optimization for huge datasets (228 GB) and models (358 GB) [10, 17].

I maintain several notes on research and IT. They can be found here.