Some research projects
Bioinformatics/Computational Biology/Systems Biology/Approximation Algorithms
My broad research goals in these areas is to develop new combinatorial/geometric
algorithmic techniques as well as to use existing such techniques to design
efficient exact or approximation algorithms for computationally challenging
problems. My research plans include
designing provably efficient algorithms for
these bioinformatics problems, as well as
implementing these algorithms and testing them again
benchmark data.
Advantages of using combinatorial or geometric algorithmic
approaches for
the computationally difficult problems are plenty. Typically combinatorial or
geometric algorithms are simple to implement and thus run faster;
it is the proof that the
algorithm is correct as well as proof of guarantees of its performance
or resource bounds that is often non-trivial. Unlike many other approaches,
resource and performance bounds (such as worst-case or average case bounds for
deterministic algorithms and expected case bounds for randomized algorithms)
for these algorithms is guaranteed. Moreover, a combinatorial
or geometric interpretation of the problem often provides more intuition
about the nature of the problem such as the structure of the optimal
solution or the effect of errors in the input data on the optimal
solution.
Below are only some specific problems/projects that I have investigated.
Bioinformatics/Computational Biology
-
Test Set Problems
-
In the most basic versions of the test set problems, we are given a universe of objects,
family of subsets ("tests") of the universe and a notion of "distinguishability"
of pairs of elements of the universe by a collection of these tests.
Our goal is to select a subset of these tests of minimum size or cost
that distinguishes every pair of elements of the universe.
Such combinatorial formulations allow us to investigate, at one shot,
several important problems in bioinformatics such as:
-
Minimum Test Collection Problem
-
applications in diagnostic testing
-
Condition Cover Problems
-
for verifying a multi-output feedforward Boolean circuit as a model of
biological pathways
-
String Barcoding Problems
-
in detecting unknown pathogen sequences
-
Minimum Cost Probe Set Problem with a Threshold
-
for minimizing the number of oligonucleotide
probes needed for analyzing populations of ribosomal RNA gene (rDNA)
clones by hybridization experiments on DNA microarrays
-
Protein Sequence Design Problems
-
In protein landscapes analysis, one wishes to analyze
the structures of the space of all protein sequences and their native 3D
structures.
A basic computational problem at this level, called the protein
sequence design (PSD) or the inverse protein folding problem,
is to take a target
3D structure as input and ask for a fittest sequence with respect to
one or more fitness functions of the target structure.
One possible way to attack the problem
by defining a heuristic sequence design (HSD) problem with
the implicit assumption being that a sequence that satisfies the
HSD problem also solves PSD.
-
Substructure Similarity Identification Problems
-
A basic version of these problems is as follows.
Our input consists of two biological structures
S1 and S2 (such as
two DNA sequences) and a similarity function
s  : 
t1
×
t2
®
R
mapping substructures
t1
of
S1
and
t2
of
S2
to a real number.
Our goal is to identify a set of pairs of disjoint
substructures, each pair containing a substructure from each of
the given structures, such that the total similarity is
maximized.
Problems of this nature appear abundantly
in bioinformatics applications such as:
-
Subsequence Similarity Identification Problems
-
S1 and S2 are (DNA or protein) sequences
and a substructure refers to a contiguous subsequence (substring).
Elucidating similarities between sequences is
a fundamental problem in bioinformatics;
algorithmic analysis of these type of problems provide the basis
of development of popular alignment softwares such as
BLAST and PipMaker.
Examples of specific problems of this type that are of interest are:
Consistent Local Alignment Sets:
this problems has applications in genome sequencing approach in which
one compares conserved regions of contigs
of a species
with the assembled sequence of another closely related species
to determine the relative order and orientations of the contigs.
Nonoverlapping Local Alignments:
This is the sequence similarity identification problem with
the added assumption that no two subsequences
t1
and
t'1
of the same input sequence can enclose each other; the assumption
is justified by preprocessing the biological data and
simplifies the complexity of the problem.
This problem has applications in computing local alignments between
sequences.
-
Topology independent protein structural alignment
-
Here each structure is a 3D protein structure.
Identifying structurally similar proteins with different chain
topologies can aid studies in homology modeling, protein folding,
protein design, and protein evolution. These include circular
permuted protein structures, and the more general cases of non-cyclic
permutations between similar structures, which are related by
non-topological rearrangement beyond circular permutation. We present
a method based on an approximation algorithm that finds sequence-order
independent structural alignments that are close to optimal. We
formulate the structural alignment problem as a special case of the
maximum-weight independent set problem, and solve this computationally
intensive problem approximately by iteratively solving relaxations of
a corresponding integer programming problem. The resulting structural
alignment is sequence order independent. Our method is also
insensitive to insertions, deletions, and gaps.
Using a novel
similarity score and a statistical model for significance $p$-value,
we are able to discover previously unknown circular permuted proteins
between nucleoplasmin-core protein and auxin binding protein, between
aspartate rasemase and 3-dehydrogenate dehydralase, as well as between
migration inhibition factor and arginine repressor which involves an
additional strand-swapping. We also report the finding of
non-cyclic permuted protein structures existing in nature between
AML1/core binding factor and ribofalvin synthase. Our method can be
used for large scale alignment of protein structures regardless of the
topology.
The approximation algorithm introduced in this work can find good
solutions for the problem of protein structure alignment.
Furthermore, this algorithm can detect topological differences
between two spatially similar protein structures. The alignment
between MIF and the arginine repressor demonstrates our
algorithm's ability to detect structural similarities even when
spatial rearrangement of structural units has occurred.
The effectiveness of our method is also demonstrated by the discovery
of previously unknown circular permutations.
In addition, we report in this study the finding of a naturally
occurring non-cyclic permuted protein between AML1/Core Binding
Factor chain F and riboflavin synthase chain A.
-
Computing similarities/distances between biological objects
-
I am interested in defining suitable similarity/distance measures
between biological objects and efficiently computing them. Motivations
for looking at this type of problems include validating alternate
experimental/computational approaches and detecting specific evolutionary
events. Some of the problems that I have looked at it in this direction
are as follows.
-
Comparison of Phylogenetic Trees
-
We have looked at designing efficient approximation algorithms for computing
the nearest-neighbour-interchange (NNI) distance including possible extensions
to weighted phylogenies. We also formulated the linear-cost subtree-transfer
distance to model recombination events and designed efficient approximation
algorithms for this.
-
Syntenic distance between multi-chromosome genomes
-
This has applications in finding the number of fissions, fusions and
translocations to transform a multi-chromosome genome to another.
-
Lateral Gene Transfer Problems in Phylogenetic Trees
-
Our goal is to investigate the Lateral Gene Transfer Problems in phylogenies.
For example, one such minimization problem, defined by
Hallet and Lagergren, is that of finding the most parsimonious lateral gene transfer scenario
for a given pair of gene and species trees; we were able to prove some inapproximability
results for this problem.
Systems Biology
Our investigation is concerned with algorithmic, computational complexity
and experimental questions
that arise in the analysis of biological networks. Our goal is to
formulate precise
theoretical problems and propose a set of experiments to validate the
approaches suggested in it.
In particular, we will look at the following aspects:
-
the formulation of decision and optimization problems for the
identification and verification of biological pathways, such as those that may
result from reverse-engineering studies,
-
the formulation of combinatorial optimization problems that arise in the
experimental design for reverse engineering of gene and protein networks,
including the characterization of its computational complexity
and the design of provably efficient deterministic or randomized algorithms,
-
carrying out a set of experiments on a particular biological pathway, to
obtain real biological data on which to apply the algorithms.
-
Reverse Engineering Problems
-
This research is motivated by a central concern of contemporary cell biology,
that of unraveling (or "reverse engineering") the web of interactions among
the components of complex protein and genetic regulatory networks.
Notwithstanding the remarkable progress in genetics and molecular biology in
the sequencing of the genomes of a number of species, the inference and
quantification of interconnections in signaling and genetic networks that are
critical to cell function is still a challenging practical and theoretical
problem.
High-throughput technologies allow the monitoring the expression levels
of sets of genes, and the activity states of signaling proteins, providing
snapshots of the transcriptional and signaling behavior of living cells.
Statistical and machine learning techniques, such as clustering, are often used
in order to group genes into co-expression patterns,
but they are less able to explain functional interactions.
An intrinsic difficulty in capturing such interactions in intact cells by
traditional genetic experiments or pharmacological interventions
is that any perturbation to a particular gene or signaling component may
rapidly propagate throughout the network, causing global changes.
The question thus arises of how to use the observed global changes to derive
interactions between individual nodes.
In this research we investigated
some computational problems that arises in the
context of experimental design for reverse engineering of protein and gene
networks.
Biological networks may have a very large number of species and parameters.
For example, the E. coli transcription network
has 577 interactions involving 116 transcription factors and
419 operons. For such large-scale networks, exhaustive calculations are not practically
possible due to combinatorial explosion and this necessitates the design of
provably efficient approximation algorithms.
-
Inferring Signal Trasduction Networks from Indirect Experimental Evidences
-
We introduce a new method of combined synthesis and inference of biological signal
transduction networks.
A main idea of our method lies in representing observed causal relationships as network paths and using
techniques from combinatorial optimization to find the sparsest graph consistent with all experimental observations.
We formalize our approach, study its computational complexity and prove new results
for exact and approximate solutions of the computationally hard transitive reduction
substep of the approach.
We validate the biological usability of our approach by successfully
applying it to a previously published signal transduction network and
by using it to synthesize and simplify a novel
network corresponding to activation induced cell death in large granular lymphocyte leukemia.
We also show that our algorithm for the transitive reduction substep performs well
on graphs with a structure similar to those observed in transcriptional
regulatory and signal transduction networks.
The NET-SYNTHESIS is freely downloadable from
http://www.cs.uic.edu/~dasgupta/network-synthesis/
.
-
Monotone Decomposition of Biological Networks
-
A useful approach to the mathematical analysis of large-scale biological
networks is based upon their decompositions into monotone dynamical systems.
This research deals with computational problems associated to finding
decompositions which are optimal in an appropriate sense.
In graph-theoretic language, the problems can be recast in terms of
maximal sign-consistent subgraphs.
The theoretical results include polynomial-time approximation algorithms
as well as constant-ratio inapproximability results.
One of the algorithms, which has a worst-case guarantee of 87.9% from
optimality, is based on the semidefinite programming
relaxation approach of Goemans-Williamson.
The algorithm was implemented and tested on a Drosophila
segmentation network and an Epidermal Growth Factor Receptor pathway model,
and it was found to perform close to optimally.
Hybrid Systems (Computational Complexity Issues)
The area of hybrid systems concerns issues of modeling, computation, and
control for systems which combine discrete and continuous components.
I am interested in designing and analyzing efficient algorithms for
hybrid systems. Some specific recent interests are listed below.
-
Piecewise Linear Systems
-
A specific interest of mine inside hybrid systems is
the class of (discrete-time) piecewise linear(PL)
systems.
The subclass of PL systems provides one systematic approach
to discrete-time hybrid systems, naturally blending switching mechanisms with
classical linear components.
PL systems are of interest as controllers as well as identification models,
and they can be thought of as arbitrary interconnections of finite automata
and linear systems.
-
Honey-pot Constrained Searching with Local Sensory Information
-
I have also investigated the problem of searching for a hidden
target in a bounded region of the plane by an autonomous robot which
is only able to use local sensory information. The problem is
naturally formulated in the continuous domain but the solution
proposed is based on an aggregation/refinement approach in which the
continuous search space is partitioned into a finite collection of
regions on which we define a discrete search problem. A solution to
the original problem is obtained through a refinement procedure that
lifts the discrete path into a continuous one.
We show that the discrete optimization is computationally difficult
(NP-hard) but there are computationally efficient approximation
algorithms to solve it. The resulting solution to the continuous
problem is in general not optimal but one can construct bounds to
gauge the cost penalty incurred due to (i) the discretization of the
problem and (ii) the attempt to approximately solve the NP-hard
problem in polynomial time. Numerical simulations show that the
algorithms proposed behave significantly better than naive approaches such
as a random walk or a greedy algorithms.
Approximation Algorithms
In general, I am interested in applications of combinatorial/geometric techniques
in designing and analyzing efficient algorithms for combinatorial, geometric or
graph-theoretic problems with applications to VLSI/CAD, multiprocessor scheduling,
combinatorial auctions, computational geometry, databases etc.