# Bioinformatics/Computational Biology/Systems Biology/Approximation Algorithms

My broad research goals in these areas is to develop new combinatorial/geometric algorithmic techniques as well as to use existing such techniques to design efficient exact or approximation algorithms for computationally challenging problems. My research plans include designing provably efficient algorithms for these bioinformatics problems, as well as implementing these algorithms and testing them again benchmark data. Advantages of using combinatorial or geometric algorithmic approaches for the computationally difficult problems are plenty. Typically combinatorial or geometric algorithms are simple to implement and thus run faster; it is the proof that the algorithm is correct as well as proof of guarantees of its performance or resource bounds that is often non-trivial. Unlike many other approaches, resource and performance bounds (such as worst-case or average case bounds for deterministic algorithms and expected case bounds for randomized algorithms) for these algorithms is guaranteed. Moreover, a combinatorial or geometric interpretation of the problem often provides more intuition about the nature of the problem such as the structure of the optimal solution or the effect of errors in the input data on the optimal solution.

Below are only some specific problems/projects that I have investigated.

## Bioinformatics/Computational Biology

Test Set Problems
In the most basic versions of the test set problems, we are given a universe of objects, family of subsets ("tests") of the universe and a notion of "distinguishability" of pairs of elements of the universe by a collection of these tests. Our goal is to select a subset of these tests of minimum size or cost that distinguishes every pair of elements of the universe. Such combinatorial formulations allow us to investigate, at one shot, several important problems in bioinformatics such as:
Minimum Test Collection Problem
applications in diagnostic testing
Condition Cover Problems
for verifying a multi-output feedforward Boolean circuit as a model of biological pathways
String Barcoding Problems
in detecting unknown pathogen sequences
Minimum Cost Probe Set Problem with a Threshold
for minimizing the number of oligonucleotide probes needed for analyzing populations of ribosomal RNA gene (rDNA) clones by hybridization experiments on DNA microarrays
Protein Sequence Design Problems
In protein landscapes analysis, one wishes to analyze the structures of the space of all protein sequences and their native 3D structures. A basic computational problem at this level, called the protein sequence design (PSD) or the inverse protein folding problem, is to take a target 3D structure as input and ask for a fittest sequence with respect to one or more fitness functions of the target structure. One possible way to attack the problem by defining a heuristic sequence design (HSD) problem with the implicit assumption being that a sequence that satisfies the HSD problem also solves PSD.
Substructure Similarity Identification Problems
A basic version of these problems is as follows. Our input consists of two biological structures S1 and S2 (such as two DNA sequences) and a similarity function s  :  t1 × t2 ® R mapping substructures t1 of S1 and t2 of S2 to a real number. Our goal is to identify a set of pairs of disjoint substructures, each pair containing a substructure from each of the given structures, such that the total similarity is maximized. Problems of this nature appear abundantly in bioinformatics applications such as:
Subsequence Similarity Identification Problems
S1 and S2 are (DNA or protein) sequences and a substructure refers to a contiguous subsequence (substring). Elucidating similarities between sequences is a fundamental problem in bioinformatics; algorithmic analysis of these type of problems provide the basis of development of popular alignment softwares such as BLAST and PipMaker. Examples of specific problems of this type that are of interest are:
Consistent Local Alignment Sets: this problems has applications in genome sequencing approach in which one compares conserved regions of contigs of a species with the assembled sequence of another closely related species to determine the relative order and orientations of the contigs.
Nonoverlapping Local Alignments: This is the sequence similarity identification problem with the added assumption that no two subsequences t1 and t'1 of the same input sequence can enclose each other; the assumption is justified by preprocessing the biological data and simplifies the complexity of the problem. This problem has applications in computing local alignments between sequences.
Topology independent protein structural alignment
Here each structure is a 3D protein structure. Identifying structurally similar proteins with different chain topologies can aid studies in homology modeling, protein folding, protein design, and protein evolution. These include circular permuted protein structures, and the more general cases of non-cyclic permutations between similar structures, which are related by non-topological rearrangement beyond circular permutation. We present a method based on an approximation algorithm that finds sequence-order independent structural alignments that are close to optimal. We formulate the structural alignment problem as a special case of the maximum-weight independent set problem, and solve this computationally intensive problem approximately by iteratively solving relaxations of a corresponding integer programming problem. The resulting structural alignment is sequence order independent. Our method is also insensitive to insertions, deletions, and gaps. Using a novel similarity score and a statistical model for significance \$p\$-value, we are able to discover previously unknown circular permuted proteins between nucleoplasmin-core protein and auxin binding protein, between aspartate rasemase and 3-dehydrogenate dehydralase, as well as between migration inhibition factor and arginine repressor which involves an additional strand-swapping. We also report the finding of non-cyclic permuted protein structures existing in nature between AML1/core binding factor and ribofalvin synthase. Our method can be used for large scale alignment of protein structures regardless of the topology. The approximation algorithm introduced in this work can find good solutions for the problem of protein structure alignment. Furthermore, this algorithm can detect topological differences between two spatially similar protein structures. The alignment between MIF and the arginine repressor demonstrates our algorithm's ability to detect structural similarities even when spatial rearrangement of structural units has occurred. The effectiveness of our method is also demonstrated by the discovery of previously unknown circular permutations. In addition, we report in this study the finding of a naturally occurring non-cyclic permuted protein between AML1/Core Binding Factor chain F and riboflavin synthase chain A.
Computing similarities/distances between biological objects
I am interested in defining suitable similarity/distance measures between biological objects and efficiently computing them. Motivations for looking at this type of problems include validating alternate experimental/computational approaches and detecting specific evolutionary events. Some of the problems that I have looked at it in this direction are as follows.
Comparison of Phylogenetic Trees
We have looked at designing efficient approximation algorithms for computing the nearest-neighbour-interchange (NNI) distance including possible extensions to weighted phylogenies. We also formulated the linear-cost subtree-transfer distance to model recombination events and designed efficient approximation algorithms for this.
Syntenic distance between multi-chromosome genomes
This has applications in finding the number of fissions, fusions and translocations to transform a multi-chromosome genome to another.
Lateral Gene Transfer Problems in Phylogenetic Trees
Our goal is to investigate the Lateral Gene Transfer Problems in phylogenies. For example, one such minimization problem, defined by Hallet and Lagergren, is that of finding the most parsimonious lateral gene transfer scenario for a given pair of gene and species trees; we were able to prove some inapproximability results for this problem.

## Systems Biology

Our investigation is concerned with algorithmic, computational complexity and experimental questions that arise in the analysis of biological networks. Our goal is to formulate precise theoretical problems and propose a set of experiments to validate the approaches suggested in it. In particular, we will look at the following aspects:

• the formulation of decision and optimization problems for the identification and verification of biological pathways, such as those that may result from reverse-engineering studies,
• the formulation of combinatorial optimization problems that arise in the experimental design for reverse engineering of gene and protein networks, including the characterization of its computational complexity and the design of provably efficient deterministic or randomized algorithms,
• carrying out a set of experiments on a particular biological pathway, to obtain real biological data on which to apply the algorithms.
Reverse Engineering Problems
This research is motivated by a central concern of contemporary cell biology, that of unraveling (or "reverse engineering") the web of interactions among the components of complex protein and genetic regulatory networks. Notwithstanding the remarkable progress in genetics and molecular biology in the sequencing of the genomes of a number of species, the inference and quantification of interconnections in signaling and genetic networks that are critical to cell function is still a challenging practical and theoretical problem. High-throughput technologies allow the monitoring the expression levels of sets of genes, and the activity states of signaling proteins, providing snapshots of the transcriptional and signaling behavior of living cells. Statistical and machine learning techniques, such as clustering, are often used in order to group genes into co-expression patterns, but they are less able to explain functional interactions. An intrinsic difficulty in capturing such interactions in intact cells by traditional genetic experiments or pharmacological interventions is that any perturbation to a particular gene or signaling component may rapidly propagate throughout the network, causing global changes. The question thus arises of how to use the observed global changes to derive interactions between individual nodes. In this research we investigated some computational problems that arises in the context of experimental design for reverse engineering of protein and gene networks. Biological networks may have a very large number of species and parameters. For example, the E. coli transcription network has 577 interactions involving 116 transcription factors and 419 operons. For such large-scale networks, exhaustive calculations are not practically possible due to combinatorial explosion and this necessitates the design of provably efficient approximation algorithms.
Inferring Signal Trasduction Networks from Indirect Experimental Evidences
We introduce a new method of combined synthesis and inference of biological signal transduction networks. A main idea of our method lies in representing observed causal relationships as network paths and using techniques from combinatorial optimization to find the sparsest graph consistent with all experimental observations. We formalize our approach, study its computational complexity and prove new results for exact and approximate solutions of the computationally hard transitive reduction substep of the approach. We validate the biological usability of our approach by successfully applying it to a previously published signal transduction network and by using it to synthesize and simplify a novel network corresponding to activation induced cell death in large granular lymphocyte leukemia. We also show that our algorithm for the transitive reduction substep performs well on graphs with a structure similar to those observed in transcriptional regulatory and signal transduction networks. The NET-SYNTHESIS is freely downloadable from http://www.cs.uic.edu/~dasgupta/network-synthesis/ .
Monotone Decomposition of Biological Networks
A useful approach to the mathematical analysis of large-scale biological networks is based upon their decompositions into monotone dynamical systems. This research deals with computational problems associated to finding decompositions which are optimal in an appropriate sense. In graph-theoretic language, the problems can be recast in terms of maximal sign-consistent subgraphs. The theoretical results include polynomial-time approximation algorithms as well as constant-ratio inapproximability results. One of the algorithms, which has a worst-case guarantee of 87.9% from optimality, is based on the semidefinite programming relaxation approach of Goemans-Williamson. The algorithm was implemented and tested on a Drosophila segmentation network and an Epidermal Growth Factor Receptor pathway model, and it was found to perform close to optimally.

## Hybrid Systems (Computational Complexity Issues)

The area of hybrid systems concerns issues of modeling, computation, and control for systems which combine discrete and continuous components. I am interested in designing and analyzing efficient algorithms for hybrid systems. Some specific recent interests are listed below.

Piecewise Linear Systems
A specific interest of mine inside hybrid systems is the class of (discrete-time) piecewise linear(PL) systems. The subclass of PL systems provides one systematic approach to discrete-time hybrid systems, naturally blending switching mechanisms with classical linear components. PL systems are of interest as controllers as well as identification models, and they can be thought of as arbitrary interconnections of finite automata and linear systems.
Honey-pot Constrained Searching with Local Sensory Information
I have also investigated the problem of searching for a hidden target in a bounded region of the plane by an autonomous robot which is only able to use local sensory information. The problem is naturally formulated in the continuous domain but the solution proposed is based on an aggregation/refinement approach in which the continuous search space is partitioned into a finite collection of regions on which we define a discrete search problem. A solution to the original problem is obtained through a refinement procedure that lifts the discrete path into a continuous one. We show that the discrete optimization is computationally difficult (NP-hard) but there are computationally efficient approximation algorithms to solve it. The resulting solution to the continuous problem is in general not optimal but one can construct bounds to gauge the cost penalty incurred due to (i) the discretization of the problem and (ii) the attempt to approximately solve the NP-hard problem in polynomial time. Numerical simulations show that the algorithms proposed behave significantly better than naive approaches such as a random walk or a greedy algorithms.

## Approximation Algorithms

In general, I am interested in applications of combinatorial/geometric techniques in designing and analyzing efficient algorithms for combinatorial, geometric or graph-theoretic problems with applications to VLSI/CAD, multiprocessor scheduling, combinatorial auctions, computational geometry, databases etc.