Home   My Personal Essay   Research   Course Work in CS and Math   Curriculum Vita        

 
   Researches  Japanese
 
   Here is only overview of my current and past research.   
 
  
  • Microbial Genotype-Phenotype Mapping     (Postdoctoral Fellow at the Lawrence Livermore National Laboratory,   August 2006 - present)
  •   
     
       Microbial phenotypes are typically due to the concerted action of multiple gene functions yet the presence of each gene may have only a weak correlation with the observed phenotype. Hence, it may be more appropriate to examine co-occurrence between sets of genes and a phenotype instead of one-to-one relation between genes and the phenotype. However, the size of the search space is an exponential function of the size of set. For an example, my data set contains about 1,2000 unique gene profiles, and to extract sets of gens using naïve method, it takes

     1-COG set  1 second
     2-COG set  1.5 hours
     3-COG set  288 days


    So, some heuristics to focus on a subset of the most promising candidate sets is necessary. I developed an efficient Class Association Rule mining algorithm, NETCAR, to extract sets of genes associated with phenotypes from phylogenetic and phenotype observation profiles. NETCAR takes into account the connectivity graph between items (genes) to restrict hypothesis space, and uses mutual information to evaluate the biconditional relation.

    Base on the sets of genes extracted by NETCAR with respect to microbial phenotypes, we found three topologies of gene module, (a) mixed, (b) star, and (c)one-to-one type. The mixed and star type gene module contains large numbers of genes that cannot be extracted by questing the one-to-one relation.


    from left to right, gene interaction network relevant with (a) aerobic and the mixed type topology, (b) endospore formation and the star type topology, and (c) motility and the one-to-one type topology



    The nodes are genes involved in the top 30 rules. The orange nodes are gene with a strong one-to-one relation while the green nodes represent geneswith a weak one-to-one relation. The size of each node and the width of each edge are in proportion to observation frequencies of corresponding gene and link in the extracted rules, respectively. The color intensity of each edge indicates the profile similarity between the linked genes.


    NETCAR is applicable to other situations where the number of features (genes) exceeds the number of samples (genomes). This is typical for biological data. For example, NETCAR may be able to mine co-regulatory gene network modules relevant to a target physiological observation, from micro-array data with many more genes than expression arrays.
      
     
     
       References   
     
       Makio Tamura and Patrik D'haeseleer, "Genotype-Phenotype Mapping by Class Association Rule Mining", submitted,  draft paper is available upon request   



     
      
  • Missing Value Expectation of Matrix Data by Fixed Rank Approximation Algorithm     (Master project at the University of Illinois at Chicago,   November 2005 - July 2006)
  •   
     
       The microarray experiment enables us to get the overview of the on-off switching of gene activities for a series of different conditions such as the time course after a certain drug dosage, consecutive environmental stimulation changes, or different physiological conditions such as normal and cancer cell or different cell development states. Since one microarray contains a huge number of spots, there are often missing values or unreliable value due to insufficient image resolution, image corruption, dust or scratch on a plate. Standard supervised statistical microarray analysis such as hierarchical clustering, k-means clustering, support vector machine classification, principal component analysis, or singular value component analysis can not be applied to data set with missing values.

    Fixed Rank Approximation Algorithm (FRAA) is a method [S. Friedland et al., A Nikneijad] to predict missing entries by using Eigengenes. The FRAA requires the fixed number of major Eigengenes as an input, however, it is difficult to guess the correct number of Eigengne or rank of perfect matrix. And therefore even FRAA itself is powerful, it not useful in a practical case. The other drawback of FRAA is that the result deepens heavily on initial tentative values for missing entries. To deal with these problems, Scanned FRAA (SFAA) is developed in this project. SFRAA automatically find the optimal number of high rank Eigengene and avoid local optimal solution by scanning that number from small to large. SFRAA shows better prediction accuracy than previously reported methods (BPCAimpute and LLSimpute).

    To make the SFRAA algorithm available, I implemented the algorithm as a Windows application called -Seed- by C# language. Since a spread sheet application such as Microsoft Excel fit to represent the numerical value of the matrix data, the application is integrated with Microsoft Excel. The user can take advantage to access various analyses methods provided by Microsoft Excel as well as the SFRAA. The other prediction algorithm can be added into the gSeedh since prediction algorithmic code is implemented as an independent module.

      
     
     
       References   
     
       Makio Tamura, "Missing Value Expectation of Matrix Data by Fixed Rank Approximation Algorithm", Master Project at the University of Illinois at Chicago, pdf
       The implementation SFRAA - Seed - is available for the Windows platform. The documentation is included in my MS project paper.   



     
      
  • Proteomics and Data Integration     (Research Assistant at the National Center for Data Minin, University of Illinois at Chicago,   August 2005 - May 2006)
  •   
     
       I am developing a peptide/protein identification algorithm for a tandem mass spectrum data analysis. Mass spectrometry is an equipment to measure molecular weight and can be apply to detect a biological molecule such as protein. But in the biological context, sample is usually mixture with very large amount of molecules, and simple Mass spectrometry has not enough dimensionality to separate such mixture. Tandem mass spectrometry is an enhanced method to analyze such a mixture sample. However, data from a tandem mass itself is also complicated and computational aid is critical to understand result. Now, I apply a new mathematical model for a tandem mass spectrum and am trying to create a new algorithm.

    As well, the inflation of biological data is getting overwhelm seriously, we need to a robust system to grasp a relevant information for our interest. For this context, data integration for biological data is getting important. I am also working on developing a method for data integration of biological data, come from genomics and proteomics as well as non-condign RNA.

      
     
       Some Links:
    The External RNA Controls Consortium: a progress report  Nature Methods 2, 731 - 734 (2005)
    Proteomics' new order  Nature Volume 437, pp169
    Bioinformatics Data Integration  Business Intelligence Network, August 16, 2005
    Mathematics in Biology; Science, February 6, 2004
      
     



     
      
  • Structural Classification of RNA   - SCOR -    (Postdoctoral Fellow at the Lawrence Berkeley National Laboratory,   May 2001 - May 2004)
  •   
     
       The Structural Classification of RNA, SCOR, is a database designed to provide a comprehensive perspective and understanding of RNA motif structure, function, tertiary interaction and their relationships.

    The number of RNA structures whose coordinate are available in the Protein Data Bank and the Nucleic Acid Database has been rapidly growing. In order to organize this information and make it available to the non-specialist, to discover new feature of RNA structure and relationships to sequence and function, and to enumerate and classify substructures for model building and RNA engineering, SCOR is developed.




    Classification of the Loops with a dinucleotide platform in a triple structure. Left side is the snapshot of web interface in SCOR and right side is a schematic representation of the classification in a directed acyclic graph. The corresponding structure is pointed by red allows.


    The classification is represented as a Directed Acyclic Graph (DAG), which allows a classification node to have multiple parents, in contrast to the strictly hierarchical classification supports three types of query terms in the updated search engine: PDB or NDB identifier, nucleotide sequence, and keyword. We also provide parseable XML files for all information. Web interface of SCOR is implemented by JSP and JAVA.
      
     
     
       References   
     
       Peter S. Klosterman, Makio Tamura, Stephen R. Holbrook and Steven E. Brenner, "SCOR: a Structural Classification of RNA database", Nucleic Acids Research, 2002, 30, 392-394.   
     
       Makio Tamura, Donna K. Hendrix, Peter S. Klosterman, Nancy R. B. Schimmelman, Steven E. Brenner, and Stephen R. Holbrook, "SCOR: Structural Classification of RNA, Version 2.0", Nucleic Acids Research, 2004, 1, 32, 182-184.   
     
       Peter S. Klosterman, Donna K. Hendrix, Makio Tamura, Stephen R. Holbrook, and Steven E. Brenner. 2004. Three-Dimensional Motifs from the SCOR: Structural Classification of RNA Database - Extruded Strands, Base Triples, Tetraloops, and U-turn. Nucleic Acids Res. 32. 2342-2352.   



     
      
  • Ribose Zipper   -   A Tertiary Interaction of RNA   (Postdoctoral Fellow at the Lawrence Berkeley National Laboratory,   May 2001 - May 2004)
  •   
     
       The ribose zipper, an important element of RNA tertiary structure, is characterized by consecutive hydrogen bonding interactions between ribose 2'-hydroxyls from different regions of an RNA chain or between RNA chains. The ribose zipper was first recognized as an intermolecular interaction in hammerhead ribozyme crystals and two intramolecular tertiary interactions in the crystal structure of the P4-P6 domain of the group I intron. One ribose zipper mediates the interaction between an adenosine rich bulge and the P4 stem and the other mediates the interaction between the GAAA tetraloop and tetraloop receptor.




    Left figures show ribose zippers of the P4-P6 domain and right shows a ribose zipper in ribosomal RNA.


    We searched for ribose zipper tertiary interactions in the crystal structures of the large ribosomal subunit RNAs of Haloarcula marismortui and Deinococcus radiodurans, and the small ribosomal subunit RNA of Thermus thermophilus and identified a total of 97 ribose zippers. Of these, 20 were found in T. thermophilus 16S rRNA, 44 in H. marismortui 23S rRNA (plus 2 bridging 5S and 23S rRNAs) and 30 in D. radiodurans 23S rRNA (plus 1 bridging 5S and 23S rRNAs).


    from left to right a, b, c, d



    The atoms included in ribose zipper residues are drawn as colored spheres. (a-b)Ribbon drawings of T. thermophilus small ribosomal subunit RNA (16S rRNA) (a) as viewed into the face interacting with 23S rRNA and (b) rotated by 90 about the vertical axis. Each colored region represents an rRNA domain (domain I is lime, domain II is teal, domain III is slate, and domain IV is pink). (c-d) Ribbon drawings of H. marismortui large subunit ribosomal RNA (23S and 5S rRNA) (c) as viewed into the face that interacts with 16S rRNA and (d) rotated by 90 about the vertical axis. The color of the ribbon represents the 23S domains and 5S rRNA (domain I - lime, domain II - teal, domain III - slate, domain IV - pink, domain V - salmon, domain VI - wheat, and 5S rRNA - olive).


    Out of a total of 66 ribose zippers in the small ribosomal subunit of T. thermophilus, and the large ribosomal subunit of H. marismortui, 43 ribose zippers (65.2%) interact with ribosomal proteins. This is especially true for canonical RZs, where 30 (75.0%) of the 40 form hydrogen bonds between the RNA backbone atoms and residues of a neighboring protein or several proteins. Arginine and lysine are the most common protein residues for hydrogen bonding to the ribose zipper backbone, thus providing charge neutralization. As judged from the structure of the large ribosomal subunit, water-mediated RNA-protein hydrogen bonds are much more frequently observed than direct hydrogen bonds. There are also a few cases in which nucleotide base atoms are used for hydrogen bonding with protein.


    from left to right, (a), (b)



    (a) Tube and ribbon drawings of H. marismortui 23S rRNA around a ribose zipper, where residues included in it and its base pair residues on the stem-side (slate) are drawn as sticks, and its neighboring ribosomal proteins L15e, L37e, and L4, which interact with 3L or its base pair residues on the stem-side. (b) Stick and tube diagrams of the ribose zipper and its neighboring residues of L15e, with only the residues used in hydrogen bonding drawn as sticks. Hydrogen bonds are shown as broken blue lines.


    We also find evidence of covariant conservation of the RZ sequences in 16S rRNA, suggesting that RZ mediated tertiary interactions are preserved in evolution.
      
     
     
     
       References   
     
       Makio Tamura and Stephen R. Holbrook, "Sequence and Structural Conservation In RNA Ribose Zippers", Journal of Molecular Biology, 2002, 320, 455-474   
       
       
      Contact  
      Makio Tamura