CLUE: A GUI for cluster browsing
The following list selects overheads used in my presentation
at MUG96, February 27, 1996. Yes, they are terse, being
intended to remind those who attended the meeting, rather than to be
a full explanation of CLUE.
CLUE = CLUster Examiner
-
Clustering run produces output files.
-
Perl script collects data from files and
creates a gdbm(GNU) database file.
-
CLUE reads gdbm file and allows user to:
- list clusters and compounds.
- search clusters for compound.
- sort clusters by variance, size, etc.
- sort compounds by size, logP, etc.
- view and print compound depictions.
- edit arbitrary groups of compounds;
groups can be stored in database.
find cross reference list of all clusters
which contain a compound.
Design Goals
-
Do better than dumpcluster, statcluster,
print_cmpds, singleton_list, etc.
-
Provide a friendly GUI powerful enough
for experienced user and simple enough
for occasional users.
-
Quick(interactive) response.
-
Handle large systems smoothly
(10,000 clusters and 200,000 compounds).
-
Collect all relevant output data into
one file.
Design decisions
-
Use tcl-tk
to make creation and maintenance
of GUI quicker, easier, and more reliable.
Tcl-tk is freely available for Unix, Mac and
Windows. There is a gdbm version available.
-
Use Perl
to collect and parse information from
relevant output files. Perl is a powerful, stable,
file and text manipulation language available
for no cost. There is a Perl+gdbm version.
-
Use GNU's gdbm
to store the output information
in a single file. gdbm does not have the size
limitations of standard Unix dbm or ndbm.
gdbm is available from GNU for free.
-
Use Daylight's PRADO to create postscript files
of compound's depictions. Use GNU's
ghostview
to preview and control printing of
these files.
Design compromises
Precompute and store some data in
database. Increases size of database.
increases CLUE startup (load database)
time, but speeds up user interaction.
For example: (time on Iris 36MHz 4D/35)
Nclus/Ncomp Size(Mb) Create(sec) Load(Sec)
3/24 0.007 xx 5
23/500 0.083 xxx 7
368/2,825 1.006 xxx 23
1,753/39,468 8.610 xxx 140
12,834/198,637 45.717 xxx 370
Once any database is loaded, time to navigate
(scroll, etc.) is constant. Some operations
(load compounds, depict) depend on individual
cluster size.
Contents of gdbm database
- Experiment record ("EX")
experimentName, Ncompounds, Nclusters,
clusteringMethod, NcompoundsInClusters,
Nsingletons, averageClusterSize, text
- Cluster records (clusterIndex, e.g. "CL42")
NCompounds, Ncrisp, Nfuzzy, Nskeletons,
averageVariance, overallVariance, centroid,
totalBitsInFingerprint, BKSbits, NBKSreps,
compoundIndices, variances, inventoryIndex
- Compound records (compoundIndex. e.g. "CP796")
name, SMILES, molecularWeight, clogP,
AndrewsBinding, inventoryIndex, totalClusters,
clusterIndices, totalCentroid, centroidIndices,
totalSkeleton, skeletonIndices, totalBKS,
BKSIndices, nearest20Neighbors
- Group records (groupName, e.g. "Sngl")
Ncompounds, compoundIndices, referenceString,
inventoryIndex
- Index records ("IX")
compoundNames
Future work
-
Include 2D depictions in database, or
compute depiction on demand for
quicker response and more interactivity
than ghostscript provides. PRADO is
quick, but postscript files are big and
slow to interpret.
-
Include ability to set up new clustering runs
with subsets of clusters and compounds.
-
Provide general mechanism to interface
to external programs, for example
smi2tdt, smi2tanmat.
-
Invent and implement new graphical
representations of interesting clustering
statistics.
-
Extend group paste operation to include
Boolean operations on compounds.
The Web Counter
tells me that
you are person number
to visit this page.
If you have any comments or inquiries, please contact me at
tjo@acm.org