четверг, 8 января 2009 г.

Python in Bioinformatics and Chemical Informatics

Hello, my name is Andrew Dalke. I'm a software
consultant for the computational life sciences.
That a cumbersome phrase but it's the best I've
been able to come up with. What it means is I
help design and develop software systems which
help computational chemists and biologists do
their research. I'm going to tell you a story.
Well, several stories all intertwined. One is
the story of high-level languages in the
computational life sciences. Another is the
story of my career and my interests in this
field. A third is information about some of the
available Python resources, and I'll probably
include a few other subplots to keep things
interesting. As with all autobiographical
stories, parts of this one likely are not true.
I haven't done the research and fact checking
to back everything up. But I hope it's at least
entertaining and perhaps enlightening. In the
beginning there was Fortran. Developed first in
the 1950s it became the predominate programming
language in scientific computing. It was one of
the first high-level languages, as compared to
macro assemblers, and it was designed for
scientists to do numerics, fast. What I mean by
"high-level language" is that it was the first
language to let scientists with relatively
little training in computers do science using a
computer. And it was a wild success. One of the
accounts I read was of of the original FORTRAN
developers giving a scientist a manual and
within a short time that scientist wrote a
working program. Fortran is still widely used
for some types of scientific programming, which
surprises software people from nearly every
other field. Not only is there nearly 50 years
of working library code but it's still makes
fast code. I chuckle every time I see a paper
talking about how Java approaches the
performance of C++ since I remember papers 10
years ago talking about how with expression
templates C++ performance could approach that
of FORTRAN. It's hard to say just exactly when
computer programs started playing a role in
understanding chemistry. Before the general
purpose computer, back when you rewired boards,
it was hard to tell the difference between the
program and the computer. It's also hard to
tell without digging up a lot of old papers
because people are usually interested in the
science, and not the software which enables the
science. It's also hard to tell because
chemistry is a very broad field, with many
sub-disciplines. The subfield now known as
computational chemistry started in the late
1950s with people like Pople developing
self-consistent molecular orbital theory, which
uses a computer to find an internally
consistent solution to the quantum description
of a molecule. There was a large enough group
of people that QCPE (the Quantum Chemistry
Program Exchange) started in 1962 as way for
people to share software for quantum chemistry.
I mentioned high-level language before. A
language is used to communicate, and most of
the time people think of high-level languages
in programming as a way for someone to compute
with a computer. It's more than that. A
language is a way for people to communicate
with other people. In the case of QCPE this was
Fortran (Fortran II actually, and something
called the Fortran Assembler Program). It's the
earliest example I know of people sharing the
results of their work, and it's hard to imagine
that happening without a common high-level
language. QCPE is still around. You can go to
their site and download source code. Mostly
Fortran but a few others as well. It's a bit
strange though in that they charge based on the
number of lines in the program, with the most
expensive category as 10,000 lines or more.
Another early subfield to use computers was
crystallography. This uses various chemical
tricks to make a crystal out of the protein,
DNA, or whatever else is being studied. This is
placed in an X-ray beam which causes the X-rays
to scatter in a pattern specific to the crystal
structure. By reversing the process
computationally (basically a Fourier transform)
the original structure can be found. Easy to
say but harder in practice. Even if the
structure is well defined and easy to
crystallize there usually isn't enough data in
the scattering to know where all the atoms are.
Instead, crystallographers combine that data
with constraints given by the chemistry and
found by experiments to develop an initial
model of the structure and use numerical
methods, structure visualization and intuition
to refine model. The programs for doing this
started in the mid-1960s. One early Fortran
program, ORTEP, from 1965, draws a simple
molecular structure along with the
uncertainties in position. It is still
available to this day. I think it's pretty cool
that software older than I am is still being
used. Either that or it says something about
the slow progress in this field ;) The
refinement process is complicated. There are
many ways to do it based both on the structure
being studied and the person doing the work.
The term used in science is "protocol" but you
know it as "algorithm." Rather than code each
protocol as a new Fortran program, perhaps
using an extensive software library, several
projects in the late-1970s/early-1980s decided
to embed an interactive domain-specific
language. The most important of these are
CHARMm (1983) and X-PLOR (shortly thereafter).
(I say "are" hesitantly since I know X-PLOR has
been succeeded by CNS.) These scripting
languages made an environment that was easier
to use. Just a few lines of scripting could
replace many lines of Fortran and the
interactivity meant researchers could work with
the data more easily and flexibly. In addition,
these programs came with source code so you
could add your own features if you wanted to.
This not the same as open source; at best you
could distribute changes to others who had a
license. Structure visualization also started
in the 1960s. The first work was by Levinthal
using an oscilloscope as the display! Computer
graphics required expensive, specialized
hardware for decades so the programming
language used depended on the computer systems.
By the late 1970s there were a few widely used
visualization programs; GRIP started in 1971
and was written in PL/I, Frodo in 1978 and
Bilder in 1980. Of these, Frodo had the most
effect in part because it was widely shared.
People would use it at one site, like it, work
on it, and bring a copy of the source with them
elsewhere. (On tape naturally.) Software
branched because different sites were doing
different research or had different
visualization hardware. Without the internet
there was no good way to merge branches. Plus,
many people in this field are good enough to
write code for themselves but know it isn't
good enough for others. This hasn't changed.
Frodo ended up highly fragmented. Besides the
normal problems in fragmentation that you know
about, eg, time wasted on redoing someone
else's work, fragmentation is more worrisome
for science because it's harder to reproduce
someone else's work. You may need to get the
specific version used by the author and it may
be hard to track that down. It's made worse
because people will often forget to list
exactly which version of the software was used,
preferring the canonical literature reference.
Towards the end of the 1980s Alwyn Jones worked
on reconciling the different threads and
produced O, written in C. It's still being
improved and is one of the most widely used
programs for crystallography. Still, if you
want, you can download TurboFrodo. As you might
have inferred, visualization is a bit different
than the other scientific programs. For a long
time it was dismissed as "making pretty
pictures" akin to what a draftsman does -- and
for the most part it still is. It's relatively
hard to publish, esp. as new features are added
and so people who work on visualization
software are less competitive in the publish or
perish environment of academic life. The
scientific examples I gave so far were modeling
oriented. Loosely speaking, that means applying
techniques from physical chemistry to
understand how a molecular structure works and
interacts. Given a compound, do something with
it. Another subfield is chemical informatics
which helps find the compound in the first
place. For example, find all published chemical
data about a given compound, list all compounds
in the in-house corporate collection (of
already synthesized drugs) which have a given
substructure, or are similar to some other
compound. The field of chemical informatics is
quite old. It is closely tied to chemical
notation, which started with Berzelius nearly
200 years ago. The late 1800s found the rise of
paper based information systems, which moved to
machines in the 1920s and computers in the
1950s after the Univac became available. The
very earliest programs I dug up were in
assembly for the Univac. I haven't yet found
much about the languages used in the 1960s and
1970s but I suspect it was Fortran and PL/I.
I've mentioned PL/I before but didn't talk
about its impact on chemistry. That's because I
don't know it. I suspect it was used where C
would be used a decade or so later, as a fast,
portable language that could handle complex
data structures, as compared to Fortran IV.
See, from a computer science sense, molecular
modeling isn't all that complex. Most of the
programs are structured like this: a list of
atom types, a list of bond types, a list of
coordinates, some data table lookups and code
to solve some linear algebra or numeric
integration problem. Databases are different,
because they handle complex record types and
deal with graph searches, bitmap tests and
other non-numeric algorithms. Visualization
code also handles complex data and has its own
set of non-numeric algorithms, and it has to
manipulate a lot of user-defined state. Fortran
is not the right programming language for this.
C started being used in the early 1980s, in
large part because of DEC and the affordable
minicomputer. This allowed a single research
group to have its own computer. Because more
people had computers, it became possible for
people to start commercializing software.
That's not to say there weren't commercial
software systems before then, but they were
rare. The ones I knew about were time-share
based. For example, in the 1960s ISI (the
Institute for Scientific Information, a
for-profit company) indexed the most popular
chemistry journals and provided chemistry and
literature searches. In part because of their
early use of computers to help enter and
maintain all the chemistry data, they kept up
to date with the literature. By comparison, CAS
(the Chemistry Abstract Service) was about
three years behind. In the 1960s ISI could mail
the results of the searches to their customers,
acting as a clipping service (or early form of
RSS), and in 1969 they supported ICI's
(Imperial Chemical Industries) CROSSBOW
(Computerized Retrieval of Organic Structures
Based on Wiswesser). Written in a mixture of
COBOL and assembly, it let people do queries
against their server. In the 1970s they and
other companies provided more powerful clients,
with support for graphical input. But for the
most part these were service based commercial
systems and not ones distributed as products.
The first application suites started in the
1980s by merging capabilities from many
different subfields (molecular modeling,
chemical informatics, and others) into a
GUI-based end-user program. The first one I
know of was Tripos's Sybyl which started in
1979 but it had a major rewrite in the
mid-1980s and was converted from Fortran into
C. Similar programs included Quanta (1984) and
Insight (1984). For this talk, Tripos is the
most interesting. It included SPL, the Sybyl
Programming Language as part of the rewrite
into C. One of the main developers used to work
at Bell Labs where he learned about embedding
application-specific languages. They figured it
was a good idea for prototyping and figured
they would rewrite the prototype for
production. Turns out that SPL was fast enough
and good enough that more and more of Sybyl was
written in SPL and not C. This meant the
company could get more done with fewer people,
which let them stay competitive. Sybyl is still
one of the leading modeling packages, in part
because of the flexibility available from
including SPL. So far I've talked a lot about
chemistry software but not about biology
software. That's partially because I know the
chemistry software field much better than the
biology one so I can weave a more cohesive
story. It's also because the history of
chemistry software is much better documented
because it's an older field, and that in turn
is because chemistry is much more
mathematically oriented than is most biology.
Still there are a few interesting observations
of how high-level languages are used in biology
software which are different from chemistry.
One is GCG, also called the Wisconsin Package.
It is a set of interoperable programs for doing
DNA and protein sequence analysis. Each program
did one thing well and the output from one
program could be used as the input for another.
This let scientists develop new programs using
a shell language (under Unix or VMS). Shell
languages are neat because they let you
experiment with things interactively and
because they let you mix and match programs
written in different languages. Okay, I've
taken you up through history to the early
1990s. This is a good breaking point for two
reasons; it's about when the modern high level
language (like Python) were released and 1992
is when I first started in this field. In
retrospect my interest in high-level languages
started early. In high school I wrote a BASIC
in BASIC and in college my larger programs
usually included an interpreter of some sort.
The lab I worked in did image analysis of
liquid crystal flows. The image tools were a
set of Unix utilities so I learned shell
scripting, and grew to enjoy awk. I worked a
bit on molecular dynamics in 1992 and in grad
school joined a molecular dynamics group in
1993. We developed a parallel molecular
dynamics program called NAMD. It was written in
C++ in part because we didn't know (or want to
know) FORTRAN and in part because the harder
part of distributed computing is data
management, not calculations. We also developed
a visualization program called VRChem, which
became VMD. This was also written in C++ but it
had it's own simple scripting language for
doing command-oriented tasks like "load a
molecule" or "rotate around the x axis by 5
degrees". I had some ideas on making a new
language for molecular modeling but I'm one who
tends to use other packages than make my own so
I looked around a bit first. I used Perl 4 a
lot for CGI scripts and as a replacement for
awk. I loved working in it for simple scripting
but it couldn't handle complex data structures
and it was hard to embed. I looked at Python
but heeded people's cautionary statements about
indentation (boy was I wrong). Tcl's syntax was
a powerful superset of the one we had already
so it was a more natural fit. Tcl turned out to
be a great fit, especially once I figured out
how to make atom selection "objects" and added
support for people to display their own
graphics. This let the researchers do
complicated analysis of their structure data
and display the results along with the
structure itself. Some of the software they
wrote was nearly 1,000 lines long and quite
impressive on what it display. We weren't the
only ones to use Tcl. Several other
visualization packages included an embedded Tcl
including Cerius2, a commercial system. (Though
I don't like the way they did atom selections.
:) Tcl wasn't as popular with the software
developers. Oh, it was great for file parsing
but it was slow and didn't have namespaces
(this was in the pre-8.0 days) and didn't
really handle complex data structures -- we had
to use a 3rd party package to support real
dictionaries! This meant there were two classes
of users; "researchers" who did things in Tcl,
and "programmers" who did things in C++. I
wanted a language which was more scalable than
that. I was reintroduced to Python by Jim
Phillips in I believe 1996. After reading
through the documentation I decided that that
was the language I wanted to use. It had all
sorts of useful things for programmers -- real
data structures, a large number of libraries,
support for C extensions -- and looked simple
enough for computational scientists. But it
would take a couple of years and two changes of
fields before I got a chance to use it. I
worked for a while at a company called MAG
("Molecular Applications Group"). I helped
develop a web-based bioinformatics server in
1997/1998. We wanted to sell this as a product
which meant we had to do it in Perl, because
that's what bioinformatics people used. Why
Perl? Think back to the early days of the web,
now more than 10 years ago. Interactive web
pages could only be done with CGI. The old NCSA
server (yoohoo.ncsa.uiuc.edu) had a dozen
examples of writing CGI programs in different
languages and the easiest to figure out was
Perl. Indeed, that's how I learned about Perl.
There was a third-party package called
cgi-lib.pl from Steven Brenner and nearly
everyone doing CGI scripts in Perl used this
library. What's interesting is that Steven is
in bioinformatics. Just as interesting, when
Perl5 came out Lincoln Stein developed a
replacement package for cgi-lib based on
Perl5's new module system. And Lincoln is also
in bioinformatics. The most popular CGI
libraries for Perl were both developed by
bioinformatics people! I think the main reason
why is because bioinformatics has a long
history of sharing data, even back to the 1920s
with the early work in drosophila. Sharing was
needed to do good science. It took a lot of
work to figure out the genome and no one lab
could do it all or come up with all the ideas.
It's hard to make money from sequence data. We
have the human genome now which means with a
good target and half a billion dollars of
research we might have a new drug. So nearly
all of the bioinformatics data and even most of
the software is freely available, funded in
large part by taxes and research grants.
Compare this to the pharmaceutical datasets
used in chemical informatics where the tie
between the data and making money is much
clearer. In that field only about 5% of the
data is public. The rest is in proprietary
databases, either commercial or in-house and
very few programs are freely available. Perl
specifically was picked because bioinformatics
gets its start in the 1980s managing early
sequence databases, which were frequently
developed on Suns, which were cheap and came
with the Sybase database. Many of the early
people had Unix experience, which fits quite
naturally with Perl. I also think Perl worked
because of its strong support for string,
string editing and regular expressions. Both
are core parts of bioinformatics, where
sequences are represented as strings. Early
bioinformatics work didn't need complex data
structures like trees so Perl's limitations
weren't that big a problem. (See also Lincoln
Stein's How Perl Saved the Human Genome
Project) Steven started bioperl in the
mid-1990s as a way for people to share their
Perl programs. Perl5 came out in 1995 with
support for modules and complex data types and
within a few years of that bioperl became a set
of Perl libraries for doing bioinformatics.
Bioperl had their first meeting (where the
different developers got together) in 1998 and
Biopython joined them in 1999 which was the
basis for the Open Bioinformatics Foundation.
Remember Python? This talk is about Python. We
left off with me working on a bioinformatics
server written in Perl, because that's what our
potential customers were using. I wrote
something like 8,000 lines of Perl over 6
months. Clean code and it had a lot of
functionality. But I had a hard time explaining
the result to non-programmers and even to
programmers. I managed to convince Jeff Chang,
a co-worker, that Python was a better language
than Perl. He was working on an in-house
project he got to try it out. He agreed with me
and has stayed with Python since then. At the
Objects in Bioinformatics conference in 1999 I
held a birds-of-a-feather meeting on high-level
languages in bioinformatics. At it, Jeff and I
decided to start Biopython along the same lines
as Bioperl. Rather than duplicate effort, we
asked the Bioperl people if we could work with
them and use their servers. They agreed, and
the BioJava folk followed soon afterwards.
Together we decided to start the Open
Bioinformatics Foundation, a non-profit company
to manage the computer systems needed by the
different groups. We also have a meeting every
year, BOSC, which is a satellite conference of
ISMB, the big yearly bioinformatics conference.
The OBF mostly host infrastructure projects;
libraries for the different languages, and
standard object models like XML and CORBA, and
an SQL schema. At about the same time, Jeff
Bizzaro started Bioinformatics.org which is
more along the lines of Sourceforge and is
mostly aimed for applications and small
packages. So if you're looking for packages for
bioinformatics you should try out those two
sites first. Biopython, Bioperl, etc. are
libraries. These style of programming is not
new. The earliest commercial success I know is
the Daylight toolkit, a set of libraries for
chemical informatics which started some 20
years ago. After I left MAG I worked for a
while at Bioreason which developed data mining
tools for analyzing high throughput chemical
screens. We used Daylight as our bases for
understanding the chemistry. It was convenient
in part because Daylight and Bioreason are both
located in Santa Fe, so it was easy to get help
and report bugs. Daylight sells toolkits with C
and Fortran bindings. For several years I
wanted to do Python so I looked into the ways
to make Daylight's C API accessible from
Python. Luckily, Dave Beazley wrote SWIG and
Roger Critchlow wrote DaySWIG. But the result
still felt like C. I instead wrote PyDaylight
which is a thick wrapper to the toolkit which
feels like Python. It handles garbage
collection, iterators, attributes, and
exceptions all in a very object-oriented
fashion. This let us do our work a lot faster
and let more people in the company develop new
code with fewer errors. We released PyDaylight
as LGPL in the 1998 and presented it formally
in 1999. I wrote a Dr. Dobb's article about it
too. I'm rather happy about that program
because it's the basis of my consulting work
that I've been doing the last 4 years. Vertex
Pharmaceuticals was one of my first clients. I
had convinced Pat Walters there to take a look
at Python, especially with PyDaylight. He was a
big Perl programmer but decided to switch over
to Python and now Vertex is a big Python (and
Java) shop. PyDaylight depends on a commercial
toolkit. Brian Kelly, who worked with me at
Bioreason, has developed FROWNS, a free version
of the PyDaylight API. There are also toolkits
for Python for molecular modeling. The earliest
is MMTK from Konrad Hinsen which is mostly for
macromolecules. Another company in Santa Fe,
OpenEye, developed first OELib (now Open Babel)
then OEChem, an extensive set of libraries for
modeling both small molecule and large molecule
support, eg, for doing docking. Bob Tolbert
developed Python bindings for OELib which
persuaded OpenEye to hire him and have him
develop the Python bindings for OEChem. I like
to think I helped with that persuasion. As for
visualization there are many choices. VMD now
also supports Python, and UCSF's Chimera is
written in a combination of Python and C++ as
is Warren DeLano's PyMol. PyMol is the current
best open source molecular modeling package.
Michel Sanner also has a complete set of data
flow tools for molecular visualization. (BTW,
Michel's been pushing for a common Pythonic API
for molecules and ran a conference on this
topic last fall.) There's also PyQuante for
doing quantum chemistry, I've heard about
in-house work at Tripos doing programs with
Python, and I know various pharmas and biotechs
are using Python for in-house work. So by
comparison to bioinformatics, where Perl is
king, Python is well on the road to being the
high-level programming language for
computational chemistry. There are a few other
high-level languages for the life sciences
which I'll mention. Lisp is used at a few
places. EcoCyc is one, mostly I think because
it was started by someone who's been using Lisp
for decades. Like I said earlier, the
programming for these fields are mostly boring
from a computer science sense. The problems
almost never need the cool macro solutions that
sets Lisp apart from most other languages.
There are also some people doing Smalltalk,
including a bioSqueak. Baylor College of
Medicine is the only place I know using
Smalltalk. Klotho used Prolog as a declarative
language for molecular structure and assembly.
And there's SVL, the Scientific Vector Language
used by CCG's MOE. It's a non-OO imperative
language with special scalar types like
molecule as well as support for
higher-dimensional arrays. SVL points out some
of the things that Python does not do well. I
and others would really like a built-in
multidimensional vector data type, including
C-speed numerical operations. I hope
Numeric/numarray becomes that some day but I
know it will still be a while. A good built-in
numeric package including statistics would make
Python more useful for gene expression
analysis. Currently many of those analyzes are
done in R. Python doesn't take real advantage
of multiple processors, because of the global
interpreter lock. I know about some ways around
it but they are kludges. I want some way to run
Python code in a sand box so a researcher can
download and run someone else's analysis code
from off the web without worrying about the
evil it might do. PyPy should help with this.
One of the biggest complaints I get about
Python is that it doesn't have a good standard
cross-platform GUI framework. That's the main
reason people still use Java. I don't
understand that one because the complaint is
that Tkinter doesn't use native widgets but I
thought Java didn't either. But I'm not a GUI
person. And finally something that Warren first
explained to me is that Python doesn't make for
a good shell or command language. "ls -l *.txt"
is lengthy to write in Python and "rotate x 5
degrees" is easier to explain than
"viewMatrix.rotateX(5)". His PyMol has a dual
mode interface similar to the Interactive
Python used by SciPy that can distinguish
between a shell-like command and a Python
statement. The fundamental issue here is that
basic Tcl is still easier for a few things than
basic Python. I would like to experiment some
to see how much of a problem this is. There you
have it, a summary of the last 50 years of
high-level languages in the computational life
sciences and how Python fits into that history.
You got a lot of information in the last 30
minutes, I hope you found it interesting, or at
least useful. Any questions?

Комментариев нет:

Отправить комментарий