Information integration in bioinformatics

The gold standard for the integration of information in bioinformatics today still is the SRS platform of Lion Bioscience as publicly accessible at the European Bioinformatics Institute (EBI) [28]. The developers take a particular pride in the flexibility of the tool, facilitating a straight-forward integration of local databases, a feature signing responsible for the commercial success of this technology far beyond its application in bioinformatics. While still available to the academic community for no money, the number of installations of SRS is surprisingly low. This speaks for the quality of the EBI installation, but to our understanding also for the extra burden such a system imposes in terms of maintenance. A companion tool, Prisma, addressing automated updates, is not free.

The problem of maintenance and coherence across sites is particularly obvious in the myGrid effort [21]. The development of workflows for the repository of data and their analysis across multiple tools and sites is mostly static. It is not possible for a machine already contributing to myGrid that has idle time, to help out other machines that are busy by a dynamic addition of services to those it already offers. Such would at least require the installation and deinstallation of the service's respective runtime environment in a fully automated manner. Further problems would require to be addressed for the notification of other sites in order to become aware of such a change and to subsequently react to to it. Such issues are addressed in agent research and standardised middleware like CORBA. The here presented Debian Linux distribution has implemented such dynamics.

The sharing of workloads in homogeneous environments is addressed by grid initiatives, like by the Globus-based [9] NorduGrid initiative[8]. For each site, the workload is addressed by one or multiple clusters of homogeneous machines, the coherence of installations across institutions is coordinated and supervised by so called virtual organisations (VOs). Those define Runtime Environments (REs) and clusters adhering to such specify such in their description that is utilised in the selection of clusters feasible for a job's execution.

To deploy one's algorithms and data, the programs are submitted as source code and (unless programmed in interpreted scripts or other hardware-independent languages) compiled prior to execution. VOs may ease this burden and require the sites to install a minimal set of programs for every machine by the inclusion on an RE. While this has been proven to be very functional for the grids' roots in particle physics, for bioinformatics, with its vast heterogeneity of small applications and comparatively tiny databases, even if an agreement could possibly be reached in a VO, the effect could hardly be maintained by a site's maintainer.

Andreas Tille 2005-05-13