No matter the problem we study, revision and exploration of the knowledge already acquired is a must for the researcher. Modern experimental techniques such as DNA microarrays, in which thousand of genes can be assayed simultaneously, require the efficient retrieval of the information gathered for sets of genes, some of them the researcher is possibly not familiar with.

Scientific knowledge is mostly stored in huge collections of written text. Medline, the archive of abstracts of biological articles, contains already more than 10 million entries. The rapid growth of these collections makes it very difficult for human experts to access the information in effective ways. Therefore, it is absolutely necessary the creation of tools capable of extracting the information of interest from the literature stored in the databases.

Although still very experimental, the first applications of information extraction (IE) techniques have addressed problems such as the detection of gene names and their position on chromosomes, the detection of protein names, the construction of knowledge bases, and more frequently the detection of protein-protein interactions. General utilities for text retrieval like keyword suggestion systems and searches for related documents are also available. The application of IE techniques to the analysis of expression array data has also been addressed recently.

There are currently two alternative approaches in the field of automated IE. The most biologically-oriented is directly related to the methods of bioinformatics. It involves a simple analysis of the co-occurrence of names in text and direct statistical treatment of words as particles (word content, comparison of frequencies, etc.). These applications have proven to be relatively successful, but there is an obvious limit to what can be done as statistics cannot fully account for the complex structure of the information contained in plain scientific text. More computationally-oriented approaches involve the use of text parsing, part-of-speech tagging, resolution of ambiguities, etc. Adapting these approaches to field-specific problems (i.e. dictionaries, expressions and linguistic constructions) requires some considerable work.

It is important to keep in mind that the degree of success of these approaches depends not only on the quality of the methods, but also on the intrinsic difficulty of the particular field of application. Realistic methodologies will use techniques from both approaches and combine them within a single application. In particular, part-of-speech taggers, a classic development in natural language understanding, can be applied with little effort to biological text (at the syntax level). However, at the level of semantics and pragmatics, no general solutions exist for understanding the information in the text according to local and global context. Here, frame-based approaches that use patterns that match frequently-used constructions can extract a large amount of information with little effort and considerable reliability.

2002 ALMA Bioinformatics, SL. All rights reserved.