You have to explore large amounts of expression array or proteomic data that imply hundreds of genes or proteins, you want to know something about a new field or you simply need some clues about what is known in the literature about a gene, a protein family, a drug or a disease. This normally includes reviewing hundreds/thousands of documents, study the titles and read the abstracts of a number of them and try to get a preliminary overview. That's where information extraction comes in.
The AlmaTextMiner analyzes the query documents to extract the most relevant parts, interconnects the documents of the same or even completely different queries and presents the results in an easy to use interface that allows the researcher to quickly explore the literature. This allows to spot the most relevant documents, expand the query to include related documents that did not include the initial query terms, and cluster and annotate large text collections to make them accessible for exploration. One of the advanced features of the ATM is to compare different document collections and highlight the differences between them to answer for example question like "what is specific to each of the drugs of this group of drugs with similar effects?".
The ATM builds heavily on KnowledgeDB, Alma's central database that contains information about genes, proteins, drugs, diseases and other "bio-entities" and the specific algorithms to securely map them to the documents. This knowledge adds a new level of information to the text collections and assists in the analysis of hidden relations that would not be evident using the text alone.
The first commercial version of the AlmaTextMiner was developed to assist in the analysis of DNA array experiments. The main result of these experiments is the discovery of sets of genes with similar gene expression patterns (expression-based gene clusters). The underlying assumption is that these gene clusters are related by their participation in common biological processes. The operations carried out to define the "biological meaning" of these clusters typically involve consulting functional annotations in different sequence databases such as SwissProt or GeneBank. This information is often insufficient and bibliographic information must be consulted, usually by following the links to selected Medline abstracts provided in some sequence databases. Since only a small fraction of these pointers provide direct information about gene function further references are usually collected by querying PubMed directly with gene names. In practice, analysis of a full experiment can imply tens of thousands of references, making the systematic analysis of the differences between gene groups impractical.
The AlmaTextMiner involves the annotation of function for groups of genes that show similar expression patterns in DNA array experiments. First the system uses the groups of genes as a framework for clustering the related literature and performs based on this a functional analysis of the gene expression clusters, summarizes the documents for each cluster and highlights the most relevant documents.