Brief info

ALMATextMiner is a system for extracting relevant bibliographic information related to groups of genes. The goal of the system is discovering and pinpointing the important facts related to a given group of genes but not to others. To do so, for each group of genes the system retrieves pertinent keywords, sentences, and links to relevant articles.
These are then weighted such that information that is specifically relevant to each group is highlighted. This helps significantly in the functional annotation of each group of genes, and has the potential to undercover many other interesting informations (involvement in diseases, important mutations or alterations, connections with celullar processes and pathways, etc.).
Even if the most obvious application of the system is in the annotation of microarray results, the analysis of any set of genes or proteins is possible. The system can therefore be used for a number of genomics and proteomics applications


You have to explore large amounts of expression array or proteomic data that imply hundreds of genes or proteins, you want to know something about a new field or you simply need some clues about what is known in the literature about a gene, a protein family, a drug or a disease. This normally includes reviewing hundreds/thousands of documents, study the titles and read the abstracts of a number of them and try to get a preliminary overview. That's where information extraction comes in.

The AlmaTextMiner analyzes the query documents to extract the most relevant parts, interconnects the documents of the same or even completely different queries and presents the results in an easy to use interface that allows the researcher to quickly explore the literature. This allows to spot the most relevant documents, expand the query to include related documents that did not include the initial query terms, and cluster and annotate large text collections to make them accessible for exploration. One of the advanced features of the ATM is to compare different document collections and highlight the differences between them to answer for example question like "what is specific to each of the drugs of this group of drugs with similar effects?".
The ATM builds heavily on KnowledgeDB, Alma's central database that contains information about genes, proteins, drugs, diseases and other "bio-entities" and the specific algorithms to securely map them to the documents. This knowledge adds a new level of information to the text collections and assists in the analysis of hidden relations that would not be evident using the text alone.

The first commercial version of the AlmaTextMiner was developed to assist in the analysis of DNA array experiments. The main result of these experiments is the discovery of sets of genes with similar gene expression patterns (expression-based gene clusters). The underlying assumption is that these gene clusters are related by their participation in common biological processes. The operations carried out to define the "biological meaning" of these clusters typically involve consulting functional annotations in different sequence databases such as SwissProt or GeneBank. This information is often insufficient and bibliographic information must be consulted, usually by following the links to selected Medline abstracts provided in some sequence databases. Since only a small fraction of these pointers provide direct information about gene function further references are usually collected by querying PubMed directly with gene names. In practice, analysis of a full experiment can imply tens of thousands of references, making the systematic analysis of the differences between gene groups impractical.
The AlmaTextMiner involves the annotation of function for groups of genes that show similar expression patterns in DNA array experiments. First the system uses the groups of genes as a framework for clustering the related literature and performs based on this a functional analysis of the gene expression clusters, summarizes the documents for each cluster and highlights the most relevant documents.

These techniques increase their performance dramatically when they are used together, the TextMiner to find relevant documents and get an overview over the available publications, the KnowledgeExplorer to establish relationships between the entities, then back to the TextMiner to extract information about a set of entity pairs and to compare them to other sets to spot hidden differences and to extend the text corpus with the new information, and all this knowledge in accumulated in KnowledgeDB to be mined in further sessions
Contact us at for further information print this page