One of the main limitations of the literature is that it is "flat" in the sense that no information about words and terms used is associated with the documents. Therefore relevant documents are difficult to find because words are ambiguous and keyword searches suffer in general from being low in precision and recall. One reason for this is that many words form part of multi-word terms and their meaning depends on the context (e.g. the word "DNA" means something different in "DNA replication", "DNA splicing" and "DNA expression"). This also implies that it is difficult to find relevant information in the text and to compare the information in different publications and other data sources (e.g. databases of gene and protein sequences, protein structures, metabolic pathways or transcription factors).
To facilitate the access to the literature and as a basic framework for other methods in information extraction KnowledgeDB includes:
- Annotation of the text by classifying every biologically relevant term that appears.
- Exploitation of the possibilities of external knowledge sources (ontologies like UMLS or GO) to support this task.
- Link the classified terms to existing databases (proteins/genes to sequence/structure databases, chemical compounds to chemical ontologies,
diseases to a medical classification of diseases, ...).
This allows to use the literature like a structured database to perform complex queries (e.g. retrieve all the documents that are related with some form of cancer and that contain a human gene/a drug in the same paragraph) and to relate the information in the text with external knowledge sources (e.g. get the specifications for a drug that is mentioned in the text just by clicking on it). Based on this information text can be clustered by specific criteria and information extraction methods can be applied to reduce the complexity of information for the user and reduce the amount of text that has to be reviewed to answer a certain question (e.g. compare all abstracts for a number of diseases to extract the differences between them).