The detection and correct classification of named entities in the text is the basis of the NLP (Natural Language Processing) process. Without clearly tagged named entities all subsequent steps will not reach a satisfying performance. That means that the objects present in the documents have to be detected (where are they?) and classified (what are they?). That is especially important for gene and protein names because they are the basic building blocks in our understanding of Molecular Biology and crucial for the understanding of the molecular mechanism of many diseases. This simply means that if the term "cdc28 kinase" is present in the text one has to detect that
- it represents a protein names and
- to which molecular entity it refers ( i.e. what is the database ID in for example Swiss-Prot of this protein).
Biological nomencluture is especially difficult to manage, and this has hindered the development of accurate text mining systems in the field. Ambiguity tends to be high in biological names: Synonims or aliases exist for many bio-entities (some human genes can be found under 17 different names in literature), and since names are often acronyms, many of them can stand for multiple meanings (The gene name "SCT" matches more than 20 other meanings, like "Stem Cell Transplant" or "Stair Climbing Test").
Besides, detecting a possible name is just part of the story: it is also important to link the name to its database entries, so that we can access to the full knowledge for that bio-entity and represent it precisely. Many of the present systems for name detection obviate this crucial step.
We are developing Text Detective for the task of detection of entity names in the biological field. The system accurately annotates text entries, detecting the bio-entities cited on it. The system uses a sophisticated algorithm for disambiguating the possible names found in the text. By using a powerful combination of available knowledge on bio-entities and carefully designed rules, Text Detective is able to achieve up to 90% both in precision and recall on several bio-entity types:
- Gen and protein names (human, S. cerevisiae and E. coli. Mouse coming soon)
- Chemicals and drugs
More organisms and additional entities (tissues, experimental techniques, ...) will be taken into account in following releases.