TY - JOUR
T1 - A multi-strategy approach to biological named entity recognition
AU - Atkinson, John
AU - Bull, Veronica
N1 - Funding Information:
This research is partially sponsored by the Universidad de Concepcion, Chile under grant number DIUC no. 210.093.015-1.0.
PY - 2012/12/1
Y1 - 2012/12/1
N2 - Recognizing and disambiguating bio-entities (genes, proteins, cells, etc.) names are very challenging tasks as some biologica databases can be outdated, names may not be normalized, abbreviations are used, syntactic and word order is modified, etc. Thus, the same bio-entity might be written into different ways making searching tasks a key obstacle as many candidate relevant literature containing those entities might not be found. As consequence, the same protein mention but using different names should be looked for or the same discovered protein name is being used to name a new protein using completely different features hence named-entity recognition methods are required. In this paper, we developed a bio-entity recognition model which combines different classification methods and incorporates simple pre-processing tasks for bio-entities (genes and proteins) recognition is presented. Linguistic pre-processing and feature representation for training and testing is observed to positively affect the overall performance of the method, showing promising results. Unlike some state-of-the-art methods, the approach does not require additional knowledge bases or specific-purpose tasks for post processing which make it more appealing. Experiments showing the promise of the model compared to other state-of-the-art methods are discussed.
AB - Recognizing and disambiguating bio-entities (genes, proteins, cells, etc.) names are very challenging tasks as some biologica databases can be outdated, names may not be normalized, abbreviations are used, syntactic and word order is modified, etc. Thus, the same bio-entity might be written into different ways making searching tasks a key obstacle as many candidate relevant literature containing those entities might not be found. As consequence, the same protein mention but using different names should be looked for or the same discovered protein name is being used to name a new protein using completely different features hence named-entity recognition methods are required. In this paper, we developed a bio-entity recognition model which combines different classification methods and incorporates simple pre-processing tasks for bio-entities (genes and proteins) recognition is presented. Linguistic pre-processing and feature representation for training and testing is observed to positively affect the overall performance of the method, showing promising results. Unlike some state-of-the-art methods, the approach does not require additional knowledge bases or specific-purpose tasks for post processing which make it more appealing. Experiments showing the promise of the model compared to other state-of-the-art methods are discussed.
KW - Bioinformatics
KW - Machine learning
KW - Markov models
KW - Named entity recognition
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=84865040493&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2012.05.033
DO - 10.1016/j.eswa.2012.05.033
M3 - Article
AN - SCOPUS:84865040493
SN - 0957-4174
VL - 39
SP - 12968
EP - 12974
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 17
ER -