Identification of relations between risk factors and their pathologies or health conditions by mining scientific literature.
Abstract
Risk factors discovery and prevention is an active research field within the biomedical domain. Despite abundant existing information on risk factors, as found in bibliographical databases or on several websites, accessing this information may be difficult. Methods from Natural Language Processing and Information Extraction can be helpful to access it more easily. Specifically, we show a procedure for analyzing massive amounts of scientific literature and for detecting linguistically marked associations between pathologies and risk factors. This approach allowed us to extract over 22,000 risk factors and associated pathologies. The performed evaluations pointed out that (1) over 88% of risk factors for coronary heart disease are correct, (2) associated pathologies, when they could be compared to MeSH indexing, are correct in about 70%, and (3) in existing terminologies links between risk factors and their pathologies are seldom recorded.