Text and Data Mining

What is Text Mining?

“Text mining” or “text and data mining” (TDM) refer to a process of deriving high-quality information from text materials and databases using software.

Researchers use text mining to extract assertions, facts and relationships from text, for purposes of identifying patterns or relations between items that would otherwise be difficult to discern. In order to do this, text miners need to build a collection of articles (or corpus) and mine against this information using text mining software (such as Linguamatics I2E or IBM Watson).

Text mining is different from search, and involves using complex software to analyze information far more quickly than a human being can to identify patterns and make new connections. For example, such a connection may be an unexpected pattern in protein interactions that eventually leads to the development of a new drug, or maybe a subtle shift in weather patterns that predicts a downturn in the price of wheat. In many cases, this knowledge is spread across a number of sources.

What are the benefits of text mining?

Mining gives your R&D team the ability to examine articles in great detail so that you can guide business decisions and prioritize investments.

If your organization has not yet pursued text mining, here are three reasons to start:


1. Enhance R&D Efficiency

According to a 2012 report by JISC (Joint Information Systems Committee), approximately 1.5M scientific papers are published each year. MEDLINE contains over 24 million references to biomedical journal articles. It is neither feasible nor cost-effective for a researcher to read and analyze this much information. Text mining enables researchers to analyze massive amounts of information quickly to extract data, assertions and facts from unstructured text sources specific to a particular research topic.


2. Increase Discovery

Unlike search engines, which surface documents based on keywords, text mining tools analyze documents to identify entities and extract relationships between them, unlocking hidden information to help researchers identify and develop new hypotheses, attain knowledge and improve understanding.


3. Monitor Drug Safety

Recognizing the potential for adverse effects from a drug is vital at each stage of its pipeline as is information on drug interactions, unsafe dosage levels and safety issues related to drug target pathways. Text mining can help companies avoid late stage drug development failures.

*Sources 1. JISC (2012) The Value and Benefit of Text Mining to UK Further and Higher Education. Digital Infrastructure. Available at: http://bit.ly/jisc-textm Programme: Digital Infrastructure www.jisc.ac.uk/whatwedo/programmes/di_directions.aspx

Text Mining Challenges

Despite the many benefits of text mining, researchers face a number of obstacles before they even get a chance to run queries against the body of biomedical literature.

Here are three primary challenges for researchers as they build a corpus for their text mining projects:


1. Incomplete Information

Many researchers build their corpus using scientific article abstracts because they are easily accessible via biomedical databases such as PubMed. While text mining data from abstracts provides some value, there are limitations as to what data can be found within an abstract compared to what insights can be gained from using full-text articles.


2. Limited Access to XML-Formatted Content

When researchers have subscriptions the documents are often provided as PDFs, a format not intended for use with text mining software. Researchers must then spend time converting the PDFs to XML (Extensible Markup Language), the preferred format for use in text mining software.

3. Inconsistent Licensing Terms and Fees

Because text mining projects depend on access to a broad base of content, the text miner must work directly with multiple rightsholders (publishers and authors) for the use of full-text XML articles, resulting in varying fee structures, inconsistent terms of use and ultimately reduced productivity.