Die drei größten Herausforderungen bei der Verwendung wissenschaftlicher Artikel bei KI- und Machine-Learning-Projekten

Lassen Sie nicht zu, dass der Zugriff auf Inhalte, manuelle Prozesse und Lizenzbedingungen Ihre Arbeit behindern

Forscherinnen und Datenwissenschaftlerinnen verwenden Text-Mining-Tools, um Fakten, Annahmen und Querverbindungen aus riesigen Mengen veröffentlichter Informationen zu extrahieren und zu interpretieren. Projekte, die von KI und maschinellem Lernen unterstützt werden, beschleunigen den Forschungsprozess, ermöglichen mehr Entdeckungen, stellen wettbewerbsfähige Informationen bereit und helfen Unternehmen, potenzielle Sicherheitsprobleme in der Arzneimittel- oder Produktentwicklungspipeline zu identifizieren.

Trotz der vielen Vorteile von Text Mining stehen Forscher*innen jedoch vor einer Reihe von Herausforderungen, bevor sie überhaupt die Möglichkeit haben, Abfragen in der wissenschaftlichen Literatur durchzuführen.

Hier sind die drei wichtigsten Herausforderungen, von denen wir hören, wenn Unternehmen eine Sammlung von Artikeln für ihre Text-Mining-Projekte erstellen, und unsere konkreten Tipps, um diese zu überwinden

1. Incomplete Information in Article Abstracts  

Many researchers build their corpus using scientific article abstracts because they are easily accessible via databases such as PubMed. While data from abstracts provides some value, there are limitations as to what data can be found within an abstract. The ability to mine the full text of the article — including detailed descriptions of methods and protocols and the complete study results — ensures that researchers don’t miss vital data, discoveries, and assertions. However, unlike article abstracts, full text is not often readily available from publishers in a format suitable for text mining. 

Tip: The more data from multiple publishers/sources the better. Focus on full text to reduce FOMO, and ideally unify their format to make ingestion into mining tools simpler. 

2. Limited Access to XML-Formatted Content 

When companies have journal subscriptions, the documents are often available as PDFs, a format not intended for use with text mining software. Researchers and data scientists must then spend time converting the PDFs to XML (Extensible Markup Language), the preferred format for use in text mining software. XML is used to encode documents in a format that is easily read by computers or „machines“ and is used widely so that computer programs can parse or display the content appropriately. To convert PDFs to XML, researchers must use additional software tools which is not only inefficient but also can create several problems with the document itself, including loss of data and tables, conflation of document sections into a “blob of text,” and the addition of bad characters and non-words – leaving open the risk of missing data. 

Tip: Lean more on original source XML versus conversion of PDFs, especially if it is normalized to a standard schema (like JATS).  This typically provides better quality results. Remember: bad data in, bad data out! 

3. Inconsistent Licensing Terms and Fees 

There are different approaches to defining a corpus of scientific literature – some projects require only several, dozens, or hundreds of articles, while others require the processing of hundreds of thousands or even millions of articles.  To get the best results, varying projects often depend on access to a broad base of content, so businesses must work directly with multiple rightsholders and publishers for the use of full-text XML articles. This typically results in varying fee structures, inconsistent terms of use, and ultimately reduced productivity. Without a common set of terms and conditions for using full-text content across publishers, researchers and/or information managers are left with the task of negotiating one-by-one with individual rightsholders to obtain the content and rights they need for text mining. 

Tip: Save time and effort by taking advantage of collective licensing options available – let someone else take on the negotiating for you. This is also an important time to involve the person or department within your company who manages subscriptions (typically a knowledge or information manager) – they’ll have insights into what the company currently utilizes and may already have relationships with partners that can help streamline licensing.  

Keep Learning: 

How can CCC and RightsDirect help? 

RightFind® XML enables researchers in R&D-intensive companies to make discoveries and connections that can only be found in article full-text. Learn more here.  

This article was originally written by Carl Robinson, Sr. Director Pre-Sales & Consulting at CCC for the Velocity of Content blog.