Lassen Sie nicht zu, dass der Zugriff auf Inhalte, manuelle Prozesse und Lizenzbedingungen Ihre Arbeit behindern
Forscher*innen und Datenwissenschaftler*innen verwenden Text-Mining-Tools, um Fakten, Annahmen und Querverbindungen aus riesigen Mengen veröffentlichter Informationen zu extrahieren und zu interpretieren. Projekte, die von KI und maschinellem Lernen unterstützt werden, beschleunigen den Forschungsprozess, ermöglichen mehr Entdeckungen, stellen wettbewerbsfähige Informationen bereit und helfen Unternehmen, potenzielle Sicherheitsprobleme in der Arzneimittel- oder Produktentwicklungspipeline zu identifizieren.
Trotz der vielen Vorteile von Text Mining stehen Forscher*innen jedoch vor einer Reihe von Herausforderungen, bevor sie überhaupt die Möglichkeit haben, Abfragen in der wissenschaftlichen Literatur durchzuführen.
Hier sind die drei wichtigsten Herausforderungen, von denen wir hören, wenn Unternehmen eine Sammlung von Artikeln für ihre Text-Mining-Projekte erstellen, und unsere konkreten Tipps, um diese zu überwinden
1. Incomplete Information in Article Abstracts
Many researchers build their corpus using scientific article abstracts because they are easily accessible via databases such as PubMed. While data from abstracts provides some value, there are limitations as to what data can be found within an abstract. The ability to mine the full text of the article — including detailed descriptions of methods and protocols and the complete study results — ensures that researchers don’t miss vital data, discoveries, and assertions. However, unlike article abstracts, full text is not often readily available from publishers in a format suitable for text mining.
Tip: The more data from multiple publishers/sources the better. Focus on full text to reduce FOMO, and ideally unify their format to make ingestion into mining tools simpler.
2. Limited Access to XML-Formatted Content
When companies have journal subscriptions, the documents are often available as PDFs, a format not intended for use with text mining software. Researchers and data scientists must then spend time converting the PDFs to XML (Extensible Markup Language), the preferred format for use in text mining software. XML is used to encode documents in a format that is easily read by computers or „machines“ and is used widely so that computer programs can parse or display the content appropriately. To convert PDFs to XML, researchers must use additional software tools which is not only inefficient but also can create several problems with the document itself, including loss of data and tables, conflation of document sections into a “blob of text,” and the addition of bad characters and non-words – leaving open the risk of missing data.
Tip: Lean more on original source XML versus conversion of PDFs, especially if it is normalized to a standard schema (like JATS). This typically provides better quality results. Remember: bad data in, bad data out!
3. Inconsistent Licensing Terms and Fees
Tip: Save time and effort by taking advantage of collective licensing options available – let someone else take on the negotiating for you. This is also an important time to involve the person or department within your company who manages subscriptions (typically a knowledge or information manager) – they’ll have insights into what the company currently utilizes and may already have relationships with partners that can help streamline licensing.
- 5 Considerations for Knowledge Managers Leveraging Scientific Literature for Text and Data Mining
- Analytics Start with a Question: How to Better Understand Your Data
How can CCC and RightsDirect help?
RightFind® XML enables researchers in R&D-intensive companies to make discoveries and connections that can only be found in article full-text. Learn more here.
This article was originally written by Carl Robinson, Sr. Director Pre-Sales & Consulting at CCC for the Velocity of Content blog.