Fünf Empfehlungen für smartes Text- und Data-Mining


Jüngste Untersuchungen zeigen, dass die Einreichungsrate bei wissenschaftlichen Zeitschriften in den ersten Monaten des Jahres 2020 exponentiell gestiegen ist. Da die Menge an verfügbaren Informationen ständig wächst, wenden sich F&E-intensive Unternehmen zunehmend dem Text- und Data-Mining von wissenschaftlicher Volltextliteratur zu – sowohl im großen Maßstab als auch im Kontext einzelner Projekte – um Informationen zu extrahieren und ihre Wissenslieferkette zu stärken. Natürlich variieren diese Anstrengungen in Umfang und Anforderungen von Unternehmen zu Unternehmen oder sogar von Projekt zu Projekt. Bei der Nutzung von Volltextinhalten gibt es viele Faktoren, die ein Wissensmanager berücksichtigen sollte, wenn er versucht, einen optimalen Workflow zu entwickeln, der für die Bedürfnisse seines Unternehmens geeignet ist. Im Folgenden möchten wir ein paar dieser Faktoren näher beleuchten.

1) End-to-End Workflow

As a knowledge manager, it is essential to understand your company’s expected end-to-end workflow for text mining full-text literature. It can be useful to map the anticipated inputs and outputs at each phase of the workflow, as well as clarifying expected timelines and business criticality. This applies both to any backend data processing pipeline as well as to the dependent end-user workflows. By looking at this workflow as one continuous stream, a knowledge manager can ensure that adjustments upstream do not break processes downstream.

2) Corpus Parameters

The parameters for defining a full-text corpus of scientific literature will vary depending on the organization’s end-to-end workflow. For example, the dimensions of a corpus being leveraged in a text mining process applied to specific projects – such as a pharmacovigilance workflow – will differ from those used within broader initiatives to process scientific information at scale, apply machine learning or artificial intelligence capabilities, or construct knowledge graph representations. In narrower use cases, specific queries may rely on keywords or subject-related metadata (such as Medical Subject Headings aka MeSH or other indexing aids) that will pull relevant content based on the project specifications. The broader the use case, the less likely an organization is to be able to pre-filter for specific topics; in these cases, time- or journal-based, or other broader categories of content, need to be applied. Based on the end-to-end workflow envisioned, knowledge managers can help their stakeholders by identifying key questions that will define the approach to creating a useful corpus, such as:

  • What are the desired outputs from processing full-text content?
  • Is there an ongoing need for new literature, or will a backfile of historical research suffice?
  • What is the expected outcome of the text and data mining effort?
  • What journals, timelines, and fields of research are most relevant to the project?

3) Volume

As described above, there are different approaches to defining a corpus of scientific literature. Those parameters will naturally affect the volume of content in question. Looking back at the two use cases mentioned above, the organization that is text mining for specific projects may consume only several, dozens, or hundreds of articles at a given time for interrogation. Larger scale initiatives, with broader corpus parameters, may result in the processing of hundreds of thousands, or even millions, of articles. And, apart from the project needs at the present snapshot of time, it is important to consider also the maintenance of the project over time and its likely content needs in future.  Predicting the amount of content needed for text mining going forward is an important exercise to undertake but may also be a challenge. One method that helps with this estimation is analyzing current or backfile needs, then using this metric to forecast future needs.  This calculation can help a knowledge manager choose the appropriate method for consuming the content and predicting costs.

4) Timeliness 

For some organizations, timeliness is an important factor and is essential for supporting their text mining use case. As a knowledge manager, it’s important to understand the stakeholder expectations for when a particular published article is expected to yield outputs from a text and data mining processing pipeline. Delays may be introduced – and therefore impact business commitments, service level agreements, and expectations – by article lag time from publication to presence in a feed of data or accessibility by a public API, by any batch or asynchronous processing rules, and so on.

5) Licensing

Published scientific literature is a precious asset and important investment for an organization. Based on the expected end-to-end workflow and corpus parameters, the knowledge manager should determine whether content needs can be satisfied through existing subscriptions/licenses, or whether these should be augmented through extended licensing or through transactional steps. Any expected transactional impact should likewise be factored into the workflow, and its impact measured on timeliness requirements.

Considering these five factors will help a knowledge manager uncover the optimum workflow for their organization to consume full-text content. Text mining, for many, is a new and exciting technology that can drastically improve research and innovation within an organization. By understanding and analyzing these five considerations, the potentially daunting task of developing a new workflow to support text mining will become significantly more manageable.

Author: Garrett Dintaman

Garrett Dintaman is CCC’s Associate Product Manager and Product Owner for RightFind XML for Mining. He focuses on client needs and use cases related to text and data mining, data processing pipelines, and analytics. In his free time, Garrett enjoys following college basketball and spending time in nature.