Pflichtlektüre zum Thema Datenmanagement im Herbst 2022

„In fast jeder Position in der heutigen Welt werden Entscheidungen auf der Grundlage der Zusammenstellung und Analyse von Daten und der Umsetzung von Strategien auf der Grundlage der Datenergebnisse getroffen. In meiner Welt vergeht kein Tag, an dem ich nicht etwas mit Daten mache.“ – Stephen Howe, Analytics beginnt mit einer Frage: Wie Sie Ihre Daten besser verstehen

Das Copyright Clearance Center freut sich, die Herbstausgabe 2022 der Reihe „Data Management Must-Reads“ zu veröffentlichen – eine sorgfältig kuratierte Auswahl wichtiger Artikel aus den letzten Monaten, die Entwicklungen in der Welt der Daten erläutern, die man nicht verpassen darf.

Lesen Sie die die Liste der spannenden Beiträge nachfolgend im Original, das zunächst auf dem Blog des CCC erschien.

Managing the entire lifecycle of data

It’s often the case that companies think of the data that they collect or ingest as static. However, data, like a natural resource, has a life cycle of its own. In June an article from the IEEE Computer Society did an excellent job in discussing the importance of managing the entire lifecycle of data. The Importance of Data Lifecycle Management (DLM) and Best Practices describes the various stages of the life cycle and highlights the importance of curating and maintaining data.

The data quality challenge in action

Data quality is a well-known challenge for companies of all sizes. Some of the engineers at LinkedIn looked at this problem of managing data quality at the scale of data that LinkedIn consumes. Towards data quality management at LinkedIn describes the architecture of a solution that they developed, the “Data Health Monitor”, with the goal of improving the quality of data that LinkedIn uses for machine learning efforts.

Going inside the BLOOM Project

Staying on the topic of machine learning, an article the MIT Technology Review describes the BLOOM Project (BigScience Large Open-science Open-access Multilingual Language Model), which attempts to eliminate some of the criticism that has been directed at language models: they are opaque, both in the source code and in the data that is used for training the models. Inside a radical new project to democratize AI describes how the project designers hope to make their models as powerful as those of proprietary ones, but with a transparent process.

Research around databases

For those of you interested in fundamental research around databases, an article in the Communications of the ACM, The Seattle Report on Database Research describes the most recent of an ongoing (since 1988) series of meetings to identify promising areas of research for the next five years.

While the author of 8 Levels of Reproducibility: Future-Proofing Your Python Projects uses Python to discuss his ideas, the framework for reproducible research and coding that he lays out is applicable to data science projects using any language.