OpenCitations Meta: Discussion

3 Jun 2024

Authors:

(1) Arcangelo Massari, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};

(2) Fabio Mariani, Institute of Philosophy and Sciences of Art, Leuphana University, Lüneburg, Germany {[email protected]};

(3) Ivan Heibi, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};

(4) Silvio Peroni, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};

(5) David Shotton, Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom {[email protected]}.

Table of Links

5. Discussion

As shown in Section 2, when considering only semantic publishing datasets, OpenCitations Meta, which currently includes data from Crossref, DataCite, and the NIH Open Citation Collection (ICite et al., 2022), is first in data volume. Moreover, work is already underway to ingest data from new sources, such as the Japan Link Center (Hara, 2020), the OpenAIRE Research Graph (Atzori et al., 2017), and the Dryad Digital Repository (Vision, 2010).

When compared to the OpenAIRE Research Graph, OpenCitations Meta has advantages in terms of functionality: namely the use of OMIDs, globally unique persistent identifiers used internally to identify every entity within OpenCitations Meta. This usage makes it possible to represent and index citations between bibliographic resources that lack an external persistent identifier such as a Digital Object Identifier (DOI). This feature adds significant value for the OpenCitations Indexes, as it allows for the first time the ingestion of many citations which until now were not possible to be characterised, particularly citations between publications from the humanities and social sciences (Gorraiz et al., 2016), and citations involving primary sources, e.g. a statue, a painting, or a codex, which typically lack a persistent identifier. Importantly, having an OMID also permits the identified resource to be assigned a unique URL, for example https://w3id.org/oc/meta/br/061401975837 for omid:br/061401975837.

Another feature that, to the best of our knowledge, is only present in OpenCitations Meta is the mechanism for change-tracking management within the provenance information stored in RDF. This information can be queried using the Python timeagnostic-library software (Massari & Peroni, 2022). It can perform time-traversal SPARQL queries, i.e. queries across different snapshots together with provenance information.

As far as other bibliographic datasets that do not use Semantic Web technologies go, OpenAlex (Priem et al., 2022) is an important case to consider for comparison with OpenCitations Meta. OpenAlex uses web crawls to add missing metadata, a feature that allows it to automatically correct a higher number of errors appearing in the data of the sources, when compared to OpenCitations Meta.

Indeed, currently, the main limitation of OpenCitations Meta concerns the quality of the data, which is strictly dependent on the quality of the sources. Crossref does not double-check the metadata provided by publishers, and thus many errors are preserved. For instance, it is possible to encounter articles published in the future (the metadata available at https://api.crossref.org/v1/works/10.12960/tsh.2020.0006 say that the article will be published in print in 2029). Some of these errors can be corrected automatically without any background knowledge, while others require either the use of web crawlers or manual intervention. While OpenAlex is pursuing the path of web crawls, OpenCitations is working on a framework that will allow the editing and curation of data by trusted human domain experts (such as academic librarians).

OpenCitations Meta fulfils its primary purpose by holding the bibliographic metadata required to describe the citing and cited publications involved in the citations within the OpenCitations Indexes. In addition to these bibliographic metadata elements, however, we are well aware that there are additional metadata elements of great importance for the academic community: Abstracts, for text mining, domain and subject field determination, and indexing (even if the full texts of the publications are available open access elsewhere), and Funder IDs, Funding information and Institutional identifiers, essential for determining performance metrics and undertaking research assessment. Once we have completed the provision of our textual search operations, expanded our coverage in the ways indicated, and enhanced the computational infrastructure upon which OpenCitations Meta and the OpenCitations Indexes run, we will proceed to integrate and populate these additional metadata fields.

The provision of high-quality bibliographic metadata is a complex and difficult goal to achieve by automated operations, while the scale of the operations precludes manual curation except for a minority of records. No bibliographic dataset is currently able to achieve this goal on its own. For this reason, all the available bibliographic databases should be viewed as complementary. For example, while at the moment OpenAlex provides better quality metadata, OpenCitations Meta has complete provenance data openly available, and enables more complex searches, thanks to the potentialities given by Semantic Web technologies. For example, "Search for all authors who co-authored with Silvio Peroni or Fabio Vitali in conference proceedings that were published by Springer after 2009". Furthermore, OpenAlex is only partially free, since a fee must be paid to make more than a hundred thousand requests per day via the API and to access data updated every hour via the API (instead of every month via the dump)[9]. In contrast, users can make unlimited requests to the latest version of OpenCitations Meta for free.

Also, although the OpenAIRE Research Graph currently contains more metadata, such data are released under a CC-BY attribution licence, while the data released by OpenCitations Meta is under a CC0 public domain waiver, permitting complete freedom for reuse, including commercial reuse, and for machine processing without any requirement for attribution.

This paper is available on arxiv under CC 4.0 DEED license.

← Previous

OpenCitations Meta: Data and services

Up Next →

OpenCitations Meta: Methodology