OpenCitations Meta: Related Works

cover
3 Jun 2024

Authors:

(1) Arcangelo Massari, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {arcangelo.massari@unibo.it};

(2) Fabio Mariani, Institute of Philosophy and Sciences of Art, Leuphana University, Lüneburg, Germany {fabio.mariani@leuphana.de};

(3) Ivan Heibi, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {ivan.heibi2@unibo.it};

(4) Silvio Peroni, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {silvio.peroni@unibo.it};

(5) David Shotton, Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom {david.shotton@opencitations.net}.

In this section, we will review the most important scholarly publishing datasets to which access does not require subscription, i.e. publicly available datasets holding scholarly bibliographic metadata. Since OpenCitations Meta uses Semantic Web technologies to represent data, special attention will be given to RDF datasets, namely Wikidata, Springer Nature SciGraph, BioTea, the OpenResearch Knowledge Graph and Scholarly Data. In addition, the OpenAIRE Research Graph, OpenAlex and Scholarly Data will be described, as they are the most extensive datasets in terms of the number of works, although they do not represent data semantically.

OpenAlex (Priem et al., 2022) rose from the ashes of the Microsoft Academic Graph on January 1st 2022, and inherited all its metadata. It includes data from Crossref (Hendricks et al., 2020), Pubmed (Maloney et al., 2013), ORCID (Haak et al., 2012), ROR (Lammey, 2020), DOAJ (Morrison, 2017), Unpaywall (Dhakal, 2019), arXiv (Sigurdsson, 2020), Zenodo (Research & OpenAIRE, 2013), the ISSN International Centre[1], and the Internet Archive’s General Index[2]. In addition, web crawls are used to add missing metadata. With over 240 million works[3], OpenAlex is the most extensive bibliographic metadata dataset to date. OpenAlex assigns persistent identifiers to each resource. In addition, authors are disambiguated through heuristics based on co-authors, citations, and other features of the bibliographic resources. The data are distributed under a CC0 licence and can be accessed via API, web interface or downloading a full snapshot copy of the OpenAlex database.

The OpenAIRE project started in 2008 to support the adoption of the European Commission Open Access mandates (Manghi et al., 2010), and it is now the flagship organisation within the Horizon 2020 research and innovation programme to realise the European Open Science Cloud (European Commission. Directorate General for Research and Innovation., 2016). One of its primary outcomes is the OpenAIRE Research Graph, which includes metadata about scholarly outputs (e.g. literature, datasets and software), organisations, research funders, funding streams, projects, and communities, together with provenance information. Data are harvested from a variety of sources (Atzori et al., 2017): archives, e.g. ArXiv (Sigurdsson, 2020) Europe PMC (The Europe PMC Consortium, 2015), Software Heritage (Abramatic et al., 2018) and Zenodo (Research & OpenAIRE, 2013); aggregator services, e.g. DOAJ (Morrison, 2017) and OpenCitations (Peroni & Shotton, 2020); and other research graphs, e.g. Crossref (Hendricks et al., 2020) and DataCite (Brase, 2009). As of June 2023, this OpenAIRE dataset consisted of 232,174,001 research products[4]. The deduplication process implemented by OpenAIRE takes into account not only PIDs but also other heuristics, such as the number of authors and the Levenstein distance of titles. However, the internal identifiers OpenAIRE associates with entities are not persistent and may change when the data are updated. Data of the OpenAIRE Research Graph can be accessed via an API and the Explore interface. Dumps are also available under a Creative Commons Attribution 4.0 International Licence.

Semantic Scholar was introduced by the Allen Institute for Artificial Intelligence in 2015 (Fricke, 2018). It is a search engine that uses artificial intelligence to select only papers most relevant to the user’s search and to simplify exploration, e.g. by producing automatic summaries. Semantic Scholar sources its content via web indexing and partnerships with scientific journals, indexes, and content providers. Among those are the Association for Computational Linguistics, Cambridge University Press, IEEE, PubMed, Springer Nature, The MIT Press, Wiley, arXiv, HAL, and PubMed. As of June 2023, it indexes 212,605,886 scholarly works[5]. Authors are disambiguated via an artificial intelligence model (Subramanian et al., 2021), associated with a Semantic Scholar ID, and a page is automatically generated for each author, which the real person can redeem. Semantic Scholar provides a web interface, APIs, and the complete dataset is downloadable under the Open Data Commons Attribution Licence (ODCBy) v1.0.

Wikidata was introduced in 2012 by Wikimedia Deutschland as an open knowledge base to store in RDF data from other Wikimedia projects, such as Wikipedia, Wikivoyage, Wiktionary, and Wikisource (Mora-Cantallops et al., 2019). Due to its success, Google closed Freebase in 2014, which was intended to become “Wikipedia for structured data” and migrated it to Wikidata (Tanon et al., 2016). Since 2016, the WikiCite project has contributed significantly to the evolution of Wikidata as a bibliographic database, such that, by June 2023, Wikidata contained descriptions of 39,864,447 academic articles[6]. The internal Wikidata identifier referring to any entity (including bibliographic resources) is associated with numerous external identifiers, e.g. DOI, PMID, PMCID, arXiv, ORCID, Google Scholar, VIAF, Crossref funder ID, ZooBank and Twitter. The data are released under a CC0 licence as RDF dumps in Turtle and NTriples. Users can browse them via SPARQL, a web interface and, as of 2017, via Scholia – a web service which performs real-time SPARQL queries to generate profiles on researchers, organisations, journals, publishers, academic works and research topics, while also generating valuable infographics (Nielsen et al., 2017).

While OpenAIRE Research Graph and Wikidata aggregate many heterogeneous sources, Springer Nature SciGraph (Hammond et al., 2017), on the other hand, aggregates only data from Springer Nature and its partners. It contains entities concerning publications, affiliations, research projects, funders and conferences, totalling more than 14 million research products[7]. There is no current plan to offer a public SPARQL endpoint, but there is the possibility to explore the data via a browser interface, and a dump is released monthly in JSON-LD format under a CC-BY licence.

BioTea is also a domain-oriented dataset, and represents the annotated full-text open-access subset of PubMed Central (PMC-OA) (Garcia et al., 2018) using RDF technologies. At the time of that 2018 paper, the dataset contained 1.5 million bibliographic resources. Unlike other datasets, BioTea describes metadata and citations and defines the annotated full-texts semantically. Named-entity recognition analysis is adopted to identify expressions and terminology related to biomedical ontologies that are then recorded as annotations (e.g. about biomolecules, drugs, and diseases). BioTea data are released as dumps in RDF/XML and JSON-LD formats under the Creative Commons Attribution Non-Commercial 4.0 International licence, while the SPARQL endpoint is currently offline.

A noteworthy approach is that adopted by the Open Research Knowledge Graph (ORKG) (Auer et al., 2020). Metadata are mainly collected either by trusted agents via crowdsourcing or automatically from Crossref. However, ORKG’s primary purpose is not to organise metadata but to provide services. The main scope of these services is to perform a literature comparison analysis using word embeddings to enable a similarity analysis and foster the exploration and link of related works. To enable such sophisticated analyses, metadata from Crossref is insufficient; therefore, structured annotations on the topic, result, method, educational context and evaluator must be manually specified for each resource. The dataset contains (as of June 2023) 25,680 papers[8], 5153 datasets, 1364 software and 71 reviews. Given the importance of human contribution to the creaton of the ORKG dataset, the platform keeps track of changes and provenance, athough not in RDF format. The data can be explored through a web interface, SPARQL, and an API, and can also be downloaded under a CC BY-SA licence.

ScholarlyData collects information only about conferences and workshops on the topic of the Semantic Web (Nuzzolese et al., 2016). Data are modelled following the Conference Ontology, which describes typical entities in an academic conference, such as accepted papers, authors, their affiliations, and the organising committee, but not bibliographic references. Up to June 2023, the dataset stored information about 5678 conference papers. Such a dataset is updated by employing the Conference Linked Open Data generator software, which outputs RDF starting from CSV files (Gentile & Nuzzolese, 2015). The deduplication of the agents is based only on their URIs using a supervised classification method (Zhang et al., 2017), while ORCIDs are added in a further step. This methodology does not address the existence of homonyms. However, this is a minor issue for ScholarlyData, since only a few thousand people are involved in the conferences being indexed. ScholarlyData can be explored via a SPARQL endpoint, and dumps are available in RDF/XML format under a Creative Commons Attribution 3.0 Unported licence.

To conclude, we would like to point out that none of these other datasets mentioned above exposes change-tracking data and the related provenance information in RDF.

Table 1 summarises all the considerations made on each dataset.

This paper is available on arxiv under CC 4.0 DEED license.


[1] https://www.issn.org/

[2] https://archive.org/details/GeneralIndex

[3] https://docs.openalex.org/api-entities/works

[4] https://explore.openaire.eu/search/find/research-outcomes

[5] https://www.semanticscholar.org/

[6] https://scholia.toolforge.org/statistics

[7] https://scigraph.springernature.com/explorer/datasets/data_at_a_glance/

[8] https://orkg.org/papers