FREME Datasets

This page lists the datasets that were converted in the FREME project. They can be used with FREME NER.

ORCID

Description:

ORCID (Open Researcher and Contributor ID) is a nonproprietary alphanumeric code to uniquely identify scientific and other academic authors. This dataset contains RDF conversion of the ORCID dataset. The current conversion is based on the 2014 ORCID data dump, which contains around 1.3 million JSON files amounting to 41GB of data.

The converted RDF version is 13GB large (uncompressed) and it is modelled with well known vocabularies such as Dublin Core, FOAF, schema.org, etc., and it is interlinked with GeoNames.

Dump of the converted dataset can be downloaded from here.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Author and Maintainer:

Milan Dojchinovski

URL:

Statbel Corpus

Description:

This corpus contains RDF conversion of datasets from the "Statistics Belgium" (also known as Statbel) which aims at collecting, processing and disseminating relevant, reliable and commented data on Belgian society. http://statbel.fgov.be/en/statistics/figures/

Currently, the corpus contains three datasets:

  • Belgian house price index dataset (dump): measures the inflation on residential property market in Belgium. The data for conversion was obtained from [here].(http://statbel.fgov.be/en/statistics/figures/economy/construction_industry/house_price_index/)

  • Employment, unemployment, labour market structure dataset (dump): data on employment, unemployment and the labour market from the labour force survey conducted among Belgian households. The data for conversion was obtained from here.

  • Unemployment and additional indicators dataset (dump): contains unemployment related statistics about Belgium and its regions. The data for conversion was obtained from here.

The corpus is provided in RDF and it is modelled using the Data Cube vocabulary.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Author and Maintainer:

Milan Dojchinovski

URL:

Global Airports in RDF

Description:

This corpus contains RDF conversion of Global airports dataset which was retrieved from openflights.org. The dataset contains information about airport names, its location, codes, and other related info.

The a dump from the dataset can be downloaded from here. The corpus is provided in RDF and it is interlinked with DBpedia.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Author and Maintainer:

Milan Dojchinovski

URL:

DBpedia abstract corpus

Description:

This corpus contains a conversion of Wikipedia abstracts in six languages (dutch, english, french, german, italian and spanish) into the I used the NLP Interchange Format (NIF). The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation.

Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Author and Maintainer:

Martin Brümmer

URL:

CORDIS corpus

Description:

CORDIS (Community Research and Development Information Service), is the European Commission’s core public repository providing dissemination information for all EU-funded research projects. This dataset contains RDF of the CORDIS FP7 dataset which provides descriptions for projects funded by the European Union under the seventh framework programme for research and technological development (FP7) from 2007 to 2013. The converted dataset contains over 1 million of RDF triples with a total size of around 200MB in the N-Triples RDF serialization format.

The dataset is modelled with well known vocabularies such as Dublin Core, FOAF, DBpedia ontology, DOAP, etc., and it is interlinked with DBpedia. Dump of the converted dataset can be downloaded from here.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Author and Maintainer:

Milan Dojchinovski

URL:

VIAF

Description:

The VIAF® (Virtual International Authority File) combines multiple name authority files into a single OCLC-hosted name authority service. The goal of the service is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

URL:

Geopolotical Ontology

Description:

The FAO geopolitical ontology and related services have been developed to facilitate data exchange and sharing in a standardized manner among systems managing information about countries and/or regions.

The geopolitical ontology ensures that FAO and associated partners can rely on a master reference for geopolitical information, as it manages names in multiple languages (English, French, Spanish, Arabic, Chinese, Russian and Italian); maps standard coding systems (UN, ISO, FAOSTAT, AGROVOC, etc); provides relations among territories (land borders, group membership, etc); and tracks historical changes.

URL:

ONLD

Description:

The NCSU Organization Name Linked Data (ONLD) is based on the NCSU Organization Name Authority, a tool maintained by the Acquisitions & Discovery department since 2009 to manage the variant forms of name for journal and e-resource publishers, providers, and vendors in E-Matrix, our locally-developed electronic resource management system (ERMS).

The information in the NCSU Organization Name Linked Data are represented as RDF triples using properties from the SKOS, RDF Schema, FOAF, and OWL vocabularies. Clicking on the name of each property will take users to the property's definition. The authorized form of name for each organization is recorded with skos:prefLabel and variant forms of name were recorded with skos:altLabel. All of the organizations are associated with relevant classes in several popular vocabularies using rdf:type. The webpage of the organization is recorded using foaf:homepage.

URL:

GRID

Description:

Global Research Identifier Database (GRID) provides identifiers for world research organizations.

URL: