Named Entity Recognition against Custom Dataset

Introduction

Most named entity recognition tools (NER) perform linking of the entities occurring in the text with only one dataset provided by the NER system. However, very often a user would like to match (link) the entities occurring in the document with a proprietary domain specific dataset. In this tutorial you can learn how to perform named entity linking against a custom dataset using the e-Entity sevices.

Quick links

  1. Pre-requisites
  2. Step 1: Prepare your dataset
  3. Step 2: Submit your dataset
  4. Step 3: Perform Named Entity Recognition with your dataset
  5. Step 4: Simplify the results (in CSV)
  6. Step 5: Remove your dataset
  7. cURL examples
  8. How to upload a large dataset

Pre-requisites

  1. A custom dataset provided in RDF containing entities and labels for the entities.
  2. The authentication token of an existing FREME user, see the authentication article. This tutorial page uses an existing dummy user, so you do not have to handle this by yourself.
  3. A document/text which will be used for processing.

Step 1: Prepare your dataset

First we need to prepare our custom dataset. The dataset should be provided in RDF and contain a list of entity names. Below is a small example of such dataset in the N-triples format where in each record, the subject (first element) is a URI identifier for the entity, the predicate (second element) is the type of the information we describe (preferred or alternative label for the entity), and the object (last element) is the actual name for the entity.

<http://www.freme-projects.eu/dataset/people/Milan_Dojchinovski> <http://www.w3.org/2004/02/skos/core#prefLabel> "Milan Dojchinovski" .
<http://www.freme-projects.eu/dataset/people/Milan_Dojchinovski> <http://www.w3.org/2004/02/skos/core#altLabel> "Milan" .
<http://www.freme-projects.eu/dataset/people/Sebastian_Hellmann> <http://www.w3.org/2004/02/skos/core#prefLabel> "Sebastian Hellmann" .
<http://www.freme-projects.eu/dataset/people/Sebastian_Hellmann> <http://www.w3.org/2004/02/skos/core#altLabel> "Sebastian" .
<http://www.freme-projects.eu/dataset/people/Felix_Sasaki> <http://www.w3.org/2004/02/skos/core#prefLabel> "Felix Sasaki" .
<http://www.freme-projects.eu/dataset/people/Felix_Sasaki> <http://www.w3.org/2004/02/skos/core#altLabel> "Felix" .
<http://www.freme-projects.eu/dataset/people/Jan_Nehring> <http://www.w3.org/2004/02/skos/core#prefLabel> "Jan Nehring" .
<http://www.freme-projects.eu/dataset/people/Jan_Nehring> <http://www.w3.org/2004/02/skos/core#altLabel> "Jan" .
<http://www.freme-projects.eu/dataset/org/INFAI> <http://www.w3.org/2004/02/skos/core#prefLabel> "INFAI" .
<http://www.freme-projects.eu/dataset/org/DFKI> <http://www.w3.org/2004/02/skos/core#prefLabel> "DFKI" .
<http://www.freme-projects.eu/dataset/org/Tilde> <http://www.w3.org/2004/02/skos/core#prefLabel> "Tilde" .
<http://www.freme-projects.eu/dataset/org/ISMB> <http://www.w3.org/2004/02/skos/core#prefLabel> "ISMB" .
<http://www.freme-projects.eu/dataset/org/Wripl> <http://www.w3.org/2004/02/skos/core#prefLabel> "Wripl" .
<http://www.freme-projects.eu/dataset/org/Vistatec> <http://www.w3.org/2004/02/skos/core#prefLabel> "Vistatec" .
<http://www.freme-projects.eu/dataset/org/AgroKnow> <http://www.w3.org/2004/02/skos/core#prefLabel> "AgroKnow" .
<http://www.freme-projects.eu/dataset/org/iMinds> <http://www.w3.org/2004/02/skos/core#prefLabel> "iMinds" .


The dataset above contains description for 12 entities. For example, we have information that the entity Milan Dojchinovski is known as "Milan" and also as "Milan Dojchinovski".

Step 2: Submit the custom dataset to the e-Entity service

First we need to submit our dataset before we use it for entity linking. You can re-use the example dataset above.

Dataset content
Paste here the dataset content. It should be valid RDF in the serialization format as specified bellow.


Dataset serialization format
Specify the RDF serialization format of the dataset.


Dataset name
The name of the dataset. It will be used as its ID, therefore it must be unique.


Authorization token
The dataset has to be owned by anyone, see the authentication article for further information. In this tutorial we use a predefined dummy user, so you do not have to care about this.


Dataset description
Short description of the dataset.


Response

    	Here you will get information whether your dataset was successfully accepted and processed.
    

Step 3: Perform Named Entity Recognition with your dataset

After we have submitted our custom proprietary dataset to the e-Entity service, we can proceed with performing entity spotting and linking against our dataset.

Text for processing
Paste here the text that should be processed. The recognized entities will be linked with your dataset. This is sent as body of the request.


Text language
Specify the language of the input. E.g. "language=en".


Input format
Since we send plain text we will set the input format to text/plain by setting the Content-Type parameter to "Content-Type: text/plain" or by setting the "informat" parameter to "text" ("informat=text").


Outformat format
Specify the RDF serialization format for the output. You can set the output format by setting the outformat parameter, e.g. "outformat=turtle", or by setting the "Accept" header, e.g. "Accept: text/turtle".


Dataset name
The name of the dataset used for entity linking. The value should be the same as you specified when you submitted the dataset for creation. You can specify the target dataset using the "dataset" parameter. E.g. "dataset=dbpedia".


Response

    	Here you will get the results from the named entity recognition using your dataset.
    

Step 4: Simplify the results (in CSV)

If you are not familiar with RDF/NIF format, you can retrieve the results in a simplified form (e.g. in CSV) by specifying filter. In the following example we are using the "extract-entities-only" filter which will output the results in CSV.

Text for processing
Paste here the text that should be processed. The recognized entities will be linked with your dataset. This is sent as body of the request.


Text language
Specify the language of the input. E.g. "language=en".


Filter name
Specify the name of the filter which will be used to simplify the results. E.g. "filter=extract-entities-only".


Input format
Since we send plain text we will set the input format to text/plain by setting the Content-Type parameter to "Content-Type: text/plain" or by setting the "informat" parameter to "text" ("informat=text").


Outformat format
Specify the RDF serialization format for the output. You can set the output format by setting the outformat parameter, e.g. "outformat=turtle", or by setting the "Accept" header, e.g. "Accept: text/turtle".


Dataset name
The name of the dataset used for entity linking. The value should be the same as you specified when you submitted the dataset for creation. You can specify the target dataset using the "dataset" parameter. E.g. "dataset=dbpedia".


Response

    	Here you will get the results from the named entity recognition using your dataset.
    

Step 5: Remove your dataset

Name of the dataset which you want to delete:



Access token from the owner of the dataset you want to delete:



    	Here you will get the response from your delete request.
    

cURL Examples

Below you can find some useful cURL command examples.

Submit dataset

curl -v -d @dataset.nt "https://api.freme-project.eu/current/e-entity/freme-ner/datasets?name=NAME&description=DATASET_DESCRIPTION&informat=n-triples&outformat=json&language=en&token=YOUR_TOKEN_HERE" -H "Content-Type: text/n3"

Submit text for named entity recognition

curl -v -d "Diego Maradona is from Argentina." "https://api.freme-project.eu/current/e-entity/freme-ner/documents?dataset=dbpedia&informat=text&outformat=turtle&language=en" -H "Content-Type: text/plain"

How to upload a large dataset

In a real world use case very large datasets are necessary. To make the upload of large datasets feasible you should consider the following hints:

  • Chunk the dataset in chunks of e.g. 1000 lines each. When the dataset is too large then it does not fit in a single HTTP request.
  • Start with an HTTP POST request to create a new dataset and upload the first chunk. Afterwards add the remaining chunks with HTTP PUT requests. This will add the data to the previously created dataset. See the interactive api documentation how to perform the two types of request.
  • Use n-triples as serialization format for your dataset, if possible. This makes it easy to chunk the dataset because n-triples hold one triple a line.
  • If possible, execute the requests to upload the data from the same server that runs FREME-NER to speed up processing.

Have a look at the Freme Dataset Tool, it can support you with dataset creation and uploading.