Reconcile Entities
Introduction
Entity reconciliation, also called entity linking and named entity disambiguation, is the step where we add unique identifiers in the form of URIs to your data to represent each unique entity. The goal is to use the same identifier every time that the same real-world thing is mention in your data, other LINCS data, and, ideally, linked data elsewhere on the web. By using the same identifier, we can connect all of the statements made about that entity together to create a rich and informative graph about that entity.
This page covers how reconciliation fits into the data conversion workflows and the available tools, while our Reconciliation Guide gets into the details of how to reconcile entities.
For a list of all reconciliation tools that LINCS uses, see the Reconcile page.
Resources Needed
Given that this process is so time consuming, we recommend starting to reconcile our data as early as possible. You can follow our Reconciliation Guide to start reconciling even before you have committed to the rest of the LINCS conversion process.
Reconciliation can be completed in tandem with the other conversion steps. Use placeholder values in the conversion until your Research Team has finished reconciling. Once URIs have been found, they can be added to either the source data or to the converted data to replace the placeholder values. The dataset, however, will not be published by LINCS until either the Research Team finishes their reconciliation or LINCS and the Research Team come to an agreement that no more reconciliation can take place, and new URIs need to be minted for the remaining entities. Note that once the data is published, the Research Team can continue to enhance it, including further reconciliation.
Reconciliation will always be completed by your Research Team because it requires domain knowledge to ensure you are choosing the correct identifiers. This is a great task for undergraduate or graduate research assistants.
The Conversion Team can offer guidance on this step and particularly with how to set up your data for reconciliation and how to merge the URIs back into your data.
Time Commitment
Reconciliation tends to be the slowest part of the conversion process. It can be sped up with tools that perform automated linking, but this comes at the sacrifice of accuracy. The loss in accuracy is worsened for the type of data coming into LINCS because it references more obscure or historically overlooked entities that are not well represented in existing LOD sources.
LINCS’s approach is to mix automation with manual review:
- Start with tools that automatically suggest candidate matches for entities
- If possible, apply filtering based on the context for each entity in your data and the authority data to separate trustworthy suggestions from ones that need review
- Have students manually review the uncertain candidates
With that mixed approach, you can estimate the time needed by assuming that each entity in your data will need 1-5 minutes for a human to reconcile it. This range depends on how familiar the person is with the data and whether they will need to spend time researching the entities to confirm a match.
For large datasets, it is not always feasible to carefully reconcile all entities. Our strategy for this has been:
- Reconcile as much as is feasible
- When the data is ready for publication, other than reconciliation not being completed, your team can discuss with LINCS at what point you would like to call it and mint URIs for the remaining unreconciled entities
- Once the data is published, you can slowly continue to add URIs for the unreconciled entities in ResearchSpace Review
Research Team | Ontology Team | Conversion Team | Storage Team | |
---|---|---|---|---|
Set Up your Data | ✓ | ✓ | ||
Reconcile your Entities | ✓ | |||
Merge Reconciled Data | ✓ | ✓ | ||
Choose Vocabularies | ✓ | ✓ |
Set Up your Data
For each workflow there are typically two options for setting up your data:
- Use a tool that takes your data in its original format and allows you to add URIs to the source data. For example, LEAF Writer lets you annotate XML and TEI data with entity types and URIs.
- In this case, you should not need to do any setup beyond the typical Clean Data step.
- Use a script or query tool to pull entities out of your data, along with contextual information about those entities. Then use a tool such as VERSD or OpenRefine to find URIs for the entities. Finally, use another script or query tool to insert those URIs back into either the source data or the converted data.
- This case will typically result in one or more spreadsheets where each row represents one entity and the columns contain contextual details about the entity. For example, you may have one internal unique identifier per row to represent a person, and then columns for their name, birth date, and death date so that you can quickly check if candidate URIs are correct.
- LINCS typically uses custom scripts to complete this step. The Conversion Team can offer advice and sample scripts from previous conversions.
Consider enhancing your source data with internal unique identifiers for each entity. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, reconcile them, and put the new URIs into the source or converted data then you will be able to easily put the new URIs in the correct locations.
Reconcile your Entities
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
For structured data, extract entities and their context from your source data. This process may require a custom script, but often it will be as easy as using a simplified version of your source spreadsheets or using an online tool to convert structured data into a spreadsheet. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise. Note that OpenRefine accepts a broad range of starting file types so you may be able to skip the initial extraction step.
Particularly for small datasets, you may find it sufficient to manually lookup URIs and add them directly to your source data or wait and add them to the converted data.
As usual, your options for semi-structured data depend on your specific data:
- Use LEAF Writer to review and add URIs directly to source XML documents.
- Create a custom script to extract entities and their context. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise.
- Manually lookup URIs and add them directly to your source data or wait and add them to the converted data.
To take advantage of the automated triple extraction possible with the TEI workflow, it is best that you insert entity URIs directly into your source TEI files. LEAF Writer is our preferred tool for this step.
Alternatively, you may choose to make manual changes or changes using custom scripts to the source TEI files. You do still have the option to add additional URIs to the converted data at the end.
This step of the natural language data workflow is still in progress. Check back over the next few months as we release the tools described here.
The APIs that we use to automatically extract triples from natural language data combine the tasks of identifying entities, choosing URIs for entities, and extracting the relationships between entities. If you are using this approach, then you do not need to do anything for this step. Though, you may want to review the URIs that the tool suggests for your data and double check any that have conflicting information. You should also do additional look-ups to find URIs for entities that our tools did not reconcile, as these automated tools do not search through all authority files. Refer to the documentation for the tool you are using to confirm which authority files it connects to.
Merge Reconciled Data
The Research Team and Conversion Team will use a custom script or the Linked Data Enhancement API to merge the new URIs with either the cleaned version of the source data or the converted data.
The Conversion Team will mint new URIs for anything that could not be reconciled.
Choose Vocabularies
Similarly to reconciling entities, you also need to choose vocabulary terms to use in your data and include their URIs. These vocabulary terms are often used to add fine-grained types to entities and relationships, compared to the broad types that CIDOC CRM uses. Choosing the appropriate vocabulary terms for your project will require you to explore the terms’ definitions to find ones that fit. When possible, prioritize using terms that are already frequently used within LINCS data to increase the connections and potential for interesting queries between your data and other LINCS data.
See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.