Interlinking data with alignments and link keys

This version:
http://alignapi.gforge.inria.fr/tutorial/tutorial6/
Author:
Jérôme Euzenat, INRIA & LIG

This tutorial explains how it is possible to generate links between datasets from EDOAL alignments with link keys. It optionally illustrates similar taks with Silk.

Requirements

The tutorial works simply with software embedded in the Alignment API. However, making is work with Silk or a triple store requires additional software.

As usual, the whole tutorial is performed through command line.

The evaluation of such queries under a triple store, and using named graphs, are illustrated here.

Set up

First start by cleaning up your environment:

$ cd tutorial6 $ mkdir results

Data sets

We are using two different data sets in files.

Of course, the tutorial can be adapted with your own data sets.

Generating links from link keys

From an alignment comprising link keys, it is possible to generate sameAs links. We have several such alignments here:

The goal of the tutorial is that you apply them one after the other, i.e., replacing the number in the instructions below to see what these link keys do.

This corresponds to linkkey3.rdf:
<map> <Cell> <entity1> <edoal:Class rdf:about="&insee;Departement"/> </entity1> <entity2> <edoal:Class> <edoal:and rdf:parseType="Collection"> <edoal:Class rdf:about="&eurostat;NUTSRegion"/> <edoal:AttributeValueRestriction> <edoal:onAttribute> <edoal:Property rdf:about="&eurostat;level"/> </edoal:onAttribute> <edoal:comparator rdf:resource="&edoal;equals"/> <edoal:value><edoal:Literal edoal:type="&xsd;integer" edoal:string="3" /></edoal:value> </edoal:AttributeValueRestriction> <edoal:AttributeValueRestriction> <edoal:onAttribute> <edoal:Relation> <edoal:compose rdf:parseType="Collection"> <edoal:Relation rdf:about="&eurostat;hasParentRegion" /> <edoal:Relation rdf:about="&eurostat;hasParentRegion" /> <edoal:Relation rdf:about="&eurostat;hasParentRegion" /> </edoal:compose> </edoal:Relation> </edoal:onAttribute> <edoal:comparator rdf:resource="&edoal;equals"/> <edoal:value><edoal:Instance rdf:about="&esdata;FR" /></edoal:value> </edoal:AttributeValueRestriction> </edoal:and> </edoal:Class> </entity2> <relation>equivalence</relation> <measure>1.0</measure> <edoal:linkkey> <edoal:Linkkey> <edoal:binding> <edoal:Intersects> <edoal:property1> <edoal:Property rdf:about="&insee;nom" /><!-- xml:lang="fr"--> </edoal:property1> <edoal:property2> <edoal:Property rdf:about="&eurostat;name" /> </edoal:property2> </edoal:Intersects> </edoal:binding> </edoal:Linkkey> </edoal:linkkey> </Cell> </map>

The full alignment is available at: linkkey3.rdf

This is processed by:

$ java -cp $CLASSPATH fr.inrialpes.exmo.align.cli.ParserPrinter file:linkkey1.rdf -r fr.inrialpes.exmo.align.impl.renderer.SPARQLLinkkerRendererVisitor -o results/query.sparql

to generate a set of SPARQL queries.
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ns1:<http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#> PREFIX owl:<http://www.w3.org/2002/07/owl#> PREFIX ns2:<http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/> PREFIX ns0:<http://rdf.insee.fr/geo/> PREFIX xsd:<http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?s1 owl:sameAs ?s2 } WHERE { ?s1 rdf:type ns0:Departement . ?s2 rdf:type ns1:NUTSRegion . ?s2 ns1:level ?o2 . FILTER (?o2=3) ?s2 ns1:hasParentRegion ?o4 . ?o4 ns1:hasParentRegion ?o5 . ?o5 ns1:hasParentRegion ?o6 . FILTER (?o6=ns2:FR) ?s1 ns0:nom ?o7 . ?s2 ns1:name ?o8 . FILTER( lcase(str(?o7)) = lcase(str(?o8)) ) }
Think about what you could do to improve this query?

Processing any of these SPARQL queries, will generate links.

$ java -cp $CLASSPATH arq.query --query results/query.sparql --data regions-2010.rdf --data nuts2008_complete.rdf > results/links.ttl
@prefix geo: <http://rdf.insee.fr/geo/> . @prefix cc: <http://creativecommons.org/ns#> . @prefix : <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ns0: <http://rdf.insee.fr/geo/> . @prefix ns1: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://rdf.insee.fr/geo/2010/DEP_67> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR421> . <http://rdf.insee.fr/geo/2010/DEP_39> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/CH025> , <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR432> . <http://rdf.insee.fr/geo/2010/DEP_2A> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR831> . <http://rdf.insee.fr/geo/2010/DEP_61> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR253> . <http://rdf.insee.fr/geo/2010/DEP_33> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR612> . <http://rdf.insee.fr/geo/2010/DEP_05> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR822> . <http://rdf.insee.fr/geo/2010/DEP_74> owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR718> . ...

The full link set is available at: results/links.ttl

Can you spot a problem? Where does it come from? How can it be solved?

Generating links from similarity measures

We use Silk 2.6.1 in order to generate links based on the similarity between resources. Silk is driven by scripts which express such similarity. The scripts are expressed in the Link Specification Language

We have several linkkage rules available they are all in the same script (identified by no1...no6):

Again, your goal is to process the linkage rules provided in this script from n1 to n6 and to understand what they do.

Here is the example of a part of script.xml

<?xml version="1.0" encoding="utf-8" ?> <Silk> <Prefixes> <Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" /> <Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" /> <Prefix id="id2" namespace="http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#" /> <Prefix id="id1" namespace="http://rdf.insee.fr/geo/" /> </Prefixes>
<DataSources> <DataSource id="id1" type="file"> <Param name="file" value="regions-2010.rdf"/> <Param name="format" value="RDF/XML" /> </DataSource> <DataSource id="id2" type="file"> <Param name="file" value="nuts2008_complete.rdf"/> <Param name="format" value="RDF/XML" /> </DataSource> </DataSources>
<Interlinks> <Interlink id="no1"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="id1" var="e1"> <RestrictTo> ?e1 rdf:type id1:Departement . </RestrictTo> </SourceDataset> <TargetDataset dataSource="id2" var="e2"> <RestrictTo> ?e2 rdf:type id2:NUTSRegion . </RestrictTo> </TargetDataset> <LinkageRule> <Aggregate type="max"> <Compare metric="jaccard"> <TransformInput function="tokenize"> <Input path="?e1\id1:subdivision/id1:nom" /> </TransformInput> <TransformInput function="tokenize"> <Input path="?e2/id2:name" /> </TransformInput> </Compare> <Compare metric="jaccard"> <TransformInput function="tokenize"> <Input path="?e1/id1:nom" /> </TransformInput> <TransformInput function="tokenize"> <Input path="?e2/id2:name" /> </TransformInput> </Compare> </Aggregate> </LinkageRule>
<Filter />
<Outputs> <Output type="file"> <Param name="file" value="results/Round1.rdf"/> <Param name="format" value="alignment"/> </Output> </Outputs> </Interlink> </Interlinks> </Silk>

$ java -DconfigFile=script.xml -DlinkSpec=no1 -jar silk.jar
The result is provided as a set of links in a format which is supposed to be the Alignment format. However, it is not correct. This is fixed here by applyng the patch:
$ sh bin/fix.sh results/Round1-accepted.rdf
on resulting file (here results/Round1-accepted.rdf). It is possible to count the number of answers provided by the evaluation through:
$ grep entity1 results/Round1-accepted.rdf | wc -l 103

Link quality can be tested by comparison with the reference alignment reflinks.rdf:

$ java -cp $CLASSPATH fr.inrialpes.exmo.align.cli.EvalAlign -i fr.inrialpes.exmo.align.impl.eval.PRecEvaluator file:reflinks.rdf file:results/Round1-accepted.rdf
which answers:
<?xml version='1.0' encoding='utf-8' standalone='yes'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:map='http://www.atl.external.lmco.com/projects/ontology/ResultsOntology.n3#'> <map:output rdf:about=''> <map:input1 rdf:resource="http://rdf.insee.fr/geo/"/> <map:input2 rdf:resource="http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#"/> <map:precision>0.9611650485436893</map:precision> <map:recall>1.0</map:recall> <map:fMeasure>0.9801980198019802</map:fMeasure> <map:oMeasure>0.9595959595959596</map:oMeasure> <result>1.0404040404040404</result> </map:output> </rdf:RDF>
It provides all valid links (recall=100%) but not all the links it found are correct (precision=96%). Could you improve on this?

Try the other proposed linked rule and/or try to improve the linkage used.

rulecomparison#linksprec.rec.
no1dpt ≡ NRname=nom103.961.0
no2dpt ≡ NR&level=3tok(name)~tok(nom)100.991.0
no3dpt ≡ NRAVG(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name))891.0.90
no4dpt ≡ NR&level=3&hasParentRegion3=FRAVG(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name))891.0.90
no5dpt ≡ NR&level=3&hasParentRegion3=FRname=nom991.01.0
no6dpt ≡ NR&level=3&hasParentRegion3=FRMAX(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name))439.231.0
no7dpt ≡ NR&level=3&hasParentRegion3=FRMIN(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name))891.0.90

Extra work

For the curious, we have a larger example, between the French communes in insee-communes.ttl and those of geonames (communes_gn.ttl). A starting script is geo-script.xml. This sample data comes from the LinkKeyDisco system experiments.


http://alignapi.gforge.inria.fr/tutorial/tutorial6/

$Id: index.html 2058 2015-09-11 06:24:58Z euzenat $