'How to set up regular dumping of standalone Wikibase content as ttl for loading into Blazegraph
So I'm setting up an instance of Wikibase on a server and trying to get a standalone version of the Wikidata Query Service, I'm having trouble following the tutorials in this regard, especially in relationship to the following:
- Exporting this Wikibase instance's data at as
.ttl.gzfile. - Figuring out some way to automate the process of exporting that data and importing it into Blazegraph regularly (it's a small instance, so even monthly updates would be alright)
I know how to export the Wikibase data as XML but that's about it. Any help with this would be much, much appreciated!
UPDATE 1: I'm still struggling, but this is what I've been trying thus far. I've adapted these lines and process by combining aspects of code here and here.
Create RDF dump of local Wikibase instance (only has one entity in it, so it's very small:
php /var/lib/mediawiki/extensions/Wikibase/repo/maintenance/dumpRdf.php --server http://localhost:400 --output /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/wikibase-05072022-all.ttl
Next, I run the file ./runBlazegraph.sh as follows, because ./munge.sh seems to fail without it running:
sudo BLAZEGRAPH_OPTS="-DwikibaseConceptUri=http://localhost:400" bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/runBlazegraph.sh
Next I run ./munge.sh to preprocess the exported ttl file:
sudo bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/munge.sh -f /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/wikibase-05072022-all.ttl -d /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/split -- --conceptUri http://localhost:400
This creates a file called wikidump-000000001.ttl.gz.good in /data/split. Following this, I attempt to import it using ./loadRestAPI.sh:
sudo bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/loadRestAPI.sh -d /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/split
The output from the ./loadRestAPI.sh is as follows:
Loading with properties...
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=wdq
#Files to load
fileOrDirs=/var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/split
#Property file (if creating a new namespace)
propertyFile=/var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/RWStore.properties
<?xml version="1.0"?><data modified="0" milliseconds="68"/>DATALOADER-SERVLET: Loaded wdq with properties: /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/RWStore.properties
However, the running Blazegraph instance prints the following in response:
Reading properties: /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/RWStore.properties
15:14:27.545 [com.bigdata.journal.Journal.executorService9] ERROR com.bigdata.rdf.store.DataLoader IP: UA: - Parser error - skipping source: source=/var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/data/split/wikidump-000000001.ttl.gz.good
org.openrdf.rio.RDFParseException: Expected an RDF value here, found '' [line 1]
at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
Additionally, looking in /data/split/ reveals that the file wikidump-000000001.ttl.gz.good has been renamed wikidump-000000001.ttl.gz.good.fail.
This process also creates another file is created, wikidata.jnl in /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/. It is approximately 1.6 GB in size, but does not include the new entity from the local instance when queried. It's always the exact same size every time this process is run through.
What I believe is happening is that the wikidump-000000001.ttl.gz.good created by the script dumpRdf.php is in the wrong format, but I'm not sure why or how that occurred or what alternate steps could be taken to fix it. The original file wikibase-05072022-all.ttl appears as follows:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wd: <http://localhost:400/entity/> .
@prefix data: <http://localhost:400/wiki/Special:EntityData/> .
@prefix s: <http://localhost:400/entity/statement/> .
@prefix ref: <http://localhost:400/reference/> .
@prefix v: <http://localhost:400/value/> .
@prefix wdt: <http://localhost:400/prop/direct/> .
@prefix wdtn: <http://localhost:400/prop/direct-normalized/> .
@prefix p: <http://localhost:400/prop/> .
@prefix ps: <http://localhost:400/prop/statement/> .
@prefix psv: <http://localhost:400/prop/statement/value/> .
@prefix psn: <http://localhost:400/prop/statement/value-normalized/> .
@prefix pq: <http://localhost:400/prop/qualifier/> .
@prefix pqv: <http://localhost:400/prop/qualifier/value/> .
@prefix pqn: <http://localhost:400/prop/qualifier/value-normalized/> .
@prefix pr: <http://localhost:400/prop/reference/> .
@prefix prv: <http://localhost:400/prop/reference/value/> .
@prefix prn: <http://localhost:400/prop/reference/value-normalized/> .
@prefix wdno: <http://localhost:400/prop/novalue/> .
wikibase:Dump a schema:Dataset,
owl:Ontology ;
cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
schema:softwareVersion "1.0.0" ;
schema:dateModified "2022-05-08T02:41:03Z"^^xsd:dateTime ;
owl:imports <http://wikiba.se/ontology-1.0.owl> .
data:Q1 a schema:Dataset ;
schema:about wd:Q1 ;
schema:version "2"^^xsd:integer ;
schema:dateModified "2022-05-06T20:07:48Z"^^xsd:dateTime ;
wikibase:statements "0"^^xsd:integer ;
wikibase:sitelinks "0"^^xsd:integer ;
wikibase:identifiers "0"^^xsd:integer .
wd:Q1 a wikibase:Item ;
rdfs:label "NameOfEntity"@en ;
skos:prefLabel "NameOfEntity"@en ;
schema:name "NameOfEntity"@en .
UPDATE 2: It's possible, per here, that including rdfs:label, skos:prefLabel, and schema:name could be the issue, maybe? So I tried this, and it appeared to work?
I checked SELECT * where {?a ?b ?c} at http://localhost:9999/bigdata/#query and it executed correctly and provided "NameOfEntity" in the output.
However, I then attempted to use ./runUpdate as follows:
sudo bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/runUpdate.sh -- --wikibaseUrl http://localhost:400 --conceptUri http://localhost:400
Which resulted in the following stack trace:
15:36:36.069 [main] INFO o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
15:36:36.069 [main] INFO o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
15:36:36.467 [main] INFO o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the dump
15:36:36.580 [main] INFO o.w.q.r.t.change.ChangeSourceContext - Found start time in the RDF store: 2022-05-08T02:41:03Z
15:36:36.649 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during updater run.
java.lang.RuntimeException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:400 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:244)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.doFetchRecentChanges(RecentChangesPoller.java:325)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.fetchRecentChanges(RecentChangesPoller.java:314)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.batch(RecentChangesPoller.java:338)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:162)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:38)
at org.wikidata.query.rdf.tool.Updater.run(Updater.java:152)
at org.wikidata.query.rdf.tool.Update.run(Update.java:174)
at org.wikidata.query.rdf.tool.Update.main(Update.java:98)
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:400 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.getJson(WikibaseRepository.java:439)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:241)
... 8 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.base/java.net.Socket.connect(Socket.java:609)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
... 20 common frames omitted
Exception in thread "main" java.lang.RuntimeException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:400 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:244)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.doFetchRecentChanges(RecentChangesPoller.java:325)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.fetchRecentChanges(RecentChangesPoller.java:314)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.batch(RecentChangesPoller.java:338)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:162)
at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:38)
at org.wikidata.query.rdf.tool.Updater.run(Updater.java:152)
at org.wikidata.query.rdf.tool.Update.run(Update.java:174)
at org.wikidata.query.rdf.tool.Update.main(Update.java:98)
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:400 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.getJson(WikibaseRepository.java:439)
at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:241)
... 8 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.base/java.net.Socket.connect(Socket.java:609)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
... 20 more
I triple-checked that http://localhost:400 was running (and it is). I'm wondering if it might be a result of where MediaWiki is placed though? Visiting http://localhost:400 provides the Apache2 Ubuntu Default Page, with the URL http://localhost:400/mediawiki landing at http://localhost:400/wiki/Main_Page. Should I attempt URL forwarding or is it another issue in its entirety?
Regardless I fully plan to catalog all of the issues I've had somewhere, and I think I'm getting close to success?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
