Error during a wikidata update - wikidata

I've created a local version of the wikidata api using the instructions here, and after running munge.sh with the default options, I've run
./runUpdate.sh -n wdq which resulted with the following error message.
ERROR org.wikidata.query.rdf.tool.Update -
RDF store reports the last update time is before the minimum safe poll time.
You will have to reload from scratch or you might have missing data.
What does it mean? Should I munge again before updating?

The default updater can only currently update based on what is in RecentChanges for the wiki.
The default for this is 30 days, so if the dump that you imported is from longer than 30 days ago the updater will fail.
There are options that can now be passed to the updater script to look into the history of RecentChanges for longer periods.
You can also set the last updater triple that the check is performed on.
These options can be seen discussed in https://phabricator.wikimedia.org/T182394 (but im not sure better docs currently exist):
"wikibaseMaxDaysBack" can be used to set the maximum days to look back in RecentChanges
"init" can be used to set the last updated triple

Related

How to recover from "missing docs" in xtdb?

I'm using xtdb in a testing environment with a RocksDB backend. All was well until yesterday, when the system stopped ingesting new data. It tells me that this is because of "missing docs", and gives me the id of the allegedly missing doc, but since it is missing, that doesn't tell me much. I have a specific format for my xt/ids (basically type+guid) and this doesn't match that format, so I don't think this id is one of mine. Calling history on the entity id just gives me an empty vector. I understand the block on updates for consistency reasons, but how to diagnose and recover from this situation (short of trashing the database and starting again)? This would obviously be a massive worry were it to happen in production.
In the general case this "missing docs" error indicates a corrupted document store and the only proper resolution is to manually restore/recover based on a backup of the document store. This almost certainly implies some level of data loss.
However, there was a known bug in the transaction function logic prior to 1.22.0 which could intermittently produce this error (but without any genuine data loss), see https://github.com/xtdb/xtdb/commit/1c30550fb14bd6d09027ff902cb00021bd6e57c4
However, if you weren't using transaction functions then there may be another unknown explanation.

how to run aggregated update report?

If one types update in the sbt console, it runs an aggregated report that typically takes a minute or so for a project.
However, if one programatically runs update for each ProjectRef it is cronically slow (10 minutes to an hour is not unheard of).
How can one programmatically run the same (faster) aggregated update report that the console runs?
If one types update in the sbt console, it runs an aggregated report that typically takes a minute or so for a project.
The implementation of the update task is available here:
https://github.com/sbt/sbt-zero-thirteen/blob/v0.13.9/main/src/main/scala/sbt/Defaults.scala#L1325-L1443
The main thing it adds there is caching based on the input parameters.
Not sure what you mean by aggregated. Do you mean aggregated across the configurations (e.g. Compile and Test?)
Basically this PR is how I ended up doing it
https://github.com/ensime/ensime-sbt/pull/122
which meant setting up an aggregated report in a single task and calling that once, which is referenced later on.

xProc - Pausing a pipeline and continue it when certain event occurs

I'm fairly new to xProc and xPath, but I've been asked to solve the following problem:
Step 2 receives data via the secondary port from step 1. Step 2 contains a p:for-each, which saves a document into a folder for each element that passes the for-each.
(Part A)
These documents (let's say I receive 6 documents from for-each) lay in the same directory and get filtered by p:directory-list and are eventually stored in one single document, containing the whole path of every document the for-each created. (Part B)
So far, so good.
The problem is that Part A seems to be too slow. Part B already tries to read the data Step A
stores while the directory is still empty. Meaning, I'm having a performance / synchronization problem.
And now comes the question:
Is it possible to let the pipeline wait and to let it continue as soon as a certain event occurs?
That's what I'm imagining:
Step B waits as long as necessary until the directory, which Step A stores the data in, is no longer empty. I read something about
dbxml:breakpoint, but unfortunately I couldn't find more information than the name and
a short description of what it seems to do:
Set a breakpoint, optionally based upon a condition, that will cause pipeline operation to pause at the breakpoint, possibly requiring user intervention to continue and/or issuing a message.
It would be awesome if you know more about it and could give an example of how it's used. It would also help if you know a workaround or another way to solve this problem.
UPDATE:
After searching google for half an eternity, I found SMIL which's timesheets seem to do the trick. Has anyone experience with throwing XML / xProc and SMIL together?
Back towards the end of 2009 I proposed the concept of 'Orchestrating XProc with SMIL' http://broadcast.oreilly.com/2009/09/xproc-and-smil-orchestrating-p.html in a blog post on the O'Reilly Network.
However, I'm not sure that this (XProc + Time) is the solution to your problem. It's not entirely clear, to me, from you description what's happening. Are you implying that you're trying to write something to disk and then read it in a subsequent step? You need to keep stuff in the pipeline in order to ensure you can connect outputs to subsequent inputs.

Riak: are my 2is broken?

we're having some weird things happening with a cleanup cronjob and riak:
the objects we store (postboxes) have a 2i for modification date (which is a unix timestamp).
there's a cronjob running freqently deleting all postboxes that have not been modified within 180 days. however we've found evidence that postboxes that some (very little) postboxes that were modified in the last three days were deleted by this cronjob.
After reviewing and debugging several times over every line of code, I am confident, that this is not a problem of the cronjob.
I also traced back all delete calls to that bucket - and no one else is deleting objects there.
Of course I also checked with Riak to read the postboxes with r=ALL: they're definitely gone. (and they are stored with w=QUORUM)
I also checked the logs: updating the post boxes did succeed (there were no errors reported back from the write operations)
This leaves me with two possible causes for this:
riak loses data (which I am not willing to believe that easily)
the secondary indexes are corrupt and queries to them return wrong keys
So my questions are:
Can 2is actually break?
Is it possible to verify that?
Am I missing something completely different?
Cheers,
Matthias
Secondary index queries in Riak are coverage queries, which means that they will only use one of the stored replicas, and not perform a quorum read.
As you are writing with w=QUORUM, it is possible that one (or more) of the replicas may not get updated if you have n_val set to 3 or higher while the operation still is deemed successful. If this is the one selected for the coverage query, you could end up deleting based on the old value. In order to avoid this, you will need to perform updates with w=ALL.

QTP Retrieve code that has already been executed

I'm trying to figure out a way to reflectively look at the code that I've executed in a QTP script. The idea here is, when I encounter a crash, have a recovery scenario that captures an error message and sends it to QC as a defect. If I can see the code I've already executed, then I could also include the steps to reproduce the defect, in theory.
Any thoughts?
Option 1: Movie recording and playback
QTP11 (finally) has a feature for demands like that: Take a look at Tools, Options, Run, Screen capture. "Save movie to results" there allows you to record exactly what happened. The resulting movie is part of the run result, i.e. if you submit a bug with this run result, the movie will be included.
I would not use such a feature because you would have to record the movie always just to have it in case of error. You would end up with big run results that contain movies nobody wants to see, just to have them in the rare cases that an error occurred and a defect is created. But:
In this regard, HP has done the job right: You can select in the dialog to save the movie to results only if an error occurs. And, to avoid saving the hole boring part of the test execution that did not contain errors, yet seeing the critical steps that lead to the error, you can specify to keep only the last N kB of the movie so you will always see what lead to the error.
Option 2: "Macro" recording and playback
You could, in theory, create your own playback methods for all test objects (registering functions via RegisterUserFunc), and make them save the call info into some data structure before doing the playback step (by calling the original playback function).
Then, still in theory, you could create a nice little playback engine that iterates over that data structure and executes exactly the playback steps that were recorded previously.
I´ve done similar stuff to repeat bundles of playback steps after changing AUT config to iterate a given playback over various configs without changing the code that does the original playback.
But hey, this is quite some work, lots of things can be wrong: The AUT must be in the same intial state upon playback as during "recording of playback". This includes all relevant databases and subsystems of your testing environment. Usually, this is not an easy task in large projects and not worth the trouble (we are talking about re-creating the original initial config just to reproduce one single bug).
So I suggest you check out the movie feature, i.e. option 1. It does not playback the steps in the AUT, but it shows what happened during the original playback -- exactly.

Resources