Avoiding HTTP "too many requests" error when using SPARQLWrapper and Wikidata - http

I have a list of approximately 6k wikidata instance IDs (beginning Q#####) I want to look up the human-readable labels for. I am not too familiar with SPARQL, but following some guidelines have managed to find a query that works for a single ID.
from SPARQLWrapper import SPARQLWrapper, JSON
query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT *
WHERE {
wd: Q##### rdfs:label ?label .
FILTER (langMatches( lang(?label), "EN" ) )
}
LIMIT 1
"""
sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
output = sparql.query().convert()
I had hoped that iterating over a list of IDs would be as simple as putting the IDs in a dataframe and using the apply function...
ids_DF['label'] = ids_DF['instance_id'].apply(my_query_function)
... However, when I do that it errors out with a "HTTPError: Too Many Requests" warning. Looking into the documentation, specifically the query limits section, it says the following:
Query limits
There is a hard query deadline configured which is set to 60 seconds. There are also following limits:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
One client is allowed 30 error queries per minute
I'm unsure how to go about resolving this. Am I looking to run 6k error queries (i'm unsure what an error query even is)? In which case I presumably need to run them in batches to avoid going over the 30 second window.
My first attempt to resolve this was been to put a delay of 2 seconds after each query (see third from last line below). I noticed that each instance ID was taking approximately 1 second to return a value so my thinking was that a delay would boost the amount of time taken to 3 seconds (which should comfortably keep me within the limit). However, that still returns the same error. I've tried extending this sleep period as well, with the same results.
from SPARQLWrapper import SPARQLWrapper, JSON
query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT *
WHERE {
wd: Q##### rdfs:label ?label .
FILTER (langMatches( lang(?label), "EN" ) )
}
LIMIT 1
"""
sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
time.sleep(2) # imported from time
sparql.setReturnFormat(JSON)
output = sparql.query().convert()
A similar question on this topic was asked here but I've not been able to follow the advice given.

Related

'distributed=true' property doesn't seem to work with ingest from query

I am performing ingest from query in the following manner:-
.append async mytable with(distributed=true) <| myquery
Since this is using 'async' , I got an OperationId to track the progress. So when I issue .show operations command against the OperationsId , I get 2 rows in the resultset. The 'State' column value for both the rows was 'InProgress'. The 'NodeId' column value for one of the rows was blank whereas for the other row it was KENGINE000001. My cluster has 10+ worker nodes. Should I be getting ~ 10 rows as a result of this command , since I am using distribute=true option? And my data load is also heavy , so it's really a candidate for distributed ingestion. So either this property is not working or I am not interpreting its usage correctly?
Should I be getting ~ 10 rows as a result of this command , since I am using distribute=true option?
No
So either this property is not working or I am not interpreting its usage correctly?
likely the latter, or a false expectation from the output of .show operations, see above.
you can track the state/status of an async command using .show operations <operation_id>
If it doesn't reach a final state ("Completed","Failed","Throttled", etc.) after 1hr, that's unexpected - and you should open a support ticket for that.
regardless - it's ill-advised to attempt to ingest a lot of data (multi-GBs or more) using a single command, even if it's distributed.
if that's what you're attempting to do, you should consider splitting your ingestion into multiple commands, each handling a subset of the data.
see the 'remarks' section here: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-ingestion/ingest-from-query

Cosmos db OFFSET LIMIT clause issue

I am querying a Cosmos DB using the REST API. I am having problems with the 'OFFSET LIMIT' clause. I have tested this both with my code (Dart) and Postman with the same results:
This query works ok:
SELECT * FROM Faults f WHERE CONTAINS(f.Key, 'start', true)
This query does not work. Same as 1 but using OFFSET and LIMIT to get a subset:
SELECT * FROM Faults f
WHERE CONTAINS(f.Key, 'start', true)
OFFSET 10 LIMIT 10
This query works ok. Same as 2. but with an additional filter
SELECT * FROM Faults f
WHERE CONTAINS(f.Key, 'start', true)
AND f.Node = 'XI'
OFFSET 10 LIMIT 10
I don't get why if 1 and 3 are working 2 is not.
This is the response from query 2:
{
"code": "BadRequest",
"message": "The provided cross partition query can not be directly served by the gateway. This is a first chance (internal) exception that all newer clients will know how to handle gracefully. This exception is traced, but unless you see it bubble up as an exception (which only happens on older SDK clients), then you can safely ignore this message.\r\nActivityId: 5918ae0e-71ab-48a4-aa20-edd8427fe21f, Microsoft.Azure.Documents.Common/2.11.0",
"additionalErrorInfo": "{\"partitionedQueryExecutionInfoVersion\":2,\"queryInfo\":{\"distinctType\":\"None\",\"top\":null,\"offset\":10,\"limit\":10,\"orderBy\":[],\"orderByExpressions\":[],\"groupByExpressions\":[],\"groupByAliases\":[],\"aggregates\":[],\"groupByAliasToAggregateType\":{},\"rewrittenQuery\":\"SELECT *\\nFROM Faults AS f\\nWHERE CONTAINS(f.Key, \\\"start\\\", true)\\nOFFSET 0 LIMIT 20\",\"hasSelectValue\":false},\"queryRanges\":[{\"min\":\"\",\"max\":\"FF\",\"isMinInclusive\":true,\"isMaxInclusive\":false}]}"
}
Thanks for your help
It seems that you can't execute cross partition query through REST API.
Probably, you have to use the official SDKs.
Cosmos DB : cross partition query can not be directly served by the gateway
Thanks decoy for putting me in the right direction.
OFFSET LIMIT is not supported by the REST API.
Pagination though can be achieved with the headers without using the SDK.
Set on your first request:
x-ms-max-item-count to the amount of records you want to retrieve at a time e.g. 10.
With the response you get the header:
x-ms-continuation String that points to the next document.
To get the next 10 documents send a new request with the headers:
x-ms-max-item-count = 10. Just like the first one.
x-ms-continuation set to the value you got from the response.
So it is very easy to get the next documents but it is not straightforward to get the previous ones. I had to save the document nÂș and 'x-ms-continuation' strings as key-value pairs and use them to implement a 'search previous' pagination.
I don't know if there is an easier way to do it.

Problem with automated SQLite query via Node Red

Warning, I am a complete noob with SQLite and Node-Red.
I am working on a project to scan and read car license plates. I now have the hardware up and running, it is passing plate information to a very basic SQLite 3 table of two records through Node-Red on a Raspberry Pi 3.
I can run instant queries, where a module sends over an exact query to run, ie
SELECT "License_Plate" FROM QuickDirtyDB WHERE "License_Plate" LIKE "%RAF66%"
This will come back with my plate RAF660, as below
topic: "SELECT "License_Plate" FROM QuickDirtyDB WHERE "License_Plate" LIKE "%RAF66%""
payload: array[1]
0: object
License_Plate: "RAF660"
When I automate and run this query it will not work, have been playing with this for three days now.
I am even unable to get a very basic automated query to work like
'var readlpr = msg.payload;
msg.topic = 'SELECT "License_Plate" FROM QuickDirtyDB WHERE "License_Plate" = ' + readlpr + ''
return msg;'
that's two single quotes at the end of the query line.
This is sent through to the query as below, it is the output from the debug node, exactly what is going into the query.
"SELECT "License_Plate" FROM QuickDirtyDB WHERE "License_Plate" = RAF660 "
and the error that comes out is,
"Error: SQLITE_ERROR: no such column: RAF660"
After this is working, I need to work out how I can allow a mismatch of two characters in case the OCR software either misread two characters or even drops two characters entirely. Is this something that a query can handle, or will I have to pass many plate details to a program to work out if I have a match?
I thought I would have had to run a query to create some kind of a view and then requery my read plate vs that view to see which plate in the database is the closest match, not sure if I have the terminology correct, view, join, union etc.
Thank you for looking and any suggestions you may have.
I will probably be going home in about an hour, so may not be able to check back in till Monday
RAF660 is a string and needs to be quoted "RAF660"
License_Plate is a column and should not be quoted.
The way you have it reads as fetch the rows where the RAF660 column is set to the value "License_Plate".

SOLR Delta import takes longer than next scheduled delta import cron job

We are using Solr 5.0.0. Delta import configuration is very simple, just like the apache-wiki
We have setup cron job to do delta-imports every 30 mins, simple setup as well:
0,30 * * * * /usr/bin/wget http://<solr_host>:8983/solr/<core_name>/dataimport?command=delta-import
Now, what happens if sometimes currently running delta-import takes longer than the next scheduled chron job?
Does SOLR Launches next delta-import in a parallel thread? Or ignores job until previous one is done?
Extending time in cron scheduler isn't an option as similar problem could happen as user and document number increases over the time...
I had the similar problem at my end.
Here is how I had a work around for it.
Note : I have implemented solr with core.
I have one table where in I have kept the info about solr like core name, last re-index date and re-indexing-required, current_status.
I have written a scheduler where it check which all cores needs re-indexing(delta-import) from the above table and starts the re-index.
Re-indexing request are sent/invoked after every 20 minutes(In your its 30 min).
When I start the re-indexing also update table and mark the status for the specific core as "inprogress".
After ten minutes I fire a request checking if the re-indexing is completed.
For checking the re-indexing I have used the request as :
final URL url = new URL(SOLR_INDEX_SERVER_PROTOCOL, SOLR_INDEX_SERVER_IP, Integer.valueOf(SOLR_INDEX_SERVER_PORT),
"/solr/"+ core_name +"/select?qt=/dataimport&command=status");
check the status for Committed or idle and the consider it as re-indexing is completed and mark the status of it as Idle in the table.
So re-indexing scheduler wont pick core which are in inprogress status.
Also it considers only those cores for re-indexing where in there some updates (which can be identified by flag "re-indexing-required").
Re-indexing is invoked only if re-indexing-required is true and current status is idle.
If there are some updates(identified by "re-indexing-required") but the current_status is inprogress the scheduler wont pick it for re-indexing.
I hope this may help you.
Note : I have used DIH for indexing and re-indexing.
Solr will simple ignore next import request until the end of the first one and it will not cache the second request. I can observe the behaviour and I've been read it somewhere but couldn't find it now.
Infact I'm dealing with same problem. I try to optimize the queries:
deltaImportQuery="select * from Assests where ID='${dih.delta.ID}'"
deltaQuery="select [ID] from Assests where date_created > '${dih.last_index_time}' "
I only retrieved ID field in first hand and than try to retrive the intended doc.
You may also specify your fields instead of '*' sign. since I use view it doesn't apply in my case
I will update if I had another solution.
Edit After Solution
Beyond the suggested request above I change one more think that speed up my indexing process 10 times. I had two big Entities nested. I used Entity inside another one like
<entity name="TableA" query="select * from TableA">
<entity name="TableB" query="select * from TableB where TableB.TableA_ID='${TableA.ID}'" >
Which yields to multi valued tableB fields. But For every row one request maded to db for TableB.
I changed my view using a with clause combined with a comma separeted field value. And parse the value from solr field mapping. and indexed it in to multivalued field.
My whole indexing process speed up from hours to minutes. Below is my view and solr mappping config.
WITH tableb_with as (SELECT * from TableB)
SELECT *,STUFF( (SELECT ',' + REPLACE( fieldb1, ',', ';') from tableb_with where tableb_with.tableA.ID = tableA.ID
for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,
STUFF( (SELECT ',' + REPLACE( fieldb1, ',', ';') from tableb_with where tableb_with.tableA.ID = tableA.ID
for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,
Al fancy Joins and unions goes into with clouse in tableB and also alot of joins in tableA. Actually this view held 200 hundred field in total.
solr mappping is goes like this :
<field column="field1WithComma" name="field1" splitBy=","/>
Hope It may help someone.

How to get SimpleDB Query Count with Boto and SDBManager

I would like to query my SimpleDB domain to get the count of records that match a certain criteria. Something that could be done like this:
rs = appsDomain.select("SELECT count(*) FROM %s WHERE (%s='%s' or %s='%s') and %s!='%s'" % (APPS_SDBDOMAIN, XML_APPNODE_NAME_ATTR, appName, XML_APPNODE_RESERVED_NAME_ATTR, appName, XML_EMAIL_NODE, thisSession.email), None, True)
After doing some reading I have found that possibly getting a query count from SimpleDB via the SDBManager count method might be more efficient than doing a straight forward "count(*)" style query. Further, I would love not to have to loop over a result set when I know there is only one row and column that I need yet I would want to avoid this too:
count = int(rs.iter().next()['Count'])
Is it true that SDBManager is more efficient? Is there a better way?
If SDBManager is the best way can anyone show me how to use it as I have been thoroughly unsuccessful?
Thanks in advance!
Well, I stopped being lazy and simply went to the source to get my answer
(FROM: boto-2.6.0-py2.7.egg/boto/sdb/db/manager/sdbmanager.py)
def count(self, cls, filters, quick=True, sort_by=None, select=None):
"""
Get the number of results that would
be returned in this query
"""
query = "select count(*) from `%s` %s" % (self.domain.name, self._build_filter_part(cls, filters, sort_by, select))
count = 0
for row in self.domain.select(query):
count += int(row['Count'])
if quick:
return count
return count
As you can see the sdbmanager.count method does nothing special and in fact does what I was hoping to avoid which is looping over a record store just to get the 'Count' value(s).
So in the end I will probably just implement this method myself as using the SDBManager actually implies a lot more over head which, in my case, is not worth it.
Thanks!

Resources