Wikidata Forbidden Access - wikidata

I was trying to run some wikidata queries with Python requests and multiprocessing (number_workers = 8) and now I'm getting code 403 (Access Forbiden). Are there any restrictions? I've seen here that I should limit myself to 5 concurrent queries, but even with one query I don't get any result now through Python. It used to work.
Is this Access Forbiden temporary or am I blacklisted forever? :(
I didn't see any restrictions in their doc, so I was not aware that I'm doing something that will get me banned.
Does anyone know what the situation is?
wikidata_url = 'https://query.wikidata.org/sparql'
headers = {'User-Agent': 'Chrome/77.0.3865.90'}
r = requests.get(wikidata_url, params={'format': 'json', 'query': query, 'headers': headers})
EDIT AFTER FIXES:
It turned out that I was temporarily banned from the server. I have changed my user agent to follow the recommended template and I waited for my bad to be removed. The problem was that I was ignoring the error 429 that tells me that I have exceeded my allowed limit and I have to retry after some time (some seconds). This leaded to my error 403.
I tried to correct my error caused by inexperience by writing the following code that takes this into account. I added this edit because it may be useful for someone else.
def get_delay(date):
try:
date = datetime.datetime.strptime(date, '%a, %d %b %Y %H:%M:%S GMT')
timeout = int((date - datetime.datetime.now()).total_seconds())
except ValueError:
timeout = int(date)
return timeout
def make_request(params):
r = requests.get(wikidata_url, params)
print(r.status_code)
if r.status_code == 200:
if r.json()['results']['bindings']:
return r.json()
else:
return None
if r.status_code == 500:
return None
if r.status_code == 403:
return None
if r.status_code == 429:
timeout = get_delay(r.headers['retry-after'])
print('Timeout {} m {} s'.format(timeout // 60, timeout % 60))
time.sleep(timeout)
make_request(params)

The access limits were tightened up in 2019 to try and cope with overloading of the query servers. The generic python-request user agent was blocked as part of this (I don't know if/when this was reinstated).
Per the Query Service manual, the current rules seem to be:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
One client is allowed 30 error queries per minute
Clients who don't comply with the User-Agent policy may be blocked completely
Access to the service is limited to 5 parallel queries per IP [this may change]
I would recommend trying again, running single queries with a more detailed user-agent, to see if that works.

Related

Log entry to Splunk using python

In Splunk we have an url, index, token, host, source and sourcetype and with those detail need to post data in splunk using python.
I was able to write a code using requests with URL, index, token and it works
import requests
url='SPLUNK_URL'
Header = {'Authorization': 'Splunk '+'1234567'}
json = {"index":"xxx_yyy", "event": { 'message' : "Value" } }
r = requests.post(url, headers=Header, json, verify=False)
But sometimes get this error ConnectionError: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')")). How to avoid this error ?
Assuming this is HEC,
I would compare the times you receive this error vs times you have issues on receiver, such as high CPU utilization , or internal logs for connection drops etc. That could be your answer as receiver rejects/resets. Also if you are sending directly to Indexer rather than mid instance, I believe there is a common issue for that.

Stop Scrapy request pipeline for a few minutes and retry

I am scraping a single domain using Scrapy and Crawlera proxy and sometimes due to Crawlera issues (technical break) and I am getting 407 status code and can't scrape any site. Is it possible to stop request pipeline for 10 minutes and then restart the spider? To be clear I do not wan't to defer the request, but stop everything (maybe except Item processing) for 10 minutes until they resolve the problem. I am running 10 concurrent threads.
Yes you can, there are few ways of doing this but the most obvious would be is simply insert some blocking code:
# middlewares.py
class BlockMiddleware:
def process_response(self, response, request):
if response.status == 407:
print('beep boop, taking a nap')
time.sleep(60)
and activate it:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.BlockMiddleware': 100,
{

FQL error 'failed to open stream failed to open stream: HTTP request failed! HTTP/1.0 500 Internal Server Error in '

I am 2 facebook accounts.
I use one for testing (has 600 friends) and other for development (5 friends). I am trying get all the ids of a users friend using the code
function get_photo($access_token){
// Run fql query
$fql_query_url = 'https://graph.facebook.com/'
. '/fql?q=SELECT+pid+,object_id+,owner+FROM+photo+WHERE+owner+=+me()+OR+owner+IN+(SELECT+uid2+FROM+friend+WHERE+uid1+=+me())+LIMIT+50'
. '&access_token=' . $access_token;
// First logically check the result, TRUE: Result is passed to $fql_query_result.
// FALSE: $flq_query_result = 0 or false.
$fql_query_result = file_get_contents($fql_query_url)?file_get_contents($fql_query_url):0;
// Only json_decode the result if it was valid.
$fql_query_obj = (!$fql_query_result == 0) ? json_decode($fql_query_result, true) : "FQL was not a valid query";
//display results of fql query
echo '<pre>';
//print_r("query results:");
print_r($fql_query_obj);
echo '</pre>';
return $fql_query_obj;}
I have obtained the following permissions
user_about_me,read_stream,user_activities,email,user_location,user_photos,friends_photos,publish_actions,user_birthday,user_likes,read_insights,read_insights,user_status';
The problem is the code works from the developers account but gives the following error with the test account.
file_get_contents(https://graph.facebook.com//fql?q=SELECT+pid+,object_id+,owner+FROM+photo+WHERE+owner+=+me()+OR+owner+IN+(SELECT+uid2+FROM+friend+WHERE+uid1+=+me())+LIMIT+50&access_token=AAAEEKwQN3CMBAEUC6DqfakwkSZCdeLI2zk5Ec2evZBJvZB14Nh9e4ZBs8bOOw36F9T2winWRyzSx3vSCcWOl4A80AgcOjEvft1sbW7MLeEE2cyVPCIAb): failed to open stream: HTTP request failed! HTTP/1.0 500 Internal Server Error in /var/www/JMJ_test/facebook_include1.php on line 190
FQL was not a valid query
I have found this problem with other APIs. Normally, it's nothing wrong with the code, but the amount of friends on the account. The dev one works, only 5 friends, whereas the testing one doesn't because it has 600 friends. I will try and find the limit for you in a second.
Edit
Sorry, I could not find the limit, but I read something about 600 requests per 600 seconds per iP. Not sure if that has anything to do with it.

Node.js Http.request slows down under load testing. Am I doing something wrong?

Here is my sample code:
var http = require('http');
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET'
};
http.createServer(function (req, res) {
var start = new Date();
var myCounter = req.query['myCounter'] || 0;
var isSent = false;
http.request(options1, function(response) {
response.setEncoding('utf8');
response.on('data', function (chunk) {
var end = new Date();
console.log(myCounter + ' BODY: ' + chunk + " time: " + (end-start) + " Request start time: " + start.getTime());
if (! isSent) {
isSent = true;
res.writeHead(200, {'Content-Type': 'application/xml'});
res.end(chunk);
}
});
}).end();
}).listen(3013);
console.log('Server running at port 3013');
What i found out is that, if I connect to other server, (google or any other), the response will get slower and slower to few seconds. It doesn't happen if I connect to another node.js server within the same network.
I use JMeter to test. 50 concurrent per sec with 1000 loop.
I have no idea what the problem is...
=========================
Further investigation:
I run the same script on Rackspace, and also on EC2 for testing. And the script will use http.request to connect to: Google, Facebook, and also my another script that simply output data (like hello world) which is hosted by another EC2 instance.
The test tool I just is jMeter on my desktop.
Pre-node.js test:
jMeter -> Google Result: fast and consistent.
jMeter -> Facebook result: fast and consistent.
jMeter -> My Simple Output Script Result: fast and consistent.
Then I make a 50 Concurrent Threads /sec , with 100 loops, testing to my Rackspace nodejs, and then EC2 node.js which has the same king of performance issue
jMeter -> node.js -> Google Result: from 50 ms goes to 2000 ms in 200 requsets.
jMeter -> node.js -> Facebook Result: from 200 ms goes to 3000 ms after 200 requsets.
jMeter -> node.js -> My Simple Output Script Result: from 100 ms goes to 1000 ms after 200 requsets.
The first 10-20 requests are fast, then start slowing down.
Then, when I change to 10 Concurrent Threads, things start changing.. The response is very consistent, no slow down.
Something to do with # of concurrent threads that Node.js (http.request) can handle.
------------ More --------------
I did more test today and here it is:
I used http.Agent and increase the max socket. However, the interesting this is that, on one testing server (EC2), it improves a lot and no more slow down. HOwever, the other server (rackspace) only improves a bit. It still shows slow down. I even set "Connection: close" in the request header, it only improves 100ms.
if http.request is using connection pooling, how to increase it?
in both server, if I do "ulimit -a", the # of file open is 1024.
------------- ** MORE AND MORE ** -------------------
It seems that even I set maxSockets to higher number, it only works at some limit. There seems to be a internal or OS dependent socket limitation. However to bump it up?
------------- ** AFTER EXTENSIVE TESTING ** ---------------
After reading lots of post, I find out:
quoted from: https://github.com/joyent/node/issues/877
1) If I set the headers with connection = 'keep-alive', the performance is good and can go up to maxSocket = 1024 (which is my linux setting).
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET',
**headers: {
'Connection':'keep-alive'
}**
};
If I set it to "Connection":"close", the response time would be 100 times slower.
Funny things happened here:
1) in EC2, when I first test with Connection:keep-alive, it will take about 20-30 ms. Then if I change to Connection:Close OR set Agent:false, the response time will slow down to 300 ms. WIHTOUT restarting the server, if I change to Connection:keep-alive again, the response time will slow down even further to 4000ms. Either I have to restart server or wait for a while to get back my 20-30ms lighting speed response.
2) If I run it with agent:false, at first, the response time will slow down to 300ms. But then, it will get faster again and back to "normal".
My guess is connection pooling still in effect even you set agent:false. However, if you keep connection:keep-alive, then it will be fast for sure. just don't switch it.
Update on July 25, 2011
I tried the latest node.js V0.4.9 with the http.js & https.js fix from https://github.com/mikeal/node/tree/http2
the performance is much better and stable.
I solved the problem with
require('http').globalAgent.maxSockets = 100000
or
agent = new http.Agent()
agent.maxSockets = 1000000 # 1 million
http.request({agent:agent})
I was struggling with the same problem. I found my bottleneck to be the DNS stuff, although I'm not exactly clear where/why. If I make my requests to something like http://myserver.com/asd I am barely able to run 50-100 rq/s and if I go beyond 100 rq/s and more things become disaster, response times become huge and some requests never finish and wait indefinitely and I need to kill -9 my server. If I make the requests to the IP address of the server everything is stable at 500 rq/s although not exactly smooth and the graph (I have realtime graph) is peaky. And beware there is still limit to the number of open files in Linux and I managed to hit it once. Another observation is that single node process can not smoothly make 500 rq/s. But I can start 4 node processes each making 200 rq/s and I get very smooth graph and consistent CPU/net load and very short response times. This is node 0.10.22 .
This will not neccesarily fix your problem, but it cleans up your code a bit and makes use of the various events in the way you should:
var http = require('http');
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET'
};
http.createServer(function (req, res) {
var start = new Date();
var myCounter = req.query['myCounter'] || 0;
http.request(options1, function(response) {
res.on('drain', function () { // when output stream's buffer drained
response.resume(); // continue to receive from input stream
});
response.setEncoding('utf8');
res.writeHead(response.statusCode, {'Content-Type': 'application/xml'});
response.on('data', function (chunk) {
if (!res.write(chunk)) { // if write failed, the stream is choking
response.pause(); // tell the incoming stream to wait until output stream drained
}
}).on('end', function () {
var end = new Date();
console.log(myCounter + ' time: ' + (end-start) + " Request start time: " + start.getTime());
res.end();
});
}).end();
}).listen(3013);
console.log('Server running at port 3013');
I removed the output of the body. Since we are streaming from one socket to another we cannot be sure to see the entire body at any time without buffering it.
EDIT: I believe node is using connection pooling for http.request. If you have 50 concurrent connections (and thus 50 concurrent http.request attempts) you might be running into the connection pool limit. I currently have no time to look this up for you, but you should have a look at the node documentation regarding http, especially http agent.
EDIT 2: There's a thread regarding a very similar problem on the node.js mailing list. You should have a look at it, especially Mikael's post should be of interest. He suggests turning off connection pooling entirely for the requests by passing the option agent: false to the http.request invocation. I do not have any further clues, so if this does not help maybe you should try getting help on the node.js mailing list.
Github issue 877 may be related:
https://github.com/joyent/node/issues/877
Though it's not clear to me if this is what you're hitting. The "agent: false" workaround worked for me when I hit that, as did setting a "connection: keep-alive" header with the request.

ORA-29270: too many open HTTP requests

Can someone help me with this problem that occurs whenever you run a TRIGGER, but works in a normal PROCEDURE?
TRIGGER:
create or replace
procedure testeHTTP(search varchar2)
IS
Declare
req sys.utl_http.req;<BR>
resp sys.utl_http.resp;<BR>
url varchar2(500);
Begin
url := 'http://www.google.com.br';
dbms_output.put_line('abrindo');
-- Abrindo a conexão e iniciando uma requisição
req := sys.utl_http.begin_request(search);
dbms_output.put_line('preparando');
-- Preparandose para obter as respostas
resp := sys.utl_http.get_response(req);
dbms_output.put_line('finalizando response');
-- Encerrando a comunicação request/response
sys.utl_http.end_response(resp);
Exception
When Others Then
dbms_output.put_line('excecao');
dbms_output.put_line(sys.utl_http.GET_DETAILED_SQLERRM());
End;
close your user session and then the problem is fixed.
Internal there is a limit from 5 http requests.
Might a problem is the missing: utl_http.end_response
or an exception in the app and not a close from the resp object.
modify the code like that:
EXCEPTION
WHEN UTL_HTTP.TOO_MANY_REQUESTS THEN
UTL_HTTP.END_RESPONSE(resp);
you need to close your requests once you are done with them, it does not happen automatically (unless you disconnect form the db entirely)
It used to be utl_http.end_response, but I am not sure if it is the same api any more.
Usually we need UTL_HTTP.END_RESPONSE(resp); to avoid of ORA-29270: too many open HTTP requests, but I think I reproduced the problem of #Clóvis Santos in Oracle 19c.
If web-service always returns status 200 (success) then too many open HTTP requests never happens. But if persistent connections are enabled and web-service returns status 404, behavior becomes different.
Let`s call something that always return 404.
First call of utl_http.begin_request returns normally and opens new persistent connection. We can check it with select utl_http.get_persistent_conn_count() from dual;. Second call causes an exception inside utl_http.begin_request and persistent connection becomes closed. (Exception is correctly handled with end_response/end_request).
If I continue then each odd execution returns 404 normally and each even execution gives an exception (handled correctly of course).
After some iterations I get ORA-29270: too many open HTTP requests. If web-service returns status 200 everything goes normally.
I guess, it happens because of the specific web-service. Probable it drops persistent connection after 404 and doesn't after 200. Second call tries to reuse request on persistent connection but it doesn't exist and causes request leak.
If I use utl_http.set_persistent_conn_support (false, 0); once in my session the problem disappears. I can call web-service as many times as I need.
Resolution:
Try to switch off persistent connection support. Probable, on the http-server persistent connections work differently for different requests. Looks like a bug.

Resources