I'm getting a permanent hang when trying to read a response using requests to access a particular site, most likely due to blocking of some sort. What I'm unsure about is how CURL, which is successfully receiving a response, is different to my Python get request which never receives any response.
Note: the curl command is expected to return an error as i'm not sending required info like cookies
curl:
curl 'https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'
successfully gets response
Python:
import requests
r = requests.get('https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'})
hangs on read forever
It works with python 3.
import requests
r = requests.get('https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'})
print(r.headers)
Response:
{'Cache-Control': 'max-age=86400, public', 'Content-Encoding': 'gzip', 'Content-Language': 'en-US', 'Content-Type': 'text/html;charset=utf-8', 'Server': 'Apache-Coyote/1.1', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Length': '8009', 'Date': 'Wed, 19 Feb 2020 06:04:55 GMT', 'Connection': 'keep-alive'}
There maybe subtle differences in the ways requests are made. For example Python requests will automatically add a few headers:
'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'
(You can see them by executing: r.request.headers)
Whereas Curl will add: Accept: */* but not gzip unless you ask for it. But the site in question seems to support gzip so the problem must lie elsewhere.
Suggestion: add a timeout in your request, and catch possible exceptions ie:
try:
r = requests...
except requests.exceptions.RequestException as e:
print (e)
Related
I'm trying to integrate custom authentication service with micronaut security and to do this I've implemented my own AuthenticationProvider and that works fine for basic auth, however I also need to take care of authentication tokens passed in the request.
To do this I'm trying to implement my own AuthenticationFetcher and in the fetchAuthentication method I'm trying to get my custom authentication header and then authenticate the request.
#Override
public Publisher<Authentication> fetchAuthentication(HttpRequest<?> request) {
if (request.getHeaders().get(authConfiguration.getTokenHeader()) != null) {
The issue I'm having is that netty's request.getHeaders() doesn't return all headers that are being sent to the webservice (I confirmed from my browsers developer console)
GET /service/all HTTP/1.1
Accept: application/json, text/plain, */*
Cookie: m=2258:Z3Vlc3Q6Z3Vlc3Q%253D
Accept-Encoding: gzip, deflate
Host: localhost:4200
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1 Safari/605.1.15
Accept-Language: pl-pl
Referer: http://localhost:4200/campaigns
Connection: keep-alive
X-Token: my.token.here
And here are my app settings
micronaut:
server:
netty:
maxHeaderSize: 1024
worker:
threads: 4
parent:
threads: 4
childOptions:
autoRead: true
application:
name: appName
Any feedback appreciated.
It was caused by cors config:
micronaut:
server:
cors:
enabled: true
After adding this my request header was removed in the filter chains.
I've been trying to scrape the data from www.spitogatos.gr but with no luck.
My code looks something like:
import requests
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr'
}
url = "https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage"
req = requests.get(url, headers=headers)
print (req)
print (req.content)
Although I get a response status 200, instead of any useful content I get the HTML message:
Pardon Our Interruption As you were browsing something about your
browser made us think you were a bot. There are a few reasons this
might happen:
You've disabled JavaScript in your web browser. You're a power user
moving through this website with super-human speed. You've disabled
cookies in your web browser. A third-party browser plugin, such as
Ghostery or NoScript, is preventing JavaScript from running. ...
Now I had a look at Firefox to see what kind of request it sends, and although it sends a POST request I did copy the Cookie that Firefox sends with it's request. So my header would look something like:
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr',
'Cookie':'PHPSESSID=iielr7e8boibudjmln9ubd3i62; spitogatosS=areaIDs%255B0%255D%3D2007%26propertyCategory%3Dresidential%26listingType%3Dsale; currentCurrency=EUR; spitogatosInfoBar_shortCuts=false; openedTabs=; _ga=GA1.2.1557987790.1597249012; _gid=GA1.2.507964674.1597249012; _gat_UA-3455846-10=1; _gat_UA-3455846-2=1; _hjid=dfd027d6-e6f1-474c-a427-c26d5f2ca64c; _cmpQcif3pcsupported=1; reese84=3:T2t/w3VdpNG5w9Knf78l7w==:Gg20L4RyGJffieidEn4Eb1Hmb3wyAtPQfmH/5WYHWfKjzLmjhkGCoTR0j5UUmKxIbkzZltWBeJ6KaPVCFa5qiaddz2Cn6OltrBdp…2YIriDYTOwLMNNxEFPDPkL/Lw2cGC0MwJ3uUg6kSP/VgPp/AYkIcVjXLgqjSwmAdGl4oQDyrAKDpn9PcN/fWSUjPrtAOAJzkWcZ7FPCfvcsnAo9oSNpXtAaZ0JLzgMKXqQqP8Jrakjo4eL9TSdFKIVEJZos=:eBpByDUvhUkR0pGwgnYacTV3VeYzKEi+4pJpI3mhQ6c=; _fbp=fb.1.1597249012911.16321581; _hjIncludedInPageviewSample=1; eupubconsent=BO4CN1PO4CN1PAKAkAENAAAAgAAAAA; euconsent=BO4CN1PO4CN1PAKAkBENDV-AAAAx5rv6_77e_9f-_fv_9ujzGr_v_e__2mccL5tn3huzv6_7fi_-0nV4u_1tfJdydkh-5YpCjto5w7iakiPHmqNeZ1nfmz1eZpRP58E09j53zpEQ_r8_t-b7BCHN_Y2v-8K96lPKACA; spitogatosHomepageMap=0'.encode('utf8')
}
Now when I do this the request will SOMETIMES work and sometimes it will give me the above message. So I keep refreshing Firefox and copy the Cookies over to my python script.
It is a hit and miss.
I have also been trying to replicate the exact POST request of firefox (identical data, params and headers) but this gives me the same error message.
Firefox does not seem to have this problem, no matter how many requests or refreshes I do.
Can someone help me please?
Thanks
Specifying User-Agent and Accept-Language I was able to get correct response every time (try to change User-Agent header to Linux one, as I have in example):
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage'
print( requests.get(url, headers=headers).text )
I have resolved the problem : the answer was so simple in fact. It's the "adblocker ultimate" that were the cause of the blockage.
So if you have some adds blockers, you could try to desactivate them, just to see, because the website could identify you as a bot when you use them.
But just to know, i have 3 adds blockers (addblock plus, ublock origin and adblocker ultimate) and it seems that only adblocker ultimate (installed very recently, while i have the others since months) caused the trouble.
I hope this solution could work for you too, if you're in the same case; or could help anybody else.
I am trying to use JSoup to parse content from URLs like https://www.tesco.com/groceries/en-GB/products/300595003
Jsoup.connect(url).get() simply times out, however I can access the website fine in the web browser.
Through trial and error, the simplest working curl command I found was:
curl 'https://www.tesco.com/groceries/en-GB/products/300595003' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0' \
-H 'Accept-Language: en-GB,en;q=0.5' --compressed
I am able to translate the User-Agent and Accept-Language into JSoup, however I still get timeouts. Is there an equivalent to the --compressed flag for Jsoup, because the curl command will not work without it?
To find out what --compressed option does try using curl with --verbose parameter. It will display full request headers.
Without --compressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
With --comppressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> Accept-Encoding: deflate, gzip
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
The difference is new Accept-Encoding header so adding .header("Accept-Encoding", "deflate, gzip") should solve your problem.
By the way, for me both jsoup and curl are able to download page source without this header and without --compressed and I'm not getting timeouts, so there's a chance your requests are somehow limited by server for making too many requests.
EDIT:
It works for me using your original command with --http1.1 so there has to be a way to make it work for you as well. I'd start with using Chrome developer tools to take a look at what headers your browser sends and try to pass all of them using .header(...). You can also copy curl command to see all headers and simulate exactly what Chrome is sending:
I am trying to crawl some information from www.blogabet.com.
In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.
I am asking myself:
1. For which websites do I have to use headers?
2. How do I get the information that I have to provide in the header?
3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters
Thank you very much :)
scrapy shell
from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'
page = Request(url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
fetch(page)
If you look in your network panel when you load that page you can see the XHR and the headers it sends
So it looks like he just copied those.
In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.
I'm trying to remove the unnecessary pre-flight requests in my application. For it I've simplified some parts of request, removed custom headers etc. But got stuck on a problem - GET requests now work fine without pre-flights, but POST requests still have them.
I've followed the requirements:
Request does not set custom HTTP headers.
Content type is "text/plain; charset=utf-8".
The request method has to be one of GET, HEAD or POST. If POST, content type should be one of application/x-www-form-urlencoded, multipart/form-data, or text/plain.
Both GET and POST requests go through the single httpinvoke call.
As an example - GET request that is not prefaced by pre-flight:
URL: http://mydomain/APIEndpoint/GETRequest?Id=346089&Token=f5h345
Request Method:GET
Request Headers:
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Cache-Control:no-cache
Connection:keep-alive
Content-Type:text/plain; charset=utf-8
Host: correct host
Origin:http://localhost
Pragma:no-cache
Referer: correct referer
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters:
Id=346089
Token=f5h345
And a POST request that looks very similar but is still prefaced with pre-flight:
URL: http://mydomain/APIEndpoint/GETRequest?param=TEST
Request Method:POST
Request Headers:
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Cache-Control:no-cache
Connection:keep-alive
Content-Length:11
Content-Type:text/plain; charset=UTF-8
Host:
Origin:http://localhost
Pragma:no-cache
Referer:
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters:
param:TEST
Request Payload
{MyData: {}}
Any advice would be appreciated! Thanks!
==== Update ===
As requested, posting the pre-flight request for the POST request:
URL: http://mydomain/APIEndpoint/GETRequest?param=TEST
Request Method:OPTIONS
Status Code:200 OK
Response Header
Access-Control-Allow-Origin:*
Cache-Control:no-cache
Content-Length:0
Date:Wed, 09 Aug 2017 08:02:16 GMT
Expires:-1
Pragma:no-cache
Server:Microsoft-IIS/8.5
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET
Request Headers
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Access-Control-Request-Method:POST
Cache-Control:no-cache
Connection:keep-alive
Host:correct host
Origin:http://localhost
Pragma:no-cache
Referer: correct referer
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters
param:TEST
It's a legacy code which uses the httpinvoke library. Code snippet that actually does the call:
_converters: {
'text json': function (input, reviver) {},
'json text': JSON.stringify
};
headers = {
'Content-Type': 'text/plain; charset=utf-8'
};
data = {params: "test"};
httpinvoke(url, method.toUpperCase(), {
corsExposedHeaders: ['Content-Type'],
headers: headers,
input: data,
converters: _converters,
inputType: 'json',
outputType: 'json',
timeout: self._getMessageTimeout()
}).then(function (res) {}, function (error) {});
This could happen if there are event listeners registered on the XMLHttpRequestUpload object (that forces a preflight; see the note on the use-CORS-preflight flag in https://xhr.spec.whatwg.org/, and a related note in https://fetch.spec.whatwg.org/ and the updated documentation on CORS preflight requests in the MDN CORS article).
Does httpinvoke do that?
As #Anne mentioned the reason that POST were sending pre-flight requests despite the requests themselves conforming to the rules of "simple requests" (and thus not needing a pre-flight) was in the XMLHttpRequestUpload event listeners.
XMLHttpRequestUpload itself might not be mentioned in the code, but you can always find it in xhr.upload variable. This was the case for http-invoke library.
So totally innocent looking code like:
xhr.upload.onprogress = onuploadprogress;
actually causes mandatory pre-flight requests.
Thanks to all who helped solve this problem.