Cannot get Python request GET to work with spitogatos.gr - web-scraping

I've been trying to scrape the data from www.spitogatos.gr but with no luck.
My code looks something like:
import requests
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr'
}
url = "https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage"
req = requests.get(url, headers=headers)
print (req)
print (req.content)
Although I get a response status 200, instead of any useful content I get the HTML message:
Pardon Our Interruption As you were browsing something about your
browser made us think you were a bot. There are a few reasons this
might happen:
You've disabled JavaScript in your web browser. You're a power user
moving through this website with super-human speed. You've disabled
cookies in your web browser. A third-party browser plugin, such as
Ghostery or NoScript, is preventing JavaScript from running. ...
Now I had a look at Firefox to see what kind of request it sends, and although it sends a POST request I did copy the Cookie that Firefox sends with it's request. So my header would look something like:
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr',
'Cookie':'PHPSESSID=iielr7e8boibudjmln9ubd3i62; spitogatosS=areaIDs%255B0%255D%3D2007%26propertyCategory%3Dresidential%26listingType%3Dsale; currentCurrency=EUR; spitogatosInfoBar_shortCuts=false; openedTabs=; _ga=GA1.2.1557987790.1597249012; _gid=GA1.2.507964674.1597249012; _gat_UA-3455846-10=1; _gat_UA-3455846-2=1; _hjid=dfd027d6-e6f1-474c-a427-c26d5f2ca64c; _cmpQcif3pcsupported=1; reese84=3:T2t/w3VdpNG5w9Knf78l7w==:Gg20L4RyGJffieidEn4Eb1Hmb3wyAtPQfmH/5WYHWfKjzLmjhkGCoTR0j5UUmKxIbkzZltWBeJ6KaPVCFa5qiaddz2Cn6OltrBdp…2YIriDYTOwLMNNxEFPDPkL/Lw2cGC0MwJ3uUg6kSP/VgPp/AYkIcVjXLgqjSwmAdGl4oQDyrAKDpn9PcN/fWSUjPrtAOAJzkWcZ7FPCfvcsnAo9oSNpXtAaZ0JLzgMKXqQqP8Jrakjo4eL9TSdFKIVEJZos=:eBpByDUvhUkR0pGwgnYacTV3VeYzKEi+4pJpI3mhQ6c=; _fbp=fb.1.1597249012911.16321581; _hjIncludedInPageviewSample=1; eupubconsent=BO4CN1PO4CN1PAKAkAENAAAAgAAAAA; euconsent=BO4CN1PO4CN1PAKAkBENDV-AAAAx5rv6_77e_9f-_fv_9ujzGr_v_e__2mccL5tn3huzv6_7fi_-0nV4u_1tfJdydkh-5YpCjto5w7iakiPHmqNeZ1nfmz1eZpRP58E09j53zpEQ_r8_t-b7BCHN_Y2v-8K96lPKACA; spitogatosHomepageMap=0'.encode('utf8')
}
Now when I do this the request will SOMETIMES work and sometimes it will give me the above message. So I keep refreshing Firefox and copy the Cookies over to my python script.
It is a hit and miss.
I have also been trying to replicate the exact POST request of firefox (identical data, params and headers) but this gives me the same error message.
Firefox does not seem to have this problem, no matter how many requests or refreshes I do.
Can someone help me please?
Thanks

Specifying User-Agent and Accept-Language I was able to get correct response every time (try to change User-Agent header to Linux one, as I have in example):
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage'
print( requests.get(url, headers=headers).text )

I have resolved the problem : the answer was so simple in fact. It's the "adblocker ultimate" that were the cause of the blockage.
So if you have some adds blockers, you could try to desactivate them, just to see, because the website could identify you as a bot when you use them.
But just to know, i have 3 adds blockers (addblock plus, ublock origin and adblocker ultimate) and it seems that only adblocker ultimate (installed very recently, while i have the others since months) caused the trouble.
I hope this solution could work for you too, if you're in the same case; or could help anybody else.

Related

Python requests hangs whereas CURL doesn't (same request)

I'm getting a permanent hang when trying to read a response using requests to access a particular site, most likely due to blocking of some sort. What I'm unsure about is how CURL, which is successfully receiving a response, is different to my Python get request which never receives any response.
Note: the curl command is expected to return an error as i'm not sending required info like cookies
curl:
curl 'https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'
successfully gets response
Python:
import requests
r = requests.get('https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'})
hangs on read forever
It works with python 3.
import requests
r = requests.get('https://www.yellowpages.com.au/search/listings?clue=Programmer&locationClue=All+States&pageNumber=3&referredBy=UNKNOWN&&eventType=pagination', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0'})
print(r.headers)
Response:
{'Cache-Control': 'max-age=86400, public', 'Content-Encoding': 'gzip', 'Content-Language': 'en-US', 'Content-Type': 'text/html;charset=utf-8', 'Server': 'Apache-Coyote/1.1', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Length': '8009', 'Date': 'Wed, 19 Feb 2020 06:04:55 GMT', 'Connection': 'keep-alive'}
There maybe subtle differences in the ways requests are made. For example Python requests will automatically add a few headers:
'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'
(You can see them by executing: r.request.headers)
Whereas Curl will add: Accept: */* but not gzip unless you ask for it. But the site in question seems to support gzip so the problem must lie elsewhere.
Suggestion: add a timeout in your request, and catch possible exceptions ie:
try:
r = requests...
except requests.exceptions.RequestException as e:
print (e)

When do I have to set headers and how do I get them?

I am trying to crawl some information from www.blogabet.com.
In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.
I am asking myself:
1. For which websites do I have to use headers?
2. How do I get the information that I have to provide in the header?
3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters
Thank you very much :)
scrapy shell
from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'
page = Request(url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
fetch(page)
If you look in your network panel when you load that page you can see the XHR and the headers it sends
So it looks like he just copied those.
In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.

How to copy HTTP headers from Charles to Postman

I have a problem with recreating the headers, everything seem identical, but it just doesn't work. Need those headers to access Instagram API
I tried to use Charles to intercept a traffic from mobile device and it's working as expected, but I'm struggling to recreate the same headers.
URL is https://i.instagram.com/api/v1/feed/user/7499201770/reel_media/
Headers are
:method: GET
:scheme: https
:path: /api/v1/feed/user/7499201770/reel_media/
:authority: i.instagram.com
content-type: application/json
authority: i.instagram.com
accept: */*
path: /api/v1/feed/user/7499201770/reel_media/
accept-language: en-IN;q=1.0
accept-encoding: gzip;q=1.0, compress;q=0.5
content-length: 2
user-agent: Instagram 10.29.0 (iPhone7,2; iPhone OS 9_3_3; en_US; en-US; scale=2.00; 750x1334) AppleWebKit/420+
referer: https://www.instagram.com/
x-ig-capabilities: 3w==
cookie: ds_user_id=6742557571; sessionid=IGSCf716eb61bf2a6d41f...
I tried to use Postman in order to recreate this request, but every time I get the same error "Login required". How should I paste those headers? I can't understand that
It was the user-agent: Instagram 10.29.0 (iPhone7,2; iPhone OS 9_3_3; en_US; en-US; scale=2.00; 750x1334) AppleWebKit/420+ that I didn't copy
With user-agent it works, so HTTP headers will look like this in case someone writing an instagram story saver)
["Content-Type": "application/json",
"Accept-encoding": "gzip, deflate",
"User-agent": "Instagram 10.29.0 (iPhone7,2; iPhone OS 9_3_3; en_US; en-US; scale=2.00; 750x1334) AppleWebKit/420+",
"Cookie": "ds_user_id=67425...; sessionid=IGSCf716eb61b....]

CORS: No pre-flight on GET but a pre-flight on POST

I'm trying to remove the unnecessary pre-flight requests in my application. For it I've simplified some parts of request, removed custom headers etc. But got stuck on a problem - GET requests now work fine without pre-flights, but POST requests still have them.
I've followed the requirements:
Request does not set custom HTTP headers.
Content type is "text/plain; charset=utf-8".
The request method has to be one of GET, HEAD or POST. If POST, content type should be one of application/x-www-form-urlencoded, multipart/form-data, or text/plain.
Both GET and POST requests go through the single httpinvoke call.
As an example - GET request that is not prefaced by pre-flight:
URL: http://mydomain/APIEndpoint/GETRequest?Id=346089&Token=f5h345
Request Method:GET
Request Headers:
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Cache-Control:no-cache
Connection:keep-alive
Content-Type:text/plain; charset=utf-8
Host: correct host
Origin:http://localhost
Pragma:no-cache
Referer: correct referer
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters:
Id=346089
Token=f5h345
And a POST request that looks very similar but is still prefaced with pre-flight:
URL: http://mydomain/APIEndpoint/GETRequest?param=TEST
Request Method:POST
Request Headers:
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Cache-Control:no-cache
Connection:keep-alive
Content-Length:11
Content-Type:text/plain; charset=UTF-8
Host:
Origin:http://localhost
Pragma:no-cache
Referer:
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters:
param:TEST
Request Payload
{MyData: {}}
Any advice would be appreciated! Thanks!
==== Update ===
As requested, posting the pre-flight request for the POST request:
URL: http://mydomain/APIEndpoint/GETRequest?param=TEST
Request Method:OPTIONS
Status Code:200 OK
Response Header
Access-Control-Allow-Origin:*
Cache-Control:no-cache
Content-Length:0
Date:Wed, 09 Aug 2017 08:02:16 GMT
Expires:-1
Pragma:no-cache
Server:Microsoft-IIS/8.5
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET
Request Headers
Accept:*/*
Accept-Encoding:gzip, deflate
Accept-Language:uk-UA,uk;q=0.8,ru;q=0.6,en-US;q=0.4,en;q=0.2
Access-Control-Request-Method:POST
Cache-Control:no-cache
Connection:keep-alive
Host:correct host
Origin:http://localhost
Pragma:no-cache
Referer: correct referer
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Query String Parameters
param:TEST
It's a legacy code which uses the httpinvoke library. Code snippet that actually does the call:
_converters: {
'text json': function (input, reviver) {},
'json text': JSON.stringify
};
headers = {
'Content-Type': 'text/plain; charset=utf-8'
};
data = {params: "test"};
httpinvoke(url, method.toUpperCase(), {
corsExposedHeaders: ['Content-Type'],
headers: headers,
input: data,
converters: _converters,
inputType: 'json',
outputType: 'json',
timeout: self._getMessageTimeout()
}).then(function (res) {}, function (error) {});
This could happen if there are event listeners registered on the XMLHttpRequestUpload object (that forces a preflight; see the note on the use-CORS-preflight flag in https://xhr.spec.whatwg.org/, and a related note in https://fetch.spec.whatwg.org/ and the updated documentation on CORS preflight requests in the MDN CORS article).
Does httpinvoke do that?
As #Anne mentioned the reason that POST were sending pre-flight requests despite the requests themselves conforming to the rules of "simple requests" (and thus not needing a pre-flight) was in the XMLHttpRequestUpload event listeners.
XMLHttpRequestUpload itself might not be mentioned in the code, but you can always find it in xhr.upload variable. This was the case for http-invoke library.
So totally innocent looking code like:
xhr.upload.onprogress = onuploadprogress;
actually causes mandatory pre-flight requests.
Thanks to all who helped solve this problem.

Scrapy Shell: twisted.internet.error.ConnectionLost although USER_AGENT is set

When I try to scrape a certain web site (with both, spider and shell), I get the following error:
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
I found out that this can happen, when no user agent is set.
But after setting it manually, I still got the same error.
You can see the whole output of scrapy shell here: http://pastebin.com/ZFJZ2UXe
Notes:
I am not behind a proxy, and I can access other sites via scrapy shell without problems. I am also able to access the site with Chrome, so it is not a network or connection issue.
Maybe someone can give me a hint how I could solve this problem?
Here is 100% working code.
What you need to do is you have to send request headers as well.
Also set ROBOTSTXT_OBEY = False in settings.py
# -*- coding: utf-8 -*-
import scrapy, logging
from scrapy.http.request import Request
class Test1SpiderSpider(scrapy.Spider):
name = "test1_spider"
def start_requests(self):
headers = {
"Host": "www.firmenabc.at",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"DNT": "1",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language":"en-US,en;q=0.8"
}
yield Request(url= 'http://www.firmenabc.at/result.aspx?what=&where=Graz', callback=self.parse_detail_page, headers=headers)
def parse_detail_page(self, response):
logging.info(response.body)
EDIT:
You can see what headers to send by inspecting the URLs in Dev Tools

Resources