Scrapy Shell: twisted.internet.error.ConnectionLost although USER_AGENT is set - web-scraping

When I try to scrape a certain web site (with both, spider and shell), I get the following error:
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
I found out that this can happen, when no user agent is set.
But after setting it manually, I still got the same error.
You can see the whole output of scrapy shell here: http://pastebin.com/ZFJZ2UXe
Notes:
I am not behind a proxy, and I can access other sites via scrapy shell without problems. I am also able to access the site with Chrome, so it is not a network or connection issue.
Maybe someone can give me a hint how I could solve this problem?

Here is 100% working code.
What you need to do is you have to send request headers as well.
Also set ROBOTSTXT_OBEY = False in settings.py
# -*- coding: utf-8 -*-
import scrapy, logging
from scrapy.http.request import Request
class Test1SpiderSpider(scrapy.Spider):
name = "test1_spider"
def start_requests(self):
headers = {
"Host": "www.firmenabc.at",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"DNT": "1",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language":"en-US,en;q=0.8"
}
yield Request(url= 'http://www.firmenabc.at/result.aspx?what=&where=Graz', callback=self.parse_detail_page, headers=headers)
def parse_detail_page(self, response):
logging.info(response.body)
EDIT:
You can see what headers to send by inspecting the URLs in Dev Tools

Related

Instagram blocks me for the requests with 429

I have used a lot of requests to https://www.instagram.com/{username}/?__a=1 to check if a pseudo was existing and now I am getting 429.
Before, I had just to wait few minutes to make the 429 disapear. Now it is persistent ! :( I'm trying once a day, it doesnt work anymore.
Do you know anything about instagram requests limitation ?
Do you have any workaround please ? Thanks
Code ...
import requests
r = requests.get('https://www.instagram.com/test123/?__a=1')
res = str(r.status_code)
Try adding the user-agent header, otherwise, the website thinks that your a bot, and will block you.
import requests
URL = "https://www.instagram.com/bla/?__a=1"
HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(URL, headers=HEADERS)
print(response.status_code) # <- Output: 200

Cannot get Python request GET to work with spitogatos.gr

I've been trying to scrape the data from www.spitogatos.gr but with no luck.
My code looks something like:
import requests
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr'
}
url = "https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage"
req = requests.get(url, headers=headers)
print (req)
print (req.content)
Although I get a response status 200, instead of any useful content I get the HTML message:
Pardon Our Interruption As you were browsing something about your
browser made us think you were a bot. There are a few reasons this
might happen:
You've disabled JavaScript in your web browser. You're a power user
moving through this website with super-human speed. You've disabled
cookies in your web browser. A third-party browser plugin, such as
Ghostery or NoScript, is preventing JavaScript from running. ...
Now I had a look at Firefox to see what kind of request it sends, and although it sends a POST request I did copy the Cookie that Firefox sends with it's request. So my header would look something like:
headers={
'Host': 'www.spitogatos.gr',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'application/json; charset=utf-8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.spitogatos.gr/',
'Content-Type': 'text/plain; charset=utf-8',
'Origin': 'https://www.spitogatos.gr',
'Cookie':'PHPSESSID=iielr7e8boibudjmln9ubd3i62; spitogatosS=areaIDs%255B0%255D%3D2007%26propertyCategory%3Dresidential%26listingType%3Dsale; currentCurrency=EUR; spitogatosInfoBar_shortCuts=false; openedTabs=; _ga=GA1.2.1557987790.1597249012; _gid=GA1.2.507964674.1597249012; _gat_UA-3455846-10=1; _gat_UA-3455846-2=1; _hjid=dfd027d6-e6f1-474c-a427-c26d5f2ca64c; _cmpQcif3pcsupported=1; reese84=3:T2t/w3VdpNG5w9Knf78l7w==:Gg20L4RyGJffieidEn4Eb1Hmb3wyAtPQfmH/5WYHWfKjzLmjhkGCoTR0j5UUmKxIbkzZltWBeJ6KaPVCFa5qiaddz2Cn6OltrBdp…2YIriDYTOwLMNNxEFPDPkL/Lw2cGC0MwJ3uUg6kSP/VgPp/AYkIcVjXLgqjSwmAdGl4oQDyrAKDpn9PcN/fWSUjPrtAOAJzkWcZ7FPCfvcsnAo9oSNpXtAaZ0JLzgMKXqQqP8Jrakjo4eL9TSdFKIVEJZos=:eBpByDUvhUkR0pGwgnYacTV3VeYzKEi+4pJpI3mhQ6c=; _fbp=fb.1.1597249012911.16321581; _hjIncludedInPageviewSample=1; eupubconsent=BO4CN1PO4CN1PAKAkAENAAAAgAAAAA; euconsent=BO4CN1PO4CN1PAKAkBENDV-AAAAx5rv6_77e_9f-_fv_9ujzGr_v_e__2mccL5tn3huzv6_7fi_-0nV4u_1tfJdydkh-5YpCjto5w7iakiPHmqNeZ1nfmz1eZpRP58E09j53zpEQ_r8_t-b7BCHN_Y2v-8K96lPKACA; spitogatosHomepageMap=0'.encode('utf8')
}
Now when I do this the request will SOMETIMES work and sometimes it will give me the above message. So I keep refreshing Firefox and copy the Cookies over to my python script.
It is a hit and miss.
I have also been trying to replicate the exact POST request of firefox (identical data, params and headers) but this gives me the same error message.
Firefox does not seem to have this problem, no matter how many requests or refreshes I do.
Can someone help me please?
Thanks
Specifying User-Agent and Accept-Language I was able to get correct response every time (try to change User-Agent header to Linux one, as I have in example):
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.spitogatos.gr/search/results/residential/sale/r100/m2007m/propertyType_apartment/onlyImage'
print( requests.get(url, headers=headers).text )
I have resolved the problem : the answer was so simple in fact. It's the "adblocker ultimate" that were the cause of the blockage.
So if you have some adds blockers, you could try to desactivate them, just to see, because the website could identify you as a bot when you use them.
But just to know, i have 3 adds blockers (addblock plus, ublock origin and adblocker ultimate) and it seems that only adblocker ultimate (installed very recently, while i have the others since months) caused the trouble.
I hope this solution could work for you too, if you're in the same case; or could help anybody else.

why is nothing getting parsed in my web scraping program?

I made this code to search all the top links in google search. But its returning none.
import webbrowser, requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
website = f'http://google.com/search?q={string}'
req_web = requests.get(website).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)
Google needs that you specify User-Agent http header to return correct page. Without the correct User-Agent specified, Google returns page that doesn't contain <div> tags with r class. You can see it when you do print(soup) with and without User-Agent.
For example:
import requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
website = f'http://google.com/search?hl=en&q={string}'
req_web = requests.get(website, headers=headers).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)
Prints:
https://www.instagram.com/selenagomez/?hl=en
Answer from Andrej Kesely will throw an error since this css class no longer exists:
gotolink = parser.find('div', class_='r').a["href"]
AttributeError: 'NoneType' object has no attribute 'a'
Learn more about user-agent and request headers.
Basically user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
In this case, you need to send a fake user-agent so Google would treat your request as a "real" user visit, also known as user-agent spoofing.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = result.select_one('.yuRUbf a')['href']
print(link)
# https://www.instagram.com/selenagomez/
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your case is that you don't need to think about how to bypass Google blocks if they appear or figure out how to scrape elements that are a bit harder to scrape since it's already done for the end-user. The only thing that needs to be done is just get the data you want from the JSON string.
Example code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means index of the first organic result
link = results['organic_results'][0]['link']
print(link)
# https://www.instagram.com/selenagomez/
Disclaimer, I work for SerpApi.

When do I have to set headers and how do I get them?

I am trying to crawl some information from www.blogabet.com.
In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.
I am asking myself:
1. For which websites do I have to use headers?
2. How do I get the information that I have to provide in the header?
3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters
Thank you very much :)
scrapy shell
from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'
page = Request(url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
fetch(page)
If you look in your network panel when you load that page you can see the XHR and the headers it sends
So it looks like he just copied those.
In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.

Simulate file download with Gatling

Good morning,
I would like to simulate a file download with Gatling. I'm not sure that a simple get request on a file ressource really simulate it:
val stuffDownload: ScenarioBuilder = scenario("Download stuff")
.exec(http("Download stuff").get("https://stuff.pdf")
.header("Content-Type", "application/pdf")
.header("Content-Type", "application/force-download"))
I want to challenge my server with multiple downloads within the same moment and I need to be sure I have the right method to do it.
Thanks in advance for your help.
EDIT: Other headers I send:
"User-Agent" -> "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Accept" -> "application/json, text/plain, */*; q=0.01",
"Accept-Encoding" -> "gzip, deflate, br",
"Accept-Language" -> "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7",
"DNT" -> "1",
"Connection" -> "keep-alive"
It looks technically globally fine except that:
you have 2 Content-Type ?
Is there a mistake in second one ?
Also aren't you missing other browser headers like User-Agent ?
Aren't you missing an important one related to Compression like Accept-Encoding ?
But regarding functional part, aren't you missing some steps before it ?
I mean do your user access immediately the link or do they hit a login screen , then do a search and finally click on a link ?
Also, is it always the same file ? Shouldn't you introduce a kind of variability using Gatling CSV Feeders for example with a set of files ?

Resources