How to get Request Headers automatically using Scrapy? - web-scraping

Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?

Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.

Related

how to deal with captcha when web scraping using R

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.
You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

How does this site know I'm a scraper?

I'm trying to scrape results from this web form (sample ID: 15740175). Actually I'm sending a POST request from Scrapy, in the same way that the form does.
I am working from a non-blocked IP - I can make a request successfully from Firefox on this machine. I'm using Firefox with JavaScript and cookies disabled, so the site doesn't require either JS or cookies to return results.
This is my Scrapy code:
allowed_domains = ['eservices.landregistry.gov.uk']
start_urls = []
_FORM_URL = "http://eservices.landregistry.gov.uk/www/wps/portal/!ut/p/b1/" \
"hc7LDoIwEAXQb-ELOrQFu60EgSgg8hDYEFQ0GHksCIZ-veBODTK7Sc69MyhFMU" \
"rrvC9veVc2df6Y9lTNCGZUlik2GVFXYCkbg8iBQoCSESR_gCEv5Y8oBpr5d9ba" \
"QxfvhNYHd-ENjtCxLTg44vy0ndP-Eh3CNefGoLMa-UU95tKvanfDwSJrd2sQDw" \
"OoP-DzNsMLYPr9DWBmOCDHbKoCJSNbzfWwiKK2CvvyoF81LkkvDLGUgw!!/dl4" \
"/d5/L0lDU0lKSmdwcGlRb0tVUW9LVVEhL29Gb2dBRUlRaGpFQ1VJZ0FJQUl5Rk" \
"FNaHdVaFM0SldsYTRvIS80RzNhRDJnanZ5aERVd3BNaFFqVW81Q2pHcHhBL1o3" \
"XzMyODQxMTQySDgzNjcwSTVGRzMxVDUzOFY0LzAvMjc0MzY5MTc0Njk2L3NwZl" \
"9BY3Rpb25OYW1lL3NwZl9BY3Rpb25MaXN0ZW5lci9zcGZfc3RydXRzQWN0aW9uL" \
"yEyZlFEU2VhcmNoLmRv/"
def start_requests(self):
settings = get_project_settings()
ids = ['15740175']
for i, id in enumerate(ids):
yield FormRequest(
url=self._FORM_URL,
formdata={
'polygonId': id,
'enquiryType': 'lrInspireId',
},
headers={
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:43.0) Gecko/20100101 Firefox/43.0",
'Accept-Language': 'en-GB,en;q=0.5', '
'Referer': ''
}
)
def parse(self, response):
# do parsing here
But in the log I just see a 403 response. (NB, the site's robots.txt doesn't forbid scraping.)
I've used Charles to inspect the request sent by Scrapy, and all the request headers (including User-Agent) look identical to the request headers sent when I make the request in Firefox and get 200 back.
Presumably the site knows I'm a scraper and is blocking me, but how does it know? I'm genuinely mystified. I'm only sending one response, so it can't be to do with rate limiting or download delays.
This site may be protected against CSRF (Cross Site Request Forgery). Also the action URL looks like having a session token which prevents replay attacks. However scraping can be illegal and check with the owners of this site/ organization before accessing this site in such ways
Just open page source HTML in browser and refresh it several times - you'll see that form action URL is changing every time, so it is dynamic URL while you try to use it as hard-coded. You should fetch HTML page with the form first and then send form data using current form action URL.

Jetty client http error code 412

I am accessing one website (I am hiding origin website name as it is against the policy) using browser and jetty/apache httpclient.
The website works fine with web browser.
Using api I am able to login into website,gets the session cookie JSESSIONID and home page html content. But after that when I submit any form or call the links from html I receive the HTTP error code 412(Pre condition failed).
I understand this error is due problem in client header. I set all the headers from browser(checked using inspect element in chrome browser). Still I have the same error.
I am not able to track down which header is causing the problem.
Here is the Header from browser
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-GB,en-US;q=0.8,en;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:318
Content-Type:application/x-www-form-urlencoded
Cookie:language=en_IN; __gads=ID=14c64d8f9fd7de54:T=1424658276:S=ALNI_Mba1kvJO4mLo7R-T2jUJE9zCYck5A; SLB_Cookie=ffffffff09461c2d45525d5f4f58455e445a4a422971; JSESSIONID=36m4Oo6dCML_Wvx-Wgmm9rtLh9mbURxnZhWIVwg-zHaNzFQeUt9C!-1989013783; _ga=GA1.3.379900459.1428120216
Host:www.irctc.co.in
Origin:https://www.example.com
Referer:https://www.example.com/context/home
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
Same header I am setting from jetty client.
Request request = httpClient.newRequest(url);
request.method(HttpMethod.POST);
request.agent(USER_AGENT);
request.accept("text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
request.header(HttpHeader.REFERER,"https://www.example.com/context/home");
request.header(HttpHeader.ACCEPT_ENCODING, "gzip, deflate");
request.header(HttpHeader.ACCEPT_LANGUAGE, "en-GB,en-US;q=0.8,en;q=0.6");
request.header(HttpHeader.CACHE_CONTROL, "max-age=0");
request.header(HttpHeader.CONNECTION, "keep-alive");
request.header(HttpHeader.CONTENT_TYPE, "application/x-www-form-urlencoded");
request.header(HttpHeader.HOST, "www.example.com);
for (Map.Entry<String, String> entry : params.entrySet()){
request.param(URLEncoder.encode(entry.getKey(), "UTF-8"), URLEncoder.encode(entry.getValue(), "UTF-8"));
if(!StringUtils.isEmpty(content)){
content+="&";
}
content+=URLEncoder.encode(entry.getKey(), "UTF-8")+"="+URLEncoder.encode(entry.getValue(), "UTF-8");
}
request.header(HttpHeader.CONTENT_LENGTH, ""+content.length());
I see JSESSIONID and SLB_Cookie are present in the request. Since the website is out of our control I really can not track what is the issue.
Please help me to resolve this issue. Any pointers to resolve the issue on client side is appreciated. is there any way we can make sure which header causing this issue.
I solved the problem.
Issue with the form parameters value. I was sending encoded values where were encoded by jetty client again

the header fields of an HTTP message: who chooses them?

Looking at the network panel in the developer tools of google chrome I can read the HTTP request and response messages of each file in a web page and, in particular, I can read the start line and the headers with all their fields.
I know (and I hope that is right) that the start line of each HTTP message has a specific and rigorous structure (different for request and response message, of course) and any element inside a start line cannot be missed.
Unlike the start line, the header of an HTTP message contains additional informations, so, I guess, the header fields are facultative or, at least, not so strictly requested like the fields in the start line.
Considering all this, I'm wondering: who sets the header fields in an HTTP message? Or, in other words, how are determined the header fields of an HTTP message?
For example, i can actually see that the HTTP request message for a web page is this:
GET / HTTP/1.1
Host: www.corriere.it
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4,de;q=0.2
Cookie: rccsLocalPref=milano%7CMilano%7C015146; rcsLocalPref=milano%7CMilano; _chartbeat2=DVgclLD1BW8iBl8sAi.1422913713367.1430683372200.1111111111111111; rlId=8725ab22-cbfc-45f7-a737-7c788ad27371; __ric=5334%3ASat%20Jun%2006%202015%2014%3A13%3A31%20GMT+0200%20%28ora%20legale%20Europa%20occidentale%29%7C; optimizelyEndUserId=oeu1433680191192r0.8780217287130654; optimizelySegments=%7B%222207780387%22%3A%22gc%22%2C%222230660652%22%3A%22false%22%2C%222231370123%22%3A%22referral%22%7D; optimizelyBuckets=%7B%7D; __gads=ID=bbe86fc4200ddae2:T=1434976116:S=ALNI_MZnWxlEim1DkFzJn-vDIvTxMXSJ0g; fbm_203568503078644=base_domain=.corriere.it; apw_browser=3671792671815076067.; channel=Direct; apw_cache=1438466400.TgwTeVxF.1437740670.0.0.0...EgjHfb6VZ2K4uRK4LT619Zau06UsXnMdig-EXKOVhvw; ReadSpeakerSettings=enlarge=enlargeoff; _ga=GA1.2.1780902850.1422986273; __utma=226919106.1780902850.1422986273.1439110897.1439114180.19; __utmc=226919106; __utmz=226919106.1439114180.19.18.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); s_cm_COR=Googlewww.google.it; gvsC=New; rcsddfglr=1441375682.3.2.m0i10Mw-|z1h7I0wH.3671792671815076067..J3ouwyCkNXBCyau35GWCru0I1mfcA3hRLNURnDWREPs; cpmt_xa=5334,5364; utag_main=v_id:014ed4175b8e000f4d2bb480bdd10606d001706500bd0$_sn:74$_ss:1$_st:1439133960323$_pn:1%3Bexp-session$ses_id:1439132160323%3Bexp-session; testcookie=true; s_cc=true; s_nr=1439132160762-Repeat; SC_LNK_CR=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; dtLatC=116p80.5p169.5p91.5p76.5p130.5p74p246.5p100p74.5p122.5; dtCookie=E4365758C13B82EE9C1C69A59B6F077E|Corriere|1|_default|1; dtPC=-; NSC_Wjq_Dpssjfsf_Dbdif=ffffffff091a1f8d45525d5f4f58455e445a4a423660; hz_amChecked=1
how these header fields are chosen? Who/what chose them? (The browser? Not me, of course...)
p.s.:
hope my question is clear, please, forgive my bad english
All internet websites are hosted on HTTP servers, these headers are set by the http server who is hosting the webpage. They are used to control how pages are shown, cached, and encoded.
Web browsers set the headers when requesting the pages from the servers. This mutual communication protocol is the HTTP protocol linked above.
here is a list of all the possible header fields for a request message: the question is, why the broser chooses only some of them?
The browser doesn't include all possible request headers in every request because either:
They aren't applicable to the current request or
The default value is the desired value
For instance:
Accept tells the server that only certain data formats are acceptable in the response. If any kind of data is acceptable, then it can be omitted as the default is "everything".
Content-Length describes the length of the body of the request. A GET request doesn't have a body, so there is nothing to describe the length of.
Cookie contains a cookie set by the server (or JavaScript) on a previous request. If a cookie hasn't been set, then there isn't one to send back to the server.
and so on.

XULRunner Application Request Header Information

I've developed a standalone XULRunner app that I'm using as a site-specific browser. The web application it accesses does filtering of browsers to know whether the browser being used is optimal. I'd like to add my XULRunner app to the list of optimal browsers. I figured that to do this, I'll need to know the HTTP header information that accompanies request sent by the XULRunner app. What information in the HTTP header can I use to identify my XULRunner app? Something like the Gecko Engine version, etc. I've been searching around, but no luck yet.
An application is usually identified by means of the User-Agent header. You can see it on the client side by means of the window.navigator.userAgent property, e.g. the header for Firefox 12 on Windows 7 is:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
The important part here is Gecko/... (identifies a Gecko-based browser) and rv:... (Gecko version). The Firefox/12.0 part should be replaced by something like MyApp/1.2.3 in your case (name and version number of your application).

Resources