When to pass headers when scraping a website - web-scraping

I am trying to understand when xml2/rvest commands actually query a website and when it is necessary to specify headers in order to avoid default headers being passed.
library(httr)
library(xml2)
library(rvest)
url <- "http://testing-ground.scraping.pro"
#open session, passing headers
s <- paste0(url, "/textlist") %>% html_session(add_headers(
"user-agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Accept" = "text/css,*/*;q=0.1",
"Accept-Language" = "en-US,en;q=0.5",
"Accept-Encoding" = "gzip, deflate, br"))
#scrape path list, assuming headers need not be passed again
url_list <- s %>% html_nodes(xpath=".//a[contains(text(), 'Text list')]") %>% html_attr("href")
#Concatenate base URL and scraped paths
url_list <- paste0(url, url_list)
#scrape web page, assuming headers need not be passed again
h <- s %>% jump_to(url_list[1]) %>% read_html()
My take on the above code is that the website is queried three times: 1) when session opened, 2) when paths scraped, 3) when webpage scraped. Passing headers when opening the session is sufficient and the same headers will be recycled in the following commands which use the same session.
Is this correct? Will any cookies (and other unspecified header information) also be stored in the session and passed on again to the website?

I understand now that the website is queried only twice:
1) when session opened: html_session()
3) when jumping to another
URL within session: jump_to()
Scraping (read_html()) or parsing (html_nodes()) from a session does not generate queries since the session already contains the page.
This is not a full answer as I still don't know whether unspecified headers (including cookies) are recycled back to the host.

Related

How to get Request Headers automatically using Scrapy?

Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.

how to deal with captcha when web scraping using R

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.
You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

How does this site know I'm a scraper?

I'm trying to scrape results from this web form (sample ID: 15740175). Actually I'm sending a POST request from Scrapy, in the same way that the form does.
I am working from a non-blocked IP - I can make a request successfully from Firefox on this machine. I'm using Firefox with JavaScript and cookies disabled, so the site doesn't require either JS or cookies to return results.
This is my Scrapy code:
allowed_domains = ['eservices.landregistry.gov.uk']
start_urls = []
_FORM_URL = "http://eservices.landregistry.gov.uk/www/wps/portal/!ut/p/b1/" \
"hc7LDoIwEAXQb-ELOrQFu60EgSgg8hDYEFQ0GHksCIZ-veBODTK7Sc69MyhFMU" \
"rrvC9veVc2df6Y9lTNCGZUlik2GVFXYCkbg8iBQoCSESR_gCEv5Y8oBpr5d9ba" \
"QxfvhNYHd-ENjtCxLTg44vy0ndP-Eh3CNefGoLMa-UU95tKvanfDwSJrd2sQDw" \
"OoP-DzNsMLYPr9DWBmOCDHbKoCJSNbzfWwiKK2CvvyoF81LkkvDLGUgw!!/dl4" \
"/d5/L0lDU0lKSmdwcGlRb0tVUW9LVVEhL29Gb2dBRUlRaGpFQ1VJZ0FJQUl5Rk" \
"FNaHdVaFM0SldsYTRvIS80RzNhRDJnanZ5aERVd3BNaFFqVW81Q2pHcHhBL1o3" \
"XzMyODQxMTQySDgzNjcwSTVGRzMxVDUzOFY0LzAvMjc0MzY5MTc0Njk2L3NwZl" \
"9BY3Rpb25OYW1lL3NwZl9BY3Rpb25MaXN0ZW5lci9zcGZfc3RydXRzQWN0aW9uL" \
"yEyZlFEU2VhcmNoLmRv/"
def start_requests(self):
settings = get_project_settings()
ids = ['15740175']
for i, id in enumerate(ids):
yield FormRequest(
url=self._FORM_URL,
formdata={
'polygonId': id,
'enquiryType': 'lrInspireId',
},
headers={
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:43.0) Gecko/20100101 Firefox/43.0",
'Accept-Language': 'en-GB,en;q=0.5', '
'Referer': ''
}
)
def parse(self, response):
# do parsing here
But in the log I just see a 403 response. (NB, the site's robots.txt doesn't forbid scraping.)
I've used Charles to inspect the request sent by Scrapy, and all the request headers (including User-Agent) look identical to the request headers sent when I make the request in Firefox and get 200 back.
Presumably the site knows I'm a scraper and is blocking me, but how does it know? I'm genuinely mystified. I'm only sending one response, so it can't be to do with rate limiting or download delays.
This site may be protected against CSRF (Cross Site Request Forgery). Also the action URL looks like having a session token which prevents replay attacks. However scraping can be illegal and check with the owners of this site/ organization before accessing this site in such ways
Just open page source HTML in browser and refresh it several times - you'll see that form action URL is changing every time, so it is dynamic URL while you try to use it as hard-coded. You should fetch HTML page with the form first and then send form data using current form action URL.

the header fields of an HTTP message: who chooses them?

Looking at the network panel in the developer tools of google chrome I can read the HTTP request and response messages of each file in a web page and, in particular, I can read the start line and the headers with all their fields.
I know (and I hope that is right) that the start line of each HTTP message has a specific and rigorous structure (different for request and response message, of course) and any element inside a start line cannot be missed.
Unlike the start line, the header of an HTTP message contains additional informations, so, I guess, the header fields are facultative or, at least, not so strictly requested like the fields in the start line.
Considering all this, I'm wondering: who sets the header fields in an HTTP message? Or, in other words, how are determined the header fields of an HTTP message?
For example, i can actually see that the HTTP request message for a web page is this:
GET / HTTP/1.1
Host: www.corriere.it
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4,de;q=0.2
Cookie: rccsLocalPref=milano%7CMilano%7C015146; rcsLocalPref=milano%7CMilano; _chartbeat2=DVgclLD1BW8iBl8sAi.1422913713367.1430683372200.1111111111111111; rlId=8725ab22-cbfc-45f7-a737-7c788ad27371; __ric=5334%3ASat%20Jun%2006%202015%2014%3A13%3A31%20GMT+0200%20%28ora%20legale%20Europa%20occidentale%29%7C; optimizelyEndUserId=oeu1433680191192r0.8780217287130654; optimizelySegments=%7B%222207780387%22%3A%22gc%22%2C%222230660652%22%3A%22false%22%2C%222231370123%22%3A%22referral%22%7D; optimizelyBuckets=%7B%7D; __gads=ID=bbe86fc4200ddae2:T=1434976116:S=ALNI_MZnWxlEim1DkFzJn-vDIvTxMXSJ0g; fbm_203568503078644=base_domain=.corriere.it; apw_browser=3671792671815076067.; channel=Direct; apw_cache=1438466400.TgwTeVxF.1437740670.0.0.0...EgjHfb6VZ2K4uRK4LT619Zau06UsXnMdig-EXKOVhvw; ReadSpeakerSettings=enlarge=enlargeoff; _ga=GA1.2.1780902850.1422986273; __utma=226919106.1780902850.1422986273.1439110897.1439114180.19; __utmc=226919106; __utmz=226919106.1439114180.19.18.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); s_cm_COR=Googlewww.google.it; gvsC=New; rcsddfglr=1441375682.3.2.m0i10Mw-|z1h7I0wH.3671792671815076067..J3ouwyCkNXBCyau35GWCru0I1mfcA3hRLNURnDWREPs; cpmt_xa=5334,5364; utag_main=v_id:014ed4175b8e000f4d2bb480bdd10606d001706500bd0$_sn:74$_ss:1$_st:1439133960323$_pn:1%3Bexp-session$ses_id:1439132160323%3Bexp-session; testcookie=true; s_cc=true; s_nr=1439132160762-Repeat; SC_LNK_CR=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; dtLatC=116p80.5p169.5p91.5p76.5p130.5p74p246.5p100p74.5p122.5; dtCookie=E4365758C13B82EE9C1C69A59B6F077E|Corriere|1|_default|1; dtPC=-; NSC_Wjq_Dpssjfsf_Dbdif=ffffffff091a1f8d45525d5f4f58455e445a4a423660; hz_amChecked=1
how these header fields are chosen? Who/what chose them? (The browser? Not me, of course...)
p.s.:
hope my question is clear, please, forgive my bad english
All internet websites are hosted on HTTP servers, these headers are set by the http server who is hosting the webpage. They are used to control how pages are shown, cached, and encoded.
Web browsers set the headers when requesting the pages from the servers. This mutual communication protocol is the HTTP protocol linked above.
here is a list of all the possible header fields for a request message: the question is, why the broser chooses only some of them?
The browser doesn't include all possible request headers in every request because either:
They aren't applicable to the current request or
The default value is the desired value
For instance:
Accept tells the server that only certain data formats are acceptable in the response. If any kind of data is acceptable, then it can be omitted as the default is "everything".
Content-Length describes the length of the body of the request. A GET request doesn't have a body, so there is nothing to describe the length of.
Cookie contains a cookie set by the server (or JavaScript) on a previous request. If a cookie hasn't been set, then there isn't one to send back to the server.
and so on.

R XML package: how to set the user-agent?

I've confirmed that R calls of XML functions such as htmlParse and readHTML send a blank user agent string to the server.
?XML::htmlParse tells me under isURL that "The libxml parser handles the connection to servers, not the R facilities". Does that mean there is no way to set user agent?
(I did try options(HTTPUserAgent="test") but that is not being applied.)
Matt's answer is entirely correct. As for downloading to a string/character vector,
you can use RCurl and getURLContent() (or getForm() or postForm() as appropriate).
With these functions, you have immense control over the HTTP request, including being able to set the user-agent and any field in the header. So
x = getURLContent("http://biostatmatt.com", useragent = "BioStatMatt-via-R",
followlocation = TRUE)
htmlParse(x, asText = TRUE) # or htmlParse(I(x))
does the job.
XML::htmlParse uses the libxml facilities (i.e. NanoHTTP) to fetch HTTP content using the GET method. By default, NanoHTTP does not send a User-Agent header. There is no libxml API to pass a User-Agent string to NanoHTTP, although one can pass arbitrary header strings to lower-level NanoHTTP functions, like xmlNanoHTTPMethod. Hence, it would require significant source code modification in order to make this possible in the XML package.
Alternatively, options(HTTPUserAgent="test") sets the User-Agent header for functions that use the R facility for for HTTP requests. For example, one could use download.file like so:
options(HTTPUserAgent='BioStatMatt-via-R')
download.file('http://biostatmatt.com/', destfile='biostatmatt.html')
XML::htmlParse('biostatmatt.html')
The (Apache style) access log entry looks something like this:
160.129.***.*** - - [01/Sep/2011:20:16:40 +0000] "GET / HTTP/1.0" 200 4479 "-" "BioStatMatt-via-R"

Resources