how to deal with captcha when web scraping using R - r

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.

You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

Related

How to get Request Headers automatically using Scrapy?

Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.

How does this site know I'm a scraper?

I'm trying to scrape results from this web form (sample ID: 15740175). Actually I'm sending a POST request from Scrapy, in the same way that the form does.
I am working from a non-blocked IP - I can make a request successfully from Firefox on this machine. I'm using Firefox with JavaScript and cookies disabled, so the site doesn't require either JS or cookies to return results.
This is my Scrapy code:
allowed_domains = ['eservices.landregistry.gov.uk']
start_urls = []
_FORM_URL = "http://eservices.landregistry.gov.uk/www/wps/portal/!ut/p/b1/" \
"hc7LDoIwEAXQb-ELOrQFu60EgSgg8hDYEFQ0GHksCIZ-veBODTK7Sc69MyhFMU" \
"rrvC9veVc2df6Y9lTNCGZUlik2GVFXYCkbg8iBQoCSESR_gCEv5Y8oBpr5d9ba" \
"QxfvhNYHd-ENjtCxLTg44vy0ndP-Eh3CNefGoLMa-UU95tKvanfDwSJrd2sQDw" \
"OoP-DzNsMLYPr9DWBmOCDHbKoCJSNbzfWwiKK2CvvyoF81LkkvDLGUgw!!/dl4" \
"/d5/L0lDU0lKSmdwcGlRb0tVUW9LVVEhL29Gb2dBRUlRaGpFQ1VJZ0FJQUl5Rk" \
"FNaHdVaFM0SldsYTRvIS80RzNhRDJnanZ5aERVd3BNaFFqVW81Q2pHcHhBL1o3" \
"XzMyODQxMTQySDgzNjcwSTVGRzMxVDUzOFY0LzAvMjc0MzY5MTc0Njk2L3NwZl" \
"9BY3Rpb25OYW1lL3NwZl9BY3Rpb25MaXN0ZW5lci9zcGZfc3RydXRzQWN0aW9uL" \
"yEyZlFEU2VhcmNoLmRv/"
def start_requests(self):
settings = get_project_settings()
ids = ['15740175']
for i, id in enumerate(ids):
yield FormRequest(
url=self._FORM_URL,
formdata={
'polygonId': id,
'enquiryType': 'lrInspireId',
},
headers={
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:43.0) Gecko/20100101 Firefox/43.0",
'Accept-Language': 'en-GB,en;q=0.5', '
'Referer': ''
}
)
def parse(self, response):
# do parsing here
But in the log I just see a 403 response. (NB, the site's robots.txt doesn't forbid scraping.)
I've used Charles to inspect the request sent by Scrapy, and all the request headers (including User-Agent) look identical to the request headers sent when I make the request in Firefox and get 200 back.
Presumably the site knows I'm a scraper and is blocking me, but how does it know? I'm genuinely mystified. I'm only sending one response, so it can't be to do with rate limiting or download delays.
This site may be protected against CSRF (Cross Site Request Forgery). Also the action URL looks like having a session token which prevents replay attacks. However scraping can be illegal and check with the owners of this site/ organization before accessing this site in such ways
Just open page source HTML in browser and refresh it several times - you'll see that form action URL is changing every time, so it is dynamic URL while you try to use it as hard-coded. You should fetch HTML page with the form first and then send form data using current form action URL.

Jetty client http error code 412

I am accessing one website (I am hiding origin website name as it is against the policy) using browser and jetty/apache httpclient.
The website works fine with web browser.
Using api I am able to login into website,gets the session cookie JSESSIONID and home page html content. But after that when I submit any form or call the links from html I receive the HTTP error code 412(Pre condition failed).
I understand this error is due problem in client header. I set all the headers from browser(checked using inspect element in chrome browser). Still I have the same error.
I am not able to track down which header is causing the problem.
Here is the Header from browser
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-GB,en-US;q=0.8,en;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:318
Content-Type:application/x-www-form-urlencoded
Cookie:language=en_IN; __gads=ID=14c64d8f9fd7de54:T=1424658276:S=ALNI_Mba1kvJO4mLo7R-T2jUJE9zCYck5A; SLB_Cookie=ffffffff09461c2d45525d5f4f58455e445a4a422971; JSESSIONID=36m4Oo6dCML_Wvx-Wgmm9rtLh9mbURxnZhWIVwg-zHaNzFQeUt9C!-1989013783; _ga=GA1.3.379900459.1428120216
Host:www.irctc.co.in
Origin:https://www.example.com
Referer:https://www.example.com/context/home
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
Same header I am setting from jetty client.
Request request = httpClient.newRequest(url);
request.method(HttpMethod.POST);
request.agent(USER_AGENT);
request.accept("text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
request.header(HttpHeader.REFERER,"https://www.example.com/context/home");
request.header(HttpHeader.ACCEPT_ENCODING, "gzip, deflate");
request.header(HttpHeader.ACCEPT_LANGUAGE, "en-GB,en-US;q=0.8,en;q=0.6");
request.header(HttpHeader.CACHE_CONTROL, "max-age=0");
request.header(HttpHeader.CONNECTION, "keep-alive");
request.header(HttpHeader.CONTENT_TYPE, "application/x-www-form-urlencoded");
request.header(HttpHeader.HOST, "www.example.com);
for (Map.Entry<String, String> entry : params.entrySet()){
request.param(URLEncoder.encode(entry.getKey(), "UTF-8"), URLEncoder.encode(entry.getValue(), "UTF-8"));
if(!StringUtils.isEmpty(content)){
content+="&";
}
content+=URLEncoder.encode(entry.getKey(), "UTF-8")+"="+URLEncoder.encode(entry.getValue(), "UTF-8");
}
request.header(HttpHeader.CONTENT_LENGTH, ""+content.length());
I see JSESSIONID and SLB_Cookie are present in the request. Since the website is out of our control I really can not track what is the issue.
Please help me to resolve this issue. Any pointers to resolve the issue on client side is appreciated. is there any way we can make sure which header causing this issue.
I solved the problem.
Issue with the form parameters value. I was sending encoded values where were encoded by jetty client again

How to request from multiple sources with urllib.request.Request?

The following code makes a single request to Yahoo API. How do I manage to request from multiple sources using urllib.request.Request? I am aware of gerequests. If possible, Is there any performance difference between the two?
Any suitable modules on this topic?
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent': user_agent, }
assembled_request = urllib.request.Request(YAHOO, None, headers)
response = urllib.request.urlopen(assembled_request)
html_data = response.read()
There is no way to "bundle" multiple requests into one, the overhead needed to generate the request object on the client should be trivial anyway. You are pretty much just waiting for the request to go through and for the server to respond. If you need to send requests in bulk, doing it asynchronously is the best way.
If you are using an API on the web to query some data, there is often some equivalent of a get_multiple() method that you can use instead of just using get() X amount of times. This might be the kind of thing you are looking for.
For example:
www.example.com/get_cat.html?brown=1
Might yield a brown cat object.
While:
www.example.com/get_cats.html?brown=1
Might yield all the brown cat objects that the database contains.
These kinds of methods save time and bandwidth for both the server and client.

XULRunner Application Request Header Information

I've developed a standalone XULRunner app that I'm using as a site-specific browser. The web application it accesses does filtering of browsers to know whether the browser being used is optimal. I'd like to add my XULRunner app to the list of optimal browsers. I figured that to do this, I'll need to know the HTTP header information that accompanies request sent by the XULRunner app. What information in the HTTP header can I use to identify my XULRunner app? Something like the Gecko Engine version, etc. I've been searching around, but no luck yet.
An application is usually identified by means of the User-Agent header. You can see it on the client side by means of the window.navigator.userAgent property, e.g. the header for Firefox 12 on Windows 7 is:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
The important part here is Gecko/... (identifies a Gecko-based browser) and rv:... (Gecko version). The Firefox/12.0 part should be replaced by something like MyApp/1.2.3 in your case (name and version number of your application).

Resources