How does this site know I'm a scraper? - web-scraping

I'm trying to scrape results from this web form (sample ID: 15740175). Actually I'm sending a POST request from Scrapy, in the same way that the form does.
I am working from a non-blocked IP - I can make a request successfully from Firefox on this machine. I'm using Firefox with JavaScript and cookies disabled, so the site doesn't require either JS or cookies to return results.
This is my Scrapy code:
allowed_domains = ['eservices.landregistry.gov.uk']
start_urls = []
_FORM_URL = "http://eservices.landregistry.gov.uk/www/wps/portal/!ut/p/b1/" \
"hc7LDoIwEAXQb-ELOrQFu60EgSgg8hDYEFQ0GHksCIZ-veBODTK7Sc69MyhFMU" \
"rrvC9veVc2df6Y9lTNCGZUlik2GVFXYCkbg8iBQoCSESR_gCEv5Y8oBpr5d9ba" \
"QxfvhNYHd-ENjtCxLTg44vy0ndP-Eh3CNefGoLMa-UU95tKvanfDwSJrd2sQDw" \
"OoP-DzNsMLYPr9DWBmOCDHbKoCJSNbzfWwiKK2CvvyoF81LkkvDLGUgw!!/dl4" \
"/d5/L0lDU0lKSmdwcGlRb0tVUW9LVVEhL29Gb2dBRUlRaGpFQ1VJZ0FJQUl5Rk" \
"FNaHdVaFM0SldsYTRvIS80RzNhRDJnanZ5aERVd3BNaFFqVW81Q2pHcHhBL1o3" \
"XzMyODQxMTQySDgzNjcwSTVGRzMxVDUzOFY0LzAvMjc0MzY5MTc0Njk2L3NwZl" \
"9BY3Rpb25OYW1lL3NwZl9BY3Rpb25MaXN0ZW5lci9zcGZfc3RydXRzQWN0aW9uL" \
"yEyZlFEU2VhcmNoLmRv/"
def start_requests(self):
settings = get_project_settings()
ids = ['15740175']
for i, id in enumerate(ids):
yield FormRequest(
url=self._FORM_URL,
formdata={
'polygonId': id,
'enquiryType': 'lrInspireId',
},
headers={
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:43.0) Gecko/20100101 Firefox/43.0",
'Accept-Language': 'en-GB,en;q=0.5', '
'Referer': ''
}
)
def parse(self, response):
# do parsing here
But in the log I just see a 403 response. (NB, the site's robots.txt doesn't forbid scraping.)
I've used Charles to inspect the request sent by Scrapy, and all the request headers (including User-Agent) look identical to the request headers sent when I make the request in Firefox and get 200 back.
Presumably the site knows I'm a scraper and is blocking me, but how does it know? I'm genuinely mystified. I'm only sending one response, so it can't be to do with rate limiting or download delays.

This site may be protected against CSRF (Cross Site Request Forgery). Also the action URL looks like having a session token which prevents replay attacks. However scraping can be illegal and check with the owners of this site/ organization before accessing this site in such ways

Just open page source HTML in browser and refresh it several times - you'll see that form action URL is changing every time, so it is dynamic URL while you try to use it as hard-coded. You should fetch HTML page with the form first and then send form data using current form action URL.

Related

authenticate HTTP request to /admin-ajax.php on wordpress

I'm trying to download a file provided by a Wordpress plugin Pinpoint World. This plugin uses admin-ajax.php to retrieve that file in admin UI.
I want to periodically download it for backup. How can I download it using curl? It looks like it needs to authenticate the request using cookies (as the browser does while inspecting the requests). Anyway I can simulate that using curl in bash?
The following results in 400 Bad Request:
curl "https://${HOST}/wp-admin/admin-ajax.php" \
--data-raw 'action=dopbsp_reservations_get&type=xls&calendar_id=1&start_date=&end_date=&start_hour=00%3A00&end_hour=23%3A59&status_pending=false&status_approved=false&status_rejected=false&status_canceled=false&status_expired=false&payment_methods=&search=&page=1&per_page=25&order=ASC&order_by=check_in' \
-o /tmp/output.xls
Basic authentication (using --user) didn't work either.
How can I authenticate to wordpress' admin-ajax, using bash?
You can just pass the cookie from your authenticated logged-in user on your curl request
First, login to your wordpress site on your browser.
Then hit F12 and go to application tab, then cookies
then copy the cookies that looks like wordpress_logged_in_xxxxxxxxxxxx
then you can use it on your curl request
example to run basic test,
create a simple ajax request which return a user object if your request is authenticated. otherwise, it will return null
add_action( 'wp_ajax_sample_duh', 'sample_duh');
add_action( 'wp_ajax_nopriv_sample_duh', 'sample_duh');
function sample_duh() {
wp_send_json([
'user' => wp_get_current_user()
]);
}
run your curl request with the cookies you copied from the browser.
e.g.
curl -X POST --cookie "wordpress_logged_in_xxxxxxxxxxxxxx=xxxxxxxxxxx" http://mydomain.me/wp-admin/admin-ajax.php?action=sample_duh
You should get the user object in your response if you have a valid cookie,
then use the same cookie with your actual curl request

Anti-Scraping bypass?

Helllo,
I'm working on a scraper for this page : https://www.dirk.nl/
I'm trying to get in scrapy shell the 'row-wrapper' div class.
If I enter response.css('row-wrapper'), it gives me some random results, I think an anti scraping system is involved. I need the hrefs from this class.
Any opinions on how can I move forward ?
We would need a little bit more data, like the response you receive and any code if it's already set up.
But from the looks of it, it can be multiple things ( from 429 Response blocking the request because of the rate limit to sites internal API XHR causing data to not be rendered on page load etc. ).
Before fetching any website for scraping reasons, try curl, postman or insomnia software to see what type of the response are you going to receive. Some special servers and website architectures require certain cookies and headers while some don't. You simply have to do this research so you can make your scraping workflow efficient.
I ran curl https://www.dirk.nl/ and it returned data that's generated by Nuxt framework. In this case that data is unusable since Nuxt uses it's own functionality to parse data.
Instead, the best solution would be not to get the HTML based data but API content data.
Something like this:
curl 'https://content-api.dirk.nl/misc/specific/culios.aspx?action=GetRecipe' \
-H 'accept: application/json, text/plain, */*' \
--data-raw '{"id":"11962"}' \
--compressed
Will return:
{"id":11962,"slug":"Muhammara kerstkrans","title":"Muhammara kerstkrans","subtitle":"", ...Rest of the data
I don't understand this language but from my basic understanding this would be an API route for recipes.

How to get Request Headers automatically using Scrapy?

Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.

authorization with python requests

there's next site - vten.ru
When I try to make GET request with Postman to it, I give in return status code 304 Not Modified.
Code on Phyton:
import requests
url = "http://vten.ru"
payload = ""
headers = {
'cache-control': "no-cache",
'Postman-Token': "29ae741a-1c31-4a52-b10e-4486cb0d6eb7"
}
response = requests.request("GET", url, data=payload, headers=headers)
print(response.text)
how can I get the page?
You presumably already have a version of the request cached, hence the "Not Modified" response indicating that the response hasn't changed since you last requested it.
EDIT:
Viewing that site/inspecting the network activity via Chrome shows that the document returned is actually http://m.vten.ru. You should try making your GET request to that URL instead.
You also need to add the Accept: text/html header to your request. That returns the page you want having just tested it locally.

how to deal with captcha when web scraping using R

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.
You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

Resources