Cannot replicate POST request with JSON body - web-scraping

I'm using Scrapy to replicate a POST request to a site and I'm sure I'm passing the right form arguments but somehow the site isn't responding what it should.
Copying the curl request from Chrome gives (it is modified):
curl 'https://example.com/somepath' -H 'origin: https://example.com/' -H 'x-requested-with: XMLHttpRequest' -H 'pragma: no-cache' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' -H 'content-type: application/json'--data '{"foor":"var"}' --compressed
Here is my Scrapy request:
FormRequest(url="https://example.com/somepath", formdata={'foo': 'var'})

You are missing to include the Content-Type header, and also you won't be able to do that request with FormRequest. Just use normal Request with the correct body:
import json
...
Request(
url="https://example.com/somepath",
body=json.dumps({'foo': 'var'}),
headers={'Content-Type': 'application/json'},
)

Related

Protection Magic from "avito.ru" unable to scrap, request blocked

avito.ru has some special scraping protections and i try to understand how it works.
When i request this Url https://www.avito.ru/all?q=car, without cookies, as fresh user, i receive the correct HTML Content.
Once i copy the request over to cUrl, it fails.
curl 'https://www.avito.ru/all?q=car' \
-H 'authority: www.avito.ru' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8' \
-H 'accept-language: de-DE,de;q=0.5' \
-H 'cache-control: no-cache' \
-H 'pragma: no-cache' \
-H 'sec-ch-ua: "Not_A Brand";v="99", "Brave";v="109", "Chromium";v="109"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36' \
--compressed
I receive then the VPN / IP blocking page. The request inside the Browser works always fine, regardless what i do.
Why is my cloned cUrl request not working ? Any ideas ?

Curl gives html error 1020 when making HTTP request to opensea API

I'm trying to make a request to the OpenSea.io API. When I go to the network inspector I can see a whole slew of requests that come through to/from the page. When I select one, right click, and choose copy as curl I can then paste that into my terminal and normally the data comes through as output to the terminal. For a few reqeuests, I got a message about binary output that I was able to resolve by modifying the request. For example:
curl 'https://api.opensea.io/tokens/?limit=100' \
-X 'GET' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Cache-Control: no-cache' \
-H 'Origin: https://opensea.io' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Connection: keep-alive' \
-H 'Referer: https://opensea.io/' \
-H 'Host: api.opensea.io' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45'
The above gave me the binary error message, I resolve it by adding --compressed at the end to decompress the "binary" data and removed the br option from the encoding header. The below request works just fine in my terminal now.
curl 'https://api.opensea.io/tokens/?limit=100' \
-X 'GET' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Cache-Control: no-cache' \
-H 'Origin: https://opensea.io' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Connection: keep-alive' \
-H 'Referer: https://opensea.io/' \
-H 'Host: api.opensea.io' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45' --compressed
So that's all fine and dandy, but that didn't fix my issues for all of the of the requests. I went through and found the requests that have the data that I'm looking for but they give a new error about not being the website owner. Consider the below request:
curl 'https://api.opensea.io/graphql/' \
-X 'POST' \
-H 'Content-Type: application/json' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Host: api.opensea.io' \
-H 'Cache-Control: no-cache' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Origin: https://opensea.io' \
-H 'Content-Length: 451' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Referer: https://opensea.io/' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Cookie: _ga_9VSBF2K4BX=GS1.1.1653330281.9.1.1653332997.0; csrftoken=BVdZtaJOMRxED1ALVr79hZfFHIcUUTeNokvuFbqkb17fPoZiEqpe5Fb26Mq4RQsg; sessionid=eyJzZXNzaW9uSWQiOiI0MzJjMWVlYi0zY2Q5LTQ4Y2QtODljZS1jZWFhNzk0NzI2ZDIifQ:1ntDPZ:iRgNCzJHvxP1nDBSR90Hjx4hcpPy8UmpZl7GG6lV2e8; ajs_anonymous_id=41ec97c3-3ebf-467b-a921-a31f94abeb2f; amp_ddd6ec=yUkvg9MB9AgtD0-EafL8wO...1g3p2k0km.1g3p52466.5c.54.ag; _fbp=fb.1.1652624043939.1609498506; _ga=GA1.2.337370304.1652623932; _gid=GA1.2.1049414718.1653330282; _uetsid=9d339a80dac511ec84300fb0b22c8619; _uetvid=ebc21490d88011ec99749d8ebc9bcd13; __cf_bm=OZmIijoynqXFgy9j69FEOB2a0As_1yLXG3751dUFAO4-1653332831-0-AX1rqerC9b2mttE3Lg4rIp33aWgqCGg2fozR3+cJTaeEEJ6xgpz1/VY5OIrHCONfYfGI26n0qHHCGtxb5YDwVBw=; cf_chl_2=; cf_chl_prog=; cf_clearance=mfMY41rDtGcV.Hkkmp5dZkZUtz10Y7fXRmobKhROBlw-1653331507-0-150; _gcl_au=1.1.13890619.1653330282; __os_session=eyJpZCI6IjQzMmMxZWViLTNjZDktNDhjZC04OWNlLWNlYWE3OTQ3MjZkMiJ9; __os_session.sig=xyK0HcEq8hEtOPpbnB0ra5A18qm3t-xGKx_2YDCmObc' \
-H 'x-signed-query: d73eda68d997705a2785aa8222d5a3c5663c392d0df699f665e44fb31e14642b' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
--data-binary '{"id":"TraitsDropdownQuery","query":"query TraitsDropdownQuery(\n $collection: CollectionSlug!\n) {\n collection(collection: $collection) {\n assetCount\n numericTraits {\n key\n value {\n max\n min\n }\n }\n stringTraits {\n key\n counts {\n count\n value\n }\n }\n defaultChain {\n identifier\n }\n id\n }\n}\n","variables":{"collection":"boredapeyachtclub"}}' --compressed
When the webpage makes the request, the site server returns back a JSON file with all kinds of useful data inside. But for some reason when I make the request it gives me back an HTML file and says:
<h1>
<span class="error-description">Access denied</span>
<span class="code-label">Error code <span>1020</span></span>
</h1>
<div class="large-font">
<p>You do not have access to api.opensea.io.</p><p>The site owner may have set restrictions that prevent you from accessing the site. Contact the site owner for access or try loading the page again.</p>
</div>
Can anybody help in resolving this? What changes do I need to make to the curl request so that I actually get the JSON data I'm looking for? I understand the page is saying that I am not the website owner and that's correct, but then why does it give the JSON data to my browser and not to me through a CURL request? How does the server know the difference between my terminal and a browser making a request when I pass through all of the same headers and cookies that the browser had given it? I noticed that in the cookies there was some cf_bm and similar cookies that hold some info like a unix time stamp. I tried to pass along the current unix time stamp generating on the fly using NODE.js and Axios but I still got the same message so I believe there's something more going on besides a cookie difference. Additionally, I tried finding the cookie values from previous requests to see if maybe the server gave it some info that you have to send back later but I couldn't find any matching values between one request to the next.
Any help is much appreciated, both in fixing this specific problem as well as explaining the overall process of how the server identifies the differences between browser and terminal.
Reason Of Access Denied or 1020 is The target source is blocking you on ip or User Agent Level
Solution: Use Proxy And Set your Request Header random.

How to access websites that have ip block on GCP services using Cloud Run or Cloud Funcions

I'm trying to access a site using Google Cloud Functions or Cloud Run.
But it looks like it's blocking the IPs coming from these services
Locally the code works fine. I've tried to add a lot of headers to simulate a local call, but it doesn't work.
Some of the headers:
--header 'Connection: keep-alive'
--header 'Accept: /'
--header 'X-Requested-With: XMLHttpRequest'
--header 'Origin: <site fororigin>'
--header 'Referer: <site for referer>'
--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
--header 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8'
--header 'Sec-Fetch-Site: same-origin'
--header 'Sec-Fetch-Mode: cors'
--header 'Sec-Fetch-Dest: empty'
--header 'Accept-Language: pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
It also works if I create a VM on my region (south america) but this solution will create a lot of automation complexities
Is there a way to work around the Ip block? maybe by calling another server to change the IP?
I wrote the documentation on this topic which explains how to get a static IP address, which you can ask to be whitelisted, and use it for outbound connections on Cloud Run. https://cloud.google.com/run/docs/configuring/static-outbound-ip
This involves routing your external traffic through a VPC Connector to a VPC that has NAT configuration with one or more static IP addresses. That way, Cloud Run will be using those IPs when it connects to external endpoints.

Curl says it cannot resolve host but host can be resolved

I replaced mywebsite with the correct website that can be resolved when I run curl www.mywebsite.com. These are the options I am using:
curl -X 'GET https://www.mywebsite.com/Web2/PDF.aspx?page=1' \
-H 'Host: www.mywebsite.org' \
-H 'Connection: keep-alive' \
-H 'Upgrade-Insecure-Requests: 1' \
-A 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36 OPR/51.0.2830.26' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
-H 'DNT: 1' \
-e 'the-referer' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Accept-Language: en-US,en;q=0.9' \
-b '_the-cookies'
When I try to run this in OSX terminal, the following happens:
$ curl -X 'GET https://www.mywebsite.com/Web2/PDF.aspx?page=1' \
-H 'Host: www.mywebsite.org' \
-H 'Connection: keep-alive' \
-H 'Upgrade-Insecure-Requests: 1' \
-A 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36 OPR/51.0.2830.26' \
curl: (6) Could not resolve host:
Mac-mini-3:~ myuser$
It says:
curl: (6) Could not resolve host:
Why is this happening? And why is it trying to run commands when I used the \ escape sequence in terminal? It should not be running any commands until all the options are passed.
Because you have not specified a host. The host is specified as a request command (part of -X arg).
You need to have (note the placement of single quote)
curl -X GET 'https://www.mywebsite.com/Web2/PDF.aspx?page=1' ...

Local drupal cURL requests results in "forbidden" error

I can make the following request from any remote client/server:
curl 'http://my.drupalserver.com/node/4688?_format=json' -H 'Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' -H 'Content-Type: application/hal+json' -H 'Accept: */*' -H 'Connection: keep-alive' --compressed
And it works: I get my node as expected.
However: when I make this exact same request from the same server the website is hosted on I get a 403 forbidden error.
I'm at my wits end, the drupal web profiler for both requests clearly shows the request headers are identical, so I have no idea what the problem could be.
I have already cleared the caches, checked the "trusted hosting", ...
I'm running Drupal 8.0.5

Resources