Protection Magic from "avito.ru" unable to scrap, request blocked - web-scraping

avito.ru has some special scraping protections and i try to understand how it works.
When i request this Url https://www.avito.ru/all?q=car, without cookies, as fresh user, i receive the correct HTML Content.
Once i copy the request over to cUrl, it fails.
curl 'https://www.avito.ru/all?q=car' \
-H 'authority: www.avito.ru' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8' \
-H 'accept-language: de-DE,de;q=0.5' \
-H 'cache-control: no-cache' \
-H 'pragma: no-cache' \
-H 'sec-ch-ua: "Not_A Brand";v="99", "Brave";v="109", "Chromium";v="109"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36' \
--compressed
I receive then the VPN / IP blocking page. The request inside the Browser works always fine, regardless what i do.
Why is my cloned cUrl request not working ? Any ideas ?

Related

Curl gives html error 1020 when making HTTP request to opensea API

I'm trying to make a request to the OpenSea.io API. When I go to the network inspector I can see a whole slew of requests that come through to/from the page. When I select one, right click, and choose copy as curl I can then paste that into my terminal and normally the data comes through as output to the terminal. For a few reqeuests, I got a message about binary output that I was able to resolve by modifying the request. For example:
curl 'https://api.opensea.io/tokens/?limit=100' \
-X 'GET' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Cache-Control: no-cache' \
-H 'Origin: https://opensea.io' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Connection: keep-alive' \
-H 'Referer: https://opensea.io/' \
-H 'Host: api.opensea.io' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45'
The above gave me the binary error message, I resolve it by adding --compressed at the end to decompress the "binary" data and removed the br option from the encoding header. The below request works just fine in my terminal now.
curl 'https://api.opensea.io/tokens/?limit=100' \
-X 'GET' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Cache-Control: no-cache' \
-H 'Origin: https://opensea.io' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Connection: keep-alive' \
-H 'Referer: https://opensea.io/' \
-H 'Host: api.opensea.io' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45' --compressed
So that's all fine and dandy, but that didn't fix my issues for all of the of the requests. I went through and found the requests that have the data that I'm looking for but they give a new error about not being the website owner. Consider the below request:
curl 'https://api.opensea.io/graphql/' \
-X 'POST' \
-H 'Content-Type: application/json' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Host: api.opensea.io' \
-H 'Cache-Control: no-cache' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Origin: https://opensea.io' \
-H 'Content-Length: 451' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15' \
-H 'Referer: https://opensea.io/' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Cookie: _ga_9VSBF2K4BX=GS1.1.1653330281.9.1.1653332997.0; csrftoken=BVdZtaJOMRxED1ALVr79hZfFHIcUUTeNokvuFbqkb17fPoZiEqpe5Fb26Mq4RQsg; sessionid=eyJzZXNzaW9uSWQiOiI0MzJjMWVlYi0zY2Q5LTQ4Y2QtODljZS1jZWFhNzk0NzI2ZDIifQ:1ntDPZ:iRgNCzJHvxP1nDBSR90Hjx4hcpPy8UmpZl7GG6lV2e8; ajs_anonymous_id=41ec97c3-3ebf-467b-a921-a31f94abeb2f; amp_ddd6ec=yUkvg9MB9AgtD0-EafL8wO...1g3p2k0km.1g3p52466.5c.54.ag; _fbp=fb.1.1652624043939.1609498506; _ga=GA1.2.337370304.1652623932; _gid=GA1.2.1049414718.1653330282; _uetsid=9d339a80dac511ec84300fb0b22c8619; _uetvid=ebc21490d88011ec99749d8ebc9bcd13; __cf_bm=OZmIijoynqXFgy9j69FEOB2a0As_1yLXG3751dUFAO4-1653332831-0-AX1rqerC9b2mttE3Lg4rIp33aWgqCGg2fozR3+cJTaeEEJ6xgpz1/VY5OIrHCONfYfGI26n0qHHCGtxb5YDwVBw=; cf_chl_2=; cf_chl_prog=; cf_clearance=mfMY41rDtGcV.Hkkmp5dZkZUtz10Y7fXRmobKhROBlw-1653331507-0-150; _gcl_au=1.1.13890619.1653330282; __os_session=eyJpZCI6IjQzMmMxZWViLTNjZDktNDhjZC04OWNlLWNlYWE3OTQ3MjZkMiJ9; __os_session.sig=xyK0HcEq8hEtOPpbnB0ra5A18qm3t-xGKx_2YDCmObc' \
-H 'x-signed-query: d73eda68d997705a2785aa8222d5a3c5663c392d0df699f665e44fb31e14642b' \
-H 'X-BUILD-ID: da14c5fd3811187c88141eb116061b5f6cf87f45' \
-H 'X-API-KEY: 2f6f419a083c46de9d83ce3dbe7db601' \
--data-binary '{"id":"TraitsDropdownQuery","query":"query TraitsDropdownQuery(\n $collection: CollectionSlug!\n) {\n collection(collection: $collection) {\n assetCount\n numericTraits {\n key\n value {\n max\n min\n }\n }\n stringTraits {\n key\n counts {\n count\n value\n }\n }\n defaultChain {\n identifier\n }\n id\n }\n}\n","variables":{"collection":"boredapeyachtclub"}}' --compressed
When the webpage makes the request, the site server returns back a JSON file with all kinds of useful data inside. But for some reason when I make the request it gives me back an HTML file and says:
<h1>
<span class="error-description">Access denied</span>
<span class="code-label">Error code <span>1020</span></span>
</h1>
<div class="large-font">
<p>You do not have access to api.opensea.io.</p><p>The site owner may have set restrictions that prevent you from accessing the site. Contact the site owner for access or try loading the page again.</p>
</div>
Can anybody help in resolving this? What changes do I need to make to the curl request so that I actually get the JSON data I'm looking for? I understand the page is saying that I am not the website owner and that's correct, but then why does it give the JSON data to my browser and not to me through a CURL request? How does the server know the difference between my terminal and a browser making a request when I pass through all of the same headers and cookies that the browser had given it? I noticed that in the cookies there was some cf_bm and similar cookies that hold some info like a unix time stamp. I tried to pass along the current unix time stamp generating on the fly using NODE.js and Axios but I still got the same message so I believe there's something more going on besides a cookie difference. Additionally, I tried finding the cookie values from previous requests to see if maybe the server gave it some info that you have to send back later but I couldn't find any matching values between one request to the next.
Any help is much appreciated, both in fixing this specific problem as well as explaining the overall process of how the server identifies the differences between browser and terminal.
Reason Of Access Denied or 1020 is The target source is blocking you on ip or User Agent Level
Solution: Use Proxy And Set your Request Header random.

How can I debug Google Analytics 4 events, that are missing in the debug view

i'm fairly new to Google Analytics and I'm starting with the new Google Analytics 4. I've set it up via Google Tag Manager.
I have two custom events:
cta_visible (event visible)
click_meeting_link (outbound click)
When I debug my page with the https://tagassistant.google.com/, I can see both events beeing triggered.
In the debug view of Google Analytics, the cta_visible event is displayed, the click_meeting_link is missing. I thought, that it's maybe a bug, caused by the fact, that as I'm clicking the link, my browser is leaving the page.
But I can see the event cta_visible in my reports, click_meeting_link is also missing there.
In the network tab I see both events being sent to GA (with a response code of 204).
curl 'https://www.google-analytics.com/g/collect?v=2&tid=G-NKBZG0FK64&gtm=2oead0&_p=1988538019&sr=1792x1120&gcs=G100&gdid=dOThhZD&ul=en-gb&cid=1495603155.1634555573&_s=5&dl=https%3A%2F%2Finnovation.tarent.de%2Fsparring&dt=Innovation%20Sparring%20%7C%20tarent&sid=1634555572&sct=1&seg=0&en=click_meeting_link&_c=1&_et=2&ep.debug_mode=true&ep.click_url=https%3A%2F%2Fmeetings.hubspot.com%2Ffrederik-vosberg%2Finnovation-sparring' \
-X 'POST' \
-H 'authority: www.google-analytics.com' \
-H 'content-length: 0' \
-H 'pragma: no-cache' \
-H 'cache-control: no-cache' \
-H 'sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'content-type: text/plain;charset=UTF-8' \
-H 'accept: */*' \
-H 'origin: https://innovation.tarent.de' \
-H 'sec-fetch-site: cross-site' \
-H 'sec-fetch-mode: no-cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://innovation.tarent.de/' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8,de;q=0.7' \
--compressed
curl 'https://www.google-analytics.com/g/collect?v=2&tid=G-NKBZG0FK64&gtm=2oead0&_p=1800931673&sr=1792x1120&gcs=G100&ul=en-gb&cid=175794657.1634555667&_s=2&dl=https%3A%2F%2Finnovation.tarent.de%2Fsparring&dt=Innovation%20Sparring%20%7C%20tarent&sid=1634555666&sct=1&seg=0&en=cta_visible&_fv=1&_nsi=1&_ss=1&_eu=C&ep.debug_mode=true' \
-X 'POST' \
-H 'authority: www.google-analytics.com' \
-H 'content-length: 0' \
-H 'pragma: no-cache' \
-H 'cache-control: no-cache' \
-H 'sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'content-type: text/plain;charset=UTF-8' \
-H 'accept: */*' \
-H 'origin: https://innovation.tarent.de' \
-H 'sec-fetch-site: cross-site' \
-H 'sec-fetch-mode: no-cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://innovation.tarent.de/' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8,de;q=0.7' \
--compressed
Any suggestions, what can cause this?
Thanks in advance
I've found the problem: The Consent Management Platform usercentrics activated the new Google Consent Mode. We didn't configure it properly, so analytics tracking was denied. This added the gcs:G100 parameter to the request, which tells Google Analytics to ignore it.
This prevented also that the DebugView works properly.
I don't see anything wrong in the network requests. They look perfectly fine. This data should be accessible in GA4.
But if you don't see the other event in GA4, I only can presume the event is not created in GA4. GA4 limits the number of unique events that you can create ( https://support.google.com/analytics/answer/9267744?hl=en ). Even though the limit is quite allowing, it's still a fitting security measure to not create them on the fly.
Also, you don't need to preserve log. It's usually just easier to prevent the page from reloading by executing this in the local console:
window.onbeforeunload = function(){return false;}

Using iTerm2 to moodify and massage commands before executing

I'm looking for a way to work with a long command on a Terminal such as iTerm2.
In the perfect world, I'd have a text box that I can modify then hitting Command+Enter will send it to "execute" - but the editing experience would be more like VSCode than terminals which make it a pain to jump around.
Example
curl 'xxx'
-X 'OPTIONS'
-H 'authority: xxx
-H 'accept: /'
-H 'access-control-request-method: POST'
-H 'access-control-request-headers: content-type'
-H 'origin: xxx'
-H 'user-agent: xxx'
-H 'sec-fetch-mode: cors'
-H 'sec-fetch-site: same-site'
-H 'sec-fetch-dest: empty'
-H 'referer: xxx'
-H 'accept-language: xxx'
-b 'sessionID'='xxx' \
--compressed
UPDATE:
I found this which kind of works but it doesn't show the output.
Since version 3 of iTerm you can use Composer feature.
Use
Cmd + shift + . = to open the editor window
Shift + enter = to send the command for execution

Curl says it cannot resolve host but host can be resolved

I replaced mywebsite with the correct website that can be resolved when I run curl www.mywebsite.com. These are the options I am using:
curl -X 'GET https://www.mywebsite.com/Web2/PDF.aspx?page=1' \
-H 'Host: www.mywebsite.org' \
-H 'Connection: keep-alive' \
-H 'Upgrade-Insecure-Requests: 1' \
-A 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36 OPR/51.0.2830.26' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
-H 'DNT: 1' \
-e 'the-referer' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Accept-Language: en-US,en;q=0.9' \
-b '_the-cookies'
When I try to run this in OSX terminal, the following happens:
$ curl -X 'GET https://www.mywebsite.com/Web2/PDF.aspx?page=1' \
-H 'Host: www.mywebsite.org' \
-H 'Connection: keep-alive' \
-H 'Upgrade-Insecure-Requests: 1' \
-A 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36 OPR/51.0.2830.26' \
curl: (6) Could not resolve host:
Mac-mini-3:~ myuser$
It says:
curl: (6) Could not resolve host:
Why is this happening? And why is it trying to run commands when I used the \ escape sequence in terminal? It should not be running any commands until all the options are passed.
Because you have not specified a host. The host is specified as a request command (part of -X arg).
You need to have (note the placement of single quote)
curl -X GET 'https://www.mywebsite.com/Web2/PDF.aspx?page=1' ...

Why would Postman strip out the form post data?

I got this Postman:
Obviously I've included the nickname field, but my Laravel app thinks otherwise. I clicked on the code link to get the curl version, and it returned this:
curl -X POST \
http://192.168.1.143:8000/api/addresses/new \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'Postman-Token: 507c0989-f02a-028c-4222-c91302402fd6' \
-H 'accept: application/x.toters.v1+json' \
-H 'accept-encoding: gzip' \
-H 'authorization: Bearer {***obfuscated***}' \
-H 'connection: Keep-Alive' \
-H 'content-language: en-US' \
-H 'content-length: 165' \
-H 'host: 192.168.1.143:8000' \
-H 'user-agent: Dalvik/2.1.0 (Linux; U; Android 8.1.0; Pixel 2 XL Build/OPM1.171019.018)' \
-d 'country_code=&street=&nickname=&lon=&phone_number=&is_default=&lat=&apartment=&building_ref='
Notice the params in -d, they're gone! How can I make Postman respect my params?
This made it work, by adding the values manually in curl:
curl -X POST http://192.168.1.143:8000/api/addresses/new -H
'Cache-Control: no-cache' -H 'Content-Type:
application/x-www-form-urlencoded' -H 'Postman-Token:
c04b38ca-c687-acfc-c4c7-b54bd85a6018' -H 'accept:
application/x.toters.v1+json' -H 'accept-encoding: gzip' -H
'authorization: Bearer
{eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjEwMiwiaXNzIjoiaHR0cDpcL1wvMTkyLjE2OC4xLjE0Mzo4MDAwXC9hcGlcL3VzZXJzXC9sb2dpbiIsImlhdCI6MTUyMDUxMjcxMSwiZXhwIjoxNjE1MTIwNzExLCJuYmYiOjE1MjA1MTI3MTEsImp0aSI6IkhzODRaamdjbnJEdlQ5Z3UifQ.TrFOeB5qKJ9DwWCqjDLSXXlBscBZKTtbogjWY_bLjdQ}'
-H 'connection: Keep-Alive' -H 'content-language: en-US' -H 'host: 192.168.1.143:8000' -H 'user-agent: Dalvik/2.1.0 (Linux; U; Android 8.1.0; Pixel 2 XL Build/OPM1.171019.018)' -d “country_code=657&street=toters&nickname=toters_office&lon=35.5243772&lat=33.8967797&phone_number=96176447024&is_default=1&apartment=toters&building_ref=toters”
I've got the latest version on the Native app (v6.0.9) and copied your POST body data - When I selected the code option, I got the correct response:
Not really sure what the problem is that you're seeing there.

Resources