Link opens from browser but gets rejected (403) if I use requests - python-requests

Problem
I can open the following link on my browser just fine:
https://www.hapag-lloyd.com/en/online-business/track/track-by-container-solution.html?container=HLBU9517848
However , If I use Python requests I will get a <Response [403]>
Upon further investigation of this response, I have noticed that in the response text of this non-authorized request, something like this appeared :
<p class="display-later" style="display:none;">Please complete
following CAPTCHA to get access to the Hapag-Lloyd's website.</p>
<p><form id="challenge-form" class="challenge-form managed-form" action="/en/online-business/track/track-by-container-solution.html?container=HLBU9517848&__cf_chl_f_tk=ubmYYNVSxf5CVa3wq85K19Cb6kHn.6V14MgS52EOPpU-1658760627-0-gaNycGzNB70" method="POST" enctype="application/x-www-form-urlencoded">
<div id='cf-please-wait'>
<div id='spinner'>
<div id="cf-bubbles">
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
</div>
<p data-translate="please_wait" id="cf-spinner-please-wait">Please stand by, while we are checking your browser...</p>
<p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...</p>
</div>
This website is using a security service to protect itself from online attacks. The action I just performed triggered the security solution. Completing the CAPTCHA proves I'm a human and gives me access to the Hapag-Lloyd's website.
Minimal Reproducible Example
url = "https://www.hapag-lloyd.com/en/online-business/track/track-by-container-solution.html?container=HLBU9517848"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
container_data = requests.get(url, proxies={"http":"http://95.66.151.101:8080"}, headers=headers)
print(container_data)
Solution
What are possible fixes ?
This answer only mentions changing User-Agent and http headers to deal 403 responses, which doesn't work on this case.
I believe I will only be on the right path if I follow something like this.
Any advice or link regarding this issue are welcome.

Related

Why does using free proxies not work for scraping google scholar?

I have seen that there are many approarches for scraping data from google scholar out there. I tried to put together my own code, however I cannot get access to google scholar using free proxies. I am interested in understanding why that is (and, secondarily, what to change). Below is my code. I know it is not the most elegant one, its my first try at data scraping...
This is a list of proxies I got from "https://free-proxy-list.net/" and I did test if they worked by accessing "http://icanhazip.com" with them.
live_proxies = ['193.122.71.184:3128', '185.76.10.133:8081', '169.57.1.85:8123', '165.154.235.76:80', '165.154.235.156:80']
Then I made the urls I want to scrape, and tried to get the content of the pages with one randome proxy
search_terms = ['Acanthizidae', 'mammalia']
for i in range(len(search_terms)):
url = 'https://scholar.google.de/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='.format(search_terms[i])
session = requests.Session()
proxy = random.choice(live_proxies)
session.proxies = {"http": proxy , "https": proxy}
ua = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
output = s.get(url, headers=ua, timeout=1.5).text
However, I get:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='scholar.google.de', port=443): Max retries exceeded with url: /scholar?hl=en&as_sdt=0%2C5&q=Acanthizidae&btnG= (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 400 Bad Request')))
Like I said, I did test the proxies before with a different site. What is the problem?

Changing user agent on headless chrome

I have an issue with changing the user agent.
I am trying to use the following line in my runner.js file in the browsers array :
chrome:headless:userAgent=Mozilla/5.0\ \(Linux\;\ Android\ 5.0\;\ SM-G900P\ Build/LRX21T\)\ AppleWebKit/537.36\ \(KHTML,\ like\ Gecko\)\ Chrome/57.0.2987.133\ Mobile\ Safari/537.36
However, the best I can get is Mozilla/5.0 (Linux in the actual user agent.
The guide doesn't say anything explicit about user agents and how to escape them.
Could someone help me with using a custom user agent for the headless chrome? I can't seem to get over the escaping problem. Thanks.
I actually found the answer, you need to escape with \\ every ; character.
E.g:
chrome:headless:userAgent=Mozilla/5.0 (X11\\; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36
will work.
In case of using in cli command you need to double escape. (I didn't have success in that)

'RCurl' [R] package getURL webpage error when scraping API

I am trying to scrape data on pages from an API using the getURL function of the RCurl package in R. My problem is that I can't replicate the response that I get when I open the URL in Chrome when I make the request using R. Essentially, when I open the API page (url below) in Chrome it works fine but if I request it in using getURL in R (or using incognito mode in Chrome) I get a '500 Internal Server Error' response and not the pretty JSON that I'm looking for.
URL/API in question:
http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082
Here is my (failed) request in [R].
test2 <- fromJSON(getURL("http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082", ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"))
My Research so Far
First I looked at this prior question on stack and added in my useragent to the request (did not solve problem but may still be necessary):
ViralHeat API issues with getURL() command in RCurl package
Next I looked at this helpful post which guides my rationale:
R Disparity between browser and GET / getURL
My Ideas About the Solution
This is not my area of expertise but my guess is that the request is lacking a cookie needed to complete the request (hence why it doesn't work in my browser in incognito mode). I compared the requests and responses from the successful request to the unsuccessful request:
Successful request:
Unsuccessful request:
Anyone have any ideas? Should I try using the package RSelenium package that was suggested by MrFlick in the 2nd post I made.
This is a courteous site. It would like to know where you come from what currency you use etc. to give you a better user experience. It does this by setting a multitude of cookies on the landing page. So we follow suit and navigate to the landing page first getting the cookies then we goto the page we want:
library(RCurl)
myURL <- "http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082"
agent="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0"
#Set RCurl pars
curl = getCurlHandle()
curlSetOpt(cookiejar="cookies.txt", useragent = agent, followlocation = TRUE, curl=curl)
firstPage <- getURL("http://www.bluenile.com", curl=curl)
myPage <- getURL(myURL, curl = curl)
library(RJSONIO)
> names(fromJSON(myPage))
[1] "diamondDetailsHeader" "diamondDetailsBodies" "pageMetadata" "expandedUrl"
[5] "newVersion" "multiDiamond"
and the cookies:
> getCurlInfo(curl)$cookielist
[1] ".bluenile.com\tTRUE\t/\tFALSE\t2412270275\tGUID\tDA5C11F5_E468_46B5_B4E8_D551D4D6EA4D"
[2] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tsplit\tver~3&presetFilters~TEST"
[3] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tsitetrack\tver~2&jse~0"
[4] ".bluenile.com\tTRUE\t/\tFALSE\t1425230275\tpop\tver~2&china~false&french~false&ie~false&internationalSelect~false&iphoneApp~false&survey~false&uae~false"
[5] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tdsearch\tver~6&newUser~true"
[6] ".bluenile.com\tTRUE\t/\tFALSE\t1443806275\tlocale\tver~1&country~IRL&currency~EUR&language~en-gb&productSet~BNUK"
[7] ".bluenile.com\tTRUE\t/\tFALSE\t0\tbnses\tver~1&ace~false&isbml~false&fbcs~false&ss~0&mbpop~false&sswpu~false&deo~false"
[8] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tbnper\tver~5&NIB~0&DM~-&GUID~DA5C11F5_E468_46B5_B4E8_D551D4D6EA4D&SESS-CT~1&STC~32RPVK&FB_MINI~false&SUB~false"
[9] "#HttpOnly_www.bluenile.com\tFALSE\t/\tFALSE\t0\tJSESSIONID\tB8475C3AEC08205E5AC6252C94E4B858"
[10] ".bluenile.com\tTRUE\t/\tFALSE\t1727630278\tmigrationstatus\tver~1&redirected~false"

How to block a specific user agent in nginx config

How do I block a user agent using nginx.
so far I have something like this:
if ($http_user_agent = "Mozilla/5.0 (Linux; Android 4.2.2; SGH-M919 Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.169 Mobile Safari/537.22") {
return 403;}
this is from a similar thread on this stack overflow.
I run nginx as a reverse proxy for cherrypy server. I intend to filter a certain user agent using nginx alone but the above code doesn't work on my server.
is that the correct way to do this?
It wasn't included in any block in the nginx config. Should I add it to the "http" block or the "server" block
in order to block the specific user agent I included this code in the "server" block:
if ($http_user_agent = "Mozilla/5.0 (Linux; Android 4.2.2; SGH-M919 Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.169 Mobile Safari/537.22"){
return 403;
}
and it worked as expected.
If's are evil - use the map directive.
Directive if has problems when used in location context, in some cases
it doesn’t do what you expect but something completely different
instead. In some cases it even segfaults. It’s generally a good idea
to avoid it if possible.
Nginx Ultimate Bad Bot Blocker makes blocking bots easy with support for Debian / Centos / Alpine Linux / FreeBSD.

curl command gives different output for a URL

I am using curl to open a URL. If worked for few URLs. But for few its giving me an error report. When I open the same url in browser its working fine. The output of both the browser and curl command should be the same, but its not.What could be the reason?
$ curl 'http://server:port/ABC_Service/app'
<html><head><title>VMware vFabric tc Runtime 2.6.4.RELEASE/6.0.35.A.RELEASE - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 401 - </h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u></u></p><p><b>description</b> <u>This request requires HTTP authentication ().</u></p><HR size="1" noshade="noshade"><h3>VMware vFabric tc Runtime 2.6.4.RELEASE/6.0.35.A.RELEASE</h3></body></html>
Expected Output:
$ curl 'http://server:port/ABC_Service/app'
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
Output in Browser (1st 2 lines):
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
- <appMetadata>
The browser is just helpfully giving you an interactive view on the XML. Use View Source to see the actual response.
A lot of websites try to detect if a browser can support XML / XSLT . If the user agent is something they know supports it, they send that which you see. If not, they send normal HTML (in your case, an error in HTML ).
You should try setting your user agent :
curl -A "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" http://server:port/ABC_Service/app
You can find a list of user agent string from different devices / programs here

Resources