I have an issue with changing the user agent.
I am trying to use the following line in my runner.js file in the browsers array :
chrome:headless:userAgent=Mozilla/5.0\ \(Linux\;\ Android\ 5.0\;\ SM-G900P\ Build/LRX21T\)\ AppleWebKit/537.36\ \(KHTML,\ like\ Gecko\)\ Chrome/57.0.2987.133\ Mobile\ Safari/537.36
However, the best I can get is Mozilla/5.0 (Linux in the actual user agent.
The guide doesn't say anything explicit about user agents and how to escape them.
Could someone help me with using a custom user agent for the headless chrome? I can't seem to get over the escaping problem. Thanks.
I actually found the answer, you need to escape with \\ every ; character.
E.g:
chrome:headless:userAgent=Mozilla/5.0 (X11\\; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36
will work.
In case of using in cli command you need to double escape. (I didn't have success in that)
Related
I've written a simple Python script to upload an image to a WP site:
import requests
import base64
BASE_URL = "https://example.com/wp-json/wp/v2"
media = {
"file": open("image.png", "rb"),
"caption": "a media file",
"description": "some media file"
}
creds = "wp_admin_user" + ":" + "app password"
token = base64.b64encode(creds.encode())
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Authorization":"Basic " + token.decode("utf-8")
}
r = requests.post(BASE_URL + "/media", headers=header, files=media)
print(r)
When using Python 3.9 on Windows, everything works as expected: I get a <Response [201]> reply and I can see the image in my site's media library.
When running the exact same script on a Linux, it fails with a 503 reply from the WP server:
<Response [503]>
The Linux is running Python 3.9.1
I can run the script again on Windows ten times, and it always works. I've searched the internets for the error and it's usually a WP configuration error, which doesn't seem to be the case here as the script works on Windows.
Any help is much appreciated!
I think that the problem is in the ip of the server where it hangs Linux
I have seen that there are many approarches for scraping data from google scholar out there. I tried to put together my own code, however I cannot get access to google scholar using free proxies. I am interested in understanding why that is (and, secondarily, what to change). Below is my code. I know it is not the most elegant one, its my first try at data scraping...
This is a list of proxies I got from "https://free-proxy-list.net/" and I did test if they worked by accessing "http://icanhazip.com" with them.
live_proxies = ['193.122.71.184:3128', '185.76.10.133:8081', '169.57.1.85:8123', '165.154.235.76:80', '165.154.235.156:80']
Then I made the urls I want to scrape, and tried to get the content of the pages with one randome proxy
search_terms = ['Acanthizidae', 'mammalia']
for i in range(len(search_terms)):
url = 'https://scholar.google.de/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='.format(search_terms[i])
session = requests.Session()
proxy = random.choice(live_proxies)
session.proxies = {"http": proxy , "https": proxy}
ua = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
output = s.get(url, headers=ua, timeout=1.5).text
However, I get:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='scholar.google.de', port=443): Max retries exceeded with url: /scholar?hl=en&as_sdt=0%2C5&q=Acanthizidae&btnG= (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 400 Bad Request')))
Like I said, I did test the proxies before with a different site. What is the problem?
Problem
I can open the following link on my browser just fine:
https://www.hapag-lloyd.com/en/online-business/track/track-by-container-solution.html?container=HLBU9517848
However , If I use Python requests I will get a <Response [403]>
Upon further investigation of this response, I have noticed that in the response text of this non-authorized request, something like this appeared :
<p class="display-later" style="display:none;">Please complete
following CAPTCHA to get access to the Hapag-Lloyd's website.</p>
<p><form id="challenge-form" class="challenge-form managed-form" action="/en/online-business/track/track-by-container-solution.html?container=HLBU9517848&__cf_chl_f_tk=ubmYYNVSxf5CVa3wq85K19Cb6kHn.6V14MgS52EOPpU-1658760627-0-gaNycGzNB70" method="POST" enctype="application/x-www-form-urlencoded">
<div id='cf-please-wait'>
<div id='spinner'>
<div id="cf-bubbles">
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
</div>
<p data-translate="please_wait" id="cf-spinner-please-wait">Please stand by, while we are checking your browser...</p>
<p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...</p>
</div>
This website is using a security service to protect itself from online attacks. The action I just performed triggered the security solution. Completing the CAPTCHA proves I'm a human and gives me access to the Hapag-Lloyd's website.
Minimal Reproducible Example
url = "https://www.hapag-lloyd.com/en/online-business/track/track-by-container-solution.html?container=HLBU9517848"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
container_data = requests.get(url, proxies={"http":"http://95.66.151.101:8080"}, headers=headers)
print(container_data)
Solution
What are possible fixes ?
This answer only mentions changing User-Agent and http headers to deal 403 responses, which doesn't work on this case.
I believe I will only be on the right path if I follow something like this.
Any advice or link regarding this issue are welcome.
I have a site that have some plugins, and one of those plugins (facebook for woocommerce) is loading until it returns a timeout error (504). I can change some constants in wp-config.php, but none of them works when I need to debug a timeout.
I tried to remove every configuration and file that I found from this plugin and then reinstall it, but the error is still there.
I tried to deactivate every other plugin, but woocommerce and the error is still there.
I looked for some debug plugins, but I only found plugins that change wp-config.php constants and do some logs at files. It is useless, I can do this.
I tried to put some "die" with messages in plugin's code, but nothing changed.
Server log just shows this:
x.x.x.x - - [09/Nov/2020:17:52:56 -0300] "xxxxx.com" "GET /wp-admin/admin.php?page=wc-facebook HTTP/1.1" 504 160 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0" "-"
I don't know what I can to do to debug this timeout, I've tried everything I know with the wordpress.
I solved by asking on plugin's forum: https://wordpress.org/support/topic/plugin-page-giving-timeout-504/#post-13687667
I just needed to activate WP_DEBUG and WP_DEBUG_LOG flags. I discovered the line that was breaking the site and then I could properly debug and find the problem.
How do I block a user agent using nginx.
so far I have something like this:
if ($http_user_agent = "Mozilla/5.0 (Linux; Android 4.2.2; SGH-M919 Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.169 Mobile Safari/537.22") {
return 403;}
this is from a similar thread on this stack overflow.
I run nginx as a reverse proxy for cherrypy server. I intend to filter a certain user agent using nginx alone but the above code doesn't work on my server.
is that the correct way to do this?
It wasn't included in any block in the nginx config. Should I add it to the "http" block or the "server" block
in order to block the specific user agent I included this code in the "server" block:
if ($http_user_agent = "Mozilla/5.0 (Linux; Android 4.2.2; SGH-M919 Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.169 Mobile Safari/537.22"){
return 403;
}
and it worked as expected.
If's are evil - use the map directive.
Directive if has problems when used in location context, in some cases
it doesn’t do what you expect but something completely different
instead. In some cases it even segfaults. It’s generally a good idea
to avoid it if possible.
Nginx Ultimate Bad Bot Blocker makes blocking bots easy with support for Debian / Centos / Alpine Linux / FreeBSD.