When I run this command:
wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
...I get this result (with nothing else in the file):
<!-- hw147.fp.gq1.yahoo.com uncompressed/chunked Wed Jun 19 03:42:44 UTC 2013 -->
But when I run wget http://yahoo.com with no --user-agent option, I get the full page.
The user agent is the same header that my current browser sends. Why does this happen? Is there a way to make sure the user agent doesn't get blocked when using wget?
It seems Yahoo server does some heuristic based on User-Agent in a case Accept header is set to */*.
Accept: text/html
did the trick for me.
e.g.
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
Note: if you don't declare Accept header then wget automatically adds Accept:*/* which means give me anything you have.
I created a ~/.wgetrc file with the following content (obtained from askapache.com but with a newer user agent, because otherwise it didn’t work always):
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
robots = off
Now I’m able to download from most (all?) file-sharing (streaming video) sites.
You need to set both the user-agent and the referer:
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --referer connect.wso2.com http://dist.wso2.org/products/carbon/4.2.0/wso2carbon-4.2.0.zip
Related
This request used to work but now gets a 403. I tried adding a user agent like in this answer but still no good: https://stackoverflow.com/a/38489588/2415706
This second answer further down says to find the referer header but I can't figure out where these response headers are: https://stackoverflow.com/a/56946001/2415706
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_url = "https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_response = requests.get(job_url, headers=headers, timeout=10)
print(job_response)
This is what I see under Request Headers for the first tab after refreshing the page but there's too much stuff. I assume I only need one of these lines.
:authority: www.ziprecruiter.com
:method: GET
:path: /Salaries/What-Is-the-Average-Programmer-Salary-by-State
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: max-age=0
cookie: __cfduid=dea4372c39465cfa2422e97f84dea45fb1620355067; zva=100000000%3Bvid%3AYJSn-w3tCu9yJwJx; ziprecruiter_browser=99.31.211.77_1620355067_495865399; SAFESAVE_TOKEN=1a7e5e90-60de-494d-9af5-6efdab7ade45; zglobalid=b96f3b99-1bed-4b7c-a36f-37f2d16c99f4.62fd155f2bee.6094a7fb; ziprecruiter_session=66052203cea2bf6afa7e45cae7d1b0fe; experian_campaign_visited=1
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"
sec-ch-ua-mobile: ?0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
EDIT: looking at the other tabs, they have referer: "referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State" so I'm trying that now but it is still 403.
Using httpx package it seems to work with:
import httpx
url = 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State'
r = httpx.get(url)
print(r.text)
print(r.status_code)
print(r.http_version)
repl.it: https://replit.com/#bertrandmartel/ZipRecruiter
I may be wrong but I think that the server didn't like the TLS negociation for the requests library. It's weird since the above call is using HTTP1.1 in the request and with curl it only works with http2 and TLS1.3
Using a curl binary built with http2 and with openssl supporting TLS1.3, the following works:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
returns:
301
The following commands are failing:
forcing http1.1 and enforcing TLS 1.3
docker run --rm curlimages/curl:7.76.1 \
--http1.1 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
forcing http2 and enforcing TLS 1.2:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.2 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
My guess is that it detects something in the TLS negociation but the check is different when there is both TLS1.3 and HTTP/2
Unfortunately, you can't check http/2 with requests/urlib since it's not supported
I am trying to use JSoup to parse content from URLs like https://www.tesco.com/groceries/en-GB/products/300595003
Jsoup.connect(url).get() simply times out, however I can access the website fine in the web browser.
Through trial and error, the simplest working curl command I found was:
curl 'https://www.tesco.com/groceries/en-GB/products/300595003' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0' \
-H 'Accept-Language: en-GB,en;q=0.5' --compressed
I am able to translate the User-Agent and Accept-Language into JSoup, however I still get timeouts. Is there an equivalent to the --compressed flag for Jsoup, because the curl command will not work without it?
To find out what --compressed option does try using curl with --verbose parameter. It will display full request headers.
Without --compressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
With --comppressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> Accept-Encoding: deflate, gzip
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
The difference is new Accept-Encoding header so adding .header("Accept-Encoding", "deflate, gzip") should solve your problem.
By the way, for me both jsoup and curl are able to download page source without this header and without --compressed and I'm not getting timeouts, so there's a chance your requests are somehow limited by server for making too many requests.
EDIT:
It works for me using your original command with --http1.1 so there has to be a way to make it work for you as well. I'd start with using Chrome developer tools to take a look at what headers your browser sends and try to pass all of them using .header(...). You can also copy curl command to see all headers and simulate exactly what Chrome is sending:
I am using a web service API.
http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=1&type=n
Typing that address into the chrome, expected result (json file containing song information) could be returned but when using curl it failed. (in both case,response code is OK but the response body is not correct in the later case )
Here are the request info dumped using the Chrome developer tool:
Request URL:http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:bid="lwaJyClu5Zg"
Host:www.douban.com
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Query String Parametersview sourceview URL encoded
app_name:radio_desktop_win
version:100
user_id:
expire:
token:
sid:
h:
channel:7
type:n
However, using that API with curl, i.e curl http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n will not return expected result.
Even specifying the exactly header as what dumped from Chrome still failed.
curl -v -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Accept-Encoding:gzip,deflat,sdcn" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Cache-Control:max-age=0" -H "Connection:keep-alive" -H "Host:www.douban.com" -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Below is what print out with -v from curl. Seems everything was identical with the request made by Chrome but still the response body is not correct.
GET /j/app/radio/people?app_name=radio_desktop_win HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip,deflat,sdcn
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Host:www.douban.com
Why this happened? Appreciate your help.
You need to put quotes around that url in the shell. Otherwise the &s are going to cause trouble.
Another common problem: you may be using an HTTP proxy with Chrome. If so, you need to tell curl about this proxy as well. You can do so by setting the environmental variable http_proxy.
I want to get the user locale when he lands in my website and then stick it to the user (also stick the new one if he wants to change the language).
Yet I don't want the locale to appear in the url.
I implemented the LocaleListener from the Symfony2 doc but I am enable to get the user default locale at the first request.
This requests are giving me nothing for a response:
$locale = $this->getRequest()->attributes->get('_locale');
$locale = $this->getRequest()->get('_locale');
While
$this->getRequest()
Sends effectively
GET /Twinkler1.2.3/web/app_dev.php/ HTTP/1.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Encoding: gzip,deflate,sdch Accept-Language: fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 Cache-Control: max-age=0 Connection: keep-alive Cookie: __uvt=; PHPSESSID=f28e3958ecab05fe97d6fc6950eb72ec; SQLiteManager_currentLangue=2 Host: localhost:8888 Referer: http://localhost:8888/Twinkler1.2.3/web/app_dev.php/login User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36 X-Php-Ob-Level: 0
So how can I get the locale of the request (French here)?
Thanks in advance
Jules
$language = $request->getPreferredLanguage();
$request->setLocale($language);
I'm debugging a program I wrote and noticed something strange. I set up an HTTP server on port 12345 that servers a simple OGG video file, and attempted to access it from Firefox.
Upon sniffing the network requests, I found these two requests were made:
GET /video.ogv HTTP/1.1
Host: 127.0.0.1:12345
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
GET /video.ogv HTTP/1.1
Host: 127.0.0.1:12345
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Range: bytes=8122368-
The video is almost 8 MB in size, so the fact that the second request specifics 8122368 bytes, which is 7932 KB, suggests it is requesting the very end of the file for some reason. Anyone have ideas?
In order to support seeking and playing back regions of the media that aren't yet downloaded, Gecko uses HTTP 1.1 byte-range requests to retrieve the media from the seek target position. So because Ogg files don't contain their duration, the initial download connection is terminated. Then there is a seek to the end of the Ogg file and read a bit of data to extract the time duration of the media. Info from here and here.
Some media format have meta data at the end of the file, and this data is usually required to allow proper seeking of the video.
Its actually requesting 8122368 bytes starting backwards from the end. Which is 7.74MB if I did my calcs correctly.
it might be something in how the buffering for that file type is done.