Python requests 403 Forbidden referer from network headers - web-scraping

This request used to work but now gets a 403. I tried adding a user agent like in this answer but still no good: https://stackoverflow.com/a/38489588/2415706
This second answer further down says to find the referer header but I can't figure out where these response headers are: https://stackoverflow.com/a/56946001/2415706
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_url = "https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_response = requests.get(job_url, headers=headers, timeout=10)
print(job_response)
This is what I see under Request Headers for the first tab after refreshing the page but there's too much stuff. I assume I only need one of these lines.
:authority: www.ziprecruiter.com
:method: GET
:path: /Salaries/What-Is-the-Average-Programmer-Salary-by-State
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: max-age=0
cookie: __cfduid=dea4372c39465cfa2422e97f84dea45fb1620355067; zva=100000000%3Bvid%3AYJSn-w3tCu9yJwJx; ziprecruiter_browser=99.31.211.77_1620355067_495865399; SAFESAVE_TOKEN=1a7e5e90-60de-494d-9af5-6efdab7ade45; zglobalid=b96f3b99-1bed-4b7c-a36f-37f2d16c99f4.62fd155f2bee.6094a7fb; ziprecruiter_session=66052203cea2bf6afa7e45cae7d1b0fe; experian_campaign_visited=1
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"
sec-ch-ua-mobile: ?0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
EDIT: looking at the other tabs, they have referer: "referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State" so I'm trying that now but it is still 403.

Using httpx package it seems to work with:
import httpx
url = 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State'
r = httpx.get(url)
print(r.text)
print(r.status_code)
print(r.http_version)
repl.it: https://replit.com/#bertrandmartel/ZipRecruiter
I may be wrong but I think that the server didn't like the TLS negociation for the requests library. It's weird since the above call is using HTTP1.1 in the request and with curl it only works with http2 and TLS1.3
Using a curl binary built with http2 and with openssl supporting TLS1.3, the following works:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
returns:
301
The following commands are failing:
forcing http1.1 and enforcing TLS 1.3
docker run --rm curlimages/curl:7.76.1 \
--http1.1 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
forcing http2 and enforcing TLS 1.2:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.2 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
My guess is that it detects something in the TLS negociation but the check is different when there is both TLS1.3 and HTTP/2
Unfortunately, you can't check http/2 with requests/urlib since it's not supported

Related

Jsoup times out and cURL only works with '--compressed' header - how do I emulate this header in Jsoup?

I am trying to use JSoup to parse content from URLs like https://www.tesco.com/groceries/en-GB/products/300595003
Jsoup.connect(url).get() simply times out, however I can access the website fine in the web browser.
Through trial and error, the simplest working curl command I found was:
curl 'https://www.tesco.com/groceries/en-GB/products/300595003' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0' \
-H 'Accept-Language: en-GB,en;q=0.5' --compressed
I am able to translate the User-Agent and Accept-Language into JSoup, however I still get timeouts. Is there an equivalent to the --compressed flag for Jsoup, because the curl command will not work without it?
To find out what --compressed option does try using curl with --verbose parameter. It will display full request headers.
Without --compressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
With --comppressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> Accept-Encoding: deflate, gzip
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
The difference is new Accept-Encoding header so adding .header("Accept-Encoding", "deflate, gzip") should solve your problem.
By the way, for me both jsoup and curl are able to download page source without this header and without --compressed and I'm not getting timeouts, so there's a chance your requests are somehow limited by server for making too many requests.
EDIT:
It works for me using your original command with --http1.1 so there has to be a way to make it work for you as well. I'd start with using Chrome developer tools to take a look at what headers your browser sends and try to pass all of them using .header(...). You can also copy curl command to see all headers and simulate exactly what Chrome is sending:

ffmpeg - How to pass http headers?

I need to pass http headers (user agent and ip) to an ffmpeg command.
I use the following command:
ffmpeg -y -timeout 5000000 -map 0:0 -an -sn -f md5 - -headers "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36" -headers "X-Forwarded-For: 13.14.15.66" -i "http://127.0.0.1"
And I run a local node.js server to see the headers I get:
'use strict';
var express = require('express');
var server = express();
server.all('/*', function(req, res) {
console.log(JSON.stringify(req.headers));
res.sendFile('SampleVideo_1080x720_1mb.mp4', {root: '.'});
});
server.listen(80);
I keep getting an error saying "No trailing CRLF found in HTTP header." and the request is stuck.
If I drop the headers - everything works normally.
I also tried putting both headers in one string, but any line breaking character I used (\r\n, \r\n, etc.) didn't work.
Can someone help me figure out how to write this command correctly with the headers included?
Short Answer
Make sure you're using the latest ffmpeg, and use the -user-agent option.
Longer Answer
For debugging, I setup a BaseHTTPSever running at 127.0.0.1:8080 with do_GET() as:
def do_GET(self):
try:
f = open(curdir + sep + self.path, 'rb')
self.send_response(200)
self.end_headers()
print("GET: "+ str(self.headers))
self.wfile.write(f.read())
f.close()
return
except IOError:
self.send_error(404,'File Not Found: %s' % self.path)
With that running, this enabled me to run your command like:
ffmpeg \
-y \
-timeout 5000000 \
-map 0:0 \
-an \
-sn \
-f md5 - \
-headers "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36" \
-headers "X-Forwarded-For: 13.14.15.66" \
-i "http://127.0.0.1:8080/some_video_file.mp4" \
-v trace
When I do this, I see the following relevant output from ffmpeg:
Reading option '-headers' ... matched as AVOption 'headers' with argument 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'.
Reading option '-headers' ... matched as AVOption 'headers' with argument 'X-Forwarded-For: 13.14.15.66'.
On the server, I saw:
User-Agent: Lavf/56.40.101
X-Forwarded-For: 13.14.15.66
So it looks like ffmpeg is setting it's own. But there is an option -user-agent to ffmpeg, and when I replaced -headers "User-Agent: <foo>" with -user-agent "<foo>", I then did see it too on the server, alongside the X-Forwarded-For header:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
Last note. There are lots of discussions around headers bugs in trac for ffmpeg. What I have observed above (that essentially it is working, perhaps with a small command change) was with a fairly recent version:
ffmpeg version 2.8.1 Copyright (c) 2000-2015 the FFmpeg developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04)
configuration: --enable-libx264 --enable-gpl --prefix=/usr/local --enable-shared --cc='gcc -fPIC'
libavutil 54. 31.100 / 54. 31.100
libavcodec 56. 60.100 / 56. 60.100
libavformat 56. 40.101 / 56. 40.101
libavdevice 56. 4.100 / 56. 4.100
libavfilter 5. 40.101 / 5. 40.101
libswscale 3. 1.101 / 3. 1.101
libswresample 1. 2.101 / 1. 2.101
libpostproc 53. 3.100 / 53. 3.100
So, your next move might be make sure you have the latest version of ffmpeg.
Well, ffmpeg manual says to split multiple http-headers by CRLF. The problem is that you overwrite your first "-header" argument with the second "-header" as there can be only one "-header" argument.
For your example, you need to join User-Agent and X-Forwarded into one argument by valid CRLF like this:
-header "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"$'\r\n'"X-Forwarded-For: 13.14.15.66"$'\r\n'
For set x:1 and y:2 for header ffmpeg request, use this:
ffmpeg -headers $'x:1\r\ny:2\r\n' -i 'http://sample.com' -y 'sample.mp4' -v debug
Result:
[http # 0x358be00] Setting default whitelist 'http,https,tls,rtp,tcp,udp,crypto,httpproxy'
[http # 0x358be00] request: GET / HTTP/1.1
User-Agent: Lavf/57.76.100
Accept: */*
Range: bytes=0-
Connection: close
Host: example.com
Icy-MetaData: 1
x:1
y:2

Xenoforo create a new thread with curl

I came to a stop to get curl to create a new thread in the Xenoforo forum software.
Httpfox gives me the following data to create a new thread:
Headers:
POST /forum/add-thread HTTP/1.1
Host forum.tld
User-Agent Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20100101 Firefox/17.0
Accept application/json, text/javascript, */*; q=0.01
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip, deflate
Connection keep-alive
Content-Type application/x-www-form-urlencoded; charset=UTF-8
X-Ajax-Referer http://forum.tld/forum/create-thread
X-Requested-With XMLHttpRequest
Referer http://forum.tld/forum/create-thread
Content-Length 17871
Cookie cookie stuff
Pragma no-cache
Cache-Control no-cache
Raw post Data:
title=title+of+the+thread&message=urlencoded+thread+message&_xfRelativeResolver=http%3A%2F%2Fforum.tld%2Fforum%2Fcreate-thread&watch_thread_state=1&poll%5Bquestion%5D=&poll%5Bresponses%5D%5B%5D=&poll%5Bresponses%5D%5B%5D=&_xfToken=atoken&_xfRequestUri=%2Fforum%2Fcreate-thread&_xfNoRedirect=1&_xfToken=atoken+again&_xfResponseType=json
I extract the token from another page with curl by login in or using cookies.
When I use this curl:
curl -b cookie.txt -L -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" -e "http://forum.tld/forum/create-thread" --data-urlencode "title=$title" --data-urlencode "watch_thread_state=1" --data-urlencode "message=$message" --data-urlencode "_xfRelativeResolver=http://forum.tld/forum/create-thread" --data-urlencode "poll[question]=" --data-urlencode "poll[responses][]=" --data-urlencode "_xfToken=$token" --data-urlencode "_xfRequestUri=/forum/create-thread" --data-urlencode "_xfNoRedirect=1" --data-urlencode "_xfResponseType=json" "http://forum.tld/forum/add-thread"
It outputs:
{"error":{"message":"Please enter a valid message."}
The $message gets read from a text file encoded in iso-8859-1.
Any ideas? I'm kind of clueless right now.

Why does curl not work, but wget works?

I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/
For curl, it returns no output at all, but with wget, it returns the entire HTML source:
Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.
curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1
wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?
The way to solve is to analyze your curl request by doing curl -v ... and your wget request by doing wget -d ... which shows that curl is redirected to a login page
> GET /2012/01/19/118675/ HTTP/1.1
> User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
> Host: opinionator.blogs.nytimes.com
> Accept: */*
>
< HTTP/1.1 303 See Other
< Date: Wed, 08 Jan 2014 03:23:06 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.nytimes.com/glogin?URI=http://opinionator.blogs.nytimes.com/2012/01/19/118675/&OQ=_rQ3D0&OP=1b5c69eQ2FCinbCQ5DzLCaaaCvLgqCPhKP
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8
followed by a loop of redirections (which you must have noticed, because you have already set the --max-redirs flag).
On the other hand, wget follows the same sequence except that it returns the cookie set by nytimes.com with its subsequent request(s)
---request begin---
GET /2012/01/19/118675/?_r=0 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: */*
Host: opinionator.blogs.nytimes.com
Connection: Keep-Alive
Cookie: NYT-S=0MhLY3awSMyxXDXrmvxADeHDiNOMaMEZFGdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI
The request sent by curl never includes the cookie.
The easiest way I see to modify your curl command and obtain the desired resource is by adding -c cookiefile to your curl command. This stores the cookie in the otherwise unused temporary "cookie jar" file called "cookiefile" thereby enabling curl to send the needed cookie(s) with its subsequent requests.
For example, I added the flag -c x directly after "curl " and I obtained the output just like from wget (except that wget writes it to a file and curl prints it on STDOUT).
In my case was because the https_proxy enviroment variable for utility cURL needs set the port in the URL, for example :
Not work with cURL :
https_proxy=http://proxyapp.net.com/
Works with cURL :
https_proxy=http://proxyapp.net.com:80/
With "wget" utility works with and without the port in url, but curl needs it, in case of not set the utility "curl" return error "(56) Proxy CONNECT aborted".
When you get verbosity of the command "curl -v" could see "curl" use port "1080" as default if port in not set at proxy url.

REST API - works in chrome but curl did not work

I am using a web service API.
http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=1&type=n
Typing that address into the chrome, expected result (json file containing song information) could be returned but when using curl it failed. (in both case,response code is OK but the response body is not correct in the later case )
Here are the request info dumped using the Chrome developer tool:
Request URL:http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:bid="lwaJyClu5Zg"
Host:www.douban.com
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Query String Parametersview sourceview URL encoded
app_name:radio_desktop_win
version:100
user_id:
expire:
token:
sid:
h:
channel:7
type:n
However, using that API with curl, i.e curl http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n will not return expected result.
Even specifying the exactly header as what dumped from Chrome still failed.
curl -v -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Accept-Encoding:gzip,deflat,sdcn" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Cache-Control:max-age=0" -H "Connection:keep-alive" -H "Host:www.douban.com" -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Below is what print out with -v from curl. Seems everything was identical with the request made by Chrome but still the response body is not correct.
GET /j/app/radio/people?app_name=radio_desktop_win HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip,deflat,sdcn
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Host:www.douban.com
Why this happened? Appreciate your help.
You need to put quotes around that url in the shell. Otherwise the &s are going to cause trouble.
Another common problem: you may be using an HTTP proxy with Chrome. If so, you need to tell curl about this proxy as well. You can do so by setting the environmental variable http_proxy.

Resources