I am using a web service API.
http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=1&type=n
Typing that address into the chrome, expected result (json file containing song information) could be returned but when using curl it failed. (in both case,response code is OK but the response body is not correct in the later case )
Here are the request info dumped using the Chrome developer tool:
Request URL:http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:bid="lwaJyClu5Zg"
Host:www.douban.com
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Query String Parametersview sourceview URL encoded
app_name:radio_desktop_win
version:100
user_id:
expire:
token:
sid:
h:
channel:7
type:n
However, using that API with curl, i.e curl http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n will not return expected result.
Even specifying the exactly header as what dumped from Chrome still failed.
curl -v -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Accept-Encoding:gzip,deflat,sdcn" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Cache-Control:max-age=0" -H "Connection:keep-alive" -H "Host:www.douban.com" -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Below is what print out with -v from curl. Seems everything was identical with the request made by Chrome but still the response body is not correct.
GET /j/app/radio/people?app_name=radio_desktop_win HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip,deflat,sdcn
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Host:www.douban.com
Why this happened? Appreciate your help.
You need to put quotes around that url in the shell. Otherwise the &s are going to cause trouble.
Another common problem: you may be using an HTTP proxy with Chrome. If so, you need to tell curl about this proxy as well. You can do so by setting the environmental variable http_proxy.
Related
Instagram has an endpoint that gives you a JSON in the response.
When I try to call it using curl I would get a JSON.
curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" https://www.instagram.com/microsoft/\?__a\=1
However, if I use other HTTP clients such as Axios for node.js I would get an html page instead. Also for some other HTTP clients I would get 302 redirect occasionally.
const axios = require('axios')
axios.request({
url: 'https://www.instagram.com/microsoft/\?__a',
headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
})
Is there a way to get around this with axios or other HTTP clients so that they follow the curl behaviour for HTTP requests?
What I usually do in these cases, is use Postman in order to convert any type of requests, like Curl based request, to any other type, for example axios, fetch etc...
This request used to work but now gets a 403. I tried adding a user agent like in this answer but still no good: https://stackoverflow.com/a/38489588/2415706
This second answer further down says to find the referer header but I can't figure out where these response headers are: https://stackoverflow.com/a/56946001/2415706
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_url = "https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_response = requests.get(job_url, headers=headers, timeout=10)
print(job_response)
This is what I see under Request Headers for the first tab after refreshing the page but there's too much stuff. I assume I only need one of these lines.
:authority: www.ziprecruiter.com
:method: GET
:path: /Salaries/What-Is-the-Average-Programmer-Salary-by-State
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: max-age=0
cookie: __cfduid=dea4372c39465cfa2422e97f84dea45fb1620355067; zva=100000000%3Bvid%3AYJSn-w3tCu9yJwJx; ziprecruiter_browser=99.31.211.77_1620355067_495865399; SAFESAVE_TOKEN=1a7e5e90-60de-494d-9af5-6efdab7ade45; zglobalid=b96f3b99-1bed-4b7c-a36f-37f2d16c99f4.62fd155f2bee.6094a7fb; ziprecruiter_session=66052203cea2bf6afa7e45cae7d1b0fe; experian_campaign_visited=1
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"
sec-ch-ua-mobile: ?0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
EDIT: looking at the other tabs, they have referer: "referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State" so I'm trying that now but it is still 403.
Using httpx package it seems to work with:
import httpx
url = 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State'
r = httpx.get(url)
print(r.text)
print(r.status_code)
print(r.http_version)
repl.it: https://replit.com/#bertrandmartel/ZipRecruiter
I may be wrong but I think that the server didn't like the TLS negociation for the requests library. It's weird since the above call is using HTTP1.1 in the request and with curl it only works with http2 and TLS1.3
Using a curl binary built with http2 and with openssl supporting TLS1.3, the following works:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
returns:
301
The following commands are failing:
forcing http1.1 and enforcing TLS 1.3
docker run --rm curlimages/curl:7.76.1 \
--http1.1 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
forcing http2 and enforcing TLS 1.2:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.2 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
My guess is that it detects something in the TLS negociation but the check is different when there is both TLS1.3 and HTTP/2
Unfortunately, you can't check http/2 with requests/urlib since it's not supported
I'm trying to trigger a new dag run via Airflow 2.0 REST API. If I am logged in to the Airflow webserver on the remote machine and I go to the swagger documentation page to test the API, the call is successful. If I log out or if the API call is sent through Postman or curl, then I get a 403 forbidden message. The same 403 error message is received in curl or postman whether I provide the web server username password or not.
curl -X POST --user "admin:blabla" "http://10.0.0.3:7863/api/v1/dags/tutorial_taskflow_api_etl/dagRuns" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"conf\":{},\"dag_run_id\":\"string5\"}"
{
"detail": null,
"status": 403,
"title": "Forbidden",
"type": "https://airflow.apache.org/docs/2.0.0/stable-rest-api-ref.html#section/Errors/PermissionDenied"
}
The security for API has been changed to default, instead of deny_all (auth_backend = airflow.api.auth.backend.default). The installation of airflow has been done using pip using ubuntu 18 bionic. Dags are running fine if triggered manually or scheduled. The database backend is postgres.
Also tried copying the cookie details from Chrome into postman to get past this issue, but it did not work.
Here is the log on the web server for the two calls mentioned above.
airflowWebserver_container | 10.0.0.4 - - [05/Jan/2021:06:35:33 +0000] "POST /api/v1/dags/tutorial_taskflow_api_etl/dagRuns HTTP/1.1" 403 170 "http://10.0.0.3:7863/api/v1/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
airflowWebserver_container | 10.0.0.4 - - [05/Jan/2021:06:35:07 +0000] "POST /api/v1/dags/tutorial_taskflow_api_etl/dagRuns HTTP/1.1" 409 251 "http://10.0.0.3:7863/api/v1/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
I am using basic_auth for Airflow v2.0. The AIRFLOW__API__AUTH_BACKEND environment variable should be set to airflow.api.auth.backend.basic_auth. You will have to restart the webserver container. Then you should be able to access all stable APIs using the cURL commands with --user option.
In Airflow 2.0, There seems to be some bug.
If you set this auth configuration in airflow.cfg, it doesn't work.
auth_backend = airflow.api.auth.backend.basic_auth
But setting this as an environment variable works
AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
#AmitSingh was correct. Setting security to default only works with the experimental api. I changed the relevant configuration in airflow, restarted and added 'experimental' in the api path. Please see https://airflow.apache.org/docs/apache-airflow/stable/rest-api-ref.html
Maybe also good to know:
You can only disable authentication for experimental API, not the stable REST API.
See: https://airflow.apache.org/docs/apache-airflow/stable/security/api.html#disable-authentication
I am trying to use JSoup to parse content from URLs like https://www.tesco.com/groceries/en-GB/products/300595003
Jsoup.connect(url).get() simply times out, however I can access the website fine in the web browser.
Through trial and error, the simplest working curl command I found was:
curl 'https://www.tesco.com/groceries/en-GB/products/300595003' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0' \
-H 'Accept-Language: en-GB,en;q=0.5' --compressed
I am able to translate the User-Agent and Accept-Language into JSoup, however I still get timeouts. Is there an equivalent to the --compressed flag for Jsoup, because the curl command will not work without it?
To find out what --compressed option does try using curl with --verbose parameter. It will display full request headers.
Without --compressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
With --comppressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> Accept-Encoding: deflate, gzip
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
The difference is new Accept-Encoding header so adding .header("Accept-Encoding", "deflate, gzip") should solve your problem.
By the way, for me both jsoup and curl are able to download page source without this header and without --compressed and I'm not getting timeouts, so there's a chance your requests are somehow limited by server for making too many requests.
EDIT:
It works for me using your original command with --http1.1 so there has to be a way to make it work for you as well. I'd start with using Chrome developer tools to take a look at what headers your browser sends and try to pass all of them using .header(...). You can also copy curl command to see all headers and simulate exactly what Chrome is sending:
When I run this command:
wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
...I get this result (with nothing else in the file):
<!-- hw147.fp.gq1.yahoo.com uncompressed/chunked Wed Jun 19 03:42:44 UTC 2013 -->
But when I run wget http://yahoo.com with no --user-agent option, I get the full page.
The user agent is the same header that my current browser sends. Why does this happen? Is there a way to make sure the user agent doesn't get blocked when using wget?
It seems Yahoo server does some heuristic based on User-Agent in a case Accept header is set to */*.
Accept: text/html
did the trick for me.
e.g.
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
Note: if you don't declare Accept header then wget automatically adds Accept:*/* which means give me anything you have.
I created a ~/.wgetrc file with the following content (obtained from askapache.com but with a newer user agent, because otherwise it didn’t work always):
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
robots = off
Now I’m able to download from most (all?) file-sharing (streaming video) sites.
You need to set both the user-agent and the referer:
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --referer connect.wso2.com http://dist.wso2.org/products/carbon/4.2.0/wso2carbon-4.2.0.zip