Instagram has an endpoint that gives you a JSON in the response.
When I try to call it using curl I would get a JSON.
curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" https://www.instagram.com/microsoft/\?__a\=1
However, if I use other HTTP clients such as Axios for node.js I would get an html page instead. Also for some other HTTP clients I would get 302 redirect occasionally.
const axios = require('axios')
axios.request({
url: 'https://www.instagram.com/microsoft/\?__a',
headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
})
Is there a way to get around this with axios or other HTTP clients so that they follow the curl behaviour for HTTP requests?
What I usually do in these cases, is use Postman in order to convert any type of requests, like Curl based request, to any other type, for example axios, fetch etc...
Related
I am trying to download multiple files from a website. Im scraping the website to come up with the individual URLs, but when I put the URL's into download.file, I'm getting a 403 forbidden error.
It looks like theres a user validation step on the website, but adding a header doesnt help.
Any help getting around this is appreciated. here's what im trying with one sample URL and file :
headers = c(
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)
download.file("https://gibsons.civicweb.net/filepro/document/125764/Regular%20Council%20-%2006%20Dec%202022%20-%20Minutes%20-%20Pdf.pdf",
"file",
mode="wb",
headers=headers)```
I'm trying to trigger a new dag run via Airflow 2.0 REST API. If I am logged in to the Airflow webserver on the remote machine and I go to the swagger documentation page to test the API, the call is successful. If I log out or if the API call is sent through Postman or curl, then I get a 403 forbidden message. The same 403 error message is received in curl or postman whether I provide the web server username password or not.
curl -X POST --user "admin:blabla" "http://10.0.0.3:7863/api/v1/dags/tutorial_taskflow_api_etl/dagRuns" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"conf\":{},\"dag_run_id\":\"string5\"}"
{
"detail": null,
"status": 403,
"title": "Forbidden",
"type": "https://airflow.apache.org/docs/2.0.0/stable-rest-api-ref.html#section/Errors/PermissionDenied"
}
The security for API has been changed to default, instead of deny_all (auth_backend = airflow.api.auth.backend.default). The installation of airflow has been done using pip using ubuntu 18 bionic. Dags are running fine if triggered manually or scheduled. The database backend is postgres.
Also tried copying the cookie details from Chrome into postman to get past this issue, but it did not work.
Here is the log on the web server for the two calls mentioned above.
airflowWebserver_container | 10.0.0.4 - - [05/Jan/2021:06:35:33 +0000] "POST /api/v1/dags/tutorial_taskflow_api_etl/dagRuns HTTP/1.1" 403 170 "http://10.0.0.3:7863/api/v1/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
airflowWebserver_container | 10.0.0.4 - - [05/Jan/2021:06:35:07 +0000] "POST /api/v1/dags/tutorial_taskflow_api_etl/dagRuns HTTP/1.1" 409 251 "http://10.0.0.3:7863/api/v1/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
I am using basic_auth for Airflow v2.0. The AIRFLOW__API__AUTH_BACKEND environment variable should be set to airflow.api.auth.backend.basic_auth. You will have to restart the webserver container. Then you should be able to access all stable APIs using the cURL commands with --user option.
In Airflow 2.0, There seems to be some bug.
If you set this auth configuration in airflow.cfg, it doesn't work.
auth_backend = airflow.api.auth.backend.basic_auth
But setting this as an environment variable works
AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
#AmitSingh was correct. Setting security to default only works with the experimental api. I changed the relevant configuration in airflow, restarted and added 'experimental' in the api path. Please see https://airflow.apache.org/docs/apache-airflow/stable/rest-api-ref.html
Maybe also good to know:
You can only disable authentication for experimental API, not the stable REST API.
See: https://airflow.apache.org/docs/apache-airflow/stable/security/api.html#disable-authentication
I have used a lot of requests to https://www.instagram.com/{username}/?__a=1 to check if a pseudo was existing and now I am getting 429.
Before, I had just to wait few minutes to make the 429 disapear. Now it is persistent ! :( I'm trying once a day, it doesnt work anymore.
Do you know anything about instagram requests limitation ?
Do you have any workaround please ? Thanks
Code ...
import requests
r = requests.get('https://www.instagram.com/test123/?__a=1')
res = str(r.status_code)
Try adding the user-agent header, otherwise, the website thinks that your a bot, and will block you.
import requests
URL = "https://www.instagram.com/bla/?__a=1"
HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(URL, headers=HEADERS)
print(response.status_code) # <- Output: 200
I am trying to crawl some information from www.blogabet.com.
In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.
I am asking myself:
1. For which websites do I have to use headers?
2. How do I get the information that I have to provide in the header?
3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters
Thank you very much :)
scrapy shell
from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'
page = Request(url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
fetch(page)
If you look in your network panel when you load that page you can see the XHR and the headers it sends
So it looks like he just copied those.
In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.
I am using a web service API.
http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=1&type=n
Typing that address into the chrome, expected result (json file containing song information) could be returned but when using curl it failed. (in both case,response code is OK but the response body is not correct in the later case )
Here are the request info dumped using the Chrome developer tool:
Request URL:http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:bid="lwaJyClu5Zg"
Host:www.douban.com
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Query String Parametersview sourceview URL encoded
app_name:radio_desktop_win
version:100
user_id:
expire:
token:
sid:
h:
channel:7
type:n
However, using that API with curl, i.e curl http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n will not return expected result.
Even specifying the exactly header as what dumped from Chrome still failed.
curl -v -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Accept-Encoding:gzip,deflat,sdcn" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Cache-Control:max-age=0" -H "Connection:keep-alive" -H "Host:www.douban.com" -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" http://www.douban.com/j/app/radio/people?app_name=radio_desktop_win&version=100&user_id=&expire=&token=&sid=&h=&channel=7&type=n
Below is what print out with -v from curl. Seems everything was identical with the request made by Chrome but still the response body is not correct.
GET /j/app/radio/people?app_name=radio_desktop_win HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip,deflat,sdcn
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Host:www.douban.com
Why this happened? Appreciate your help.
You need to put quotes around that url in the shell. Otherwise the &s are going to cause trouble.
Another common problem: you may be using an HTTP proxy with Chrome. If so, you need to tell curl about this proxy as well. You can do so by setting the environmental variable http_proxy.