I've been scraping data from an API using R with the httr and plyr libraries. Its pretty straight forward and works well with the following code:
library(httr)
library(plyr)
headers <- c("Accept" = "application/json, text/javascript",
"Accept-Encoding" = "gzip, deflate, sdch",
"Connection" = "keep-alive",
"Referer" = "http://www.afl.com.au/stat",
"Host" = "www.afl.com.au",
"User-Agent" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36",
"X-Requested-With"= "XMLHttpRequest",
"X-media-mis-token" = "f31fcfedacc75b1f1b07d5a08887f078")
query <- GET("http://www.afl.com.au/api/cfs/afl/season?seasonId=CD_S2016014", add_headers(headers))
stats <- httr::content(query)
My question is with regards to the request token required in the headers (i.e. X-media-mis-token). This is easy to get manually by inspecting the XHR elements in Chrome or Firefox, but the token is updated every 24 hrs making manual extraction a pain.
Is it possible to query the web page and extract this token automatically using R?
You can get the X-media-mis-token token, but with a disclaimer. ;)
library(httr)
token_url <- 'http://www.afl.com.au/api/cfs/afl/WMCTok'
token <- POST(token_url, encode="json")
content(token)$token
#[1] "f31fcfedacc75b1f1b07d5a08887f078"
content(token)$disclaimer
#[1] "All content and material contained within this site is protected by copyright owned by or licensed to Telstra. Unauthorised reproduction, publishing, transmission, distribution, copying or other use is prohibited.
Related
I made this code to search all the top links in google search. But its returning none.
import webbrowser, requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
website = f'http://google.com/search?q={string}'
req_web = requests.get(website).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)
Google needs that you specify User-Agent http header to return correct page. Without the correct User-Agent specified, Google returns page that doesn't contain <div> tags with r class. You can see it when you do print(soup) with and without User-Agent.
For example:
import requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
website = f'http://google.com/search?hl=en&q={string}'
req_web = requests.get(website, headers=headers).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)
Prints:
https://www.instagram.com/selenagomez/?hl=en
Answer from Andrej Kesely will throw an error since this css class no longer exists:
gotolink = parser.find('div', class_='r').a["href"]
AttributeError: 'NoneType' object has no attribute 'a'
Learn more about user-agent and request headers.
Basically user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
In this case, you need to send a fake user-agent so Google would treat your request as a "real" user visit, also known as user-agent spoofing.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = result.select_one('.yuRUbf a')['href']
print(link)
# https://www.instagram.com/selenagomez/
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your case is that you don't need to think about how to bypass Google blocks if they appear or figure out how to scrape elements that are a bit harder to scrape since it's already done for the end-user. The only thing that needs to be done is just get the data you want from the JSON string.
Example code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means index of the first organic result
link = results['organic_results'][0]['link']
print(link)
# https://www.instagram.com/selenagomez/
Disclaimer, I work for SerpApi.
I am trying to crawl some information from www.blogabet.com.
In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.
I am asking myself:
1. For which websites do I have to use headers?
2. How do I get the information that I have to provide in the header?
3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters
Thank you very much :)
scrapy shell
from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'
page = Request(url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
fetch(page)
If you look in your network panel when you load that page you can see the XHR and the headers it sends
So it looks like he just copied those.
In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.
Good morning,
I would like to simulate a file download with Gatling. I'm not sure that a simple get request on a file ressource really simulate it:
val stuffDownload: ScenarioBuilder = scenario("Download stuff")
.exec(http("Download stuff").get("https://stuff.pdf")
.header("Content-Type", "application/pdf")
.header("Content-Type", "application/force-download"))
I want to challenge my server with multiple downloads within the same moment and I need to be sure I have the right method to do it.
Thanks in advance for your help.
EDIT: Other headers I send:
"User-Agent" -> "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Accept" -> "application/json, text/plain, */*; q=0.01",
"Accept-Encoding" -> "gzip, deflate, br",
"Accept-Language" -> "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7",
"DNT" -> "1",
"Connection" -> "keep-alive"
It looks technically globally fine except that:
you have 2 Content-Type ?
Is there a mistake in second one ?
Also aren't you missing other browser headers like User-Agent ?
Aren't you missing an important one related to Compression like Accept-Encoding ?
But regarding functional part, aren't you missing some steps before it ?
I mean do your user access immediately the link or do they hit a login screen , then do a search and finally click on a link ?
Also, is it always the same file ? Shouldn't you introduce a kind of variability using Gatling CSV Feeders for example with a set of files ?
I am trying to pull a web page in my client (not a browser) with the following settings in the HTTP header
Accept: "text/html;charset=UTF-8"
Accept-Charset: "ISO-8859-1"
User-Agent: "Mozilla/5.0"
however I get an error code 406,
I also tried changing to;
Accept: "text/html"
with no success; error code and status message in the response header is
statusCode: 406
statusMessage: "Not Acceptable"
any idea waht the correct header settings should be, the page loads fine in the browser
Finally figured it out, I ran a sniffer to see which header settings worked, and here is what worked in every case
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; de-de) AppleWebKit/523.10.3 (KHTML, like Gecko) Version/3.0.4 Safari/523.10',
'Accept-Charset': 'ISO-8859-1,UTF-8;q=0.7,*;q=0.7',
'Accept-language' : 'de,en;q=0.7,en-us;q=0.3'
}
You should add Accept-Language. See Here
Why are you sending contradictory headers? You are requesting a representation that is both UTF8 and ISO-8859-1 at the same time. I guess that you could interpret the request as being for 7-bit ASCII representation.
In this case I would omit the Accept-Charset and change the Accept header to text/html, */*;q=0.1 so that you will get something back with a strong preference for HTML. See the Content Negotiation section of RFC7231 for details about these headers.
I'm printing out all the headers and I get:
map[Cookie:[_ga=GA1.2.843429125.1462575405] User-Agent:[Mozilla/5.0
(Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/601.4.4 (KHTML, like Gecko)
Version/9.0.3 Safari/601.4.4] Accept-Language:[en-us]
Accept-Encoding:[gzip, deflate] Connection:[keep-alive]
Accept:[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8]]
which means my browser is sending "Cookie", "User-Agent", "Accept-Language", "Accept-Encoding", "Connection", and "Accept" but there is no "Host" value.
How can I get my https://en.wikipedia.org/wiki/Virtual_hosting working without this value?
I'm using https://github.com/gin-gonic/gin
It stated on Golang http docs :
For incoming requests, the Host header is promoted to the Request.Host
field and removed from the Header map.
So you can get the host by access
http.Request.Host
Check here for details : https://golang.org/pkg/net/http/