R XML Bad Request php - r

I would like to scrape the contents of this web page using the XML package and htmlParse: http://www.interactivebrokers.com/en/p.php?f=products. However the link I am passing to htmlParse gives me a Bad Request error. What am I missing?

require(RCurl)
require(XML)
iburl<-'http://www.interactivebrokers.com/en/p.php?f=products'
ua<-'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:16.0) Gecko/20100101 Firefox/16.0'
ibdata<-getURL(iburl,useragent=ua)
htmlParse(ibdata)
readHTMLTable(ibdata)
It looks like it is checking the user agent.

Related

Keep receiving 503 error while scraping with rvest

I am trying to scrape fanfiction.net with rvest and keep getting the 503 server error.
The robots.txt file allows to scrape the site with a delay of 5 seconds, the "Terms of Service" only forbid it for commercial use, whereas I am intending to use it for research purposes (Digital Literary Studies here).
The following chunk of code results in an error already:
library (httr)
library (rvest)
url <- "https://www.fanfiction.net"
userAgent <- "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
read_html(url, user_agent(userAgent))
Most of the advice regarding that error recommends incorporating a delay between scraping requests, or providing a user agent.
Since I get the error from the first URL, incorporating a delay doesn't seem to solve the problem.
I provided different user agents:
"Mozilla/5.0"
agents of the "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" sort, but the server seems to react to the brackets and the long version gives me a 403 error.
Finally, I provided my institution, name and email address as a user agent, and once (!!) received a 200 status code.
What would be the next step?

Why wget fails to download a file but browser succeeds?

I am trying to download virus database for clamav from http://database.clamav.net/main.cvd location. I am able to download main.cvd from web browser(chrome or firefox) but unable to do same with wget and get the following error:
--2021-05-03 19:06:01-- http://database.clamav.net/main.cvd
Resolving database.clamav.net (database.clamav.net)... 104.16.219.84, 104.16.218.84, 2606:4700::6810:db54, ...
Connecting to database.clamav.net (database.clamav.net)|104.16.219.84|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-03 19:06:01 ERROR 403: Forbidden.
Any lead on this issue?
Edit 1:
This is how my chrome cookies look like when I try to download main.cvd
Any lead on this issue?
It might be possible that blocking is based on User-Agent header. You might use --user-agent= option to set same User-Agent as for browser. Example
wget --user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" https://www.example.com
will download example.com page and identify itself as Firefox to server. If you want know more about User-Agent's part meaning you might read Developer Mozilla docs for User-Agent header
Check for the session cookies or tokens from browser, as some websites place similar kind of security

Jsoup times out and cURL only works with '--compressed' header - how do I emulate this header in Jsoup?

I am trying to use JSoup to parse content from URLs like https://www.tesco.com/groceries/en-GB/products/300595003
Jsoup.connect(url).get() simply times out, however I can access the website fine in the web browser.
Through trial and error, the simplest working curl command I found was:
curl 'https://www.tesco.com/groceries/en-GB/products/300595003' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0' \
-H 'Accept-Language: en-GB,en;q=0.5' --compressed
I am able to translate the User-Agent and Accept-Language into JSoup, however I still get timeouts. Is there an equivalent to the --compressed flag for Jsoup, because the curl command will not work without it?
To find out what --compressed option does try using curl with --verbose parameter. It will display full request headers.
Without --compressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
With --comppressed:
> GET /groceries/en-GB/products/300595003 HTTP/2
> Host: www.tesco.com
> Accept: */*
> Accept-Encoding: deflate, gzip
> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101
Firefox/76.0
> Accept-Language: en-GB,en;q=0.5
The difference is new Accept-Encoding header so adding .header("Accept-Encoding", "deflate, gzip") should solve your problem.
By the way, for me both jsoup and curl are able to download page source without this header and without --compressed and I'm not getting timeouts, so there's a chance your requests are somehow limited by server for making too many requests.
EDIT:
It works for me using your original command with --http1.1 so there has to be a way to make it work for you as well. I'd start with using Chrome developer tools to take a look at what headers your browser sends and try to pass all of them using .header(...). You can also copy curl command to see all headers and simulate exactly what Chrome is sending:

Extracting token from XHR request header with R

I've been scraping data from an API using R with the httr and plyr libraries. Its pretty straight forward and works well with the following code:
library(httr)
library(plyr)
headers <- c("Accept" = "application/json, text/javascript",
"Accept-Encoding" = "gzip, deflate, sdch",
"Connection" = "keep-alive",
"Referer" = "http://www.afl.com.au/stat",
"Host" = "www.afl.com.au",
"User-Agent" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36",
"X-Requested-With"= "XMLHttpRequest",
"X-media-mis-token" = "f31fcfedacc75b1f1b07d5a08887f078")
query <- GET("http://www.afl.com.au/api/cfs/afl/season?seasonId=CD_S2016014", add_headers(headers))
stats <- httr::content(query)
My question is with regards to the request token required in the headers (i.e. X-media-mis-token). This is easy to get manually by inspecting the XHR elements in Chrome or Firefox, but the token is updated every 24 hrs making manual extraction a pain.
Is it possible to query the web page and extract this token automatically using R?
You can get the X-media-mis-token token, but with a disclaimer. ;)
library(httr)
token_url <- 'http://www.afl.com.au/api/cfs/afl/WMCTok'
token <- POST(token_url, encode="json")
content(token)$token
#[1] "f31fcfedacc75b1f1b07d5a08887f078"
content(token)$disclaimer
#[1] "All content and material contained within this site is protected by copyright owned by or licensed to Telstra. Unauthorised reproduction, publishing, transmission, distribution, copying or other use is prohibited.

Sites not accepting wget user agent header

When I run this command:
wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
...I get this result (with nothing else in the file):
<!-- hw147.fp.gq1.yahoo.com uncompressed/chunked Wed Jun 19 03:42:44 UTC 2013 -->
But when I run wget http://yahoo.com with no --user-agent option, I get the full page.
The user agent is the same header that my current browser sends. Why does this happen? Is there a way to make sure the user agent doesn't get blocked when using wget?
It seems Yahoo server does some heuristic based on User-Agent in a case Accept header is set to */*.
Accept: text/html
did the trick for me.
e.g.
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" http://yahoo.com
Note: if you don't declare Accept header then wget automatically adds Accept:*/* which means give me anything you have.
I created a ~/.wgetrc file with the following content (obtained from askapache.com but with a newer user agent, because otherwise it didn’t work always):
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
robots = off
Now I’m able to download from most (all?) file-sharing (streaming video) sites.
You need to set both the user-agent and the referer:
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --referer connect.wso2.com http://dist.wso2.org/products/carbon/4.2.0/wso2carbon-4.2.0.zip

Resources