Web Scraping error in R

Web Scraping error in R - r

I am having an issue trying to scrape data from a site using R. I am trying to scrape the first table from the following webpage:
http://racing-reference.info/race/2016_Daytona_500/W
I looked at a lot of the threads about this, but I can't figure out how to make it work, most likely due to the fact I don't know HTML or much about it.
I have tried lots of different things with the code, and I keep getting the same error:
Error: failed to load HTTP resource
Here is what I have now:
library(RCurl)
library(XML)
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
doc <- htmlTreeParse(URL, useInternalNodes = TRUE)
If possible could you explain why whatever the solution is works and why what I have is giving an error? Thanks in advance.

Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from:
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
Content = getURL(URL)
doc <- htmlTreeParse(Content, useInternalNodes = TRUE)

Related

R question: use xmlValue in xpathSApply, but get an empty list

I'm attempting web scraping. Before posting my question, I looked up several similar questions such as this, and this. However, I still get stuck in my problem.
Speicifically, I'm trying to extract the listed prices on a second-hand cars website. # In case you are unable to see the data because you're not a registered user of this website, I also attached the screenshot of this website's html elements:
the screen shot.
The code I executed are:
library(httr)
library(XML)
url <- "https://www.sahibinden.com/vasita?query_text_mf=alfa+romeo+giulietta&query_text=alfa+romeo+giulietta"
htmlresponse <- GET(url)
htmlcontent <- content(htmlresponse, as="text")
parsedhtml <- htmlParse(htmlcontent, asText = TRUE)
# The above is just following the conventions, and seems okay.
prices <- xpathSApply(doc = parsedhtml, path = "//div/td[#class='searchResultsPriceValue']", fun = xmlValue)
# This command returned me an empty list.
Can someone have a look and give me some advices? Thank you very much!

Access sharepoint folders in R

I'm currently trying to access sharepoint folders in R. I read multiple articles addressing that issue but all the proposed solutions don't seem to work in my case.
I first tried to upload a single .txt file using the httr package, as follows:
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- httr::GET(URL, httr::authenticate("username","password",type="any"))
I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL
I then tried another package that use a similar syntax (RCurl):
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- getURL(URL, userpwd = "username:password")
I get the following error:
Error in function (type, msg, asError = TRUE) :
I tried many other ways of linking R to sharepoint, but these two seemed the most straightforward. (also, my URL doesn't seem to be the problem since it works when I run it in my web browser).
Ultimately, I want to be able to upload a whole sharepoint folder to R (not only a single document). Something that would really help is to set my sharepoint folder as my working directory and use the base::list.files() function to list files in my folder, but I doubt thats possible.
Does anyone have a clue how I can do that?

I created an R library called sharepointr for doing just that.
What I basically did was:
Create App Registration
Add permissions
Get credentials
Make REST calls
The Readme.md for the repository has a full description, and here is an example:
# Install
install.packages("devtools")
devtools::install_github("esbeneickhardt/sharepointr")
# Parameters
client_id <- "insert_from_first_step"
client_secret <- "insert_from_first_step"
tenant_id <- "insert_from_fourth_step"
resource_id <- "insert_from_fourth_step"
site_domain <- "yourorganisation.sharepoint.com"
sharepoint_url <- "https://yourorganisation.sharepoint.com/sites/MyTestSite"
# Get Token
sharepoint_token <- get_sharepoint_token(client_id, client_secret, tenant_id, resource_id, site_domain)
# Get digest value
sharepoint_digest_value <- get_sharepoint_digest_value(sharepoint_token, sharepoint_url)
# List folders
sharepoint_path <- "Shared Documents/test"
get_sharepoint_folder_names(sharepoint_token, sharepoint_url, sharepoint_digest_value, sharepoint_path)

download xlsx from link and import into r

I know there are a number of posts on this topic and I usually am able to accomplish what I want just fine but I'm having trouble with this one particular link. It's likely related to the non-orthodox layout of the excel file. Here's my workflow:
library(rest)
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls()
That produces an error Error in getinfo.shape(fn) : Error opening SHP file
The problem is not related to the scraping of the data. The problem arises in regards to importing the data into a usable format. For example, read.xls("file.path/file.csv") produces the same error.

For example :
library(RCurl)
download.file(url, destfile = "./file.xlsx")
use your favorite reader then,

Adding the option fileEncoding="latin1" solved my problem.
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls(fileEncoding="latin1")

Downloading a complete html

I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))

This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))

R Error using readHTMLTable

I am using the following code:
url = "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
library(XML)
tabs = readHTMLTable(url, stringsAsFactors = F)
I get the following error:
Error: failed to load external entity "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
When I use the url in the browser it works fine. So, what am I doing incorrect here?
Thanks

It's difficult to know for sure since I can't replicate your error, but according the package's author (see http://comments.gmane.org/gmane.comp.lang.r.mac/2284), XML's methods for getting web content are pretty minimalistic. A workaround is to use RCurl to get the content and XML to parse it:
library(XML)
library(RCurl)
url <- "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors = F)
Or, if RCurl still throws an error, try the httr package:
library(httr)
tabs <- GET(url)
tabs <- readHTMLTable(rawToChar(tabs$content), stringsAsFactors = F)

I just got the same error as above "failed to load external entity" when using
url <- "http://www.cisco.com/c/en/us/products/a-to-z-series-index.html"
doc <- htmlTreeParse(url, useInternal=TRUE)
I came across this and another post on the topic, which didn't solve my problem. This code worked before. I then realized that I was on corporate VPN. I got off the VPN and tried again and it worked. So, being on VPN might be another reason why you would get the above error. Getting off VPN solves it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping error in R - r

Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from: URL <- "http://racing-reference.info/race/2016_Daytona_500/W" Content = getURL(URL) doc <- htmlTreeParse(Content, useInternalNodes = TRUE)

Related

R question: use xmlValue in xpathSApply, but get an empty list

Access sharepoint folders in R

download xlsx from link and import into r

Downloading a complete html

R Error using readHTMLTable

Categories

Resources