R Error using readHTMLTable - r

I am using the following code:
url = "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
library(XML)
tabs = readHTMLTable(url, stringsAsFactors = F)
I get the following error:
Error: failed to load external entity "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
When I use the url in the browser it works fine. So, what am I doing incorrect here?
Thanks

It's difficult to know for sure since I can't replicate your error, but according the package's author (see http://comments.gmane.org/gmane.comp.lang.r.mac/2284), XML's methods for getting web content are pretty minimalistic. A workaround is to use RCurl to get the content and XML to parse it:
library(XML)
library(RCurl)
url <- "http://finance.yahoo.com/q/op?s=DIA&m=2013-07"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors = F)
Or, if RCurl still throws an error, try the httr package:
library(httr)
tabs <- GET(url)
tabs <- readHTMLTable(rawToChar(tabs$content), stringsAsFactors = F)

I just got the same error as above "failed to load external entity" when using
url <- "http://www.cisco.com/c/en/us/products/a-to-z-series-index.html"
doc <- htmlTreeParse(url, useInternal=TRUE)
I came across this and another post on the topic, which didn't solve my problem. This code worked before. I then realized that I was on corporate VPN. I got off the VPN and tried again and it worked. So, being on VPN might be another reason why you would get the above error. Getting off VPN solves it.

Related

Down load data from HTTPS in R

I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")

R webscraping "SSL certificate problem: certificate has expired" but works in browser. Need to parse HTML to JSON

I've come up with a partial solution to my question, but still need help getting to the end.
My issue is that I can no longer get JSON from a website using R, but I can still access it via my browser:
library(rvest)
library(httr)
library(jsonlite)
library(dplyr)
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
fromJSON(website)
Now gives me:
Error in open.connection(con, "rb") : SSL certificate problem: certificate has expired but I'm still able to visit the site on Chrome.
Ideally I'd like to find a way to get this working using fromJSON()
I don't entirely understand what's causing this error, so I tried a bunch of different solutions. I found that I could at least read the html using this:
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
doc <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
However, from here I'm stuck. I'm struggling trying to parse the doc to the JSON I require. I've tried things like doc %>% xmlParse() %>% xmlToList() %>% toJSON() %>% fromJSON() but it comes out as gibberish.
So my question comes down to: 1) is there a way to get around the SSL certificate problem so that I can use fromJSON() directly again. And 2) if not, how can I sanitize the html document to get it in a usable JSON format?

Access sharepoint folders in R

I'm currently trying to access sharepoint folders in R. I read multiple articles addressing that issue but all the proposed solutions don't seem to work in my case.
I first tried to upload a single .txt file using the httr package, as follows:
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- httr::GET(URL, httr::authenticate("username","password",type="any"))
I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL
I then tried another package that use a similar syntax (RCurl):
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- getURL(URL, userpwd = "username:password")
I get the following error:
Error in function (type, msg, asError = TRUE) :
I tried many other ways of linking R to sharepoint, but these two seemed the most straightforward. (also, my URL doesn't seem to be the problem since it works when I run it in my web browser).
Ultimately, I want to be able to upload a whole sharepoint folder to R (not only a single document). Something that would really help is to set my sharepoint folder as my working directory and use the base::list.files() function to list files in my folder, but I doubt thats possible.
Does anyone have a clue how I can do that?
I created an R library called sharepointr for doing just that.
What I basically did was:
Create App Registration
Add permissions
Get credentials
Make REST calls
The Readme.md for the repository has a full description, and here is an example:
# Install
install.packages("devtools")
devtools::install_github("esbeneickhardt/sharepointr")
# Parameters
client_id <- "insert_from_first_step"
client_secret <- "insert_from_first_step"
tenant_id <- "insert_from_fourth_step"
resource_id <- "insert_from_fourth_step"
site_domain <- "yourorganisation.sharepoint.com"
sharepoint_url <- "https://yourorganisation.sharepoint.com/sites/MyTestSite"
# Get Token
sharepoint_token <- get_sharepoint_token(client_id, client_secret, tenant_id, resource_id, site_domain)
# Get digest value
sharepoint_digest_value <- get_sharepoint_digest_value(sharepoint_token, sharepoint_url)
# List folders
sharepoint_path <- "Shared Documents/test"
get_sharepoint_folder_names(sharepoint_token, sharepoint_url, sharepoint_digest_value, sharepoint_path)

Web Scraping error in R

I am having an issue trying to scrape data from a site using R. I am trying to scrape the first table from the following webpage:
http://racing-reference.info/race/2016_Daytona_500/W
I looked at a lot of the threads about this, but I can't figure out how to make it work, most likely due to the fact I don't know HTML or much about it.
I have tried lots of different things with the code, and I keep getting the same error:
Error: failed to load HTTP resource
Here is what I have now:
library(RCurl)
library(XML)
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
doc <- htmlTreeParse(URL, useInternalNodes = TRUE)
If possible could you explain why whatever the solution is works and why what I have is giving an error? Thanks in advance.
Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from:
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
Content = getURL(URL)
doc <- htmlTreeParse(Content, useInternalNodes = TRUE)

Scrape forum using R

I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.

Resources