Downloading a complete html - r

I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))

This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))

Related

Down load data from HTTPS in R

I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.

Web Scraping error in R

I am having an issue trying to scrape data from a site using R. I am trying to scrape the first table from the following webpage:
http://racing-reference.info/race/2016_Daytona_500/W
I looked at a lot of the threads about this, but I can't figure out how to make it work, most likely due to the fact I don't know HTML or much about it.
I have tried lots of different things with the code, and I keep getting the same error:
Error: failed to load HTTP resource
Here is what I have now:
library(RCurl)
library(XML)
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
doc <- htmlTreeParse(URL, useInternalNodes = TRUE)
If possible could you explain why whatever the solution is works and why what I have is giving an error? Thanks in advance.
Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from:
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
Content = getURL(URL)
doc <- htmlTreeParse(Content, useInternalNodes = TRUE)

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

download xlsx from link and import into r

I know there are a number of posts on this topic and I usually am able to accomplish what I want just fine but I'm having trouble with this one particular link. It's likely related to the non-orthodox layout of the excel file. Here's my workflow:
library(rest)
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls()
That produces an error Error in getinfo.shape(fn) : Error opening SHP file
The problem is not related to the scraping of the data. The problem arises in regards to importing the data into a usable format. For example, read.xls("file.path/file.csv") produces the same error.
For example :
library(RCurl)
download.file(url, destfile = "./file.xlsx")
use your favorite reader then,
Adding the option fileEncoding="latin1" solved my problem.
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls(fileEncoding="latin1")

Resources