I tried the following codes to download a html file. The code runs without error but the file returned is of very small size (~2kb) and cannot be opened.
url <- "http://racing.hkjc.com/racing/information/english/Horse/OtherHorse.aspx?HorseNo=L042#htop"
download.file(url, destfile)
I am not sure if the connection speed affects whether download.file can return the correct result because sometimes the webpage can be downloaded after several tries. Any help or alternative solution will be appreciated. Thanks.
Lot's of clean up to do, but here's the basic method
library(rvest)
read_html(url) %>%
html_nodes(xpath ='/html/body/div/form/table[3]') %>%
html_table(fill=T)
Related
I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")
I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc
I know there are a number of posts on this topic and I usually am able to accomplish what I want just fine but I'm having trouble with this one particular link. It's likely related to the non-orthodox layout of the excel file. Here's my workflow:
library(rest)
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls()
That produces an error Error in getinfo.shape(fn) : Error opening SHP file
The problem is not related to the scraping of the data. The problem arises in regards to importing the data into a usable format. For example, read.xls("file.path/file.csv") produces the same error.
For example :
library(RCurl)
download.file(url, destfile = "./file.xlsx")
use your favorite reader then,
Adding the option fileEncoding="latin1" solved my problem.
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls(fileEncoding="latin1")
I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))
This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))
I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.