Scrape multiple URLs at the same time in R - r

Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a data frame that I am interested in scraping for their text.
At the moment, I've figured out how to obtain the text for a single URL using the code below:
singleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all")
singleText <- readLines(singleURL)
Unfortunately, when I try to scale this up with multiple URLs it gives me an "Error in file(con, "r") : invalid 'description' argument" message. Here is the code I have been trying:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- readLines(multipleURL)
If anyone has any suggestions, I would be greatly appreciative.
Many thanks,
Chris

Related

R - how to compare a list of characters with an online article

I have a list of vectors:
vector <- c("Retail","real consumption","jobs")
I want to compare the list of vectors with an online article:
https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html
I want to output how many matches there are for each characters. For examples, how many times "Retail" happen in the article, and how many times "real consumption" happen in the article, regardless of capitals.
The first step I took is download the website link using
article <- download.file("https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html",destfile="basename(url)",method="libcurl")
But I got error message:
cannot open URL 'https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html'
In addition: Warning message:
In download.file("https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html", :
URL 'https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html': status was 'Couldn't resolve host name'
New Edit:
I also tried below, but I'm not sure where does this go?
con <- url("https://finance.yahoo.com/news/coroanvirus-covid-may-2020-retail-sales-171911895.html", "rb")
article <- read_html(con)
Above is just an example, in my real example, I need to compare a list of vectors with many online articles. Can anyone show me a way to do it, is there any built-in package I could use? Thanks a lot ahead!

Web scraping issue in R

I am trying to extract the data of university rankings for the year 2017 and 2018 from the website name - https://www.topuniversities.com.
I am trying to run a code in R but it's giving me an error.
My code:-
library(rvest)
#Specifying the url for desired website to be scrapped
url <-"https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores"
#Reading the HTML code from the website
webpage <- read_html(url)
vignette("selectorgadget")
ranking_html=html_nodes(url,".namesearch , .sorting_2 , .sorting_asc")
error:-
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
please help me out to solve the above issue, any suggestion related to web scraping are welcome.
I was able to extract the tables with a different approach :
library(pagedown)
library(tabulizer)
chrome_print("https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores",
"C:\\...\\table.pdf")
table_Page_2 <- extract_tables("C:\\...\\table.pdf",pages = 2)
table_Page_2
After, we have to clean the text of tables, but all the values are there.

Web Scraping error in R

I am having an issue trying to scrape data from a site using R. I am trying to scrape the first table from the following webpage:
http://racing-reference.info/race/2016_Daytona_500/W
I looked at a lot of the threads about this, but I can't figure out how to make it work, most likely due to the fact I don't know HTML or much about it.
I have tried lots of different things with the code, and I keep getting the same error:
Error: failed to load HTTP resource
Here is what I have now:
library(RCurl)
library(XML)
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
doc <- htmlTreeParse(URL, useInternalNodes = TRUE)
If possible could you explain why whatever the solution is works and why what I have is giving an error? Thanks in advance.
Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from:
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
Content = getURL(URL)
doc <- htmlTreeParse(Content, useInternalNodes = TRUE)

download xlsx from link and import into r

I know there are a number of posts on this topic and I usually am able to accomplish what I want just fine but I'm having trouble with this one particular link. It's likely related to the non-orthodox layout of the excel file. Here's my workflow:
library(rest)
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls()
That produces an error Error in getinfo.shape(fn) : Error opening SHP file
The problem is not related to the scraping of the data. The problem arises in regards to importing the data into a usable format. For example, read.xls("file.path/file.csv") produces the same error.
For example :
library(RCurl)
download.file(url, destfile = "./file.xlsx")
use your favorite reader then,
Adding the option fileEncoding="latin1" solved my problem.
url<-"http://irandataportal.syr.edu/wp-content/uploads/3.-economic-participation-and-unemployment-rates-for-populationa-aged-10-and-overa-by-ostan-province-1380-1384-2001-2005.xlsx"
unemp <- url %>%
read.xls(fileEncoding="latin1")

Scraping multiple URLs in R using sapply

Good afternoon,
Thanks for helping me out with this question.
I have a list of multiple URLs that I am interested in scraping for a specific field.
At the moment, I'm using the function below to return the value I'm interested in for a specific field:
dayViews <- function (url) {
raw <- readLines(url)
dat <- fromJSON(raw)
daily <- dat$daily_views$`2014-08-14`
return(daily)
}
How do I modify this to run on a list of multiple URLs? I tried using sapply/lapply over a list of URLs, but it gives me the following message:
"Error in file(con, "r") : invalid 'description' argument"
If anyone has any suggestions, I would be greatly appreciative.
Many thanks,
Doing something similar to you, #yarbaur, I read into R an Excel spreadsheet that keeps all the URLs of a set I want to scrape. It has columns for company, URL, and XPath. Then try something like this code where I have substituted for your variable names I made up. I am not using JSON sites, however:
temp <- apply(yourspreadsheetReadintoR, 1,
function(x) {
yourCompanyName <- x[1]
yourURLS <- x[2]
yourxpath <- x[3] # I also store the XPath expressions for each site
fetch <- content(GET(yourURLS))
locs <- sapply(getNodeSet(fetch, yourxpath), xmlValue)
data.frame(coName=rep(yourCompanyName, length(locs)), location=locs)
})

Resources