Web scraping issue in R - r

I am trying to extract the data of university rankings for the year 2017 and 2018 from the website name - https://www.topuniversities.com.
I am trying to run a code in R but it's giving me an error.
My code:-
library(rvest)
#Specifying the url for desired website to be scrapped
url <-"https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores"
#Reading the HTML code from the website
webpage <- read_html(url)
vignette("selectorgadget")
ranking_html=html_nodes(url,".namesearch , .sorting_2 , .sorting_asc")
error:-
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
please help me out to solve the above issue, any suggestion related to web scraping are welcome.

I was able to extract the tables with a different approach :
library(pagedown)
library(tabulizer)
chrome_print("https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores",
"C:\\...\\table.pdf")
table_Page_2 <- extract_tables("C:\\...\\table.pdf",pages = 2)
table_Page_2
After, we have to clean the text of tables, but all the values are there.

Related

R: Webscraping data not contained in HTML

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.
That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Error "xpath_search" after trying to scrape the website with "xml_find_all"

I'm new at R. I'm trying to scrape a public website which contains the number of prisioners and vacancies in prison in the state of São Paulo, in Brazil. I'm a journalist and I asked the state for these informations, but they didn't want to give them to me.
I can't get any data even when using xml_find_all(). How can I scrape the website?
url <- "http://www.sap.sp.gov.br/"
data <- url %>%
httr::GET() %>%
xml2::read_html() %>%
xml2::xml_find_all(url, '//*[#id="wrap"]/div/ul/ul/li[3]/div/div/span[1]/b')
Running the code above, I have the following error:
"Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns,
num_results = Inf) : Not compatible with STRSXP: [type=NULL]."
The information which need to be scraped are on right side of the website.
Access the URL;
Click on "Álvaro de Carvalho";
Get the numbers after "Capacidade" (Capacity) and "População" (Population)
on each prision (such as "Álvaro de Carvalho", "Andradina",
"Araraquara" and so on).
Thank you.
Unfortunately,you cannot solve this problem using this strategy. The main website is complex and open a couple of files. You can notice one of the files is http://www.sap.sp.gov.br/js/dados-unidades.js . This js script will load all the information you need, but you will have to clean the information with string functions.

Having difficulty navigating webpages using rvest package

I am having real difficulty with the rvest package in R. I am trying to navigate to a particular webpage after hitting an "I Agree" button on the first webpage. Here's the link to the webpage that I begin with. The code below attempts to obtain the next webpage which has a form to fill out in order to obtain data that I will need to extract.
url <- "http://wonder.cdc.gov/mcd-icd10.html"
pgsession <- html_session(url)
pgform <- html_form(pgsession)[[3]]
new_session <- html_session(submit_form(pgsession,pgform)$url)
pgform_new <- html_form(new_session)
The last line does not obtain the html form for the next webpage and gives me the following error in R.
Error in read_xml.response(x$response, ..., as_html = as_html) :
server error: (500) Internal Server Error
Would very much appreciate any help that I can get with this in both getting to the next webpage and also submitting a form to obtain data. Thanks so much for your time!

Scrape multiple URLs at the same time in R

Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a data frame that I am interested in scraping for their text.
At the moment, I've figured out how to obtain the text for a single URL using the code below:
singleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all")
singleText <- readLines(singleURL)
Unfortunately, when I try to scale this up with multiple URLs it gives me an "Error in file(con, "r") : invalid 'description' argument" message. Here is the code I have been trying:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- readLines(multipleURL)
If anyone has any suggestions, I would be greatly appreciative.
Many thanks,
Chris

Resources