Having difficulty navigating webpages using rvest package

Having difficulty navigating webpages using rvest package - r

I am having real difficulty with the rvest package in R. I am trying to navigate to a particular webpage after hitting an "I Agree" button on the first webpage. Here's the link to the webpage that I begin with. The code below attempts to obtain the next webpage which has a form to fill out in order to obtain data that I will need to extract.
url <- "http://wonder.cdc.gov/mcd-icd10.html"
pgsession <- html_session(url)
pgform <- html_form(pgsession)[[3]]
new_session <- html_session(submit_form(pgsession,pgform)$url)
pgform_new <- html_form(new_session)
The last line does not obtain the html form for the next webpage and gives me the following error in R.
Error in read_xml.response(x$response, ..., as_html = as_html) :
server error: (500) Internal Server Error
Would very much appreciate any help that I can get with this in both getting to the next webpage and also submitting a form to obtain data. Thanks so much for your time!

Related

Scraping aspx with R

I am trying to scrape the following page with R:
http://monitoramento.sema.mt.gov.br/simlam/ListarTituloImovelRural.aspx?modeloTitulo=117
using the code "17942" as the trial input in "Número do CAR" as I know that works (see pic below).
As far as I can tell aspx are webforms using ajax requests to return data using a mix of javascript and html (apologies if I have this wrong).
From other stack overflows I have seen that others have used Selenium (e.g. Scrape "aspx" page with R), however from inspecting the page it seems to be obtaining data through a POST request.
I have tried getting the data using:
search1 <- "http://monitoramento.sema.mt.gov.br/simlam/ListarTituloImovelRural.aspx?modeloTitulo=117"
rawdata <- POST(search1,encode="form",add_headers(`txtNumeroTitulo` = 17821
,`hdnCodigoPagina` = "869E318BBE656CB74E76744CD3026"
), progress())
d_content <- content(rawdata,"text") %>%
read_html() %>%
html_text()
Just as a really basic attempt to get something out I could look through, but the data I am hoping to get (see picture) is not there. I wonder if the issue is that the POST request on the site has request headers as well as headers within the form data, which my request is not getting right.
Any help greatly appreciated!

R: Webscraping data not contained in HTML

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.

That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

R - web scraping through multiple URLs? with rvest and purrr

I am trying to scrape football(soccer) statistics for a project i'm working on and i'm trying to utilise rvest and purrr to loop through the numeric values at the end of the url. I'm not sure what i'm missing but i have a snippet of the code as well as the error message that keeps coming up.
library(xml2)
library(rvest)
library(purrr)
wins_URL <- "https://www.premierleague.com/stats/top/clubs/wins?se=%d"
map_df(1:15, function(i){
cat(".")
page <- read_html(sprintf(wins_URL, i))
data.frame(statTable = html_table(html_nodes(page,"td , th")))
}) -> WinsTable
Error in doc_namespaces(doc) : external pointer is not valid
I've only recently started using R so I'm no expert and would just like to know what mistakes I'm making

Web scraping issue in R

I am trying to extract the data of university rankings for the year 2017 and 2018 from the website name - https://www.topuniversities.com.
I am trying to run a code in R but it's giving me an error.
My code:-
library(rvest)
#Specifying the url for desired website to be scrapped
url <-"https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores"
#Reading the HTML code from the website
webpage <- read_html(url)
vignette("selectorgadget")
ranking_html=html_nodes(url,".namesearch , .sorting_2 , .sorting_asc")
error:-
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
please help me out to solve the above issue, any suggestion related to web scraping are welcome.

I was able to extract the tables with a different approach :
library(pagedown)
library(tabulizer)
chrome_print("https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/-1/sort_by/scores_international_outlook/sort_order/asc/cols/scores",
"C:\\...\\table.pdf")
table_Page_2 <- extract_tables("C:\\...\\table.pdf",pages = 2)
table_Page_2
After, we have to clean the text of tables, but all the values are there.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Having difficulty navigating webpages using rvest package - r

Related

Scraping aspx with R

R: Webscraping data not contained in HTML

Using R to mimic “clicking” a download file button on a webpage

R - web scraping through multiple URLs? with rvest and purrr

Web scraping issue in R

Categories

Resources