R - web scraping through multiple URLs? with rvest and purrr

R - web scraping through multiple URLs? with rvest and purrr - r

I am trying to scrape football(soccer) statistics for a project i'm working on and i'm trying to utilise rvest and purrr to loop through the numeric values at the end of the url. I'm not sure what i'm missing but i have a snippet of the code as well as the error message that keeps coming up.
library(xml2)
library(rvest)
library(purrr)
wins_URL <- "https://www.premierleague.com/stats/top/clubs/wins?se=%d"
map_df(1:15, function(i){
cat(".")
page <- read_html(sprintf(wins_URL, i))
data.frame(statTable = html_table(html_nodes(page,"td , th")))
}) -> WinsTable
Error in doc_namespaces(doc) : external pointer is not valid
I've only recently started using R so I'm no expert and would just like to know what mistakes I'm making

Related

R: Webscraping data not contained in HTML

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.

That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

Web Scraping In R readHTMLTable error with function

I'm teaching myself some basic table web scraping techniques in R. But I see the error when running the function readHTMLTable.
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
I am specifically trying to read the data in the second table. I've already checked the page source to make sure that the table is formatted with <table> and <td>
release_table <- readHTMLTable("https://www.comichron.com/monthlycomicssales/1997/
1997-01.html", header=TRUE, which=2,stringsAsFactors=F)
I would expect the output to mirror the text in the second table.

We can use rvest to get all the tables.
url <- "https://www.comichron.com/monthlycomicssales/1997/1997-01.html"
library(rvest)
tab <- url %>% read_html() %>% html_table()
I think what you are looking for is tab[[1]] or tab[[4]].

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Difference between GET(), read_html(), getURL() in R

I'm attempting to scrape a realtor.com for a project for school. I have a working solution, which entails using a combination of the rvest and httr packages, but I want to migrate it to using the RCurl package, specifically using the getURLAsynchronous() function. I know that my algorithm will scrape much faster if I can migrate it to a solution that will download multiple URLs at once. My current solution to this problem is as follows:
Here's what I have so far:
library(RCurl)
library(rvest)
library(httr)
urls <- c("http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-1?pgsz=50",
"http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-2?pgsz=50")
prop.info <- vector("list", length = 0)
for (j in 1:length(urls)) {
prop.info <- c(prop.info, urls[[j]] %>% # Recursively builds the list using each url
GET(add_headers("user-agent" = "r")) %>%
read_html() %>% # creates the html object
html_nodes(".srp-item-body") %>% # grabs appropriate html element
html_text()) # converts it to a text vector
}
This gets me an output that I can readily work with. I'm getting all of the information off of the webpages, then reading all of the html from the GET() output. Next, I'm finding the html nodes, and converting it to text. The trouble I'm running into is when I attempt to implement something similar using RCurl.
Here is what I have for that using the same URLs:
getURLAsynchronous(urls) %>%
read_html() %>%
html_node(".srp-item-details") %>%
html_text
When I call getURIAsynchronous() on the urls vector, not all of the information is downloaded. I'm honestly not sure exactly what is being scraped. However, I know it's considerably different then my current solution.
Any ideas what I'm doing wrong? Or maybe an explanation on how getURLAsynchronous() should be working?

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))

For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - web scraping through multiple URLs? with rvest and purrr - r

Related

R: Webscraping data not contained in HTML

Web Scraping In R readHTMLTable error with function

Using R to mimic “clicking” a download file button on a webpage

Difference between GET(), read_html(), getURL() in R

Harvesting data with rvest retrieves no value from data-widget

Categories

Resources