How to scrape reactive table in R - r

Really new with R here - this is the code I normally use to scrape tables, but I couldn't get it to work due to the table on this website being reactive.
This is the url: https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management
And this is the code chunk I used.
library(rvest)
d2 <- read_html("https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management")
stats <- d2 %>%
html_node(".rt-table") %>%
html_table()
stats
Rstudio keeps showing "Error in html_table.xml_node(.) : html_name(x) == "table" is not TRUE" whenever I try to run the code...
Would really appreciate any help here :(

The data is rendered from a JSON object located in a script tag (ReactJS local state). You can get this by searching the script tag with id __NEXT_DATA__ :
library(rvest)
data <- "https://sgpgrid.com/filter/property-fund-management-including-reit-management-and-direct-property-fund-management" %>%
read_html %>%
html_nodes('script#__NEXT_DATA__') %>%
html_text()
output <- jsonlite::fromJSON(data)
print(output$props$initialState$company$companies)

Related

Web scraping tables on college basketball stats

I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?
Here is the code:
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()
Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")
This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")
Does someone who is better with html/css have any idea why this is the case?
It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.
library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()

Read HTML table using R directly from a website

I want to read covid data directly from government website: https://pikobar.jabarprov.go.id/distribution-case#
I did that using rvest library
url <- "https://pikobar.jabarprov.go.id/distribution-case#"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T)
I saw someone using lapply to make it into a tidy table, but when I tried it looked like a mess because I'm new to this.
Can anybody help me? I really frustated
You can't scrape the data in the table by rvest because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key attached.
pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data
I don't know if the api-key works in the future but it works for now as I see.

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

html_table isn't recognising my node pass despite it being a table

New to the rvest package.. I'm trying to extract the table seen here, which comprises of results from an athletics event.
https://www.decathlon2000.com/720/gotzis-2000/
Rudimentary rvest utilisation seems to be passing the url to read_html, then selecting the relevant css selectors using the "Selectorgadget" js bookmarklet, then inserting this in to html_nodes, which I've done.
gotzis2000 <- read_html("https://www.decathlon2000.com/720/gotzis-2000/")
gotzis2000 %>% html_nodes("#articlecontent td")
However, when I try and then pipe this into html_table:
gotzis2000 %>% html_nodes("#articlecontent td") %>% html_table()
I get the error Error: html_name(x) == "table" is not TRUE.
When I pipe the above with html_text, I can see the data has been extracted, so I'm not sure what the correct procedures are from here.

Resources