Web scraping a table: Error using rvest and splash - r

I am trying to scrape a table from this link (https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679). I have tried two approaches so far:
Using rvest though I couldn't find HTML nodes on the page for the table
Using splash, assuming the page is being loaded dynamically
Neither of the approaches seem to be working for me. After hours of trying and reading posts/blogs, etc, I can't seem to figure out what I'm missing. Can someone please point me to the right direction or atleast tell me what I'm doing wrong? Thanks!
Scraping using rvest:
library(rvest)
loadedTable <- read_html("https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679") %>%
html_nodes("table") %>%
html_table() %>%
as.tbl()
Error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
Scraping using splashr:
library(splashr)
library(reticulate)
install_splash()
splash("localhost") %>% splash_active()
sp <- start_splash()
pg <- render_html(url = 'https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679')
stop_splash(sp)
loadedTable1 <- pg %>%
html_node('table') %>%
html_table()
loadedTable1
Error:
Did not find required python module 'docker'
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to localhost port 8050: Connection refused
Error in stop_splash(sp) : object 'sp' not found
Error in html_element(...) : object 'pg' not found
Error: object 'loadedTable1' not found

It is pulled dynamically from another URI. There is some work to do to write a process to extract just what you see on the page, or whatever you actually want to see, as there is a lot of info brought back, including all the more detail.
The unix timestamp on the end is largely about prevent cached results being served. You can generate that and sprintf it into the rest of the url.
library(jsonlite)
data <- jsonlite::read_json('https://www.sac-isc.gc.ca/DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt?date=1622135739812')
To access all info about First Nation Kapawe'no, for example, you would do the following:
print(data$data[82])

Related

R webscraping "SSL certificate problem: certificate has expired" but works in browser. Need to parse HTML to JSON

I've come up with a partial solution to my question, but still need help getting to the end.
My issue is that I can no longer get JSON from a website using R, but I can still access it via my browser:
library(rvest)
library(httr)
library(jsonlite)
library(dplyr)
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
fromJSON(website)
Now gives me:
Error in open.connection(con, "rb") : SSL certificate problem: certificate has expired but I'm still able to visit the site on Chrome.
Ideally I'd like to find a way to get this working using fromJSON()
I don't entirely understand what's causing this error, so I tried a bunch of different solutions. I found that I could at least read the html using this:
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
doc <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
However, from here I'm stuck. I'm struggling trying to parse the doc to the JSON I require. I've tried things like doc %>% xmlParse() %>% xmlToList() %>% toJSON() %>% fromJSON() but it comes out as gibberish.
So my question comes down to: 1) is there a way to get around the SSL certificate problem so that I can use fromJSON() directly again. And 2) if not, how can I sanitize the html document to get it in a usable JSON format?

Web Scraping In R readHTMLTable error with function

I'm teaching myself some basic table web scraping techniques in R. But I see the error when running the function readHTMLTable.
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
I am specifically trying to read the data in the second table. I've already checked the page source to make sure that the table is formatted with <table> and <td>
release_table <- readHTMLTable("https://www.comichron.com/monthlycomicssales/1997/
1997-01.html", header=TRUE, which=2,stringsAsFactors=F)
I would expect the output to mirror the text in the second table.
We can use rvest to get all the tables.
url <- "https://www.comichron.com/monthlycomicssales/1997/1997-01.html"
library(rvest)
tab <- url %>% read_html() %>% html_table()
I think what you are looking for is tab[[1]] or tab[[4]].

web scraping a table with R

i am trying to web scrape a table from pitch book web site .
But using simple HTML does not work because pitch book uses java script instead of HTML to load the data so i need execute the JS in order to extract the info from the json file .
this is my code :
library(httr)
library(jsonlite)
library(magrittr)
json=get("https://my.pitchbook.com/old/
homeContent.64ea0536fd321cc1dd3b.js") %>%
content(as='text') %>%
fromJSON()
i get this error :
Error in
get("https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js")
:
object
'https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js'
not found
what ever data i am trying to load it returns the same error .
would appreciate your help :)
thank you :)
You have called base::get and not httr::GET.
So it should be
library(httr)
library(jsonlite)
library(magrittr)
json <- GET(
"https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js"
) %>%
content("text") %>%
fromJSON()
but I'm not entirely sure that your website url gives a valid json. This in itself will give
lexical error: invalid char in json text.

R - web scraping through multiple URLs? with rvest and purrr

I am trying to scrape football(soccer) statistics for a project i'm working on and i'm trying to utilise rvest and purrr to loop through the numeric values at the end of the url. I'm not sure what i'm missing but i have a snippet of the code as well as the error message that keeps coming up.
library(xml2)
library(rvest)
library(purrr)
wins_URL <- "https://www.premierleague.com/stats/top/clubs/wins?se=%d"
map_df(1:15, function(i){
cat(".")
page <- read_html(sprintf(wins_URL, i))
data.frame(statTable = html_table(html_nodes(page,"td , th")))
}) -> WinsTable
Error in doc_namespaces(doc) : external pointer is not valid
I've only recently started using R so I'm no expert and would just like to know what mistakes I'm making

Scrape forum using R

I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.

Resources