web scraping a table with R - r

i am trying to web scrape a table from pitch book web site .
But using simple HTML does not work because pitch book uses java script instead of HTML to load the data so i need execute the JS in order to extract the info from the json file .
this is my code :
library(httr)
library(jsonlite)
library(magrittr)
json=get("https://my.pitchbook.com/old/
homeContent.64ea0536fd321cc1dd3b.js") %>%
content(as='text') %>%
fromJSON()
i get this error :
Error in
get("https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js")
:
object
'https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js'
not found
what ever data i am trying to load it returns the same error .
would appreciate your help :)
thank you :)

You have called base::get and not httr::GET.
So it should be
library(httr)
library(jsonlite)
library(magrittr)
json <- GET(
"https://my.pitchbook.com/old/homeContent.64ea0536fd321cc1dd3b.js"
) %>%
content("text") %>%
fromJSON()
but I'm not entirely sure that your website url gives a valid json. This in itself will give
lexical error: invalid char in json text.

Related

R webscraping "SSL certificate problem: certificate has expired" but works in browser. Need to parse HTML to JSON

I've come up with a partial solution to my question, but still need help getting to the end.
My issue is that I can no longer get JSON from a website using R, but I can still access it via my browser:
library(rvest)
library(httr)
library(jsonlite)
library(dplyr)
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
fromJSON(website)
Now gives me:
Error in open.connection(con, "rb") : SSL certificate problem: certificate has expired but I'm still able to visit the site on Chrome.
Ideally I'd like to find a way to get this working using fromJSON()
I don't entirely understand what's causing this error, so I tried a bunch of different solutions. I found that I could at least read the html using this:
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
doc <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
However, from here I'm stuck. I'm struggling trying to parse the doc to the JSON I require. I've tried things like doc %>% xmlParse() %>% xmlToList() %>% toJSON() %>% fromJSON() but it comes out as gibberish.
So my question comes down to: 1) is there a way to get around the SSL certificate problem so that I can use fromJSON() directly again. And 2) if not, how can I sanitize the html document to get it in a usable JSON format?

Web scraping a table: Error using rvest and splash

I am trying to scrape a table from this link (https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679). I have tried two approaches so far:
Using rvest though I couldn't find HTML nodes on the page for the table
Using splash, assuming the page is being loaded dynamically
Neither of the approaches seem to be working for me. After hours of trying and reading posts/blogs, etc, I can't seem to figure out what I'm missing. Can someone please point me to the right direction or atleast tell me what I'm doing wrong? Thanks!
Scraping using rvest:
library(rvest)
loadedTable <- read_html("https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679") %>%
html_nodes("table") %>%
html_table() %>%
as.tbl()
Error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
Scraping using splashr:
library(splashr)
library(reticulate)
install_splash()
splash("localhost") %>% splash_active()
sp <- start_splash()
pg <- render_html(url = 'https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679')
stop_splash(sp)
loadedTable1 <- pg %>%
html_node('table') %>%
html_table()
loadedTable1
Error:
Did not find required python module 'docker'
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to localhost port 8050: Connection refused
Error in stop_splash(sp) : object 'sp' not found
Error in html_element(...) : object 'pg' not found
Error: object 'loadedTable1' not found
It is pulled dynamically from another URI. There is some work to do to write a process to extract just what you see on the page, or whatever you actually want to see, as there is a lot of info brought back, including all the more detail.
The unix timestamp on the end is largely about prevent cached results being served. You can generate that and sprintf it into the rest of the url.
library(jsonlite)
data <- jsonlite::read_json('https://www.sac-isc.gc.ca/DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt?date=1622135739812')
To access all info about First Nation Kapawe'no, for example, you would do the following:
print(data$data[82])

Web Scraping in R Timeout

I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:
library(rvest)
fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"
read_html(fafsa_link)
Any help would be greatly appreciated! Thank you!
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Web Scraping In R readHTMLTable error with function

I'm teaching myself some basic table web scraping techniques in R. But I see the error when running the function readHTMLTable.
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
I am specifically trying to read the data in the second table. I've already checked the page source to make sure that the table is formatted with <table> and <td>
release_table <- readHTMLTable("https://www.comichron.com/monthlycomicssales/1997/
1997-01.html", header=TRUE, which=2,stringsAsFactors=F)
I would expect the output to mirror the text in the second table.
We can use rvest to get all the tables.
url <- "https://www.comichron.com/monthlycomicssales/1997/1997-01.html"
library(rvest)
tab <- url %>% read_html() %>% html_table()
I think what you are looking for is tab[[1]] or tab[[4]].

Scraping JSON link is not working using fromJSON(url)

For web scraping I normally use the jsonlite::fromJSON(url) command which usually does the job for me. However this time it is inside another text.
Basically like this:
jQuery([
JSON stuff that I am more used to
]);
How do I get around this easily?
The actual data looks like this when I call the address (I have tapped it more pretty):
jQuery(
[
{"Date":"2019-05-31T00:00:00+02:00","FromTime":"2019-05-31T00:00:00+02:00","ToTime":"2019-05-31T00:15:00+02:00","Value":3315.9120000000003,"Value2":2584.244,"Value3":731.668},
{"Date":"2019-05-31T00:00:00+02:00","FromTime":"2019-05-31T00:15:00+02:00","ToTime":"2019-05-31T00:30:00+02:00","Value":3386.238,"Value2":2655.814,"Value3":730.424}
]
);
The errormessage I get when I try to make the function parse it is
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
jQuery([{"Date":"2019-05-29T00:
(right here) ------^
End goal is just to have a dataframe to continue work on.
You can substr what you want from rvest return. It looks like jquery return will have start and end syntax
library(rvest)
library(jsonlite)
url <- 'https://ws.50hertz.com/web02/api/PhotovoltaicActual/ListRecords?filterDateTime=2019-05-30T22:23:14.716Z&callback=jQuery&_=1559254994256'
r <- read_html(url) %>%
html_node("p") %>%
html_text()
x <- jsonlite::fromJSON(substr(r[1], 8, nchar(r) - 2))

Resources