R: Webscraping data not contained in HTML - r

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.

That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

Related

API Webscrape OpenFDA with R

I am scraping OpenFDA (https://open.fda.gov/apis). I know my particular inquiry has 6974 hits, which is organized into 100 hits per page (max download of the API). I am trying to use R (rvest, jsonlite, purr, tidyverse, httr) to download all of this data.
I checked the website information with curl in terminal and downloaded a couple of sites to see a pattern.
I've tried a few lines of code and I can only get 100 entries to download. This code seems to work decently, but it will only pull 100 entries, so one page To skip the fisrt 100, which I can pull down and merge later, here is the code that I have used:
url_json <- "https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=100&skip=6973"
raw_json <- httr::GET(url_json, accept_json())
data<- httr::content(raw_json, "text")
my_content_from_json <- jsonlite::fromJSON(data)
dplyr::glimpse(my_content_from_json)
dataframe1 <- my_content_from_json$results
view(dataframe1)
SOLUTION below in the responses. Thanks!
From the comments:
It looks like the API parameters skip and limit work better than the search_after parameter. They allow pulling down 1,000 entries simultaneously according to the documentation (open.fda.gov/apis/query-parameters). To provide these parameters into the query string, an example URL would be
https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=1000&skip=0
after which you can loop to get the remaining entries with skip=1000, skip=2000, etc. as you've done above.

Web scrape synonyms

I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})
I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

rvest package to harvest instagram number of followers?

I'm trying to adapt the example code in the rvest package for getting number of followers for an account on instagram (e.g., https://www.instagram.com/bradyellison/). I tried using selectorgadget to isolate code for number of followers, which gave me this: ._218yx:nth-child(2) ._s53mj. But I don't get the expected followers back, and not sure how to debug this. Here's my code.
# example
require(rvest)
html <- read_html("https://www.instagram.com/bradyellison/")
athlete_followers <- html_nodes(html, "._218yx:nth-child(2) ._s53mj")
length(athlete_followers)
Output is:
[1] 0
Expected followers are 12.1K. Would really appreciate help. (I've tried using the Instagram API for this first, but couldn't get it to work, perhaps because I'm in sandbox mode or something.)
You can't rvest this page because it's not a static site, rather it's generated dynamically via code (e.g. try xml_text(html)). To access Instagram data you should use their API. See full example here: https://www.r-bloggers.com/analyze-instagram-with-r/ .

Resources