I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.
Related
I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.
That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)
I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})
I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.
Trying to scrape data into R from table on website but having some trouble in finding the table. The table seems to not have any distinguishable class or id as it is labeled as . I have scraped data from many tables but have never encountered this type of situation. Still a novice and have done some searches but nothing has worked as of yet.
I have tried the following R commands but they only produce a result with "" and no data scraped. I had some results in assigning the article-template class but it is jumbled so I know that the data exists on the html_read. I am just not sure the proper way to program for this type of website.
Here is what I have so far in R:
library(xml2)
library(rvest)
website <- read_html("https://baseballsavant.mlb.com/daily_matchups")
dailymatchup <- website %>%
html_nodes(xpath = '//*[#id="leaderboard"]') %>%
html_text()
Essentially pulling the data from the table should allow a quick and organizable data frame. This is the ultimate target I am seeking.
I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.
I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]