Scraping table into R - r

Trying to scrape data into R from table on website but having some trouble in finding the table. The table seems to not have any distinguishable class or id as it is labeled as . I have scraped data from many tables but have never encountered this type of situation. Still a novice and have done some searches but nothing has worked as of yet.
I have tried the following R commands but they only produce a result with "" and no data scraped. I had some results in assigning the article-template class but it is jumbled so I know that the data exists on the html_read. I am just not sure the proper way to program for this type of website.
Here is what I have so far in R:
library(xml2)
library(rvest)
website <- read_html("https://baseballsavant.mlb.com/daily_matchups")
dailymatchup <- website %>%
html_nodes(xpath = '//*[#id="leaderboard"]') %>%
html_text()
Essentially pulling the data from the table should allow a quick and organizable data frame. This is the ultimate target I am seeking.

Related

Using rvest or RSelenium to create automated webscrape of table inside frame

I know there's a great deal of resources/questions that deal with this subject but I have been trying for days and can't seem to figure it out. I have webscraped websites before but this one is causing me problems.
The website: njaqinow.net
What I want scraped: I would like to scrape the table under the "Current Status"->"Pollutants" tab. I would like to have this scraped every time the table is updated so I can use this information inside a shiny app I am creating.
What I have tried: I have tried numerous different approaches but for simplicity I will show my most recent approach:
library("rvest")
url<-"http://www.njaqinow.net"
webpage <- read_html(url)
test<-webpage%>%
html_node("table")%>%
html_table()
My guess is that this is way more complicated then I originally thought because it seems to me that the table is inside a frame. I am not a javascript/HTML pro so I am not entirely sure. Any help/guidance would be greatly appreciated!
I can contribute a solution with RSelenium. I would show you how to navigate to that table and
get its content. For formatting the table content i provide a link to another question, but wont be
in the scope of this answer.
I think you have two challenges. Switch into a frame and switching between frames.
Switch into a frame is done by remDr$switchToFrame().
Switching between frames is discussed here: https://github.com/ropensci/RSelenium/issues/155.
In your case:
remDr$switchToFrame("contents")
...
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
Full code would read:
remDr$navigate("http://www.njaqinow.net")
frame1 <- remDr$findElement("xpath", "//frame[#id = 'contents']")
remDr$switchToFrame(frame1)
remDr$findElement("xpath", "//*[text() = 'Current Status']")$clickElement()
remDr$findElement("xpath", "//*[text() = 'POLLUTANTS']")$clickElement()
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
table <- remDr$findElement("xpath", "//table[#id = 'C1WebGrid1']")
table$getElementText()
For formatting a table you could look here:
scraping table with R using RSelenium

download/scraping table from htmlTable

I am trying to download a csv from
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.html
or I am trying to scrape data frame the html table output from the website found here
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]
I have tried to scrape the data using
library(rvest)
url <- read_html("https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-
poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-
02-09T12:00:00Z)][(-7):1:(42)][(179):1:(238)],analysis_error[(2019-02-
09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)][(179):1:
(238)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)]
[(179):1:(238)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-
09T12:00:00Z)][(-7):1:(42)][(179):1:(238)]")
test <- url %>%
html_nodes(xpath='table.erd.commonBGColor.nowrap') %>%
html_text()
And I have tried to download a csv with
download.file(url, destfile = "~/Documents/test.csv", mode = 'wb')
But neither worked either. The download.file function downloaded a csv with the node description. and the rvest method gave me a huge character string on my macbook and a null data frame on my windows. I have also tried to use selectorgadget (chrome extension) to obtain only data i need, but selectorgadget does not seem to work on the htmlTable
I managed to find workaround solution using htmltab package, not sure if it's optimal though, it's big data frame for a webpage, took a while to load in data frame. table[2] is for actual table, as there're 2 html tables in link you've given.
url1 <- "https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]"
tbls <- htmltab(url1,which = "//table[2]")
rdf <- as.data.frame(tbls)
let me know if it helps.

Scraping data from a dynamic web page (.asp) with R

I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.

How can I scrape data from this website (multiple webpages) using R?

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

Resources