Web Scraping In R readHTMLTable error with function - r

I'm teaching myself some basic table web scraping techniques in R. But I see the error when running the function readHTMLTable.
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
I am specifically trying to read the data in the second table. I've already checked the page source to make sure that the table is formatted with <table> and <td>
release_table <- readHTMLTable("https://www.comichron.com/monthlycomicssales/1997/
1997-01.html", header=TRUE, which=2,stringsAsFactors=F)
I would expect the output to mirror the text in the second table.

We can use rvest to get all the tables.
url <- "https://www.comichron.com/monthlycomicssales/1997/1997-01.html"
library(rvest)
tab <- url %>% read_html() %>% html_table()
I think what you are looking for is tab[[1]] or tab[[4]].

Related

R: Webscraping data not contained in HTML

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.
That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

R - web scraping through multiple URLs? with rvest and purrr

I am trying to scrape football(soccer) statistics for a project i'm working on and i'm trying to utilise rvest and purrr to loop through the numeric values at the end of the url. I'm not sure what i'm missing but i have a snippet of the code as well as the error message that keeps coming up.
library(xml2)
library(rvest)
library(purrr)
wins_URL <- "https://www.premierleague.com/stats/top/clubs/wins?se=%d"
map_df(1:15, function(i){
cat(".")
page <- read_html(sprintf(wins_URL, i))
data.frame(statTable = html_table(html_nodes(page,"td , th")))
}) -> WinsTable
Error in doc_namespaces(doc) : external pointer is not valid
I've only recently started using R so I'm no expert and would just like to know what mistakes I'm making

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

How can I scrape data from this website (multiple webpages) using R?

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

How can I scrape this data?

I want to scrape the statistics from this page:
url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"
Specifically, I want to grab the data in the table that's underneath Stuart's headshot. It's headlined by "Stuart Appleby - 2015 STATS PGA TOUR"
I attempt to use rvest, in combo with the Selector Gadget (http://selectorgadget.com/).
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")]//td')
'Should' get me the table without, for example, the row on top that says "Recap -- Rank -- Additional Stats"
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")] | //th//*[(#id = "playerStats")]//td')
'Should' get me the table with that "Recap -- Rank -- Add'l Stats" line.
Neither do.
Obvs I'm a complete newb when it comes to web scraping. When I click on 'view source' for that webpage, the data contained in the table isn't there.
In the source code, where I think the table should be starting, is this bit of code:
<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
{{each(t, tour) tours}}
{{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
<div class="statistics-head">
<h2 class="title">Stuart Appleby - <b>${year} STATS
.
.
.
So, it appears the table is stored somewhere (Json? Jquery? Javascript? Are those terms applicable here?) that isn't accessible to the html() function. Is there anyway to use rvest to grab this data? Is there an rvest equivalent for grabbing data that is stored in this manner?
Thanks.
I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...
content(a) gives you a list representation... basically the output from fromJSON()
or
as(a, "character") gives you the raw JSON
library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")
Check this out.
Open source project on GitHub scraping PGA data: https://github.com/zachwill/golf/blob/master/pga.py

Resources