R: scrape nested html table with links (table within cell) - r
For university research I try to scrape an FDA table (robots.txt allows to scrape this content)
The table contains 19 rows and 2 columns:
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
The format I try to extract is:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
What I achieved:
I can easly extract the items of the first column:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
Created on 2021-02-27 by the reprex package (v1.0.0)
Where I get stuck
Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT "
#> [17] "QGT "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
Created on 2021-02-27 by the reprex package (v1.0.0)
Moreover, I could not find a way to extract the urls of col2
I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.
There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.
There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.
Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.
library(rvest)
library(stringr)
library(dplyr)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]
#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")
answer <-data.frame(leftside, rightside, links)
One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.
Update
To reduce the answer down to the desired 19 original rows:
library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")
#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))
Related
How to download data from the Reptile database using r
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as: library(httr) library(tidyverse) POST( url = "http://reptile-database.reptarium.cz/advanced_search", encode = "json", body = list( genus = "Chamaeleo", species = "dilepis" )) -> res out <- content(res)[1] This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object. This object should contain the following page: https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29 This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file() library(tidyverse) library(rvest) genus <- "Chamaeleo" species <- "dilepis" pics <- paste0( "http://reptile-database.reptarium.cz/species?genus=", genus, "&species=", species) %>% read_html() %>% html_elements("#gallery img") %>% html_attr("src") [1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg" [2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg" [3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg" [4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg" [5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg" [6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg" [7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg" [8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg" [9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg" [10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg" [11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg" [12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg" [13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg" [14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg" [15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg" [16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg" [17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg" [18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg" [19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg" [20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg" [21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg" [22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg" [23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
Information lost by html_table
I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example The third table is the one with "Isiah YOUNG" in the first row, third column. library(rvest) library(dplyr) target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm" table <- target_url %>% read_html(options = c("DTDLOAD")) %>% html_nodes("[id^=splitevents]") # this is the correct node So far so good. Printing table[[1]] shows the contents I want. table[[1]] {html_node} <table id="splitevents" class="sortable" align="center"> [1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ... [2] <td>1</td>\n [3] <td>6</td>\n [4] <td></td>\n [5] <td>Isiah YOUNG</td>\n [6] <td></td>\n [7] <td>NIKE</td>\n [8] <td>20.28 Q</td>\n [9] <td><b><font color="grey">0.184</font></b></td> [10] <td>2</td>\n [11] <td>7</td>\n [12] <td></td>\n [13] <td>Elijah HALL-THOMPSON</td>\n [14] <td></td>\n [15] <td>Houston</td>\n [16] <td>20.50 Q</td>\n [17] <td><b><font color="grey">0.200</font></b></td> [18] <td>3</td>\n [19] <td>9</td>\n [20] <td></td>\n ... However, passing this to html_table results in an empty data frame. table[[1]] %>% html_table(fill = TRUE) [1] Pl Ln Athlete Affiliation Time <0 rows> (or 0-length row.names) How can I get the contents of table[[1]] (which clearly do exist) as a data frame?
The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these. An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe. library(rvest) library(dplyr) target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm" table <- read_html(target_url) %>% html_node("#splitevents") tds <- table %>% html_nodes('td') %>% html_text() ths <- table %>% html_nodes("th") %>% html_text() num_col <- length(ths) num_row <- length(tds) / num_col df <- tds %>% matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>% data.frame() %>% setNames(ths)
Web Scraping with R on multiple pages/links
I have a list of 5000 movies in an excel file: Avatar Tangled Superman Returns Avengers : Endgame Man of Steel And so on.... I need to extract weekend collections of these movies. The weekend collections are available on boxofficemojo.com website. By writing the following code, i am only able to fetch the Weekend collections of one single movie 'Avatar' since the url mentioned in the code contains only the details of 'Avatar'. library(rvest) webpage <- read_html("https://www.boxofficemojo.com/release/rl876971521/weekend/?ref_=bo_rl_tab#tabs") weekend_collections <- webpage %>% html_nodes(".mojo-field-type-rank+ .mojo-estimatable") %>% html_text() Other movies will have different url's. 5000 different movie's weekend collections will be in 5000 different url's. Is it possible to just give the list of the movies and ask r to fetch the weekend collections of every movie without providing the respective url's of the movies ? I can add the url's of the movies manually and perform the task but it isn't a great idea to manually add the url's of the movies to the code. So how do i fetch the weekend collections of these 5000 movies ? I am new to R. Need help.
It is possible to automate the search process on this site, since it is easy enough to generate the search string and parse the incoming html to navigate to the weekends page. The problem is that the search will sometimes generate several hits, so you can't be sure you are getting exactly the right movie. You can only examine the title afterwards to find out. Here is a function you can use. You supply it with a movie title and it will try to get the url to the weekend collections for the original release. It will select the first hit on the search page, so you have no guarantee it's the correct movie. get_weekend_url <- function(movie) { site <- "https://www.boxofficemojo.com" search_query <- paste0(site, "/search/?q=") search_xpath <- "//a[#class = 'a-size-medium a-link-normal a-text-bold']" release_xpath <- "//option[text() = 'Original Release']" territory_xpath <- "//option[text() = 'Domestic']" weekend <- "weekend/?ref_=bo_rl_tab#tabs" movie_url <- url_escape(movie) %>% {gsub("%20", "+", .)} %>% {paste0(search_query, .)} %>% read_html() %>% html_nodes(xpath = search_xpath) %>% html_attr("href") if(!is.na(movie_url[1])) { release <- read_html(paste0(site, movie_url[1])) %>% html_node(xpath = release_xpath) %>% html_attr("value") %>% {paste0(site, .)} } else release <- NA # We can stop if there is no original release found if(!is.na(release)) { target <- read_html(release) %>% html_node(xpath = territory_xpath) %>% html_attr("value") %>% {paste0(site, ., weekend)} } else target <- "Movie not found" return(target) } Now you can use sapply to get the urls you want: movies <- c("Avatar", "Tangled", "Superman Returns", "Avengers : Endgame", "Man of Steel") urls <- sapply(movies, get_weekend_url) urls #> Avatar #> "https://www.boxofficemojo.com/release/rl876971521/weekend/?ref_=bo_rl_tab#tabs" #> Tangled #> "https://www.boxofficemojo.com/release/rl980256257/weekend/?ref_=bo_rl_tab#tabs" #> Superman Returns #> "https://www.boxofficemojo.com/release/rl4067591681/weekend/?ref_=bo_rl_tab#tabs" #> Avengers : Endgame #> "https://www.boxofficemojo.com/release/rl3059975681/weekend/?ref_=bo_rl_tab#tabs" #> Man of Steel #> "https://www.boxofficemojo.com/release/rl4034037249/weekend/?ref_=bo_rl_tab#tabs" Now you can use these to get the tables for each movie: css <- ".mojo-field-type-rank+ .mojo-estimatable" weekends <- lapply(urls, function(x) read_html(x) %>% html_nodes(css) %>% html_text) Which gives you: weekends #> $`Avatar` #> [1] "Weekend\n " "$77,025,481" "$75,617,183" #> [4] "$68,490,688" "$50,306,217" "$42,785,612" #> [7] "$54,401,446" "$34,944,081" "$31,280,029" #> [10] "$22,850,881" "$23,611,625" "$28,782,849" #> [13] "$16,240,857" "$13,655,274" "$8,118,102" #> [16] "$6,526,421" "$4,027,005" "$2,047,475" #> [19] "$980,239" "$1,145,503" "$844,651" #> [22] "$1,002,814" "$920,204" "$633,124" #> [25] "$425,085" "$335,174" "$188,505" #> [28] "$120,080" "$144,241" "$76,692" #> [31] "$64,767" "$45,181" "$44,572" #> [34] "$28,729" "$35,706" "$36,971" #> [37] "$15,615" "$16,817" "$13,028" #> [40] "$10,511" #> #> $Tangled #> [1] "Weekend\n " "$68,706,298" "$56,837,104" #> [4] "$48,767,052" "$21,608,891" "$14,331,687" #> [7] "$8,775,344" "$6,427,816" "$9,803,091" #> [10] "$5,111,098" "$3,983,009" "$5,638,656" #> [13] "$3,081,926" "$2,526,561" "$1,850,628" #> [16] "$813,849" "$534,351" "$743,090" #> [19] "$421,474" "$790,248" "$640,753" #> [22] "$616,057" "$550,994" "$336,339" #> [25] "$220,670" "$85,574" "$31,368" #> [28] "$16,475" "$5,343" "$6,351" #> [31] "$910,502" "$131,938" "$135,891" #> #> $`Superman Returns` #> [1] "Weekend\n " "$52,535,096" "$76,033,267" #> [4] "$21,815,243" "$12,288,317" "$7,375,213" #> [7] "$3,788,228" "$2,158,227" "$1,242,461" #> [10] "$848,255" "$780,405" "$874,141" #> [13] "$1,115,228" "$453,273" "$386,424" #> [16] "$301,373" "$403,377" "$296,502" #> [19] "$331,938" "$216,430" "$173,300" #> [22] "$40,505" #> #> $`Avengers : Endgame` #> [1] "Weekend\n " "$357,115,007" "$147,383,211" #> [4] "$63,299,904" "$29,973,505" "$17,200,742" #> [7] "$22,063,855" "$8,037,491" "$4,870,963" #> [10] "$3,725,855" "$1,987,849" "$6,108,736" #> [13] "$3,118,317" "$2,104,276" "$1,514,741" #> [16] "$952,609" "$383,158" "$209,992" #> [19] "$100,749" "$50,268" "$70,775" #> [22] "$86,837" "$12,680" #> #> $`Man of Steel` #> [1] "Weekend\n " "$116,619,362" "$41,287,206" #> [4] "$20,737,490" "$11,414,297" "$4,719,084" #> [7] "$1,819,387" "$749,233" "$466,574" #> [10] "$750,307" "$512,308" "$353,846" #> [13] "$290,194" "$390,175" "$120,814" #> [16] "$61,017" If you have 5000 movies to look up, it is going to take a long time to send and parse all these requests. Depending on your internet connection, it may well take 2-3 seconds per movie. That's not bad, but it may still be 4 hours of processing time. I would recommend starting with an empty list and writing each result to the list as it is received, so that if something breaks after an hour or two, you don't lose everything you have so far.
Extract data from HTML tags in R
I have an html table I'm trying to extract data from. I have here line 21 that I need to achieve an 11 character vector from (and then do the same for all lines of data. I'm trying to write a function to do this, where: dt is my data table and this is what line 21 looks like: [1] "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td> <td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td> <td>2.7</td></tr>" I need to get rid of all of the "<tr><td>" etc., as well as insert either a 0 or an NA where they exist back to back ("</td><td></td><td>"). Here is what I have so far. First, I keep getting the error: Error in strsplit(a, "</td><td>") : non-character argument f<-function(row.data){ a<-strsplit(row.data,"<tr><td>") b<-unlist(strsplit(a,"</td><td>"))) } f(dt[21]) And this has yet to address inserting 0s or NAs. I'm quite new to R, so I am super appreciative of any help.
This can be done with gsub. As commented you should indeed escape the / with \\ dat <- "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td><td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td><td>2.7</td></tr>" a<-gsub("<tr>",0,dat) a<-gsub("<td>",0,a) a<-gsub("<\\/td>",0,a) a<-gsub("<\\/tr>",0,a) a [1] "0010011 Com00b00Radial Velocity0 \n0100326.03001.29000019.4000 \n02.700"
Like I mentioned above, your task is really parsing HTML, so a more appropriate method would be to use a package like rvest that's made for parsing HTML. I'm guessing this is part of a larger table, in which case you could probably use rvest::html_table to scrape data from an entire table at once. If, instead, what you have is really just strings of the HTML tags for each row, you can convert that text to its XML representation (the backbone of HTML) with read_html. Then from that XML, you can pull out the <tr> tags, then from those, pull out the <td> tags. I did table rows before table cells in case there's more logic that you need for keeping rows together. library(dplyr) library(rvest) tags <- "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td> <td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td> <td>2.7</td></tr>" read_html(tags) %>% html_nodes("tr") %>% html_nodes("td") #> {xml_nodeset (11)} #> [1] <td>1</td>\n #> [2] <td>11 Com</td>\n #> [3] <td>b</td>\n #> [4] <td>Radial Velocity</td> #> [5] <td>1</td>\n #> [6] <td>326.03</td>\n #> [7] <td>1.29</td>\n #> [8] <td></td>\n #> [9] <td>19.4</td>\n #> [10] <td></td> #> [11] <td>2.7</td> Then html_text pulls the inner text out from each tag. read_html(tags) %>% html_nodes("tr") %>% html_nodes("td") %>% html_text() #> [1] "1" "11 Com" "b" #> [4] "Radial Velocity" "1" "326.03" #> [7] "1.29" "" "19.4" #> [10] "" "2.7" Created on 2018-10-31 by the reprex package (v0.2.1)
Web scraping with R and selector gadget
I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull. I am ultimately interested in building a data frame that has the following schema/columns: rank, blog_name, facebook_fans, twitter_followers, alexa_rank. My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA ). My code is below: data2_html = read_html("http://blog.feedspot.com/video_game_news/") data2_html %>% html_node(".stats") %>% html_text() %>% as.numeric() I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned. 714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node} <td> [1] <span>997,669</span>
This may help you: library(rvest) d1 <- read_html("http://blog.feedspot.com/video_game_news/") stats <- d1 %>% html_nodes(".stats") %>% html_text() blogname <- d1%>% html_nodes(".tlink") %>% html_text() Note that it is html_nodes (plural) Result: > head(blogname) [1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog" [5] "Nintendo Life " "Game Informer" > head(stats,12) [1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879" [10] "38,019,811" "12,059,607" "500" blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one. FBstats = stats[seq(1,length(stats),3)] > head(stats[seq(1,length(stats),3)]) [1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work: library(rvest) library(tidyverse) # scrape html h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html() game_blogs <- h %>% html_node('table') %>% # select enclosing table node html_table() %>% # turn table into data.frame set_names(make.names) %>% # make names syntactic mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info mutate_at(3:5, parse_number) %>% # make numbers actually numbers tbl_df() # for printing game_blogs #> # A tibble: 119 x 5 #> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank #> <int> <chr> <dbl> <dbl> <dbl> #> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873 #> 2 2 IGN | Video Games 4070476 4493805 399 #> 3 3 Xbox Wire 23141452 10210993 879 #> 4 4 Official PlayStation Blog 38019811 12059607 500 #> 5 5 Nintendo Life 35977 95044 17727 #> 6 6 Game Informer 603681 1770812 10057 #> 7 7 Reddit | Gamers 1003705 430017 25 #> 8 8 Polygon 623808 485827 1594 #> 9 9 Xbox Live's Major Nelson 65905 993481 23114 #> 10 10 VG247 397798 202084 3960 #> # ... with 109 more rows It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need. library(rvest) library(stringr) data2_html = read_html("http://blog.feedspot.com/video_game_news/") data2_html %>% html_nodes(".stats") %>% html_text() %>% str_replace_all(',', '') %>% as.numeric()