\uxxxx sequences inside backticks

\uxxxx sequences inside backticks - r

I know that, in general, \uxxxx sequences are not supported inside backticks. Do you have any workaround to include them (\uxxxx sequences) in column names?
To be specific, imagine calculating Body Mass Index and adding units to column names.
Start with
dt<-data.frame(
`Weight [kg]` = runif(5,50,100),
`Height [m]` = runif(5,1.5,2),
check.names=F
)
and mutate:
> dt2<-dt %>%
mutate(
`BMI [kg/m\u00b2]`= `Weight [kg]`/`Height [m]`^2
)
This produces an error: Error: \uxxxx sequences not supported inside backticks (line 3).
My workaround is like this:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
set_colnames(colnames(.) %>% str_replace('2\\]', '\u00b2\\]'))
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m²]"
It gives me exactly what I want but is not very elegant.
Suprisingly, a bit clearer approach fails:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
rename_all(str_replace, '2\\]', '\u00b2\\]')
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m2]"
So, my question is: can it be done in not-so-hacky way?
And:
yes, I'm sure, I need \uxxx in column names;
yes, I use them further on graphs;
no, I don't want to replace them with expression's.

How about just using single quotes instead of backticks?
dt %>% mutate('BMI [kg/m\u00b2]' = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
Or double quotes?
dt %>% mutate("BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
You can also use them to access items in your new data frame:
dt2$'BMI [kg/m\u00b2]'
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
dt2$"BMI [kg/m\u00b2]"
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
Or did you specifically need to use backticks for some reason?

Argument names don't have to be in backticks, they can be regular quoted strings. So this works fine:
dt2<-dt %>%
mutate(
"BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2
)
It will be hard to refer to that column name in expressions in later code; you'll need to specify the column by number, or use an expression like dt2["BMI [kg/m\u00b2]"] (or dt2$"BMI [kg/m\u00b2]" as used by #AllanCameron in his answer). But it will print fine:
> dt2
Weight [kg] Height [m] BMI [kg/m²]
1 51.89918 1.825124 15.58029
2 80.74140 1.602126 31.45595
3 71.35380 1.974187 18.30799
4 64.44167 1.989202 16.28580
5 76.13564 1.886232 21.39922
Edited to add: It's also fine to use
`BMI [kg/m²]`
anywhere a column name can be used, you just can't encode the special char with \uxxxx.

Related

Split texts by tags and set column names

I have a text column that has a tag style. I want to split this text into columns where column names are the tags with their corresponding values.
text = "{\"article_id\":-41,\"word-count\":379,\"article_date\":05012017,\"source\":\"news::abc\",\"author\":\"Peter K\",\"title\":\"The rise of AI\",\"topics\":{\"Business\":10, \"Computer\":5},\"topics-group\":[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]}"
Desired output:
data = data.frame("article_id" = -41, "word-count" = 379, "article_date" = 05012017,
"source"= "news::abc", "author" = "Peter K", "title" = "The rise of AI",
"topics" = "{\"Business\":10, \"Computer\":5}",
"topics-group" = "[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]")
I tried with strsplit
test = strsplit(as.character(text), ",\\\"")
test
[[1]]
[1] "{\"article_id\":-41" "word-count\":379"
[3] "article_date\":05012017" "source\":\"news::abc\""
[5] "author\":\"Peter K\"" "title\":\"The rise of AI\""
[7] "topics\":{\"Business\":10, \"Computer\":5}" "topics-group\":[{\"primary\":\"Business\""
[9] "secondary\":\"Computer\"}]}"
but there are problems with tags like topics-group which is split into 2.
My workflow thought is to complete the split, then do another split for each element to separate the tags and values. But I think there must be a better way to split and set names to these tags as column names.

We may do this in tidyverse
Use str_replace_all to change the digits (\\d+) after the 'article_date":' to integer class (as there was a 0 padding at the beginning)
Use fromJSON to convert the JSON to R object
Flatten the nested list of data.frame - invoke
Use as_tibble to convert the list to a tibble
Finally, use mdy from lubridate to convert the 'article_date' to Date class
library(dplyr)
library(stringr)
library(jsonlite)
library(lubridate)
library(purrr)
text %>%
str_replace_all('(?<=article_date":)(\\d+)', as.integer) %>%
fromJSON %>%
invoke(c, .) %>%
as_tibble %>%
mutate(article_date = mdy(article_date))
-output
# A tibble: 1 x 10
article_id `word-count` article_date source author title topics.Business topics.Computer `topics-group.prima… `topics-group.second…
<int> <int> <date> <chr> <chr> <chr> <int> <int> <chr> <chr>
1 -41 379 2017-05-01 news::a… Peter K The rise o… 10 5 Business Computer

You cam add double quotes around the article_date field and parse the JSON string with jsonlite:
text <- gsub('("article_date":)(\\d+)', '\\1"\\2"', text)
library(jsonlite)
document <- fromJSON(txt=text)
> as.data.frame(document)
# article_id word.count article_date source author title topics.Business topics.Computer topics.group.primary topics.group.secondary
# 1 -41 379 05012017 news::abc Peter K The rise of AI 10 5 Business Computer
See the regex demo. Details:
("article_date":) - Group 1: "article_date": string
(\d+) - Group 2: one or more digits.
The replacement is \1"\2": Group 1 value + Group 2 value enclosed with double quotation marks.

R regex quantifier dash vs. comma with {stringr} vs. {pointblank}

I'm noticing some odd behavior with R regex quantifiers written as either {min, max} (as recommend in the stringr cheatsheet) vs. as {min - max}, when using the pointblank package. I expect the regexes to work with {min, max} and fail with {min - max}. However, in the two examples below, one works with {min, max} and one works with {min - max}.
Example 1 works as expected: pattern_comma works and pattern_dash does not. But example 2 works unexpectedly: doi_pattern_comma does not work and doi_pattern_dash does work.
Any suggestions about this regex? Or might this be a bug in pointblank (in which case I can open an issue there)?
Thank you, SO community!
library(dplyr)
library(stringr)
library(pointblank)
# EXAMPLE 1
df1 <- tibble(x = c("123", "68"))
pattern_comma <- "^\\d{1,3}$"
pattern_dash <- "^\\d{1-3}$"
stringr::str_detect(df1$x, pattern_comma) #pass
#> [1] TRUE TRUE
stringr::str_detect(df1$x, pattern_dash) #fail
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^\d{1-3}$`)
#pass
df1 %>%
pointblank::col_vals_regex(
vars(x),
pattern_comma
)
#> # A tibble: 2 x 1
#> x
#> <chr>
#> 1 123
#> 2 68
#fail
df1 %>%
pointblank::col_vals_regex(
vars(x),
pattern_dash
)
#> Error: Exceedance of failed test units where values in `x` should have matched the regular expression: `^\d{1-3}$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
# EXAMPLE 2
df2 <- tibble(doi = c("10.1186/s12872-020-01551-9", "10.1002/cpp.1968"))
doi_pattern_comma <- "^10\\.\\d{4,9}/[-.;()/:\\w\\d]+$"
doi_pattern_dash <- "^10\\.\\d{4-9}/[-.;()/:\\w\\d]+$"
stringr::str_detect(df2$doi, doi_pattern_comma) #pass
#> [1] TRUE TRUE
stringr::str_detect(df2$doi, doi_pattern_dash) #fail
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^10\.\d{4-9}/[-.;()/:\w\d]+$`)
#fail
df2 %>%
col_vals_regex(
vars(doi),
doi_pattern_comma
)
#> Error: Exceedance of failed test units where values in `doi` should have matched the regular expression: `^10\.\d{4,9}/[-.;()/:\w\d]+$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
#pass
df2 %>%
col_vals_regex(
vars(doi),
doi_pattern_dash
)
#> # A tibble: 2 x 1
#> doi
#> <chr>
#> 1 10.1186/s12872-020-01551-9
#> 2 10.1002/cpp.1968
Created on 2021-05-09 by the reprex package (v0.3.0)

You must not doubt: {min-max} quantifier does not exist, you need to use
{min,max}. \d{4-9} throws an exception (try it with sub and you will get invalid regular expression '\d{4-9}', reason 'Invalid contents of {}' ).
Next, the second issue is that the regex is parsed with the default TRE regex engine, and you can't use shorthand character classes like \w or \W inside bracket expressions there, so you need to use [:alnum:]_ instead of \w inside square brackets.
Now, that you know the right regex:
"^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$"
you can dive deeper.
You can see what results you get if you use test_col_vals_regex:
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$")
[1] TRUE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:[:alnum:]_]+$")
[1] NA
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:\\w]+$")
[1] FALSE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:\\w]+$")
[1] NA
So, all the cases when the regex is malformed return NA and the validation for those items is skipped, passing them in the end.
CONCLUSION: Always test your regex patterns for validity before using them in col_vals_regex.

Text extraction from PDF with search criteria

I need to extract text from a PDF, I have a list of keywords which tell me what text part I need to extract.
PDF looks something like this:
Schema element: Keyword1 This is my keyword
Fontsize: 14 I dont need this
Guide to complete schema element: Text text. This is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences.
Schema element: Keyword2 This is my keyword
Fontsize: 18 I dont need this
Guide to complete schema element: Text text, this is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences. This text is different from the text above.
This is my code so far:
library(pdftools)
library(pdfsearch)
library(tidyverse)
pdf <- pdf_text(dir(pattern = "*.pdf")) %>%
read_lines()
Keyword_list <- c("swDisproportionateCost", `"swDisproportionateCostOtherEULegislation", "swExemptionsTransboundary","swDisproportionateCostAlternativeFinancing","swDisproportionateCostAnalysis","swDisproportionateCostScale")`
Then I tried using keyword_search but it only told me which line the keyword was in.
I would like to extract the text in cursive into a new column in my keyword_list. I think it can be done with regex using the keyword and the text in bold as start and stops.
Here is a link to the pdf.
https://www.dropbox.com/s/kyyzr5wnh8z87if/FINAL%20Draft4_WFD_Reporting_Guidance_2022_resource_page.pdf?dl=0

This is just a rather pedestrian text extraction job. There are many ways to do it, and I'm sure there are more elegant ways to do it than this, but this one does the job:
library(pdftools)
library(dplyr)
keywords <- pdf_text("mypdf.pdf") %>%
strsplit("Schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1)) %>%
unlist %>%
trimws()
text <- pdf_text("mypdf.pdf") %>%
strsplit("Guidance on completion of schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, ":"), `[`, 1)) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = ""))) %>%
unlist() %>%
{gsub(" ", " ", .)} %>%
trimws() %>%
strsplit("Guidance on contents") %>%
sapply(`[`, 1)
df <- tibble(keywords, text)
So the result looks like this:
df
#> # A tibble: 15 x 2
#> keywords text
#> <chr> <chr>
#> 1 swExemption44Driver "Required. Select from the enumeration list the driver~
#> 2 swExemption45Impact "Required. Select from the enumeration list the impact~
#> 3 swExemption45Driver "Required. Select from the enumeration list the driver~
#> 4 swDisproportionateCost "Required. Indicate if disproportionate costs have bee~
#> 5 swDisproportionateCostScale "Conditional. Select from the enumeration list the sc~
#> 6 swDisproportionateCostAnalysis "Conditional. Select from the enumeration list the an~
#> 7 swDisproportionateCostAlterna~ "Conditional. Select from the enumeration list the al~
#> 8 swDisproportionateCostOtherEU~ "Conditional. Indicate whether the costs of basic mea~
#> 9 swTechnicalInfeasibility "Required. Report how ‘technical infeasibility’ has be~
#> 10 swNaturalConditions "Required. Select from the enumeration list the eleme~
#> 11 swExemption46 "Required. Select from the enumeration list the reason~
#> 12 swExemption47 "Required. Select from the enumeration list the modif~
#> 13 swExemptionsTransboundary "Required. Indicate whether the application of exempt~
#> 14 swExemptionsReference "Required. Provide references or hyperlinks to the re~
#> 15 driversSWExemptionsReference "Required. Provide references or hyperlinks to the re~

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!

Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89

Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>

This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"

You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.

This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

\uxxxx sequences inside backticks - r

Related

Split texts by tags and set column names

R regex quantifier dash vs. comma with {stringr} vs. {pointblank}

Text extraction from PDF with search criteria

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

Web scraping with R and selector gadget

Categories

Resources