extract multiple parts of a string with R - r

I have two strings:
data = "Product Number: #76 in c (See Top 10 products in this department)"
data1 = "Product Number: #321,222 in Thin Base Pizzas (See Top 10 products in this department)"
using str_match() in R, what would be the regex for the following results?
str_match(data, regex)
[,1] [,2] [,3]
[1,] "#76 in Fruit Juices " "76" "Fruit Juices "
str_match(data1, regex)
[,1] [,2] [,3]
[1,] "#321,222 in Thin Base Pizzas " "321,222" "Thin Base Pizzas "

You can use this regex to extract the information you need:
#([0-9,]+) in ([A-z ]+)
you can see in action here: https://regex101.com/r/IM0wHV/1

Given your first comment I think this will generalize to give you the product number.
sub(" .*", "", sub(".*#", "", data))
"76"
And this second one will give you whatever is between the in and (.
sub(" \\(.*", "", sub(".*[0-9]+ in ", "", data))
"Fruit Juices"
Not an ideal solution but it's an working example you can take forward from here.

Related

pattern matching a formula in R

I want to do a pattern matching of variables in a formula. The ideal solution should be able to perform as below:
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456' and output should be variable_1, variable_2,variable_3, variable_4, variable_5.
Note: variable name can contain character, underscore (_), numbers only and operations are limited to +,-,*,/. formula may contain constants as well (like here it is 456). The output should contain only variables names and should ignore any numeric constants.
I have tried the below codes. I was only able to check for the variable name containing only character and minus operation (-) does not work as well.
formula <- "variableX +variableY*VariableZ"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] ""
[3,] "variableY"
[4,] ""
[5,] "VariableZ"
which is correct BUT when i include minus operation (-), the strapplyc gives wrong results
formula <- "variableX -variableY"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] "-"
[3,] "variableY"
I would appreciate if anyone could help me on ideal solution.
You can use regular expressions for this:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5"
gsub("[\\+\\*\\-\\/]", ", ", formula)
Explanation of the regex:
[ and ] start and end a group of characters that you want to select
\\+ escapes the + sign, with you want to replace with ", "
\\* escapes the * sign, with you want to replace with ", "
\\- escapes the - sign, with you want to replace with ", "
\\/ escapes the / sign, with you want to replace with ", "
Edit to reflect OP's updated request
Another way would be just to extract your variables. The below works if you hold the format lowercaseletters_numberfor your variable name:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5+34+brigadeiro_5"
paste(regmatches(formula, gregexpr("variable_[0-9]", formula))[[1]],
collapse = ", ")
You can also use the stringr package if you want the code to look a little cleaner:
library(stringr)
str_extract_all(formula, "[a-z]*_[0-9]*")
You could use strsplit() with some extras.
res <- trimws(el(strsplit(formula, "\\+|\\-|\\*|\\/")))
Thereafter we want those elements yielding NA when we try to coerce them as.numeric().
res[is.na(suppressWarnings(as.numeric(res)))]
# [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"
Data
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456'

How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma

I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.

Match a substring with character, digits and spaces with gsub

I have a string like:
a <- '{:name=>"krill", :priority=>2, :count=>1}, {:name=>"vit a", :priority=>2]}, {:name=>"vit-b", :priority=>2, :count=>1}, {:name=>"vit q10", :priority=>2]}'
I would like to parse via str_match the elements within ':name=>" ' and ' " '
krill
vit a
vit-b
vit q10
So far I tried:
str_match(a, ':name=>\\"([A-Za-z]{3})')
But it doesn't work.
Any help is appreciated
You may extract those values with
> regmatches(a, gregexpr(':name=>"\\K[^"]+', a, perl=TRUE))
[[1]]
[1] "krill" "vit a" "vit-b" "vit q10"
The :name=>"\\K[^"]+ pattern matches
:name=>" - a literal substring
\K - omits the substring from the match
[^"]+ - one or more chars other than ".
If you need to use stringr package, use str_extract_all:
> library(stringr)
> str_extract_all(a, '(?<=:name=>")[^"]+')
[[1]]
[1] "krill" "vit a" "vit-b" "vit q10"
In (?<=:name=>")[^"]+, the (?<=:name=>") matches any location that is immediately preceded with :name=>".
Using stringr and positive lookbehind:
library(stringr)
str_match_all(a, '(?<=:name=>")[^"]+')
[[1]]
[,1]
[1,] "krill"
[2,] "vit a"
[3,] "vit-b"
[4,] "vit q10"

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance"
The second row should look like "hierarchy free_association_(communism_and_anarchism)"
The third row should be "state_(polity)"
And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki
You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))
You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.
If you need to get rid of redundant whitespace inside the result you may use another call to gsub:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

Split names and create matrix in R

I have this data:
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
I want to manipulate it so the first name come first, so i split it:
names <- strsplit(names, ", ")
[[1]]
[1] "Baker" "Chet"
[[2]]
[1] "Jarret" "Keith"
[[3]]
[1] "Miles Davis"
The problem is that, when i want to put them together, the name "Miles Davis" will come out wrong, because it is already the full name.
matrix(unlist(names), ncol=2, byrow = TRUE)
[,1] [,2]
[1,] "Baker" "Chet"
[2,] "Jarret" "Keith"
[3,] "Miles Davis" "Baker"
What should i do to create a new df that will look like this:
"Chet Baker"
"Keith Jarret"
"Miles Davis"
Here's the reference: http://rfunction.com/archives/1499
You can easily adapt the pattern used in the regular expression so that it matches either a comma followed by 0+ spaces or 1+ spaces:
names <- strsplit(names, ",\\s*|\\s+")
matrix(unlist(names), ncol=2, byrow = TRUE)
# [,1] [,2]
#[1,] "Baker" "Chet"
#[2,] "Jarret" "Keith"
#[3,] "Miles" "Davis"
Since the desired result is different than initially described, heres's a different approach:
names <- strsplit(names, ",\\s*")
data.frame(name = sapply(names, function(x) paste(rev(x), collapse = " ")))
# name
#1 Chet Baker
#2 Keith Jarret
#3 Miles Davis
Another option, using capture groups in a regular expression to swap everything before the comma with everything after the comma and replace the comma with a space.
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
sub("([^,]+),\\s*([^,]+)$", "\\2 \\1", names)
#[1] "Chet Baker" "Keith Jarret" "Miles Davis"
Another regex solution:
gsub("(\\w+), (\\w+)", "\\2 \\1", names)
# [1] "Chet Baker" "Keith Jarret" "Miles Davis"

Resources