pattern matching a formula in R - r

I want to do a pattern matching of variables in a formula. The ideal solution should be able to perform as below:
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456' and output should be variable_1, variable_2,variable_3, variable_4, variable_5.
Note: variable name can contain character, underscore (_), numbers only and operations are limited to +,-,*,/. formula may contain constants as well (like here it is 456). The output should contain only variables names and should ignore any numeric constants.
I have tried the below codes. I was only able to check for the variable name containing only character and minus operation (-) does not work as well.
formula <- "variableX +variableY*VariableZ"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] ""
[3,] "variableY"
[4,] ""
[5,] "VariableZ"
which is correct BUT when i include minus operation (-), the strapplyc gives wrong results
formula <- "variableX -variableY"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] "-"
[3,] "variableY"
I would appreciate if anyone could help me on ideal solution.

You can use regular expressions for this:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5"
gsub("[\\+\\*\\-\\/]", ", ", formula)
Explanation of the regex:
[ and ] start and end a group of characters that you want to select
\\+ escapes the + sign, with you want to replace with ", "
\\* escapes the * sign, with you want to replace with ", "
\\- escapes the - sign, with you want to replace with ", "
\\/ escapes the / sign, with you want to replace with ", "
Edit to reflect OP's updated request
Another way would be just to extract your variables. The below works if you hold the format lowercaseletters_numberfor your variable name:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5+34+brigadeiro_5"
paste(regmatches(formula, gregexpr("variable_[0-9]", formula))[[1]],
collapse = ", ")
You can also use the stringr package if you want the code to look a little cleaner:
library(stringr)
str_extract_all(formula, "[a-z]*_[0-9]*")

You could use strsplit() with some extras.
res <- trimws(el(strsplit(formula, "\\+|\\-|\\*|\\/")))
Thereafter we want those elements yielding NA when we try to coerce them as.numeric().
res[is.na(suppressWarnings(as.numeric(res)))]
# [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"
Data
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456'

Related

concat a SPLIT variable in R

I've been trying to split a string in R and then joining it back together but none of the tricks have worked for what I need.
!!!Important !!! My question is not a duplicate:
saving a split result into a variable and then pasting, collapsing etc is not the same as just paste a vector like this
paste(c("bla", "bla"), collapse = " ")
> paste(c("The","birch", "canoe"), collapse = ' ')
[1] "The birch canoe"
> paste(s, collapse=" ")
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
Here's the code:
I take pre-saved sentences in R
sentences[1]
and split it
s <- str_split(sentences[1])
this is what I get:
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
Now when I try to join this back together I get backslashes
toString(s)
"c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
paste produces the same result:
> paste(s)
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
I tried using str_split_fixed and wrap it into a vector, but it joins the sentence back together with a comma, even if I ask it not to.
v <- as.vector(str_split_fixed(sentences[1], " ", 5))
toString(v, sep="")
[1] "The, birch, canoe, slid, on the smooth planks."
I thought maybe str_split_i or str_split_1 could solve it as according to the documentation in theory it should, but that's what I get when I try to use it
"could not find function "str_split_1" "
Are there any other ways to join back a string after splitting it without it producing commas or backslashes?..
See the difference between:
s <- list(c("The" , "birch" , "canoe" , "slid" , "on" , "the" , "smooth" , "planks."))
paste(s[1], collapse = " ")
#[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
and
paste(s[[1]], collapse = " ")
#[1] "The birch canoe slid on the smooth planks."
This is because [[ will extract the vector, and [ and will keep the output as a list.

How to change values before text in string using R

I have multiple strings that are similar to the following pattern:
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
I need to change all 0 values to "." before the first character value within a string. My desired output in this example would be:
"........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0".
I tried using gsub to accomplish this task:
gsub("\\G([^_\\d]*)\\d", ".\\1", dat, perl=T)
Unfortunately it changed all of the 0s to "." instead of the 0s preceding the first "A".
Can someone please help me with this issue?
If you wish to simply replace each leading 0 with a ., you can use
gsub("\\G0", ".", dat, perl=TRUE)
Here, \G0 matches a 0 char at the start of string, and then every time after a successful match. See this regex demo.
If you need to replace each 0 in a string before the first letter you can use
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
Here, \G matches start of string or end of the preceding successful match, [^\p{L}0]* matches zero or more chars other than a letter and 0, then \K omits the matched text, and then 0 matches the 0 char and it is replaced with a .. See this regex demo.
See the R demo online:
dat <- c("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0","102030405000AZD")
gsub("\\G0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "102030405000AZD"
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "1.2.3.4.5...AZD"
This is really hard.
So I tried to do it with a custom function:
library(stringr)
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
Zero_Replacer <- function(x) {
x <- str_split(x, '[A-Za-z]', 2)
x[[1]][1] <- str_replace_all(x[[1]][1], "0", ".")
paste0(x[[1]][1], x[[1]][2])
}
Zero_Replacer(dat)
Output:
[1] "........AAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Regex expression starting from a certain character

Example: "example._AL(5)._._4500_GRE/Jan_2018"
I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL.
Output should look like: "AL(5)._._4500_GRE/Jan_2018"
There is some question on what we can assume is known but here are a few variations which make various assumptions.
1) word( This removes everything prior to the first word followed by a parenthesis.
"^" matches the start of string
".*?" is the shortest match of anything provided we still match rest of regex
"\\w+" matches a word
"\\(" matches a left paren
(...) forms a capture group which the replacement string can refer to as "\\1"
Code
x <- "example.AL(5)._._4500_GRE/Jan_2018"
sub("^.*?(\\w+\\()", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
1a) or matching a word followed by ( followed by anything and extracting that:
library(gsubfn)
strapplyc(x, "\\w+\\(.*", simplify = TRUE)
## [1] "AL(5)._._4500_GRE/Jan_2018"
2) AL( or if we know that the word is AL then:
sub("^.*?(AL\\(.*)", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
3) remove up to 1st dot or if we know that the part to be removed is the part before and including the first dot:
sub("^.*?\\.", "", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
4) dot separated fields If the format of the input is dot-separated fields we can parse them all out at once like this:
read.table(text = x, sep = ".", as.is = TRUE)
## V1 V2 V3 V4
## 1 example AL(5) _ _4500_GRE/Jan_2018

Handling string search and substitution in R

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

Resources