How to extract only valid equations from a string - r

Extract all the valid equations in the following text.
I have tried a few regex expressions but none seem to work.
Hoping to use sub or gsub functions in R.
myText <- 'equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6'
expected result : 2+3=5 2*3=6

Here is a base R approach. We can use grepexpr() to find multiple matches of equations in the input string:
x <- c("equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6")
m <- gregexpr("\\b\\w+(?:[+-\\*]\\w+)+=\\w+\\b", x)
regmatches(x, m)
[[1]]
[1] "2+3=5" "2*3=6"
Here is an explanation of the regex:
\\b\\w+ match an initial symbol
(?:[+-\\*]\\w+) then match at least one arithmetic symbol (+-\*) followed
by another variable
+=\\w+ match an equals sign, followed by a variable

For the examples you've posted, the regex (\d+[+\-*\/]\d+=\d+) should extract the equations and not the rest of the text. Note that this regex does not handle variables/variable names, only numbers and the basic arithmetic operators. This may need to be adapted for r.
Demo

Related

Replace Strings in R with regular expression in it dynamically [duplicate]

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

Insert character string between period and digit in R

I have a vector of character strings like so:
test <- c("A1.7","A1.8")
and I want to used regular expressions to insert A1c<= between the period and digit like so:
A1.A1c<=7 A1.A1c<=8
I looked through questions and found #zx8754 similar question; I tried to modify the answer posted in their question but had no luck
insert <- 'A1c<='
n <- 4
old <- test
lhs <- paste0('([[:alpha:]][[:digit:]][[:punct:]]{', n-1, '})([[:digit:]]+)$')
rhs <- paste0('\\1', insert, '\\2')
gsub(lhs, rhs, test)
Can anyone direct me as to how to correctly execute this?
Another pattern:
gsub("\\.(\\d+)", "\\.A1c<=\\1", test)
## [1] "A1.A1c<=7" "A1.A1c<=8"
Regex Demo
You may use
insert <- 'A1c<='
test <- c("A1.7","A1.8")
sub("(?<=\\.)(?=\\d)", insert, test, perl=TRUE)
## => A1.A1c<=7 A1.A1c<=8
See the online R demo
Details
(?<=\\.) - a positive lookbehind that matches a location that is immediately preceded with a dot
(?=\\d) - a positive lookahead that matches a location that is immediately followed with a digit.
The sub function will replace the first occurrence only, and perl=TRUE makes it possible to use the lookaround constructs in the pattern (as it is now parsed with the PCRE regex engine).

How to extract a certain part of a string in R using regular expressions

How to convert the following string in R :
this_isastring_12(=32)
so that only the following is kept
isastring_12
Eg
f('this_isastring_12(=32)') returns 'isastring_12'
This should work on other strings with a similar structure, but different characters
Another example with a different string of similar structure
f('something_here_3(=1)') returns 'here_3'
We can use sub to extract everything from first underscore to opening round bracket in the text.
sub(".*?_(.*)\\(.*", "\\1", x)
#[1] "isastring_12" "here_3" "string_4"
where x is
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")
You could use the package unglue.
Borrowing Ronak's data :
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")
library(unglue)
unglue_vec(x, "{=.*?}_{res}({=.*?})")
#> [1] "isastring_12" "here_3" "string_4"
{=.*?} matches anything until what's next is matched, but doesn't extract anything because there's no lhs to the equality
{res}, where the name res could be replaced by anything, matches anything, and extracts it
outside of curly braces, no need to escape characters
unglue_vec() returns an atomic vector of the matches

How to subset words with certain number of vowels in rstudio?

I try to subset a list of words having 5 or more vowel letters using str_subset function in rstudio. However, can't figure it.
Is there any suggestion for this issue?
Since you are evidently using stringr, the function str_count will give you what you are after. Assuming your "list of words" means a character vector of single words, the following should do the trick.
testStrings <- c("Brillig", "slithey", "TOVES",
"Abominable", "EQUATION", "Multiplication", "aaagh")
VowelCount <- str_count(testString, pattern = "[AEIOUaeiou]")
OutputStrings <- testStrings[VowelCount >= 5]
The part in square brackets is a regular expression which matches any capital or lower case vowel in English. Of course other languages have different sets of vowels which you may need to take into account.
If you want to do the same in base R, the following single-liner should do it:
OutputStrings <- grep("([AEIOUaeiou].*){5,}", testStrings, value = TRUE)

Regex to remove all non-digit symbols from string in R

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Resources