Regex to remove all non-digit symbols from string in R - r

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?

Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online

I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Related

Extract all numbers from a character string into a SINGLE character string of numbers in the original order [duplicate]

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Regex - Best way to match all values between two two digit numbers?

Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo

regular expression: remove consecutive repeated characters at least 2 times as well as those after it in a string in R

I have a vector with different strings like this:
s <- c("mir123mm8", "qwe98wwww98", "123m3tqppppw23!")
and
> s
[1] "mir123mm8" "qwe98wwww98" "123m3tqppppw23!"
I would like to have the answer like this:
> c("mir123", "qwe98", "123m3tq")
[1] "mir123" "qwe98" "123m3tq"
That means that if a string has at least 2 consecutive repeated characters, then them and after them should be removed.
What is the better way to do it using regular expression in R?
You can use back reference in the pattern to match repeated characters:
sub("(.*?)(.)\\2.*", "\\1", s)
# [1] "mir123" "qwe98" "123m3tq"
The pattern matches when the second captured group which is a single character repeats directly after it. Make the first capture group ungreedy by ? so that whenever the pattern matches, the first captured group is returned.

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Removing a character from within a vector element

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance
You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".
The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.
Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"
The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))
This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Resources