I need to remove a parenthesis after a number in a string:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The resulting string should look like this:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1-a3cons*HGDI_r_lag_1-(1-a3cons)*HNW_r_lag_2+a4cons*rate_90_r_lag_1)+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2+a7cons*dl_HNW_r_lag_1+a8cons*d_rate_UNE_lag_2+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The regular expression I am trying to capture here is the first parenthesis after the string "lag_" followed by some number. Note, that in places there are two parenthesis:
rate_90_r_lag_1))
And I only want to remove the first one.
I've tried a simple regex in gsub
a <- "dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
gsub("[0-9]\\)","[0-9]",a)
But I the resulting string removes the number and replaces it with [0-9]:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_[0-9]-a3cons*HGDI_r_lag_[0-9]-(1-a3cons)*HNW_r_lag_[0-9]+a4cons*rate_90_r_lag_[0-9])+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_[0-9]+a7cons*dl_HNW_r_lag_[0-9]+a8cons*d_rate_UNE_lag_[0-9]+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
I understand that the gsub is doing what it is intended to do. What I don't know is how to keep the number before the parenthesis?
You need to use a look around (in this case the preceded by) so that it will match just the parentheses as the matching group instead of the numbers and the parentheses. Then you can just remove the parentheses.
gsub("(?<=[0-9])\\)","", a, perl = TRUE)
You can do this using capture groups:
Lets just try it on the string my_string <- " = a0cons+a2cons*(CONH_r_lag_1)-a3cons*"
reg_expression <- "(.*[0-9])\\)(.*)" #two capture groups, with the parenthesis not in a group
my_sub_string <- sub(reg_expression,"\\1\\2", my_string)
Notice "\\1" reads like \1 to the regex engine, and so is a special character referring to the first capture group. (These can also be named)
Another way of doing this is lookarounds:
There are two basic kinds of lookarounds, a lookahead (?=) and a lookbehind (?<=). Since we want to match a pattern, but not capture, something behind our matched expression we need a lookbehind.
reg_expression <- "(?<=[0-9])\\)" #lookbehind
my_sub_string <- sub(reg_expression,"", my_string)
Which will match the pattern, but only replace the parenthesis.
Related
I have a vector that is constructed with numbers and letters. I want to get all the characters before the LAST letter of each value (which is I guess, always the 2nd letter of the vector). Using stringr (preferably)...
Example :
x = c("1H23456789H10", "97845784584H2", "0H987654321H0", "0P45454545A3", "63A00000000000A91")
str_extract_all(string = x, pattern = ????????)
I tried some tricks here : https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf
The result I want is :
"1H23456789" instead of "1H23456789H10"
"97845784584" instead of "97845784584H2"
"0H987654321", instead of "0H987654321H0"
"0P45454545", instead of "0P45454545A3"
"63A00000000000" instead of "63A00000000000A91"
str_extract(string = x, pattern = "[^A-Z]*[A-Z][^A-Z]*")
# [1] "1H23456789" "0H987654321" "0P45454545" "63A00000000000"
Explanation: we want to extract 1 pattern match per input, so we use str_extract not str_extract_all. Our pattern [^A-Z]*, any number of non-letters, followed by [A-Z] exactly one letter, followed by [^A-Z]* any number of non-letters. I just used capital letters based on your input, but you could change A-Z to A-Za-z inside the brackets to include lower case letters.
I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M
Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo
I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"
I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.