Prevent grep in R from treating "." as a letter - r

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?

grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.

grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression

You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.

Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

Related

Replacing string variable with punctuation in R without removing other string

In R, I am having trouble replacing a substring that has punctuation. Ie within the string "r.Export", I am trying to replace "r." with "Report.". I've used gsub and below is my code:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string)
The desired output is: "Report.Export" however gsub seems to replace the second r such that the output is:
Report.ExpoReport.
Using sub() instead is not a solution either because I am doing multiple gsubs where sometimes the string to be replaced is:
short <- "o."
So, then the o's in r.Export are replaced anyway and it becomes a complete mess.
string <- "r.Export"
short <- "r\\."
replacement <- "Report."
gsub(short,replacement,string)
Returns:
[1] "Report.Export"
Or, using fixed=TRUE:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string, fixed=TRUE)
Returns:
[1] "Report.Export"
Explanation: Without the fixed=TRUE argument, gsub expects a regular expression as first argument. And with regular expressions . is a placeholder for 'any character'. If you want the literal . (period) you have to use either \\. (i.e. escaping the period) or the aforementioned argument fixed=TRUE
Since you have characters in your pattern (.) which has a special meaning in regex use fixed = TRUE which matches the string as is.
gsub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
I might actually add word boundaries and lookaheads to the mix here, to ensure as targeted a match as possible:
string <- "r.Export"
replacement <- "Report."
output <- gsub("\\br\\.(?=\\w)", replacement, string, perl=TRUE)
output
[1] "Report.Export"
This approach ensures that we only match r. when the r is preceded by whitespace or is the start of the string, and also when what follows the dot is another word. Consider the sentence The project r.Export needed a programmer. We wouldn't want to replace the final r. in this case.
We can use sub
sub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"

stringr in R: extract filename from filename.extension when filename and filename.extension share common chars

I have a set of 'filename.extension's and I want to extract just the filename. I am having trouble extracting the full filename when the filename shares a character with the file extension. for example, the filename.extension "qrs.sas7bdat" has
filename="qrs"
extension="sas7bdat"
In this case one may observe that the filename shares in common with the extension the character "s".
Here's some R code to give more context:
files_sas <- c("abc.sas7bdat","qrs.sas7bdat")
stringr::str_extract(files_sas,"(?:.*|.*s)[^\\.sas7bdat]")
This set of code returns the following character vector:
"abc" "qr"
This is not what I want -- the desired result I want follows:
c("abc","qrs")
It looks like I'm close, and so I am hoping someone might be able to help me get my desired result.
Many thanks.
We can use sub to match the . (. is a metacharacter that matches any character, so we escape (\\) iit, followed by other character (.*), in the replacement, we can specify blank ("")
sub("\\..*", "", files_sas)
#[1] "abc" "qrs"
Or with stringr
library(stringr)
str_remove(files_sas, "\\..*")
Or with file_path_sans_ext
tools::file_path_sans_ext(files_sas)
#[1] "abc" "qrs"

regex to get everything before first number

I can't figure out how to get this regex to work.
My sample data vector looks like this:
claims40 1.1010101
clinical41 391.1
...
It follows the pattern of:
a name,
followed with no spaces by a version number, and
then various other numbers.
I'm trying to create a new column in the data frame with just the name, which can be a variable amount of characters.
So the new column should look like:
claims
clinical
...
When I try to use the expression:
^(.*?)\\d
in regexp, I don't get the correct character match length.
Question: What is the correct regex to capture everything in a string prior to the first number?
gsub("[^a-zA-Z]", "", c("claims40 1.1010101", "clinical41 391.1"))
# [1] "claims" "clinical"
Also this posix style:
gsub("[^[:alpha:]]", "", c("claims40 1.1010101", "clinical41 391.1"))
# [1] "claims" "clinical"
If you specifically want to match until the first digit, you could also do this
gsub("^(.+?)(?=\\d).*", "\\1", c("claims40 1.1010101", "clinical41 391.1"), perl = TRUE)
[1] "claims" "clinical"
Also with str_extract from stringr:
stringr::str_extract(c("claims40 1.1010101", "clinical41 391.1"), "^[[:alpha:]]+")
# [1] "claims" "clinical"
This "extracts" the alphabetical characters instead of removing everything else.

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

Remove substring in string and keep some substring

PES+PWA+PWH
I have the above string in R in a data frame, I have to write a script such that if it finds PES then it keeps PES and removes the rest.
I only want PES in the output
The grouping operator in R's regex would allow removal of non-"PES" characters:
gsub("(.*)(PES)(.*)", "\\2", c("PES+PWA+PWH", "something else") )
#[1] "PES" "something else"
The problem description wasn't very clear since another respondent nterpreted your request very differntly
text <- c("hello", "PES+PWA+PWH", "world")
text[grepl("PES", text)] <- "PES"
# "hello" "PES" "world"
From your question above I assumed you meant you wanted to have only those rows that contain PES?
You can use the grep function in R for that
column<-c("PES-PSA","PES","PWS","PWA","PES+PWA+PWH")
column[grep("PES",column)]
[1] "PES-PSA" "PES" "PES+PWA+PWH"
Grep takes the string to match as its first argument and the vector you want to match in as the second.

Resources