RegEx for replacing part of a string in R [duplicate] - r

This question already has answers here:
Escaped Periods In R Regular Expressions
(3 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
I am trying to do an exact pattern match using the gsub/sub and replace function. I am not getting the desired response. I am trying to remove the .x and .y from the names without affecting other names.
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub(".x$|.y$", "", name)
new.name
[1] "compa" "deriv" "isConfirmed"
company has become compa.
I have also tried
remove = c(".x", ".y")
replace(name, name %in% remove, "")
[1] "company" "deriv.x" "isConfirmed.y"
I would like the outcome to be.
"company", "deriv", "isConfirmed"
How do I solve this problem?

Here we can have a simple expression that removes the undesired . and anything after that:
(.+?)(?:\..+)?
or for exact match:
(.+?)(?:\.x|\.y)?
R Test
Your code might look like something similar to:
gsub("(.+?)(?:\\..+)?", "\\1", "deriv.x")
or
gsub("(.+?)(?:\.x|\.y)?", "\\1", "deriv.x")
R Demo
RegEx Demo 1
RegEx Demo 2
Description
Here, we are having a capturing group (.+?), where our desired output is and a non-capturing group (?:\..+)? which swipes everything after the undesired ..

The dot matches any character except a newline ao .x$|.y$ would also match the ny in company
There is no need for any grouping structure to match a dot followed by x or y. You could match a dot and match either x or y using a character class:
\\.[xy]
Regex demo | R demo
And replace with an empty string:
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub("\\.[xy]", "", name)
new.name
Result
[1] "company" "deriv" "isConfirmed"

In a regex, . represents "any character". In order to recognize literal . characters, you need to escape the character, like so:
name <- c("company", "deriv.x", "isConfirmed.y")
new.name <- gsub("\\.x$|\\.y$", "", name)
new.name
[1] "company" "deriv" "isConfirmed"
This explains why in your original example, "company" was being transformed to "compa" (deleting the "any character of 'n', followed by a 'y' and end of string").
Onyambu's comment would also work, since within the [ ] portion of a regex, . is interpreted literally.
gsub("[.](x|y)$", "", name)

Related

Find pattern and wrap between parenthesis the next letter

I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.
You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).

Removing leading question marks from first two words of data frame entries in R

I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides SklenĀ ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Transform suffix into prefix in column name

I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...
A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"

Resources