Transform suffix into prefix in column name - r

I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...

A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))

Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"

Related

Cut every element of a vector of strings at the second occurrence of a pattern in R

I have a chr vector:
> head(strings)
[1] "10_88517_0" "10_88521_1" "10_88542_2" "10_280230_3" "10_280258_4" "10_280310_5"
I want to create a new vector of substrings, obtained by cutting each element of this vector at the second _. E.g.:
> head(cut_strings)
[1] "10_88517" "10_88521" "10_88542" "10_280230" "10_280258" "10_280310"
My idea was to first grep for the position of second _ in each string:
cut_pts <- sapply(stringr::str_locate_all(strings, "_"), "[", 2)
All I can come up with though is an awkward for loop that goes through the strings vector and calls substr for each element, e.g.:
cut_strings <- strings
for(i in 1:length(strings)){
string <- strings[i]
cut_pt <- cut_pts[i]
string <- substr(string, 1, cut_pt-1)
cut_strings[i] <- string
}
I'm thinking maybe there's a way to use apply in this context, to cut each element of strings based on the appropriate element of cut_pts?
We could capture those characters in sub and remove the substring afterwards i..e below pattern matches the one or more characters not an underscore ([^_]+) followed by an underscore, then characters not an underscore and remove the character starting from second underscore by not including in the capture group ((...)). Note that we specified the start of the string (^). In the replacement, use the backreference (\\1) of the captured group
sub("^([^_]+_[^_]+)_.*", "\\1", strings)

Substitute and drop characters in column names

I have the following names of variables:
vars <- c("var-1.caps(12, For]","var2(5,For]","var-3.tree.(15, For]","var-3.tree.(30, For]")
I need to clean these names in order to get the following result:
clean_vars <- c("var1.caps_12_For","var2_5_For","var3.tree_15_For","var3.tree_30_For")
So, basically I would like to drop -, ( and ].
I was using this approach:
gsub("\\(.*\\]","",vars)
But it drops everything between ( and ]. It neither drops the symbol -.
We can capture as a group. Match the pattern for a . if it exists followed by a ( (metacharacters - so escape \\), followed by one or more digits (\\d+) captured as a group ((...)), followed by a , and zero or more spaces (\\s*), then capture the word ([A-Za-z]+) as second capture group. In the replacement, specify the backreference (\\1, \\2) of the capture group along with _ to get the expected output
out <- sub("\\.?\\((\\d+),\\s*([A-Za-z]+)\\]$", "_\\1_\\2", vars)
out
#[1] "var-1.caps_12_For" "var2_5_For" "var-3.tree_15_For" "var-3.tree_30_For"
sub('-', '', out)
#[1] "var1.caps_12_For" "var2_5_For" "var3.tree_15_For" "var3.tree_30_For"

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

RegEx for replacing part of a string in R [duplicate]

This question already has answers here:
Escaped Periods In R Regular Expressions
(3 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
I am trying to do an exact pattern match using the gsub/sub and replace function. I am not getting the desired response. I am trying to remove the .x and .y from the names without affecting other names.
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub(".x$|.y$", "", name)
new.name
[1] "compa" "deriv" "isConfirmed"
company has become compa.
I have also tried
remove = c(".x", ".y")
replace(name, name %in% remove, "")
[1] "company" "deriv.x" "isConfirmed.y"
I would like the outcome to be.
"company", "deriv", "isConfirmed"
How do I solve this problem?
Here we can have a simple expression that removes the undesired . and anything after that:
(.+?)(?:\..+)?
or for exact match:
(.+?)(?:\.x|\.y)?
R Test
Your code might look like something similar to:
gsub("(.+?)(?:\\..+)?", "\\1", "deriv.x")
or
gsub("(.+?)(?:\.x|\.y)?", "\\1", "deriv.x")
R Demo
RegEx Demo 1
RegEx Demo 2
Description
Here, we are having a capturing group (.+?), where our desired output is and a non-capturing group (?:\..+)? which swipes everything after the undesired ..
The dot matches any character except a newline ao .x$|.y$ would also match the ny in company
There is no need for any grouping structure to match a dot followed by x or y. You could match a dot and match either x or y using a character class:
\\.[xy]
Regex demo | R demo
And replace with an empty string:
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub("\\.[xy]", "", name)
new.name
Result
[1] "company" "deriv" "isConfirmed"
In a regex, . represents "any character". In order to recognize literal . characters, you need to escape the character, like so:
name <- c("company", "deriv.x", "isConfirmed.y")
new.name <- gsub("\\.x$|\\.y$", "", name)
new.name
[1] "company" "deriv" "isConfirmed"
This explains why in your original example, "company" was being transformed to "compa" (deleting the "any character of 'n', followed by a 'y' and end of string").
Onyambu's comment would also work, since within the [ ] portion of a regex, . is interpreted literally.
gsub("[.](x|y)$", "", name)

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Resources