Substitute and drop characters in column names - r

I have the following names of variables:
vars <- c("var-1.caps(12, For]","var2(5,For]","var-3.tree.(15, For]","var-3.tree.(30, For]")
I need to clean these names in order to get the following result:
clean_vars <- c("var1.caps_12_For","var2_5_For","var3.tree_15_For","var3.tree_30_For")
So, basically I would like to drop -, ( and ].
I was using this approach:
gsub("\\(.*\\]","",vars)
But it drops everything between ( and ]. It neither drops the symbol -.

We can capture as a group. Match the pattern for a . if it exists followed by a ( (metacharacters - so escape \\), followed by one or more digits (\\d+) captured as a group ((...)), followed by a , and zero or more spaces (\\s*), then capture the word ([A-Za-z]+) as second capture group. In the replacement, specify the backreference (\\1, \\2) of the capture group along with _ to get the expected output
out <- sub("\\.?\\((\\d+),\\s*([A-Za-z]+)\\]$", "_\\1_\\2", vars)
out
#[1] "var-1.caps_12_For" "var2_5_For" "var-3.tree_15_For" "var-3.tree_30_For"
sub('-', '', out)
#[1] "var1.caps_12_For" "var2_5_For" "var3.tree_15_For" "var3.tree_30_For"

Related

Cut every element of a vector of strings at the second occurrence of a pattern in R

I have a chr vector:
> head(strings)
[1] "10_88517_0" "10_88521_1" "10_88542_2" "10_280230_3" "10_280258_4" "10_280310_5"
I want to create a new vector of substrings, obtained by cutting each element of this vector at the second _. E.g.:
> head(cut_strings)
[1] "10_88517" "10_88521" "10_88542" "10_280230" "10_280258" "10_280310"
My idea was to first grep for the position of second _ in each string:
cut_pts <- sapply(stringr::str_locate_all(strings, "_"), "[", 2)
All I can come up with though is an awkward for loop that goes through the strings vector and calls substr for each element, e.g.:
cut_strings <- strings
for(i in 1:length(strings)){
string <- strings[i]
cut_pt <- cut_pts[i]
string <- substr(string, 1, cut_pt-1)
cut_strings[i] <- string
}
I'm thinking maybe there's a way to use apply in this context, to cut each element of strings based on the appropriate element of cut_pts?
We could capture those characters in sub and remove the substring afterwards i..e below pattern matches the one or more characters not an underscore ([^_]+) followed by an underscore, then characters not an underscore and remove the character starting from second underscore by not including in the capture group ((...)). Note that we specified the start of the string (^). In the replacement, use the backreference (\\1) of the captured group
sub("^([^_]+_[^_]+)_.*", "\\1", strings)

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Replace all characters between the 3rd occurrence of “-” and the ":" in each element of a vector

Here is what I am trying to do:
Given a string, I want to remove everything after the third occurrence of the '-' and the character — assuming there is a third occurrence, which there may not be.
This is my expected result :
Initial string
yy-aa-bbb-cccc1:HYT => yy-aa-bbb:HYT
yy-aa-vvv-vv:ZTR => yy-aa-vvv:ZTR
yy-aa-ddd:YTLM => yy-aa-ddd:YTLM
Any help?
gsub('(.*-.*-.*)\\-.*(\\:.*)','\\1\\2',string)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
We match two instances of characters that are not a - followed by - ([^-]+-) followed by another set of characters that are not a -, capture it as a group i.e. inside the (), followed by a - and set of characters that are not a : ([^:]+) followed by the second capture group that starts with : ((:.*)) and replace it with the backreference of the capture groups
sub("(([^-]+-){2}[^-]+)-*[^:]+(:.*)", "\\1\\3", str1)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
data
str1 <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM"
Match the the first two fields and everything afterwards to colon and replace that with the first two fields and colon. Note that \w matches any word character and the \ needs to be doubled inside "..." :
sub("(\\w+-\\w+)-.+:", "\\1:", xx)
## [1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"
Note: The input xx in reproducible form is:
xx <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM")
Just throwing a stringi solution in there.
library(stringi)
sub('_.*:' ,':', stri_replace_last_fixed(x, '-', '_'))
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"

Transform suffix into prefix in column name

I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...
A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Resources