Change order of multiple optional substrings - r

That's a bit like this question, but I have multiple substrings that may or may not occur.
The substrings code for two different dimensions, in my example "test" and "eye". They can occur in any imaginable order.
The variables can be coded in different ways - in my example, "method|test" would be two ways to code for "test", as well as "r|re|l|le" different ways to code for eyes.
I found a convoluted solution, which is using a chain of seven (!) gsub calls, and I wondered if there is a more concise way.
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")
x
#> [1] "id" "r_test" "l_method" "test_re"
#> [5] "method_le" "test_r_old" "test_l_old" "re_test_new"
#> [9] "new_le_method" "new_r_test"
Desired output
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
How I got there (convoluted)
## Unify codes for variables, I use the underscores to make it more unique for future regex
clean_test<- gsub("(?<![a-z])(test|method)(?![a-z])", "_test_", tolower(x), perl = TRUE)
clean_r <- gsub("(?<![a-z])(r|re)(?![a-z])", "_r_", tolower(clean_test), perl = TRUE)
clean_l <- gsub("(?<![a-z])(l|le)(?![a-z])", "_l_", tolower(clean_r), perl = TRUE)
## Now sort, one after the other
sort_eye <- gsub("(.*)(_r_|_l_)(.*)", "\\2\\1\\3", clean_l, perl = TRUE)
sort_test <- gsub("(_r_|_l_)(.*)(_test_)(.*)", "\\1\\3\\2\\4", sort_eye, perl = TRUE)
## Remove underscores
clean_underscore_mult <- gsub("_{2,}", "_", sort_test)
clean_underscore_ends <- gsub("^_|_$", "", clean_underscore_mult)
clean_underscore_ends
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
I'd be already very very grateful for a suggestion how to better proceed from ## Now sort, one after the other downwards...

How about tokenizing the string and using lookup tables instead? I'll use data.table to assist but the idea fits naturally with other data grammars as well
library(data.table)
# build into a table, keeping track of an ID
# to say which element it came from originally
l = strsplit(x, '_', fixed=TRUE)
DT = data.table(id = rep(seq_along(l), lengths(l)), token = unlist(l))
Now build a lookup table:
# defined using fread to make it easier to see
# token & match side-by-side; only define tokens
# that actually need to be changed here
lookups = fread('
token,match
le,l
re,r
method,test
')
Now combine:
# default value is the token itself
DT[ , match := token]
# replace anything matched
DT[lookups, match := i.match, on = 'token']
Next use factor ordering to get the tokens in the right order:
# the more general [where you don't have an exact list of all the possible
# tokens ready at hand] is a bit messier -- you might do something
# similar to setdiff(unique(match), lookups$match)
DT[ , match := factor(match, levels = c('id', 'r', 'l', 'test', 'old', 'new'))]
# sort to this new order
setorder(DT, id, match)
Finally combine again (an aggregation) to get the output:
DT[ , paste(match, collapse='_'), by = id]$V1
# [1] "id" "r_test" "l_test" "r_test" "l_test"
# [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

Here's a one-liner with nested sub that transforms x without any intermediary steps:
sub("^(\\w+)_(r|re|l|le)", "\\2_\\1",
sub("method", "test",
sub("(l|r)e", "\\1",
sub("(^new)_(\\w+_\\w+)$", "\\2_\\1", x))))
# [1] "id" "r_test" "l_test" "r_test" "l_test" "r_test_old"
# [7] "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Data:
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")

Much inspired and building on user MichaelChirico's answer, this is a function using base R only, which (in theory) should work with any number of substrings to sort. The list defines the sort (by its elements), and it specifies all ways to code for the default tokens (the list names).
## I've added some more ways to code for right and left eyes, as well as different further strings that are not known.
x <- c("id", "r_random_test_old", "r_test", "r_test_else", "l_method", "test_re", "method_le", "test_od_old",
"test_os_old", "re_mth_new","new_le_method", "new_r_test_random")
x
#> [1] "id" "r_random_test_old" "r_test"
#> [4] "r_test_else" "l_method" "test_re"
#> [7] "method_le" "test_od_old" "test_os_old"
#> [10] "re_mth_new" "new_le_method" "new_r_test_random"
sort_substr(x, list(r = c("od","re"), l = c("os","le"), test = c("method", "mth"), time = c("old","new")))
#> [1] "id" "r_test_time_random" "r_test"
#> [4] "r_test_else" "l_test" "r_test"
#> [7] "l_test" "r_test_time" "l_test_time"
#> [10] "r_test_time" "l_test_time" "r_test_time_random"
sort_substr
sort_substr <- function(x, list_substr) {
lookups <- data.frame(match = rep(names(list_substr), lengths(list_substr)),
token = unlist(list_substr))
l <- strsplit(x, "_", fixed = TRUE)
DF <- data.frame(id = rep(seq_along(l), lengths(l)), token = unlist(l))
match_token <- lookups$match[match(DF$token, lookups$token)]
DF$match <- ifelse(is.na(match_token), DF$token, match_token)
rest_token <- base::setdiff(DF$match, names(list_substr))
DF$match <- factor(DF$match, levels = c(names(list_substr), rest_token))
DF <- DF[with(DF, order(id, match)), ]
out <- vapply(split(DF$match, DF$id),
paste, collapse = "_",
FUN.VALUE = character(1),
USE.NAMES = FALSE)
out
}

Related

how to sort list.files() in correct date order?

Using normal list.files() in the working directory return the file list but the numeric order is messed up.
f <- list.files(pattern="*.nc")
f
# [1] "te1971-1.nc" "te1971-10.nc" "te1971-11.nc" "te1971-12.nc"
# [5] "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [9] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc"
where the number after "-" describes the month number.
I used the following to try to sort it
myFiles <- paste("te", i, "-", c(1:12), ".nc", sep = "")
mixedsort(myFiles)
it returns ordered files but in reverse:
[1] "te1971-12.nc" "te1971-11.nc" "tev1971-10.nc" "te1971-9.nc"
[5] "te1971-8.nc" "te1971-7.nc" "te1971-6.nc" "te1971-5.nc"
[9] "te1971-4.nc" "te1971-3.nc" "te1971-2.nc" "te1971-1.nc"
How do I fix this?
The issue is that the values get alphabetically sorted.
You could gsub out years and months as groups (.) and add "-1" as first day of the month to the yield, coerce it as.Date and order by that.
x[order(as.Date(gsub('.*(\\d{4})-(\\d{,2}).*', '\\1-\\2-1', x)))]
# [1] "te1971-1.nc" "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [6] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc" "te1971-10.nc"
# [11] "te1971-11.nc" "te1971-12.nc"
Data:
x <- c("te1971-1.nc", "te1971-10.nc", "te1971-11.nc", "te1971-12.nc",
"te1971-2.nc", "te1971-3.nc", "te1971-4.nc", "te1971-5.nc", "te1971-6.nc",
"te1971-7.nc", "te1971-8.nc", "te1971-9.nc")

Apply regmatches function to a list of chr in R

I have this list of character stored in a variable called x:
x <-
c(
"images/logos/france2.png",
"images/logos/cnews.png",
"images/logos/lcp.png",
"images/logos/europe1.png",
"images/logos/rmc-bfmtv.png",
"images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
pattern <- "images/logos/\\s*(.*?)\\s*.png"
regmatches(x, regexec(pattern, x))[[1]][2]
I wish to extract a portion of each chr string according to a pattern, like this function does, which works fine but only for the first item in the list.
pattern <- "images/logos/\\s*(.*?)\\s*.png"
y <- regmatches(x, regexec(pattern, x))[[1]][2]
Only returns:
"france2"
How can I apply the regmatches function to all items in the list in order to get a result like this?
[1] "france2" "europe1" "sudradio"
[4] "cnews" "rmc-bfmtv" "franceinfo"
[7] "lcp" "rmc" "lcp"
FYI this is a list of src tags that comes from a scraper
Try gsub
gsub(
".*/(.*)\\.png", "\\1",
c(
"images/logos/france2.png", "images/logos/cnews.png",
"images/logos/lcp.png", "images/logos/europe1.png",
"images/logos/rmc-bfmtv.png", "images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
)
which gives
[1] "france2" "cnews" "lcp" "europe1" "rmc-bfmtv"
[6] "sudradio" "franceinfo"
Output of regmatches(..., regexec(...)) is a list. You may use sapply to extract the 2nd element from each element of the list.
sapply(regmatches(x, regexec(pattern, x)), `[[`, 2)
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
You may also use the function basename + file_path_sans_ext from tools package which would give the required output directly.
tools::file_path_sans_ext(basename(x))
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
strings = c("images/logos/france2.png","images/logos/cnews.png",
"images/logos/lcp.png","images/logos/europe1.png",
"images/logos/rmc-bfmtv.png","images/logos/sudradio.png",
"images/logos/franceinfo.png")
)
df %>%
mutate(strings = str_remove(strings, "images/logos/") %>%
str_remove("\\.png"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo
Or even simpler:
library(tidyverse)
df %>%
mutate(strings = str_extract(strings, "(?<=images/logos/)(.*)(?=\\.png)"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo

Create list with specific iteration in R

I have the following dataset containing dates:
> dates
[1] "20180412" "20180424" "20180506" "20180518" "20180530" "20180611" "20180623" "20180705" "20180717" "20180729"
I am trying to create a list where in each position, the name is 'Coherence_' + the first and second dates in dates. So in output1[1] I would have Coherence_20180412_20180424. Then in output1[2] I would have Coherence_20180506_20180518, etc.
I am starting with this code but it is not working they way I need:
output1<-list()
for (i in 1:5){
output1[[i]]<-paste("-Poutput1=", S1_Out_Path,"Coherence_VV_TC", dates[[i]],"_", dates[[i+1]], ".tif", sep="")
}
Do you have any suggestions?
M
Try this:
Without loop
even_indexes<-seq(2,10,2) # List of even indexes
odd_indexes<-seq(1,10,2) # List of odd indexes
print(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
Link answer from here: Create list in R with specific iteration
Updated (To get data in List)
lst=c(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
OR
a=c(1:10)
for (i in seq(1, 9, 2)){
print(paste('Coherence',paste(a[i],a[i+1],sep = "_"),sep = "_"))
}
Output:
[1] "Coherence_1_2"
[1] "Coherence_3_4"
[1] "Coherence_5_6"
[1] "Coherence_7_8"
[1] "Coherence_9_10"
You can create these patterns using paste capability to operate on vectors:
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530",
"20180611", "20180623", "20180705", "20180717", "20180729")
paste("Coherence", dates[1:length(dates)-1], dates[2:length(dates)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180424_20180506" "Coherence_20180506_20180518"
[4] "Coherence_20180518_20180530" "Coherence_20180530_20180611" "Coherence_20180611_20180623"
[7] "Coherence_20180623_20180705" "Coherence_20180705_20180717" "Coherence_20180717_20180729"
Or other simple patterns can be generated as:
paste("Coherence", dates[seq(1, length(dates), 2)], dates[seq(2, length(dates), 2)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611"
[4] "Coherence_20180623_20180705" "Coherence_20180717_20180729"
You can use matrix(..., nrow=2):
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530", "20180611", "20180623", "20180705", "20180717", "20180729")
paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# > paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# [1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611" "Coherence_20180623_20180705"
# [5] "Coherence_20180717_20180729"

Cannot iterate through a list while I iterate through a list of lists

The input is a list of lists. Please see below. The file names is a list containing as many names as there are lists in the list (name1, name2, name3).
Each name is appended to the path: path/name1 - path/name2 - path/name3
The program iterates through the list containing the paths as it iterates through the list of lists and prints the paths with their file names. I would expect for the output to be path/name1 - path/name2 - path/name3. However I get the output below. Please see OUTPUT after INPUT
INPUT
[[1]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" "JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
OUTPUT
I would expect for the output to be path/nam1 - path/name2 - path/name3
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name2.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name3.tsv".
However I get the output below:
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv"
I cannot understand why I cannot iterate through the list of paths with the file name while iterating through the list of lists. I hope this helps to clarify the problem. Could anyone help with this?
I have analyzed each statement using printing and every thing works fine except for the output of the code below
for (i in 1:length(lc)) {
for (j in 1:length(lc[[i]])) { # fetch and read files
if (j==1) {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFile, header = TRUE, sep = '/'), error=function(e) NULL)
newFile<- tryCatch(newFile, error=function(e) data.frame())
print(tmpFn[i])
} else {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFilei, header = TRUE, sep = '/'), error=function(e) NULL)
newFile <- tryCatch(newFile, error=function(e) data.frame())
newFile <- dplyr::bind_rows(newFile, newFile)
print(tmpFn[i])
}
}
}
There's no need to use nested loop. try this:
# sample data
dataFnsDir <- "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/"
lc <- list()
lc[[1]] <- c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt","160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
, "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" ,"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
)
lc[[2]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" , "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
)
lc[[3]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
# actual code
lc.path.v <- paste0(dataFnsDir,unlist(lc))
# maybe this is what you want?
lc.path.v
#> [1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
#> [5] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [6] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [7] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [8] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [9] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
If you want to read all of them and combine them together, try this(it may not work because I don't know what the data looks like):
lc.alldf <- lapply(lc.path, read.delim, header = TRUE, sep = "/")
lc.onedf <- dplyr::bind_rows(lc.alldf)
Edit:
code improved, thanks! #Onyambu
If I understand correctly, the OP wants to create 3 new files each from the file names given as character vectors in each list element.
The main issue with OP's code is that newFile is overwritten in each iteration of the nested loops.
Here is what I would with my preferred tools (untested):
library(data.table) # for fread() and rbindlist()
library(magrittr) # use piping for clarity
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
lapply(filenames, fread) %>%
rbindlist()
}
)
This will return a list of three dataframes (data.tables).
I do not have the OP's input files available but we can simulate the effect for demonstration. If we remove the second call to lapply() we will get a list of 3 elements each containing a character vector of file names with the path prepended.
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
print(filenames)
}
)
[[1]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
Data
dataFnsDir <-"/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA"
lc <- list(
c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt",
"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" ,
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
)

How to complete several character vector formatting steps in a single function?

EDITED
I have a simple list of column names that I would like to change the format of, ideally programmatically. This is a sample of the list:
vars_list <- c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z",
"tBodyAcc.std...X", "tBodyAcc.std...Y", "tBodyAcc.std...Z",
"tGravityAcc.mean...X", "tGravityAcc.mean...Y", "tGravityAcc.mean...Z",
"tGravityAcc.std...X", "tGravityAcc.std...Y", "tGravityAcc.std...Z",
"fBodyAcc.mean...X", "fBodyAcc.mean...Y", "fBodyAcc.mean...Z",
"fBodyAcc.std...X", "fBodyAcc.std...Y", "fBodyAcc.std...Z",
"fBodyAccJerk.mean...X", "fBodyAccJerk.mean...Y", "fBodyAccJerk.mean...Z",
"fBodyAccJerk.std...X", "fBodyAccJerk.std...Y", "fBodyAccJerk.std...Z")
And this is the result I'm hoping for:
[3]"Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
[5] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
[7] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
[9] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
[11] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
[13] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
...
[43] "Freq_Body_Acc_Mean_X" "Freq_Body_Acc_Mean_Y"
[45] "Freq_Body_Acc_Mean_Z" "Freq_Body_Acc_Stddev_X"
[47] "Freq_Body_Acc_Stddev_Y" "Freq_Body_Acc_Stddev_Z"
[49] "Freq_Body_Acc_Jerk_Mean_X" "Freq_Body_Acc_Jerk_Mean_Y"
[51] "Freq_Body_Acc_Jerk_Mean_Z" "Freq_Body_Acc_Jerk_Stddev_X"
[53] "Freq_Body_Acc_Jerk_Stddev_Y" "Freq_Body_Acc_Jerk_Stddev_Z"
I've put together what feels like a really verbose way of making the changes employing regular expressions.
vars_list <- unlist(lapply(vars_list, function(x){gsub("^t", "Time", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("^f", "Freq", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("std", "Stddev", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("mean", "Mean", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.+", "", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.", "", x)}))
vars_list <- unlist(lapply(vars_list,
function(x){gsub("(?<=[a-z]).{0}(?=[A-Z])",
"_", x, perl = TRUE)}))
Is there a way to arrive at the same results more efficiently and elegantly by including two or more formatting steps in a single function call?
One alternative is to write your patterns and replacement in two vectors, then use stringi::stri_replace_all_regex which can do this replacement in a vectorized manner:
# patterns correspond to replacement at the same positions
patterns <- c('^t', '^f', 'std', 'mean', '\\.+', '(?<=[a-z])([A-Z])')
replacement <- c('Time', 'Freq', 'Stddev', 'Mean', '', '_$1')
library(stringi)
stri_replace_all_regex(vars_list, patterns, replacement, vectorize_all = F)
# [1] "Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
# [3] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
# [5] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
# [7] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
# [9] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
#[11] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
How about this using base R's sub?
sub("t(\\w+)(Acc)\\.(\\w+)\\.+([XYZ])", "Time_\\1_\\2_\\3_\\4", vars_list);
#[1] "Time_Body_Acc_mean_X" "Time_Body_Acc_mean_Y"
#[3] "Time_Body_Acc_mean_Z" "Time_Body_Acc_std_X"
#[5] "Time_Body_Acc_std_Y" "Time_Body_Acc_std_Z"
#[7] "Time_Gravity_Acc_mean_X" "Time_Gravity_Acc_mean_Y"
#[9] "Time_Gravity_Acc_mean_Z" "Time_Gravity_Acc_std_X"
#[11] "Time_Gravity_Acc_std_Y" "Time_Gravity_Acc_std_Z"
Changing mean to Mean, and std to StdDev requires two additional subs.
Ditto for t to Time and f to Freq.

Resources