R: gsub, pattern = vector and replacement = vector - r

As the title states, I am trying to use gsub where I use a vector for the "pattern" and "replacement". Currently, I have a code that looks like this:
names(x1) <- gsub("2110027599", "Inv1", names(x1)) #x1 is a data frame
names(x1) <- gsub("2110025622", "Inv2", names(x1))
names(x1) <- gsub("2110028045", "Inv3", names(x1))
names(x1) <- gsub("2110034716", "Inv4", names(x1))
names(x1) <- gsub("2110069349", "Inv5", names(x1))
names(x1) <- gsub("2110023264", "Inv6", names(x1))
What I hope to do is something like this:
a <- c("2110027599","2110025622","2110028045","2110034716", "2110069349", "2110023264")
b <- c("Inv1","Inv2","Inv3","Inv4","Inv5","Inv6")
names(x1) <- gsub(a,b,names(x1))
I'm guessing there is an apply function somewhere that can do this, but I am not very sure which one to use!
EDIT: names(x1) looks like this (There are many more columns, but I'm leaving them out):
> names(x1)
[1] "2110023264A.Ms.Amp" "2110023264A.Ms.Vol" "2110023264A.Ms.Watt" "2110023264A1.Ms.Amp"
[5] "2110023264A2.Ms.Amp" "2110023264A3.Ms.Amp" "2110023264A4.Ms.Amp" "2110023264A5.Ms.Amp"
[9] "2110023264B.Ms.Amp" "2110023264B.Ms.Vol" "2110023264B.Ms.Watt" "2110023264B1.Ms.Amp"
[13] "2110023264Error" "2110023264E-Total" "2110023264GridMs.Hz" "2110023264GridMs.PhV.phsA"
[17] "2110023264GridMs.PhV.phsB" "2110023264GridMs.PhV.phsC" "2110023264GridMs.TotPFPrc" "2110023264Inv.TmpLimStt"
[21] "2110023264InvCtl.Stt" "2110023264Mode" "2110023264Mt.TotOpTmh" "2110023264Mt.TotTmh"
[25] "2110023264Op.EvtCntUsr" "2110023264Op.EvtNo" "2110023264Op.GriSwStt" "2110023264Op.TmsRmg"
[29] "2110023264Pac" "2110023264PlntCtl.Stt" "2110023264Serial Number" "2110025622A.Ms.Amp"
[33] "2110025622A.Ms.Vol" "2110025622A.Ms.Watt" "2110025622A1.Ms.Amp" "2110025622A2.Ms.Amp"
[37] "2110025622A3.Ms.Amp" "2110025622A4.Ms.Amp" "2110025622A5.Ms.Amp" "2110025622B.Ms.Amp"
[41] "2110025622B.Ms.Vol" "2110025622B.Ms.Watt" "2110025622B1.Ms.Amp" "2110025622Error"
[45] "2110025622E-Total" "2110025622GridMs.Hz" "2110025622GridMs.PhV.phsA" "2110025622GridMs.PhV.phsB"
What I hope to get is this:
> names(x1)
[1] "Inv6A.Ms.Amp" "Inv6A.Ms.Vol" "Inv6A.Ms.Watt" "Inv6A1.Ms.Amp" "Inv6A2.Ms.Amp"
[6] "Inv6A3.Ms.Amp" "Inv6A4.Ms.Amp" "Inv6A5.Ms.Amp" "Inv6B.Ms.Amp" "Inv6B.Ms.Vol"
[11] "Inv6B.Ms.Watt" "Inv6B1.Ms.Amp" "Inv6Error" "Inv6E-Total" "Inv6GridMs.Hz"
[16] "Inv6GridMs.PhV.phsA" "Inv6GridMs.PhV.phsB" "Inv6GridMs.PhV.phsC" "Inv6GridMs.TotPFPrc" "Inv6Inv.TmpLimStt"
[21] "Inv6InvCtl.Stt" "Inv6Mode" "Inv6Mt.TotOpTmh" "Inv6Mt.TotTmh" "Inv6Op.EvtCntUsr"
[26] "Inv6Op.EvtNo" "Inv6Op.GriSwStt" "Inv6Op.TmsRmg" "Inv6Pac" "Inv6PlntCtl.Stt"
[31] "Inv6Serial Number" "Inv2A.Ms.Amp" "Inv2A.Ms.Vol" "Inv2A.Ms.Watt" "Inv2A1.Ms.Amp"
[36] "Inv2A2.Ms.Amp" "Inv2A3.Ms.Amp" "Inv2A4.Ms.Amp" "Inv2A5.Ms.Amp" "Inv2B.Ms.Amp"
[41] "Inv2B.Ms.Vol" "Inv2B.Ms.Watt" "Inv2B1.Ms.Amp" "Inv2Error" "Inv2E-Total"
[46] "Inv2GridMs.Hz" "Inv2GridMs.PhV.phsA" "Inv2GridMs.PhV.phsB"

Lot's of solutions already, here are one more:
The qdap package:
library(qdap)
names(x1) <- mgsub(a,b,names(x1))

From stringr documentation of str_replace_all, "If you want to apply multiple patterns and replacements to the same string, pass a named version to pattern."
Thus using a, b, and names(x1) from above
stringr::str_replace_all(names(x1), setNames(b, a))
EDIT
stringr::str_replace_all calls stringi::stri_replace_all_regex, which can be used directly and is quite a bit quicker.
x <- names(x1)
pattern <- a
replace <- b
microbenchmark::microbenchmark(
str = stringr::str_replace_all(x, setNames(replace, pattern)),
stri = stringi::stri_replace_all_regex(x, pattern, replace, vectorize_all = FALSE)
)
Unit: microseconds
expr min lq mean median uq max neval cld
str 1022.1 1070.45 1286.547 1175.55 1309 2526.8 100 b
stri 145.2 150.45 190.124 160.55 178 457.9 100 a

New Answer
If we can make another assumption, the following should work. The assumption this time is that you are really interested in substituting the first 10 characters from each value in names(x1).
Here, I've stored names(x1) as a character vector named "X1". The solution essentially uses substr to separate the values in X1 into 2 parts, match to figure out the correct replacement option, and paste to put everything back together.
a <- c("2110027599", "2110025622", "2110028045",
"2110034716", "2110069349", "2110023264")
b <- c("Inv1","Inv2","Inv3","Inv4","Inv5","Inv6")
X1pre <- substr(X1, 1, 10)
X1post <- substr(X1, 11, max(nchar(X1)))
paste0(b[match(X1pre, a)], X1post)
# [1] "Inv6A.Ms.Amp" "Inv6A.Ms.Vol" "Inv6A.Ms.Watt"
# [4] "Inv6A1.Ms.Amp" "Inv6A2.Ms.Amp" "Inv6A3.Ms.Amp"
# [7] "Inv6A4.Ms.Amp" "Inv6A5.Ms.Amp" "Inv6B.Ms.Amp"
# [10] "Inv6B.Ms.Vol" "Inv6B.Ms.Watt" "Inv6B1.Ms.Amp"
# [13] "Inv6Error" "Inv6E-Total" "Inv6GridMs.Hz"
# [16] "Inv6GridMs.PhV.phsA" "Inv6GridMs.PhV.phsB" "Inv6GridMs.PhV.phsC"
# [19] "Inv6GridMs.TotPFPrc" "Inv6Inv.TmpLimStt" "Inv6InvCtl.Stt"
# [22] "Inv6Mode" "Inv6Mt.TotOpTmh" "Inv6Mt.TotTmh"
# [25] "Inv6Op.EvtCntUsr" "Inv6Op.EvtNo" "Inv6Op.GriSwStt"
# [28] "Inv6Op.TmsRmg" "Inv6Pac" "Inv6PlntCtl.Stt"
# [31] "Inv6Serial Number" "Inv2A.Ms.Amp" "Inv2A.Ms.Vol"
# [34] "Inv2A.Ms.Watt" "Inv2A1.Ms.Amp" "Inv2A2.Ms.Amp"
# [37] "Inv2A3.Ms.Amp" "Inv2A4.Ms.Amp" "Inv2A5.Ms.Amp"
# [40] "Inv2B.Ms.Amp" "Inv2B.Ms.Vol" "Inv2B.Ms.Watt"
# [43] "Inv2B1.Ms.Amp" "Inv2Error" "Inv2E-Total"
# [46] "Inv2GridMs.Hz" "Inv2GridMs.PhV.phsA" "Inv2GridMs.PhV.phsB"
Old Answer
If we can assume that names(x1) is in the same order as the pattern and replacement and that it is basically a one-for-one replacement, you might be able to get away with just sapply.
Here's an example of that particular situation:
Imagine "names(x)" looks something like this:
X1 <- paste0("A2", a, sequence(length(a)))
X1
# [1] "A221100275991" "A221100256222" "A221100280453"
# [4] "A221100347164" "A221100693495" "A221100232646"
Here's our pattern and replacement vectors:
a <- c("2110027599", "2110025622", "2110028045",
"2110034716", "2110069349", "2110023264")
b <- c("Inv1","Inv2","Inv3","Inv4","Inv5","Inv6")
This is how we might use sapply if these assumptions are valid.
sapply(seq_along(a), function(x) gsub(a[x], b[x], X1[x]))
# [1] "A2Inv11" "A2Inv22" "A2Inv33" "A2Inv44" "A2Inv55" "A2Inv66"

Try mapply.
names(x1) <- mapply(gsub, a, b, names(x1), USE.NAMES = FALSE)
Or, even easier, str_replace from stringr.
library(stringr)
names(x1) <- str_replace(names(x1), a, b)

I needed to do something similar but had to use base R. As long as your vectors are the same length, I think this will work
for (i in seq_along(a)){
names(x1) <- gsub(a[i], b[i], names(x1))
}

Somehow names<- and match seems much more appropriate here...
names( x1 ) <- b[ match( names( x1 ) , a ) ]
But I am making the assumption that the elements of vector a are the actual names of your data.frame.
If a really is a pattern found within each of the names of x1 then this grepl approach with names<- could be useful...
new <- sapply( a , grepl , x = names( x1 ) )
names( x1 ) <- b[ apply( new , 1 , which.max ) ]

Related

How to make continuous a discontinuous sequence of character numbers with leading zero(s)?

I have this character vector:
dput(t$line)
c("0304", "0305", "0306", "0308", "0311", "0313", "0314", "0316",
"0318", "0321", "0322", "0323", "0324", "0326", "0327", "0330",
"0333", "0337", "0338", "0339", "0342", "0341", "0344", "0346",
"0347", "0348", "0349", "0350", "0352", "0353", "0357", "0359",
"0360", "0362", "0363", "0364", "0365", "0367", "0371", "0370",
"0373", "0375", "0378", "0380", "0381", "0385", "0386", "0387",
"0391", "0395", "0394", "0397", "0398", "0399", "0400", "0402",
"0404", "0405", "0406", "0408", "0412", "0416", "0419", "0423",
"0424", "0425", "0426", "0428", "0429", "0432", "0433", "0436",
"0435", "0439", "0437", "0440", "0441")
The numbers it contains are not completely continuous. I'd like to make them continuous, while preserving the leading zero or zeros where needed. I've come up with this solution:
paste("0", seq(as.numeric(t$line[1]), as.numeric(t$line[1]) + length(t$line), 1), sep = "")
[1] "0304" "0305" "0306" "0307" "0308" "0309" "0310" "0311" "0312" "0313" "0314" "0315" "0316" "0317" "0318" "0319" "0320"
[18] "0321" "0322" "0323" "0324" "0325" "0326" "0327" "0328" "0329" "0330" "0331" "0332" "0333" "0334" "0335" "0336" "0337"
[35] "0338" "0339" "0340" "0341" "0342" "0343" "0344" "0345" "0346" "0347" "0348" "0349" "0350" "0351" "0352" "0353" "0354"
[52] "0355" "0356" "0357" "0358" "0359" "0360" "0361" "0362" "0363" "0364" "0365" "0366" "0367" "0368" "0369" "0370" "0371"
[69] "0372" "0373" "0374" "0375" "0376" "0377" "0378" "0379" "0380" "0381"
This works okay as long as there is exactly one 0 to be added. There may however be more than one leading zero or none at all. How can the sequence be made continuous with appropriate leading zeros?
One stringr option could be:
str_pad(seq.int(min(as.numeric(x)), length.out = length(x)), 4, "left", "0")
[1] "0304" "0305" "0306" "0307" "0308" "0309" "0310" "0311" "0312" "0313" "0314" "0315" "0316"
[14] "0317" "0318" "0319" "0320" "0321" "0322" "0323" "0324" "0325" "0326" "0327" "0328" "0329"
[27] "0330" "0331" "0332" "0333" "0334" "0335" "0336" "0337" "0338" "0339" "0340" "0341" "0342"
[40] "0343" "0344" "0345" "0346" "0347" "0348" "0349" "0350" "0351" "0352" "0353" "0354" "0355"
[53] "0356" "0357" "0358" "0359" "0360" "0361" "0362" "0363" "0364" "0365" "0366" "0367" "0368"
[66] "0369" "0370" "0371" "0372" "0373" "0374" "0375" "0376" "0377" "0378" "0379" "0380"
A more general solution that takes into account the maximum length of the entries and therefore implictly the number of leading zeros:
t$line2 <- c("000517","00524")
Cont.PadZero <- function(vec) sprintf(paste0("%0", max(nchar(vec)), "d"), seq.int(min(as.numeric(vec)), max(as.numeric(vec))))
Cont.PadZero(t$line2)
[1] "000517" "000518" "000519" "000520" "000521" "000522" "000523" "000524"
You want a continuous sequence of length(x) starting at min(x), where nchar of the resulting elements is identical to that of x.
Use sprintf instead of paste0 to format leading zeros. nchar(x)[1] gives the length to which (occasional) padding with zeros is required. If it's not safe that the lengths are equal use max(nchar(x)), but that's slower.
Since x[1] does not necessarily have to be the minimum you may want to use min(as.numeric(x)) as starting point. When you use seq, it's end point should be min(as.numeric(x)) + length(x) - 1 (because the min is already the first element). Or use length.out=length(x) which appears to be faster, combined with seq.int even faster.
sprintf(paste0("%0", nchar(x)[1], "d"), seq.int(min(as.numeric(x)), length.out=length(x)))
# [1] "0304" "0305" "0306" "0307" "0308" "0309" "0310" "0311" "0312" "0313" "0314" "0315"
# [13] "0316" "0317" "0318" "0319" "0320" "0321" "0322" "0323" "0324" "0325" "0326" "0327"
# [25] "0328" "0329" "0330" "0331" "0332" "0333" "0334" "0335" "0336" "0337" "0338" "0339"
# [37] "0340" "0341" "0342" "0343" "0344" "0345" "0346" "0347" "0348" "0349" "0350" "0351"
# [49] "0352" "0353" "0354" "0355" "0356" "0357" "0358" "0359" "0360" "0361" "0362" "0363"
# [61] "0364" "0365" "0366" "0367" "0368" "0369" "0370" "0371" "0372" "0373" "0374" "0375"
# [73] "0376" "0377" "0378" "0379" "0380"
Another option is using colon :, but seq.int above appears to be faster (see benchmark below).
sprintf(paste0("%0", nchar(x)[1], "d"), 0:(length(x) - 1) + min(as.numeric(x)))
NB: To complete the original vector by imputing missings, you may do:
sprintf(paste0("%0", max(nchar(x)), "d"), do.call(`:`, as.list(range(as.numeric(x)))))
# [1] "0304" "0305" "0306" "0307" "0308" "0309" "0310" "0311" "0312" "0313" "0314"
# [12] "0315" "0316" "0317" "0318" "0319" "0320" "0321" "0322" "0323" "0324" "0325"
# [23] "0326" "0327" "0328" "0329" "0330" "0331" "0332" "0333" "0334" "0335" "0336"
# [34] "0337" "0338" "0339" "0340" "0341" "0342" "0343" "0344" "0345" "0346" "0347"
# [45] "0348" "0349" "0350" "0351" "0352" "0353" "0354" "0355" "0356" "0357" "0358"
# [56] "0359" "0360" "0361" "0362" "0363" "0364" "0365" "0366" "0367" "0368" "0369"
# [67] "0370" "0371" "0372" "0373" "0374" "0375" "0376" "0377" "0378" "0379" "0380"
# [78] "0381" "0382" "0383" "0384" "0385" "0386" "0387" "0388" "0389" "0390" "0391"
# [89] "0392" "0393" "0394" "0395" "0396" "0397" "0398" "0399" "0400" "0401" "0402"
# [100] "0403" "0404" "0405" "0406" "0407" "0408" "0409" "0410" "0411" "0412" "0413"
# [111] "0414" "0415" "0416" "0417" "0418" "0419" "0420" "0421" "0422" "0423" "0424"
# [122] "0425" "0426" "0427" "0428" "0429" "0430" "0431" "0432" "0433" "0434" "0435"
# [133] "0436" "0437" "0438" "0439" "0440" "0441"
Benchmark
f1 <- function() sprintf(paste0("%0", max(nchar(x)), "d"),
seq(min(as.numeric(x)), min(as.numeric(x)) + length(x) - 1))
f2 <- function() sprintf(paste0("%0", max(nchar(x)), "d"),
seq(min(as.numeric(x)), length.out=length(x)))
f3 <- function() sprintf(paste0("%0", max(nchar(x)), "d"),
seq.int(min(as.numeric(x)), length.out=length(x)))
f31 <- function() sprintf(paste0("%0", nchar(x[1]), "d"),
seq.int(min(as.numeric(x)), length.out=length(x)))
f4 <- function() sprintf(paste0("%0", nchar(x[1]), "d"),
0:(length(x) - 1) + min(as.numeric(x)))
f5 <- function() stringr::str_pad(seq.int(min(as.numeric(x)),
length.out=length(x)),
nchar(x[1]), "left", "0")
set.seed(5789)
x <- sample(sprintf("%05d", 1:99999))
microbenchmark::microbenchmark(seq_to=f1(), seq_len=f2(), seq.int=f3(),
seq.int1=f31(), colon=f4(), stringr=f5())
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# seq_to 104.22119 106.83928 108.92791 107.81301 109.68406 124.35686 100 f
# seq_len 87.14385 89.89180 92.34962 90.97192 92.09823 110.59426 100 d
# seq.int 85.72324 87.93885 89.91353 89.03327 90.32758 113.41480 100 c
# seq.int1 59.54312 61.63065 62.86618 62.47707 63.53334 76.33471 100 a
# colon 60.94867 63.16109 64.73306 63.88925 64.79997 81.63646 100 b
# stringr 99.08452 101.56649 104.01522 102.74420 104.20269 158.30948 100 e

Is there a specific function in R to merge 2 vectors [duplicate]

This question already has answers here:
Pasting two vectors with combinations of all vectors' elements
(8 answers)
Closed 2 years ago.
I have two vectors, one that contains a list of variables, and one that contains dates, such as
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
I want to merge them to have a vector with each variable indexed by my date, that is my desired output is
> Colonnes_Pays_Principaux
[1] "PIB_2020" "PIB_2021" "ConsommationPrivee_2020"
[4] "ConsommationPrivee_2021" "ConsommationPubliques_2020" "ConsommationPubliques_2021"
[7] "FBCF_2020" "FBCF_2021" "ProductionIndustrielle_2020"
[10] "ProductionIndustrielle_2021" "Inflation_2020" "Inflation_2021"
[13] "InflationSousJacente_2020" "InflationSousJacente_2021" "PrixProductionIndustrielle_2020"
[16] "PrixProductionIndustrielle_2021" "CoutHoraireTravail_2020" "CoutHoraireTravail_2021"
Is there a simpler / more readabl way than a double for loop as I have tried and succeeded below ?
Colonnes_Pays_Principaux <- vector()
for (Variable in (1:length(Variables_Pays))){
for (Annee in (1:length(Annee_Pays))){
Colonnes_Pays_Principaux=
append(Colonnes_Pays_Principaux,
paste(Variables_Pays[Variable],Annee_Pays[Annee],sep="_")
)
}
}
expand.grid will create a data frame with all combinations of the two vectors.
with(
expand.grid(Variables_Pays, Annee_Pays),
paste0(Var1, "_", Var2)
)
#> [1] "PIB_2000" "ConsommationPrivee_2000"
#> [3] "ConsommationPubliques_2000" "FBCF_2000"
#> [5] "ProductionIndustrielle_2000" "Inflation_2000"
#> [7] "InflationSousJacente_2000" "PrixProductionIndustrielle_2000"
#> [9] "CoutHoraireTravail_2000" "PIB_2001"
#> [11] "ConsommationPrivee_2001" "ConsommationPubliques_2001"
#> [13] "FBCF_2001" "ProductionIndustrielle_2001"
#> [15] "Inflation_2001" "InflationSousJacente_2001"
#> [17] "PrixProductionIndustrielle_2001" "CoutHoraireTravail_2001"
We can use outer :
c(t(outer(Variables_Pays, Annee_Pays, paste, sep = '_')))
# [1] "PIB_2000" "PIB_2001"
# [3] "ConsommationPrivee_2000" "ConsommationPrivee_2001"
# [5] "ConsommationPubliques_2000" "ConsommationPubliques_2001"
# [7] "FBCF_2000" "FBCF_2001"
# [9] "ProductionIndustrielle_2000" "ProductionIndustrielle_2001"
#[11] "Inflation_2000" "Inflation_2001"
#[13] "InflationSousJacente_2000" "InflationSousJacente_2001"
#[15] "PrixProductionIndustrielle_2000" "PrixProductionIndustrielle_2001"
#[17] "CoutHoraireTravail_2000" "CoutHoraireTravail_2001"
No real need to go beyond the basics here! Use paste for pasting the strings and rep to repeat either Annee_Pays och Variables_Pays to get all combinations:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
# To get this is the same order as in your example:
paste(rep(Variables_Pays, rep(2, length(Variables_Pays))), Annee_Pays, sep = "_")
# Alternative order:
paste(Variables_Pays, rep(Annee_Pays, rep(length(Variables_Pays), 2)), sep = "_")
# Or, if order doesn't matter too much:
paste(Variables_Pays, rep(Annee_Pays, length(Variables_Pays)), sep = "_")
In base R:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
cbind(paste(Variables_Pays, Annee_Pays,sep="_"),paste(Variables_Pays, rev(Annee_Pays),sep="_")

str_split on first and second occurence of delimter at different locations in character vector

I have a character list that has weather variables followed by "mean_#" where # is a number between 5 and 10. I want to subset the list to only have the weather variable names themselves. The mean weather variables look like this:
> mean_vars
[1] "dew_mean_10" "dew_mean_5" "dew_mean_6" "dew_mean_7"
[5] "dew_mean_8" "dew_mean_9" "humid_mean_10" "humid_mean_5"
[9] "humid_mean_6" "humid_mean_7" "humid_mean_8" "humid_mean_9"
[13] "rain_mean_10" "rain_mean_5" "rain_mean_6" "rain_mean_7"
[17] "rain_mean_8" "rain_mean_9" "soil_moist_mean_10" "soil_moist_mean_5"
[21] "soil_moist_mean_6" "soil_moist_mean_7" "soil_moist_mean_8" "soil_moist_mean_9"
[25] "soil_temp_mean_10" "soil_temp_mean_5" "soil_temp_mean_6" "soil_temp_mean_7"
[29] "soil_temp_mean_8" "soil_temp_mean_9" "solar_mean_10" "solar_mean_5"
[33] "solar_mean_6" "solar_mean_7" "solar_mean_8" "solar_mean_9"
[37] "temp_mean_10" "temp_mean_5" "temp_mean_6" "temp_mean_7"
[41] "temp_mean_8" "temp_mean_9" "wind_dir_mean_10" "wind_dir_mean_5"
[45] "wind_dir_mean_6" "wind_dir_mean_7" "wind_dir_mean_8" "wind_dir_mean_9"
[49] "wind_gust_mean_10" "wind_gust_mean_5" "wind_gust_mean_6" "wind_gust_mean_7"
[53] "wind_gust_mean_8" "wind_gust_mean_9" "wind_spd_mean_10" "wind_spd_mean_5"
[57] "wind_spd_mean_6" "wind_spd_mean_7" "wind_spd_mean_8" "wind_spd_mean_9"
And this is all I want at the end:
> var_names
"dew" "humid" "rain" "solar" "temp" "soil_moist" "soil_temp" "wind_dir" "wind_gust" "wind_spd"
Now I figured out how to do it but I fill my method is extraneous due to a lack of ability with regular expressions. I also will have to repeat my process 20 times substituting "mean" with other words.
var_names <- unique(str_split_fixed(mean_vars, "_", n = 3)[c(1:18,31:42),1])
var_names <- unlist(c(var_names, unique(unite(as_tibble(str_split_fixed(mean_vars, "_", n = 3)[c(19:30,43:60), 1:2])))))
I've been trying to stay within the realm of the tidyverse packages as much as possible so I was using stringr::str_split_fixed.
If you have a solution using this same function that would be ideal as I could continue the same programming style, but I'm open to all suggestions.
Thanks.
Use sub and unique. This is shorter and has no package dependencies (or use unique(str_replace(mean_vars, "_mean.*", "")) with stringr):
unique(sub("_mean.*", "", mean_vars))
giving:
[1] "dew" "humid" "rain" "soil_moist" "soil_temp"
[6] "solar" "temp" "wind_dir" "wind_gust" "wind_spd"
If for some reason you really want to use str_split then:
rmMean <- function(x) paste(head(x, -2), collapse = "_")
unique(sapply(str_split(mean_vars, "_"), rmMean))
Note
mean_vars <- c("dew_mean_10", "dew_mean_5", "dew_mean_6", "dew_mean_7", "dew_mean_8",
"dew_mean_9", "humid_mean_10", "humid_mean_5", "humid_mean_6",
"humid_mean_7", "humid_mean_8", "humid_mean_9", "rain_mean_10",
"rain_mean_5", "rain_mean_6", "rain_mean_7", "rain_mean_8", "rain_mean_9",
"soil_moist_mean_10", "soil_moist_mean_5", "soil_moist_mean_6",
"soil_moist_mean_7", "soil_moist_mean_8", "soil_moist_mean_9",
"soil_temp_mean_10", "soil_temp_mean_5", "soil_temp_mean_6",
"soil_temp_mean_7", "soil_temp_mean_8", "soil_temp_mean_9", "solar_mean_10",
"solar_mean_5", "solar_mean_6", "solar_mean_7", "solar_mean_8",
"solar_mean_9", "temp_mean_10", "temp_mean_5", "temp_mean_6",
"temp_mean_7", "temp_mean_8", "temp_mean_9", "wind_dir_mean_10",
"wind_dir_mean_5", "wind_dir_mean_6", "wind_dir_mean_7", "wind_dir_mean_8",
"wind_dir_mean_9", "wind_gust_mean_10", "wind_gust_mean_5", "wind_gust_mean_6",
"wind_gust_mean_7", "wind_gust_mean_8", "wind_gust_mean_9", "wind_spd_mean_10",
"wind_spd_mean_5", "wind_spd_mean_6", "wind_spd_mean_7", "wind_spd_mean_8",
"wind_spd_mean_9")

How to complete several character vector formatting steps in a single function?

EDITED
I have a simple list of column names that I would like to change the format of, ideally programmatically. This is a sample of the list:
vars_list <- c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z",
"tBodyAcc.std...X", "tBodyAcc.std...Y", "tBodyAcc.std...Z",
"tGravityAcc.mean...X", "tGravityAcc.mean...Y", "tGravityAcc.mean...Z",
"tGravityAcc.std...X", "tGravityAcc.std...Y", "tGravityAcc.std...Z",
"fBodyAcc.mean...X", "fBodyAcc.mean...Y", "fBodyAcc.mean...Z",
"fBodyAcc.std...X", "fBodyAcc.std...Y", "fBodyAcc.std...Z",
"fBodyAccJerk.mean...X", "fBodyAccJerk.mean...Y", "fBodyAccJerk.mean...Z",
"fBodyAccJerk.std...X", "fBodyAccJerk.std...Y", "fBodyAccJerk.std...Z")
And this is the result I'm hoping for:
[3]"Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
[5] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
[7] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
[9] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
[11] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
[13] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
...
[43] "Freq_Body_Acc_Mean_X" "Freq_Body_Acc_Mean_Y"
[45] "Freq_Body_Acc_Mean_Z" "Freq_Body_Acc_Stddev_X"
[47] "Freq_Body_Acc_Stddev_Y" "Freq_Body_Acc_Stddev_Z"
[49] "Freq_Body_Acc_Jerk_Mean_X" "Freq_Body_Acc_Jerk_Mean_Y"
[51] "Freq_Body_Acc_Jerk_Mean_Z" "Freq_Body_Acc_Jerk_Stddev_X"
[53] "Freq_Body_Acc_Jerk_Stddev_Y" "Freq_Body_Acc_Jerk_Stddev_Z"
I've put together what feels like a really verbose way of making the changes employing regular expressions.
vars_list <- unlist(lapply(vars_list, function(x){gsub("^t", "Time", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("^f", "Freq", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("std", "Stddev", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("mean", "Mean", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.+", "", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.", "", x)}))
vars_list <- unlist(lapply(vars_list,
function(x){gsub("(?<=[a-z]).{0}(?=[A-Z])",
"_", x, perl = TRUE)}))
Is there a way to arrive at the same results more efficiently and elegantly by including two or more formatting steps in a single function call?
One alternative is to write your patterns and replacement in two vectors, then use stringi::stri_replace_all_regex which can do this replacement in a vectorized manner:
# patterns correspond to replacement at the same positions
patterns <- c('^t', '^f', 'std', 'mean', '\\.+', '(?<=[a-z])([A-Z])')
replacement <- c('Time', 'Freq', 'Stddev', 'Mean', '', '_$1')
library(stringi)
stri_replace_all_regex(vars_list, patterns, replacement, vectorize_all = F)
# [1] "Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
# [3] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
# [5] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
# [7] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
# [9] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
#[11] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
How about this using base R's sub?
sub("t(\\w+)(Acc)\\.(\\w+)\\.+([XYZ])", "Time_\\1_\\2_\\3_\\4", vars_list);
#[1] "Time_Body_Acc_mean_X" "Time_Body_Acc_mean_Y"
#[3] "Time_Body_Acc_mean_Z" "Time_Body_Acc_std_X"
#[5] "Time_Body_Acc_std_Y" "Time_Body_Acc_std_Z"
#[7] "Time_Gravity_Acc_mean_X" "Time_Gravity_Acc_mean_Y"
#[9] "Time_Gravity_Acc_mean_Z" "Time_Gravity_Acc_std_X"
#[11] "Time_Gravity_Acc_std_Y" "Time_Gravity_Acc_std_Z"
Changing mean to Mean, and std to StdDev requires two additional subs.
Ditto for t to Time and f to Freq.

Sort columns numerically?

I have a rather long dataframe with 200+ columns like this:
RespondentID Q16_positiv.31 Q16_positiv.68 Q16_positiv.194 ....
Is there a way to sort the Q16_positiv.XYZ variables numerically ascending? All my attempts used aplhabetical sort, which does yield the expected results.
It sounds like you would be interested in the mixedsort function from "gtools".
Consider the following sample data:
set.seed(1)
x <- paste("Q", sample(5, 20, TRUE), sample(20, 20, TRUE), sep = "_")
x
# [1] "Q_2_19" "Q_2_5" "Q_3_14" "Q_5_3" "Q_2_6" "Q_5_8" "Q_5_1"
# [8] "Q_4_8" "Q_4_18" "Q_1_7" "Q_2_10" "Q_1_12" "Q_4_10" "Q_2_4"
# [15] "Q_4_17" "Q_3_14" "Q_4_16" "Q_5_3" "Q_2_15" "Q_4_9"
Here's the result of sort:
sort(x)
# [1] "Q_1_12" "Q_1_7" "Q_2_10" "Q_2_15" "Q_2_19" "Q_2_4" "Q_2_5"
# [8] "Q_2_6" "Q_3_14" "Q_3_14" "Q_4_10" "Q_4_16" "Q_4_17" "Q_4_18"
# [15] "Q_4_8" "Q_4_9" "Q_5_1" "Q_5_3" "Q_5_3" "Q_5_8"
Here's the result of mixedsort:
library(gtools)
mixedsort(x)
# [1] "Q_1_7" "Q_1_12" "Q_2_4" "Q_2_5" "Q_2_6" "Q_2_10" "Q_2_15"
# [8] "Q_2_19" "Q_3_14" "Q_3_14" "Q_4_8" "Q_4_9" "Q_4_10" "Q_4_16"
# [15] "Q_4_17" "Q_4_18" "Q_5_1" "Q_5_3" "Q_5_3" "Q_5_8"
I hope this is enough to be helpful for you--otherwise, please update your question with a reproducible example.

Resources