Removing strings after in a dataframe to make group - r

This is my set of col name which i want to make group.
colnames(data4)
[1] "VC_1_UI_S3" "VC_2_UI_S4" "BAF60A_KD_1_S13" "BAF60A_KD_2_S14" "VC_VD3_1_S7" "VC_VD3_2_S8"
[7] "BAF60A_VD3_1_S15" "BAF60A_VD3_2_S16"
Im doing this
metadata$Group <- sub("_[^[:alpha:]]+_S[0-9]", "", (colnames(data4)))
Which results in this
metadata
Group
VC_1_UI_S3 VC_1_UI_S3
VC_2_UI_S4 VC_2_UI_S4
BAF60A_KD_1_S13 BAF60A_KD3
BAF60A_KD_2_S14 BAF60A_KD4
VC_VD3_1_S7 VC_VD3
VC_VD3_2_S8 VC_VD3
BAF60A_VD3_1_S15 BAF60A_VD35
BAF60A_VD3_2_S16 BAF60A_VD36
For VC_VD3_1_S7 and VC_VD3_1_S8 Im getting the desired result which is only VC_VD3 but not for others so what I require is like this below
Group
VC_1_UI_S3 VC_1_UI
VC_2_UI_S4 VC_2_UI
BAF60A_KD_1_S13 BAF60A_KD
BAF60A_KD_2_S14 BAF60A_KD
VC_VD3_1_S7 VC_VD
VC_VD3_2_S8 VC_VD
BAF60A_VD3_1_S15 BAF60A_VD3
BAF60A_VD3_2_S16 BAF60A_VD3
Any suggestion or help really appreciated

We could use
sub("_\\d*_*S[0-9]+$", "", x)
#[1] "VC_1_UI" "VC_2_UI" "BAF60A_KD" "BAF60A_KD"
#[5]"VC_VD3" "VC_VD3" "BAF60A_VD3" "BAF60A_VD3"
Or use str_remove from stringr
library(stringr)
str_remove(x, "_\\d*_*S[0-9]+$")
data
x <- c("VC_1_UI_S3", "VC_2_UI_S4", "BAF60A_KD_1_S13", "BAF60A_KD_2_S14",
"VC_VD3_1_S7", "VC_VD3_2_S8", "BAF60A_VD3_1_S15", "BAF60A_VD3_2_S16"
)

You can try this regex making the number before S[0-9]+ optional -
sub("_(\\d+_)?S[0-9]+$", "", x)
#[1] "VC_1_UI" "VC_2_UI" "BAF60A_KD" "BAF60A_KD" "VC_VD3" "VC_VD3"
#[7] "BAF60A_VD3" "BAF60A_VD3"
data
x <- c("VC_1_UI_S3" , "VC_2_UI_S4" ,"BAF60A_KD_1_S13" ,"BAF60A_KD_2_S14" ,
"VC_VD3_1_S7","VC_VD3_2_S8", "BAF60A_VD3_1_S15", "BAF60A_VD3_2_S16")

Related

how to sort list.files() in correct date order?

Using normal list.files() in the working directory return the file list but the numeric order is messed up.
f <- list.files(pattern="*.nc")
f
# [1] "te1971-1.nc" "te1971-10.nc" "te1971-11.nc" "te1971-12.nc"
# [5] "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [9] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc"
where the number after "-" describes the month number.
I used the following to try to sort it
myFiles <- paste("te", i, "-", c(1:12), ".nc", sep = "")
mixedsort(myFiles)
it returns ordered files but in reverse:
[1] "te1971-12.nc" "te1971-11.nc" "tev1971-10.nc" "te1971-9.nc"
[5] "te1971-8.nc" "te1971-7.nc" "te1971-6.nc" "te1971-5.nc"
[9] "te1971-4.nc" "te1971-3.nc" "te1971-2.nc" "te1971-1.nc"
How do I fix this?
The issue is that the values get alphabetically sorted.
You could gsub out years and months as groups (.) and add "-1" as first day of the month to the yield, coerce it as.Date and order by that.
x[order(as.Date(gsub('.*(\\d{4})-(\\d{,2}).*', '\\1-\\2-1', x)))]
# [1] "te1971-1.nc" "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [6] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc" "te1971-10.nc"
# [11] "te1971-11.nc" "te1971-12.nc"
Data:
x <- c("te1971-1.nc", "te1971-10.nc", "te1971-11.nc", "te1971-12.nc",
"te1971-2.nc", "te1971-3.nc", "te1971-4.nc", "te1971-5.nc", "te1971-6.nc",
"te1971-7.nc", "te1971-8.nc", "te1971-9.nc")

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

Change order of multiple optional substrings

That's a bit like this question, but I have multiple substrings that may or may not occur.
The substrings code for two different dimensions, in my example "test" and "eye". They can occur in any imaginable order.
The variables can be coded in different ways - in my example, "method|test" would be two ways to code for "test", as well as "r|re|l|le" different ways to code for eyes.
I found a convoluted solution, which is using a chain of seven (!) gsub calls, and I wondered if there is a more concise way.
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")
x
#> [1] "id" "r_test" "l_method" "test_re"
#> [5] "method_le" "test_r_old" "test_l_old" "re_test_new"
#> [9] "new_le_method" "new_r_test"
Desired output
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
How I got there (convoluted)
## Unify codes for variables, I use the underscores to make it more unique for future regex
clean_test<- gsub("(?<![a-z])(test|method)(?![a-z])", "_test_", tolower(x), perl = TRUE)
clean_r <- gsub("(?<![a-z])(r|re)(?![a-z])", "_r_", tolower(clean_test), perl = TRUE)
clean_l <- gsub("(?<![a-z])(l|le)(?![a-z])", "_l_", tolower(clean_r), perl = TRUE)
## Now sort, one after the other
sort_eye <- gsub("(.*)(_r_|_l_)(.*)", "\\2\\1\\3", clean_l, perl = TRUE)
sort_test <- gsub("(_r_|_l_)(.*)(_test_)(.*)", "\\1\\3\\2\\4", sort_eye, perl = TRUE)
## Remove underscores
clean_underscore_mult <- gsub("_{2,}", "_", sort_test)
clean_underscore_ends <- gsub("^_|_$", "", clean_underscore_mult)
clean_underscore_ends
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
I'd be already very very grateful for a suggestion how to better proceed from ## Now sort, one after the other downwards...
How about tokenizing the string and using lookup tables instead? I'll use data.table to assist but the idea fits naturally with other data grammars as well
library(data.table)
# build into a table, keeping track of an ID
# to say which element it came from originally
l = strsplit(x, '_', fixed=TRUE)
DT = data.table(id = rep(seq_along(l), lengths(l)), token = unlist(l))
Now build a lookup table:
# defined using fread to make it easier to see
# token & match side-by-side; only define tokens
# that actually need to be changed here
lookups = fread('
token,match
le,l
re,r
method,test
')
Now combine:
# default value is the token itself
DT[ , match := token]
# replace anything matched
DT[lookups, match := i.match, on = 'token']
Next use factor ordering to get the tokens in the right order:
# the more general [where you don't have an exact list of all the possible
# tokens ready at hand] is a bit messier -- you might do something
# similar to setdiff(unique(match), lookups$match)
DT[ , match := factor(match, levels = c('id', 'r', 'l', 'test', 'old', 'new'))]
# sort to this new order
setorder(DT, id, match)
Finally combine again (an aggregation) to get the output:
DT[ , paste(match, collapse='_'), by = id]$V1
# [1] "id" "r_test" "l_test" "r_test" "l_test"
# [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Here's a one-liner with nested sub that transforms x without any intermediary steps:
sub("^(\\w+)_(r|re|l|le)", "\\2_\\1",
sub("method", "test",
sub("(l|r)e", "\\1",
sub("(^new)_(\\w+_\\w+)$", "\\2_\\1", x))))
# [1] "id" "r_test" "l_test" "r_test" "l_test" "r_test_old"
# [7] "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Data:
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")
Much inspired and building on user MichaelChirico's answer, this is a function using base R only, which (in theory) should work with any number of substrings to sort. The list defines the sort (by its elements), and it specifies all ways to code for the default tokens (the list names).
## I've added some more ways to code for right and left eyes, as well as different further strings that are not known.
x <- c("id", "r_random_test_old", "r_test", "r_test_else", "l_method", "test_re", "method_le", "test_od_old",
"test_os_old", "re_mth_new","new_le_method", "new_r_test_random")
x
#> [1] "id" "r_random_test_old" "r_test"
#> [4] "r_test_else" "l_method" "test_re"
#> [7] "method_le" "test_od_old" "test_os_old"
#> [10] "re_mth_new" "new_le_method" "new_r_test_random"
sort_substr(x, list(r = c("od","re"), l = c("os","le"), test = c("method", "mth"), time = c("old","new")))
#> [1] "id" "r_test_time_random" "r_test"
#> [4] "r_test_else" "l_test" "r_test"
#> [7] "l_test" "r_test_time" "l_test_time"
#> [10] "r_test_time" "l_test_time" "r_test_time_random"
sort_substr
sort_substr <- function(x, list_substr) {
lookups <- data.frame(match = rep(names(list_substr), lengths(list_substr)),
token = unlist(list_substr))
l <- strsplit(x, "_", fixed = TRUE)
DF <- data.frame(id = rep(seq_along(l), lengths(l)), token = unlist(l))
match_token <- lookups$match[match(DF$token, lookups$token)]
DF$match <- ifelse(is.na(match_token), DF$token, match_token)
rest_token <- base::setdiff(DF$match, names(list_substr))
DF$match <- factor(DF$match, levels = c(names(list_substr), rest_token))
DF <- DF[with(DF, order(id, match)), ]
out <- vapply(split(DF$match, DF$id),
paste, collapse = "_",
FUN.VALUE = character(1),
USE.NAMES = FALSE)
out
}

Create list with specific iteration in R

I have the following dataset containing dates:
> dates
[1] "20180412" "20180424" "20180506" "20180518" "20180530" "20180611" "20180623" "20180705" "20180717" "20180729"
I am trying to create a list where in each position, the name is 'Coherence_' + the first and second dates in dates. So in output1[1] I would have Coherence_20180412_20180424. Then in output1[2] I would have Coherence_20180506_20180518, etc.
I am starting with this code but it is not working they way I need:
output1<-list()
for (i in 1:5){
output1[[i]]<-paste("-Poutput1=", S1_Out_Path,"Coherence_VV_TC", dates[[i]],"_", dates[[i+1]], ".tif", sep="")
}
Do you have any suggestions?
M
Try this:
Without loop
even_indexes<-seq(2,10,2) # List of even indexes
odd_indexes<-seq(1,10,2) # List of odd indexes
print(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
Link answer from here: Create list in R with specific iteration
Updated (To get data in List)
lst=c(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
OR
a=c(1:10)
for (i in seq(1, 9, 2)){
print(paste('Coherence',paste(a[i],a[i+1],sep = "_"),sep = "_"))
}
Output:
[1] "Coherence_1_2"
[1] "Coherence_3_4"
[1] "Coherence_5_6"
[1] "Coherence_7_8"
[1] "Coherence_9_10"
You can create these patterns using paste capability to operate on vectors:
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530",
"20180611", "20180623", "20180705", "20180717", "20180729")
paste("Coherence", dates[1:length(dates)-1], dates[2:length(dates)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180424_20180506" "Coherence_20180506_20180518"
[4] "Coherence_20180518_20180530" "Coherence_20180530_20180611" "Coherence_20180611_20180623"
[7] "Coherence_20180623_20180705" "Coherence_20180705_20180717" "Coherence_20180717_20180729"
Or other simple patterns can be generated as:
paste("Coherence", dates[seq(1, length(dates), 2)], dates[seq(2, length(dates), 2)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611"
[4] "Coherence_20180623_20180705" "Coherence_20180717_20180729"
You can use matrix(..., nrow=2):
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530", "20180611", "20180623", "20180705", "20180717", "20180729")
paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# > paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# [1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611" "Coherence_20180623_20180705"
# [5] "Coherence_20180717_20180729"

How to complete several character vector formatting steps in a single function?

EDITED
I have a simple list of column names that I would like to change the format of, ideally programmatically. This is a sample of the list:
vars_list <- c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z",
"tBodyAcc.std...X", "tBodyAcc.std...Y", "tBodyAcc.std...Z",
"tGravityAcc.mean...X", "tGravityAcc.mean...Y", "tGravityAcc.mean...Z",
"tGravityAcc.std...X", "tGravityAcc.std...Y", "tGravityAcc.std...Z",
"fBodyAcc.mean...X", "fBodyAcc.mean...Y", "fBodyAcc.mean...Z",
"fBodyAcc.std...X", "fBodyAcc.std...Y", "fBodyAcc.std...Z",
"fBodyAccJerk.mean...X", "fBodyAccJerk.mean...Y", "fBodyAccJerk.mean...Z",
"fBodyAccJerk.std...X", "fBodyAccJerk.std...Y", "fBodyAccJerk.std...Z")
And this is the result I'm hoping for:
[3]"Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
[5] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
[7] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
[9] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
[11] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
[13] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
...
[43] "Freq_Body_Acc_Mean_X" "Freq_Body_Acc_Mean_Y"
[45] "Freq_Body_Acc_Mean_Z" "Freq_Body_Acc_Stddev_X"
[47] "Freq_Body_Acc_Stddev_Y" "Freq_Body_Acc_Stddev_Z"
[49] "Freq_Body_Acc_Jerk_Mean_X" "Freq_Body_Acc_Jerk_Mean_Y"
[51] "Freq_Body_Acc_Jerk_Mean_Z" "Freq_Body_Acc_Jerk_Stddev_X"
[53] "Freq_Body_Acc_Jerk_Stddev_Y" "Freq_Body_Acc_Jerk_Stddev_Z"
I've put together what feels like a really verbose way of making the changes employing regular expressions.
vars_list <- unlist(lapply(vars_list, function(x){gsub("^t", "Time", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("^f", "Freq", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("std", "Stddev", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("mean", "Mean", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.+", "", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.", "", x)}))
vars_list <- unlist(lapply(vars_list,
function(x){gsub("(?<=[a-z]).{0}(?=[A-Z])",
"_", x, perl = TRUE)}))
Is there a way to arrive at the same results more efficiently and elegantly by including two or more formatting steps in a single function call?
One alternative is to write your patterns and replacement in two vectors, then use stringi::stri_replace_all_regex which can do this replacement in a vectorized manner:
# patterns correspond to replacement at the same positions
patterns <- c('^t', '^f', 'std', 'mean', '\\.+', '(?<=[a-z])([A-Z])')
replacement <- c('Time', 'Freq', 'Stddev', 'Mean', '', '_$1')
library(stringi)
stri_replace_all_regex(vars_list, patterns, replacement, vectorize_all = F)
# [1] "Time_Body_Acc_Mean_X" "Time_Body_Acc_Mean_Y"
# [3] "Time_Body_Acc_Mean_Z" "Time_Body_Acc_Stddev_X"
# [5] "Time_Body_Acc_Stddev_Y" "Time_Body_Acc_Stddev_Z"
# [7] "Time_Gravity_Acc_Mean_X" "Time_Gravity_Acc_Mean_Y"
# [9] "Time_Gravity_Acc_Mean_Z" "Time_Gravity_Acc_Stddev_X"
#[11] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"
How about this using base R's sub?
sub("t(\\w+)(Acc)\\.(\\w+)\\.+([XYZ])", "Time_\\1_\\2_\\3_\\4", vars_list);
#[1] "Time_Body_Acc_mean_X" "Time_Body_Acc_mean_Y"
#[3] "Time_Body_Acc_mean_Z" "Time_Body_Acc_std_X"
#[5] "Time_Body_Acc_std_Y" "Time_Body_Acc_std_Z"
#[7] "Time_Gravity_Acc_mean_X" "Time_Gravity_Acc_mean_Y"
#[9] "Time_Gravity_Acc_mean_Z" "Time_Gravity_Acc_std_X"
#[11] "Time_Gravity_Acc_std_Y" "Time_Gravity_Acc_std_Z"
Changing mean to Mean, and std to StdDev requires two additional subs.
Ditto for t to Time and f to Freq.

Resources