split columns in to two while recoding in r - r

I have following data:
set.seed(123)
M1 <- c(sample(c("AA", "AB", "BB"), 5, replace = T))
M2k <- c(sample (c("AG", "GG", "AA"), 5, replace = T))
M3l <- c(sample (c("AT", "TT", "AA"), 5, replace = T))
M4 <- c(sample (c("CT", "TT", "CC"), 5, replace = T))
#in real data M1 .......M1000
myd <- data.frame (M1, M2k, M3l, M4)
I want split each M into two M1a, M1b for M1, M2ka, M2kb for M2k and so on. Similarly content of cell will be split AB will be A in M1a column and another M1b column. Also I want to re-code A = 1, B = 2, C = 3, G = 4, T = 5, else = NA.

EDIT reshape::colsplit will split by ''
Using reshape::colsplit.
library(reshape)
split_col <- function(.col, data){
.x <- colsplit( data[[.col]], names = paste0(.col, letters[1:2]))
}
# split each column and combine
new_data <- do.call(cbind,lapply(names(myd), split_col, data = myd))
# convert each new column to a factor with levels 1:5 as requested.
new_data_2 <- do.call(data.frame,
lapply(new_data, factor, levels = c('A','B','C','G','T'), labels= 1:5))
M1a M1b M2ka M2kb M3la M3lb M4a M4b
1 1 1 1 4 1 1 3 3
2 2 2 4 4 5 5 3 5
3 1 2 1 1 1 1 3 5
4 2 2 4 4 5 5 3 5
5 2 2 4 4 1 5 3 3

Here is another possible solution, with no particular advantage except that I find it easy follow:
myd$M5 = c("AB", "GT", "GA", "QW", "CK") # Add another test column.
mat = as.matrix(myd) # Convert to matrix for speed and indexing benefits.
# Construct new column names.
new_names = character(length=ncol(mat) * 2)
new_names[seq(1, ncol(mat) * 2, 2)] = paste(colnames(mat), "a", sep="")
new_names[seq(2, ncol(mat) * 2, 2)] = paste(colnames(mat), "b", sep="")
# Create empty matrix with correct column names.
newmat = matrix(ncol=ncol(mat) * 2, nrow=nrow(mat))
colnames(newmat) = new_names
# Split columns.
for (i in seq(1, ncol(mat))) {
newmat[, (i * 2) - 1] = substr(mat[, i], 1, 1)
newmat[, i * 2 ] = substr(mat[, i], 2, 2)
}
# Use named vector to recode data.
recode = c(A=1, B=2, C=3, G=4, T=5)
newmat[] = recode[newmat]
newmat
# M1a M1b M2ka M2kb M3la M3lb M4a M4b M5a M5b
# [1,] "1" "1" "1" "4" "1" "1" "3" "3" "1" "2"
# [2,] "2" "2" "4" "4" "5" "5" "3" "5" "4" "5"
# [3,] "1" "2" "1" "1" "1" "1" "3" "5" "4" "1"
# [4,] "2" "2" "4" "4" "5" "5" "3" "5" NA NA
# [5,] "2" "2" "4" "4" "1" "5" "3" "3" "3" NA

mnel already gave a pretty straight forward answer. This is me playing with my package in process (qdap) that is on GitHub though not on CRAN yet:
To Install qdap
# install.packages("devtools")
library(devtools)
install_github("qdap", "trinker")
Solving the problem:
lapply(seq_along(myd), function(i){
myd <<- colsplit2df(myd, (i+i-1), paste0(names(myd)[i+i-1],
letters[1:2]), sep="")
})
data.frame(apply(myd, 2, function(x) as.numeric(text2color(x,
c("A", "B", "C", "G", "T"), c(1:5, NA)))))
The work horse of this code is colsplit2df (returns a data.frame) and text2col (designed to recode text for wordcloud coloring; really a dictionary lookup tool). This really isn't what these functions are designed to do, just playing and seeing how they can be extended.

Using qdap with more stable solution:
x <- colsplit2df(myd, 1:ncol(myd), sep="")
colnames(x) <- paste(rep(colnames(myd), each = 2), letters[1:2], sep=".")
## M1a M1b M2ka M2kb M3la M3lb M4a M4b
## 1 1 1 1 4 1 1 3 3
## 2 2 2 4 4 5 5 3 5
## 3 1 2 1 1 1 1 3 5
## 4 2 2 4 4 5 5 3 5
## 5 2 2 4 4 1 5 3 3

Related

Determining All Possible Combinations of Items With a Grouping Variable, Allowing for Different Numbers of Items From Each Original Groups

This question is very similar to this one and this one but it combines elements of both in a way that I can't figure out on my own.
I have the following list.
original_groups <- list(group_1 = as.character(1:6), group_2 = as.character(7:12), group_3 = as.character(13:20))
I want to create new groups based on these original groups. There is a constraint - each new group must contain an equal number of items from each original group. Furthermore, items cannot be used more than once. For example, if we take one item from each original group, we may get the following new groups.
Number_of_Items_From_Each_Original_Group_to_End_up_in_Each_New_Group <- 1
Number_of_New_Groups <- 3
# option 1
new_groups <- list(group_1 = as.character(c(1, 7, 13)), group_2 = as.character(c(2, 8, 14)), group_3 = as.character(c(3, 9, 15)))
# option 2
new_groups <- list(group_1 = as.character(c(1, 7, 13)), group_2 = as.character(c(2, 8, 14)), group_3 = as.character(c(3, 9, 16)))
# option 3
new_groups <- list(group_1 = as.character(c(1, 8, 13)), group_2 = as.character(c(2, 7, 14)), group_3 = as.character(c(3, 9, 15)))
There are three things that make what I'm hoping to do really tricky. First, I want to generate all possible combinations since this operation is part of a larger function. Second, I want to have the option to have multiple items from each original group end up in each new group. Third, I also want to have the option to choose how many new groups there will be. Here is another example.
Number_of_Items_From_Each_Original_Group_to_End_up_in_Each_New_Group <- 2
Number_of_New_Groups <- 3
# option 1
new_groups <- list(group_1 = as.character(c(1, 2, 7, 8, 13, 14)), group_2 = as.character(c(3, 4, 9, 10, 15, 16)), group_3 = as.character(c(5, 6, 11, 12, 17, 18)))
# option 2
new_groups <- list(group_1 = as.character(c(1, 2, 7, 8, 13, 14)), group_2 = as.character(c(3, 4, 9, 10, 15, 16)), group_3 = as.character(c(5, 6, 11, 12, 17, 19)))
# option 3
new_groups <- list(group_1 = as.character(c(1, 3, 7, 8, 13, 14)), group_2 = as.character(c(2, 4, 9, 10, 15, 16)), group_3 = as.character(c(5, 6, 11, 12, 17, 18)))
If each original group contained at least 9 items, I could even make new groups that each contain 3 items from each original group. Alternatively, if each original group contained at least 9 items, I could also increase the number of new groups to 4 if only 2 items from each original group end up in each new group.
Note that the original groups don't all need to contain the same amount of items for this process to work - the third original group contains more items than the other two original groups.
Also, note that item order doesn't matter within new groups. In other words, new_groups <- list(group_1 = as.character(c(1, 2, 7, 8, 13, 14)), group_2 = as.character(c(3, 4, 9, 10, 15, 16)), group_3 = as.character(c(5, 6, 11, 12, 17, 18))) is the same as new_groups <- list(group_1 = as.character(c(2, 1, 8, 7, 14, 13)), group_2 = as.character(c(3, 4, 9, 10, 15, 16)), group_3 = as.character(c(5, 6, 11, 12, 17, 18))), so in my final output, I would only want one of these options reported.
Finally, note that the number of original groups won’t always equal the number of new groups - they just happen to in this example.
All solutions are welcome but I'd be especially curious to see one using only base functions.
Thank you!
I suggest this solution: it does not use any other library than base.
Define a permutations function, to compute all possibile combinations of elements of a vector (or list) vec taken sublen elements at a time (likewise the combn function)
permutations <- function(vec, sublen, prev_vec=NULL){
out_list <- list()
if(sublen==1){
for(v in vec){
out_list <- append(out_list,list(append(prev_vec,list(v))))
}
} else {
for (i in 1:(length(vec)-sublen+1)){
v <- vec[1]
prev_vec0 <- c(prev_vec,vec[1])
vec <- vec[2:length(vec)]
perm_list <- permutations(
vec=vec,
sublen=sublen-1,
prev_vec=prev_vec0
)
out_list <- append(out_list,perm_list)
}
}
return(out_list)
}
Define a find_matrix function that unlists matrices from deeply nested lists (source)
find_matrix <- function(x) {
if (is.matrix(x))
return(list(x))
if (!is.list(x))
return(NULL)
unlist(lapply(x, find_matrix), FALSE)
}
Define a compatible_rows function, that extract from a dataframe a subset of rows which can be used to create the other output vectors, given an output vector
compatible_rows <- function(df,row_value){
row_ids <- c()
if(is.null(nrow(df))){
return(NULL)
} else {
for (row_id in 1:nrow(df)){
row_ids <- c(row_ids,!any(row_value %in% df[row_id,]))
}
return(df[which(row_ids),])
}
}
Create a new_groups_list function, that computes all possible output matrices
new_groups_list <- function(df, prev_df=NULL, lvl=-1, verbose=F){
lvl <- lvl+1
results_list <- list()
if(is.null(nrow(df))){
if(verbose==T) cat(paste0("--BRANCH END (BEGIN lvl ",lvl,")--\n"))
prev_df0 <- rbind(prev_df,df)
rownames(prev_df0) <- NULL
if(verbose==T) cat("returned\n")
if(verbose==T) print(prev_df0)
if(verbose==T) cat("\n")
return(prev_df0)
}
if(nrow(df)==0){
if(verbose==T) cat(paste0("--BRANCH END (BEGIN lvl ",lvl,")--\n"))
prev_df0 <- rbind(prev_df,df)
rownames(prev_df0) <- NULL
if(verbose==T) cat("returned\n")
if(verbose==T) print(prev_df0)
if(verbose==T) cat("\n")
return(prev_df0)
}
for(row_id in 1:nrow(df)){
if(verbose==T) cat(paste("-- lvl",lvl,"cycle",row_id,"--\n"))
if(verbose==T) cat("initial results list\n")
if(verbose==T) print(results_list)
if(verbose==T) cat("\n")
if(verbose==T) cat("df in\n")
if(verbose==T) assign("last_df",df,envir = .GlobalEnv)
if(verbose==T) print(df)
if(verbose==T) cat("\n")
if(is.null(nrow(df))){
prev_df0 <- rbind(prev_df,df)
rownames(prev_df0) <- NULL
if(verbose==T) cat(paste0("--BRANCH END (MID lvl ",lvl,")--\n"))
if(verbose==T) cat("returned\n")
results_list <- append(results_list,list(prev_df0))
if(verbose==T) print(results_list)
if(verbose==T) cat("\n")
return(results_list)
}
considered_row <- df[1,]
if(verbose==T) assign("last_considered_row",considered_row,envir = .GlobalEnv)
if(verbose==T) cat("considered rows\n")
if(verbose==T) print(considered_row)
if(verbose==T) cat("\n")
df <- df[2:nrow(df),]
if(verbose==T) assign("last_df",df,envir = .GlobalEnv)
if(verbose==T) cat("df without considered rows\n")
if(verbose==T) print(df)
if(verbose==T) cat("\n")
prev_df0 <- rbind(prev_df,considered_row)
rownames(prev_df0) <- NULL
if(verbose==T) assign("last_prev0",prev_df0,envir = .GlobalEnv)
if(verbose==T) cat("collected considered rows\n")
if(verbose==T) print(prev_df0)
if(verbose==T) cat("\n")
comp_df <- compatible_rows(df,considered_row)
if(verbose==T) assign("last_comp_df",comp_df,envir = .GlobalEnv)
if(verbose==T) cat("compatible rows in df\n")
if(verbose==T) print(comp_df)
if(verbose==T) cat("\n")
if(verbose==T) cat(paste(">>> GOING TO LVL",lvl+1,"\n\n"))
new_rows <- new_groups_list(
comp_df,
prev_df=prev_df0,
lvl=lvl,
verbose=verbose
)
if(verbose==T) cat(paste0("--ROOT (lvl ",lvl,")--\n"))
if(verbose==T) cat("result received from branch\n")
if(verbose==T) print(new_rows)
if(verbose==T) cat("\n")
results_list <- append(results_list,list(new_rows))
if(verbose==T) cat("results list\n")
if(verbose==T) print(results_list)
if(verbose==T) cat("\n")
}
return(results_list)
}
Create create_new_groups, which wraps-up all the other functions, and outputs the whole list of possible solutions
create_new_groups <- function(original_groups, max_output = NULL){
min_len_original_groups = min(lengths(original_groups))
num_original_groups = length(original_groups)
max_len_subgroup <- floor(min_len_original_groups/2)
if(min_len_original_groups<2){
return("Not possible to populate new groups: at least one original group has less than 2 elements")
}
NewGroups_subLen_len_num <- list()
for (len_subgroup in 1:max_len_subgroup){
new_group_params <- c(
len_subgroup,
len_subgroup*num_original_groups,
floor(min_len_original_groups/len_subgroup)
)
NewGroups_subLen_len_num[[len_subgroup]] <- new_group_params
}
out_list <- list()
ind <- 1
for (e in 1:length(NewGroups_subLen_len_num)){
NewGroup_subLen_len_num <- NewGroups_subLen_len_num[[e]]
elem_list <- list()
ind <- 1
# print(ind)
for (o in 1:length(original_groups)){
original_group <- original_groups[[o]]
elem_list[[paste("group",ind)]] <- permutations(original_group,NewGroup_subLen_len_num[1])
ind <- ind+1
}
out_list[[paste(c("subLen","len","numGroups"), NewGroup_subLen_len_num, collapse = " ")]] <- elem_list
}
results_list <- list()
config_test <<- NewGroups_subLen_len_num
for (config_id in 1:length(NewGroups_subLen_len_num)){
config <- NewGroups_subLen_len_num[[config_id]]
perm_grid <- expand.grid(out_list[[config_id]])
perm_grid <- split(perm_grid,1:nrow(perm_grid))
perm_grid <- lapply(perm_grid,unlist)
perm_grid <- lapply(perm_grid,as.character)
perm_grid <- do.call(rbind, perm_grid)
new_groups <- new_groups_list(perm_grid,verbose = F)
new_groups <- find_matrix(new_groups)
# config_test <<- config
new_groups <- new_groups[which(sapply(new_groups, nrow) >= config[3])]
# out_test <<- new_groups
results_list[[config_id]] <- new_groups
}
return(results_list)
}
Given a simple input, like
original_groups <- list(
group_1 = as.character(1:2),
group_2 = as.character(3:4),
group_3 = as.character(5:7)
)
The output of create_new_groups(original_groups) is
> create_new_groups_modified(original_groups)
[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] "1" "3" "5"
[2,] "2" "4" "6"
[[1]][[2]]
[,1] [,2] [,3]
[1,] "1" "3" "5"
[2,] "2" "4" "7"
[[1]][[3]]
[,1] [,2] [,3]
[1,] "2" "3" "5"
[2,] "1" "4" "6"
[[1]][[4]]
[,1] [,2] [,3]
[1,] "2" "3" "5"
[2,] "1" "4" "7"
[[1]][[5]]
[,1] [,2] [,3]
[1,] "1" "4" "5"
[2,] "2" "3" "6"
[[1]][[6]]
[,1] [,2] [,3]
[1,] "1" "4" "5"
[2,] "2" "3" "7"
[[1]][[7]]
[,1] [,2] [,3]
[1,] "2" "4" "5"
[2,] "1" "3" "6"
[[1]][[8]]
[,1] [,2] [,3]
[1,] "2" "4" "5"
[2,] "1" "3" "7"
[[1]][[9]]
[,1] [,2] [,3]
[1,] "1" "3" "6"
[2,] "2" "4" "7"
[[1]][[10]]
[,1] [,2] [,3]
[1,] "2" "3" "6"
[2,] "1" "4" "7"
[[1]][[11]]
[,1] [,2] [,3]
[1,] "1" "4" "6"
[2,] "2" "3" "7"
[[1]][[12]]
[,1] [,2] [,3]
[1,] "2" "4" "6"
[2,] "1" "3" "7"
Moreover, the create_new_groups function also creates a global variable config_test where all possible configurations, for a given list of groups (i.e., original_groups), are stored. For example, for the previous problem, config_test is equal to
> config_test
[[1]]
[1] 1 3 2
So, for this problem only one output configuration is possible, having this structure:
the number of elements taken from each original group and used in each output group is 1
the length of the output groups is 6
the number of output groups (in each possible combination) is 2
Given a slightly more complex example
original_groups <- list(
group_1 = as.character(1:4),
group_2 = as.character(5:8),
group_3 = as.character(9:13)
)
config_test would be equal to
> config_test
[[1]]
[1] 1 3 4
[[2]]
[1] 2 6 2
I made some tests, this method should work for any number of groups, of any length, and the output should always be composed of not-duplicated matrices.
I'm sorry if the explanation is short, if I have time in the following days I'll try to add some notes.
EDIT
A simple way to output only the configurations characterized by a specific number of elements from the original groups is to change the create_new_groups as follows
create_new_groups_modified <- function(original_groups, max_output = NULL, elements_from_original = NULL){
min_len_original_groups = min(lengths(original_groups))
num_original_groups = length(original_groups)
max_len_subgroup <- floor(min_len_original_groups/2)
if(min_len_original_groups<2){
stop("Not possible to populate new groups: at least one original group has less than 2 elements")
}
NewGroups_subLen_len_num <- list()
for (len_subgroup in 1:max_len_subgroup){
new_group_params <- c(
len_subgroup,
len_subgroup*num_original_groups,
floor(min_len_original_groups/len_subgroup)
)
NewGroups_subLen_len_num[[len_subgroup]] <- new_group_params
}
out_list <- list()
ind <- 1
for (e in 1:length(NewGroups_subLen_len_num)){
NewGroup_subLen_len_num <- NewGroups_subLen_len_num[[e]]
elem_list <- list()
ind <- 1
# print(ind)
for (o in 1:length(original_groups)){
original_group <- original_groups[[o]]
elem_list[[paste("group",ind)]] <- permutations(original_group,NewGroup_subLen_len_num[1])
ind <- ind+1
}
out_list[[paste(c("subLen","len","numGroups"), NewGroup_subLen_len_num, collapse = " ")]] <- elem_list
}
results_list <- list()
config_test <<- NewGroups_subLen_len_num
# if `elements_from_original` is not set, output all possible combinations
if(is.null(elements_from_original)){
for (config_id in 1:length(NewGroups_subLen_len_num)){
config <- NewGroups_subLen_len_num[[config_id]]
perm_grid <- expand.grid(out_list[[config_id]])
perm_grid <- split(perm_grid,1:nrow(perm_grid))
perm_grid <- lapply(perm_grid,unlist)
perm_grid <- lapply(perm_grid,as.character)
perm_grid <- do.call(rbind, perm_grid)
new_groups <- new_groups_list(perm_grid,verbose = F)
new_groups <- find_matrix(new_groups)
# config_test <<- config
new_groups <- new_groups[which(sapply(new_groups, nrow) >= config[3])]
# out_test <<- new_groups
results_list[[config_id]] <- new_groups
}
} else {
# if `elements_from_original` is set, check if this is a valid configuration, then output only the matrix having this configuration
config_id <- which(sapply(NewGroups_subLen_len_num,function(x) x[1]==elements_from_original))
if (length(config_id)!=0){
config <- NewGroups_subLen_len_num[[config_id]]
perm_grid <- expand.grid(out_list[[config_id]])
perm_grid <- split(perm_grid,1:nrow(perm_grid))
perm_grid <- lapply(perm_grid,unlist)
perm_grid <- lapply(perm_grid,as.character)
perm_grid <- do.call(rbind, perm_grid)
new_groups <- new_groups_list(perm_grid,verbose = F)
new_groups <- find_matrix(new_groups)
new_groups <- new_groups[which(sapply(new_groups, nrow) >= config[3])]
results_list[[1]] <- new_groups
} else {
stop("value of elements_from_original not available: check config_test to see available configurations")
}
}
return(results_list)
}
The elements_from_original parameter of the function allows to set the number elements from the original groups to consider, and, if applicable, the output will include only matrices following that configuration.
EDIT 2
To output the matrices composed by a specific number of groups
Write a new function select_matrices_by_number_output_groups, that outputs only the matrices with n_output_groups rows
select_matrices_by_number_output_groups <- function(l,n_output_groups){
# Filter out matrices having less rows than `n_output_groups`
out_l <- l[which(
sapply(
l,
# function(x) check_matrix_by_number_output_groups(x,n_output_groups)
function(mtr){
if(nrow(mtr)<n_output_groups) return(F)
else return(T)
}
)
)]
# Cut-off rows from matrices having more rows than `n_output_groups`
out_l <- lapply(
out_l,
function(x) head(x,n_output_groups)
)
# Keep only unique elements (i.e., matrices)
out_l <- unique(out_l)
return(out_l)
}
The update create_new_groups so that it includes the select_matrices_by_number_output_groups function
create_new_groups_modified_2 <- function(original_groups, max_output = NULL, elements_from_original = NULL, n_output_groups = NULL){
min_len_original_groups = min(lengths(original_groups))
num_original_groups = length(original_groups)
max_len_subgroup <- floor(min_len_original_groups/2)
if(min_len_original_groups<2){
stop("Not possible to populate new groups: at least one original group has less than 2 elements")
}
NewGroups_subLen_len_num <- list()
for (len_subgroup in 1:max_len_subgroup){
new_group_params <- c(
len_subgroup,
len_subgroup*num_original_groups,
floor(min_len_original_groups/len_subgroup)
)
NewGroups_subLen_len_num[[len_subgroup]] <- new_group_params
}
out_list <- list()
ind <- 1
for (e in 1:length(NewGroups_subLen_len_num)){
NewGroup_subLen_len_num <- NewGroups_subLen_len_num[[e]]
elem_list <- list()
ind <- 1
# print(ind)
for (o in 1:length(original_groups)){
original_group <- original_groups[[o]]
elem_list[[paste("group",ind)]] <- permutations(original_group,NewGroup_subLen_len_num[1])
ind <- ind+1
}
out_list[[paste(c("subLen","len","numGroups"), NewGroup_subLen_len_num, collapse = " ")]] <- elem_list
}
results_list <- list()
config_test <<- NewGroups_subLen_len_num
# if `elements_from_original` is not set, output all possible combinations
if(is.null(elements_from_original)){
for (config_id in 1:length(NewGroups_subLen_len_num)){
config <- NewGroups_subLen_len_num[[config_id]]
perm_grid <- expand.grid(out_list[[config_id]])
perm_grid <- split(perm_grid,1:nrow(perm_grid))
perm_grid <- lapply(perm_grid,unlist)
perm_grid <- lapply(perm_grid,as.character)
perm_grid <- do.call(rbind, perm_grid)
new_groups <- new_groups_list(perm_grid,verbose = F)
new_groups <- find_matrix(new_groups)
new_groups <- new_groups[which(sapply(new_groups, nrow) >= config[3])]
results_list[[config_id]] <- new_groups
}
} else {
# if `elements_from_original` is set, check if this is a valid configuration, then output only the matrix having this configuration
config_id <- which(sapply(NewGroups_subLen_len_num,function(x) x[1]==elements_from_original))
if (length(config_id)!=0){
config <- NewGroups_subLen_len_num[[config_id]]
perm_grid <- expand.grid(out_list[[config_id]])
perm_grid <- split(perm_grid,1:nrow(perm_grid))
perm_grid <- lapply(perm_grid,unlist)
perm_grid <- lapply(perm_grid,as.character)
perm_grid <- do.call(rbind, perm_grid)
new_groups <- new_groups_list(perm_grid,verbose = F)
new_groups <- find_matrix(new_groups)
new_groups <- lapply(
new_groups,
function(x) {
dimnames(x) <- NULL
return(x)
}
)
if(is.null(n_output_groups)){
new_groups <- new_groups[which(sapply(new_groups, nrow) == config[3])]
} else if (n_output_groups > config[3]){
stop("value n_output_groups higher than max number of new groups for this configuration: check config_test to see available configurations")
} else {
new_groups <- select_matrices_by_number_output_groups(new_groups,n_output_groups)
}
# results_list[[1]] <- new_groups
results_list <- new_groups
} else {
stop("value of elements_from_original not available: check config_test to see available configurations")
}
}
return(results_list)
}
As mentioned in the comment, there are potentially huge numbers of combinations involved here. However, one approach that works (assuming you have enough time/memory) is as follows. This example is for just the first two elements of your original_groups list, allowing two elements from each group. It would be straightforward to generalise the final map to arbitrary numbers of groups, but this is just to illustrate the principle.
The first map sets up a vector of group indices, padded out with NAs to the length of the group (6) (i.e in this case c(1,2,1,2,NA,NA)), and finds all unique permutations of it. cross then combines each option for the first group with each option for the second, and the final map uses these indices to separate the elements into the two groups.
library(combinat) #for permn function
library(tidyverse) #purrr and dplyr
original_groups <- list(group_1 = as.character(1:6), group_2 = as.character(7:12))
no_items <- 2
no_groups <- length(original_groups) #i.e. 2 in this case
output <- map(original_groups,
~unique(permn(`length<-`(rep(seq_len(no_groups),
no_items),
length(.))))) %>%
cross() %>%
map(~list(c(original_groups$group_1[which(.$group_1 == 1)],
original_groups$group_2[which(.$group_2 == 1)]),
c(original_groups$group_1[which(.$group_1 == 2)],
original_groups$group_2[which(.$group_2 == 2)])))
head(output) #full output has 8100 elements
[[1]]
[[1]][[1]]
[1] "1" "3" "7" "9"
[[1]][[2]]
[1] "2" "4" "8" "10"
[[2]]
[[2]][[1]]
[1] "1" "3" "7" "9"
[[2]][[2]]
[1] "2" "5" "8" "10"
[[3]]
[[3]][[1]]
[1] "1" "4" "7" "9"
[[3]][[2]]
[1] "2" "5" "8" "10"
[[4]]
[[4]][[1]]
[1] "1" "4" "7" "9"
[[4]][[2]]
[1] "3" "5" "8" "10"
[[5]]
[[5]][[1]]
[1] "2" "4" "7" "9"
[[5]][[2]]
[1] "3" "5" "8" "10"
[[6]]
[[6]][[1]]
[1] "2" "4" "7" "9"
[[6]][[2]]
[1] "3" "6" "8" "10"
This is the job of expand.grid + combn: will be showing only the first 5 rows:
n <- 1
expand.grid(lapply(original_groups, combn, n, simplify = FALSE))
group_1 group_2 group_3
1 1 7 13
2 2 7 13
3 3 7 13
4 4 7 13
5 5 7 13
when n = 2
n <- 2
expand.grid(lapply(original_groups, combn, n, simplify = FALSE))
group_1 group_2 group_3
1 1, 2 7, 8 13, 14
2 1, 3 7, 8 13, 14
3 1, 4 7, 8 13, 14
4 1, 5 7, 8 13, 14
5 1, 6 7, 8 13, 14
You can write a simple function:
generate_all <- function(lst, n){
expand.grid(lapply(lst, combn, n, simplify = FALSE))
}
head(generate_all(original_groups, 3))
group_1 group_2 group_3
1 1, 2, 3 7, 8, 9 13, 14, 15
2 1, 2, 4 7, 8, 9 13, 14, 15
3 1, 2, 5 7, 8, 9 13, 14, 15
4 1, 2, 6 7, 8, 9 13, 14, 15
5 1, 3, 4 7, 8, 9 13, 14, 15
head(generate_all(original_groups, 4))
group_1 group_2 group_3
1 1, 2, 3, 4 7, 8, 9, 10 13, 14, 15, 16
2 1, 2, 3, 5 7, 8, 9, 10 13, 14, 15, 16
3 1, 2, 3, 6 7, 8, 9, 10 13, 14, 15, 16
4 1, 2, 4, 5 7, 8, 9, 10 13, 14, 15, 16
5 1, 2, 4, 6 7, 8, 9, 10 13, 14, 15, 16
6 1, 2, 5, 6 7, 8, 9, 10 13, 14, 15, 16
The prototype function below works for any number of groups from which some number ("numobs" in the function) of observations have been uniquely drawn. I have added an example with four groups instead of 3 to illustrate.
The function was written in R version 4.2.1 and data.table version 1.14.2 on a windows operating system. The code is rather sophisticated; changing the position of a single comma, or a single compatibility issue, might render the function inoperable.
In the example given by the poster, there are 15 ways of drawing two distinct members from the first two groups (length 6) and 28 ways of drawing two distinct members from the third group (length 8). There should then be 15 * 15 * 28 = 6300 possible combinations.
I have added two other examples to show that this function works for any given number of groups with k objects taken from each group. However, the code speed is affected as the number of groups and objects increases, and will likely become an issue for large numbers of groups or objects.
Suggested Citation:
Harkness, Jeffrey (2022). Customized R combination function. Posted 7/14/2022 on stackoverflow.com
##original example from poster
(original_groups <- list(group_1 = as.character(1:6),
group_2 = as.character(7:12), group_3 = as.character(13:20)))
$group_1
[1] "1" "2" "3" "4" "5" "6"
$group_2
[1] "7" "8" "9" "10" "11" "12"
$group_3
[1] "13" "14" "15" "16" "17" "18" "19" "20"
#testing 4 groups instead of 3
(original_groups2 <- list(group_1 = as.character(1:6),
group_2 = as.character(7:12), group_3 = as.character(13:20), group_4 = as.character(21:24)))
$group_1
[1] "1" "2" "3" "4" "5" "6"
$group_2
[1] "7" "8" "9" "10" "11" "12"
$group_3
[1] "13" "14" "15" "16" "17" "18" "19" "20"
$group_4
[1] "21" "22" "23" "24"
#testing 4 different groups
(original_groups3 <- list(group_1 = as.character(1:4),
group_2 = as.character(5:9), group_3 = as.character(10:16), group_4 = as.character(17:22)))
$group_1
[1] "1" "2" "3" "4"
$group_2
[1] "5" "6" "7" "8" "9"
$group_3
[1] "10" "11" "12" "13" "14" "15" "16"
$group_4
[1] "17" "18" "19" "20" "21" "22"
require(data.table)
# Loading required package: data.table
# data.table 1.14.2 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
gc <- function(input_ob = original_groups, numgroups = length(input_ob), numobs = 2) {
ansdim <- NULL
for (k in 1:numgroups) {
rowcount <- 1
temp <- as.vector(input_ob[[k]]) # temporary vector for group k
newcombs <- data.frame(combn(temp, numobs)) # data frame of combinations for group k
newcombs <- transpose(newcombs)
newname <- paste0("ansnew", k)
ansdim[k] <- dim(newcombs)[1]
assign(newname, newcombs)
}
### To hold final answer
nm <- matrix(data = 0, nrow = prod(ansdim), ncol = numobs * numgroups, byrow = T)
## All possible combinations of the first two groups
combine <- NULL
nc <- ansdim[1]
nc <- data.frame(CJ(1:nc, 1:ansdim[2])) # instructions for which row numbers to combine
ntemp <- matrix(data = 0, nrow = ansdim[1] * ansdim[2], ncol = numobs * 2, byrow = T)
for (m in 1:dim(ntemp)[1]) {
newrow <- cbind(ansnew1[nc[m, 1], ], ansnew2[nc[m, 2], ])
ntemp[m, ] <- as.matrix(newrow[1, ])
}
fcom <- ntemp ### all combinations of first two groups
## All possible combinations of all groups
for (n in 3:(numgroups)) {
nc <- ansdim[n]
nc <- data.frame(CJ(1:nrow(fcom), 1:ansdim[n])) # instructions for which row numbers to combine
ntemp <- matrix(data = 0, nrow = nrow(fcom) * ansdim[n], ncol = numobs * (n), byrow = T)
for (p in 1:dim(ntemp)[1]) {
frow <- fcom[nc[p, 1], ] ### First part of new row
srow <- as.character(as.vector(get(paste0("ansnew", n))[nc[p, 2], ])) ## second part of new row
newrow <- c(frow, srow)
ntemp[p, ] <- newrow
}
fcom <- ntemp
}
#
nm <- fcom
return(nm)
}
ans1 <- gc(input_ob = original_groups)
ans2 <- gc(input_ob = original_groups2)
ans3 <- gc(input_ob = original_groups3)
dim(ans1);head(ans1)
[1] 6300 6
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "1" "2" "7" "8" "13" "14"
[2,] "1" "2" "7" "8" "13" "15"
[3,] "1" "2" "7" "8" "13" "16"
[4,] "1" "2" "7" "8" "13" "17"
[5,] "1" "2" "7" "8" "13" "18"
[6,] "1" "2" "7" "8" "13" "19"
dim(ans2);head(ans2)
[1] 37800 8
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "1" "2" "7" "8" "13" "14" "21" "22"
[2,] "1" "2" "7" "8" "13" "14" "21" "23"
[3,] "1" "2" "7" "8" "13" "14" "21" "24"
[4,] "1" "2" "7" "8" "13" "14" "22" "23"
[5,] "1" "2" "7" "8" "13" "14" "22" "24"
[6,] "1" "2" "7" "8" "13" "14" "23" "24"
dim(ans3);head(ans3)
[1] 18900 8
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "1" "2" "5" "6" "10" "11" "17" "18"
[2,] "1" "2" "5" "6" "10" "11" "17" "19"
[3,] "1" "2" "5" "6" "10" "11" "17" "20"
[4,] "1" "2" "5" "6" "10" "11" "17" "21"
[5,] "1" "2" "5" "6" "10" "11" "17" "22"
[6,] "1" "2" "5" "6" "10" "11" "18" "19"
###Sample random rows from output to illustrate output is consistent
ans1[sample(nrow(ans1), size = 5, replace = F),]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "1" "3" "8" "12" "13" "17"
[2,] "3" "6" "8" "12" "19" "20"
[3,] "1" "3" "7" "11" "15" "19"
[4,] "1" "2" "9" "11" "17" "20"
[5,] "5" "6" "7" "8" "19" "20"
ans2[sample(nrow(ans2), size = 5, replace = F),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "2" "3" "8" "11" "13" "15" "22" "24"
[2,] "1" "2" "8" "9" "13" "19" "23" "24"
[3,] "1" "3" "11" "12" "13" "15" "21" "22"
[4,] "1" "4" "9" "10" "14" "16" "21" "22"
[5,] "2" "5" "7" "9" "15" "20" "22" "24"
ans3[sample(nrow(ans3), size = 5, replace = F),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "3" "4" "6" "7" "10" "15" "21" "22"
[2,] "1" "3" "5" "8" "12" "14" "18" "22"
[3,] "2" "4" "7" "8" "12" "16" "18" "19"
[4,] "2" "3" "5" "9" "11" "13" "17" "21"
[5,] "1" "2" "7" "8" "14" "16" "17" "18"
##shows that the output has no duplicated rows
ans1[duplicated(ans1),]
[,1] [,2] [,3] [,4] [,5] [,6]
In order to show convincingly that all the cases are represented in the output, I wrote a few lines of code that randomly sample two objects from the OP's example groups and finds the corresponding row in the output. This code can easily be looped over.
#The next lines sample from original group and find corresponding row in answer
samp1 = sample(original_groups$group_1,size = 2, replace = F)
samp2 = sample(original_groups$group_2,size = 2, replace = F)
samp3 = sample(original_groups$group_3,size = 2, replace = F)
(samprow = c(samp1, samp2, samp3))
[1] "6" "4" "8" "9" "20" "18"
colnames(ans1) = c("v1","v2","v3","v4","v5","v6")
ans5 = data.frame(ans1)
ans5[ans5$v1 %in% samprow & ans5$v2 %in% samprow &ans5$v3 %in% samprow &ans5$v4 %in% samprow &ans5$v5 %in% samprow &ans5$v6 %in% samprow,]
v1 v2 v3 v4 v5 v6
5627 4 6 8 9 18 20
7/16/2022 Update: Generating all combinations
The following approach was written for the case of 3 groups. The code would require modest modification to output the data in whatever form the user wants. It can be altered to accommodate other group sizes, periodically dump data into an output file to avoid size issues, etc., but I will not do so here for the sake of brevity. Rough calculations suggest that for the example data in "original_groups", it should work in about two hours or less for a reasonably fast laptop. It has been commented out for this reason.
Every unique sampled set (the output from gc()) is represented by a single integer in the code, so they don't have to be copied or repeated over and over. This helps to mitigate data size issues.
The code below writes a data.table where each row gives 3 integers that correspond to three unique sampled sets, like a set of instructions that can be used to construct the final result in whatever form (list, etc.) the user wants with additional code.
After running the code for about 15 minutes, the answer frame had collected 697,152 combinations after removing duplicates. The duplicate removal code is taken from here. All other code is original. Code and example output are shown below.
References:
Stack Overflow commentator Thomas (2014). Accessed 7/16/2022. URL: Deleting reversed duplicates with R
# inans = ans1 #object from gc() function with all combinations
# inans = data.table(inans)
# colvec = colnames(inans)
# inans$index = 1:nrow(inans)
#
# for(i in 1:nrow(inans)){ #For every possible sampled sequence
# sf = NULL;sf2 = NULL;ntemp = NULL; nin = NULL
# ntemp = as.character(as.vector(inans[i,..colvec]))
# nin = inans$index[i]
# sf = inans
# for(k in colvec){ #Remove rows that don't match first element
# sf = sf[!get(k) %in% ntemp]
# }
#
# ntemp2 = NULL; sin = NULL
# for(j in 1:nrow(sf)){ #Loop through all possibilities for element
# ntemp2 = as.character(as.vector(sf[j,..colvec]))
# sin = sf$index[j]
# sf2 = sf
# for(m in colvec){
# sf2 = sf2[!get(m) %in% ntemp2] #Remove rows that don't match second element
# }
# if(dim(sf2)[1] > 0){
# newframe = data.table(nin, sin, sf2$index) #Store row numbers for each combination
# if(i == 1){
# ansframe = newframe
# }else{
# ansframe = rbind(ansframe, newframe) #start process over for next sampled set
# }}}}
#
# ##Remove duplicate groups
# ##Code Source:
# ##https://stackoverflow.com/questions/22756392/deleting-reversed-duplicates-with-r
# ansframe = ansframe[!duplicated(apply(ansframe,1,function(x) paste(sort(x),collapse=''))),]
#
# #input data.table of all possible sample sets
# #6300 total here - made from gc() function above
# #index is an integer to represent each sequence
# head(inans)
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 14 1
# 2: 1 2 7 8 13 15 2
# 3: 1 2 7 8 13 16 3
# 4: 1 2 7 8 13 17 4
# 5: 1 2 7 8 13 18 5
# 6: 1 2 7 8 13 19 6
#
#
# #data.table that holds instructions to build final output
# #each row gives row numbers for a feasible combination
# head(ansframe)
# nin sin V3
# 1: 1 6300 4046
# 2: 1 6300 4047
# 3: 1 6300 4048
# 4: 1 6300 4051
# 5: 1 6300 4052
# 6: 1 6300 4055
#
# #answer frame size: each row represents a combination
# dim(ansframe)
# [1] 697152 3
#
#
# ###Code to print a selection of valid combinations from the
# ###output instructions given by the code above.
# forind = as.integer(seq(1,nrow(ansframe),length.out = 50))
# for(i in forind){print(inans[as.numeric(ansframe[i])])}
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 14 1
# 2: 5 6 11 12 19 20 6300
# 3: 3 4 9 10 15 16 4046
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 4 9 11 16 17 4079
# 3: 5 6 10 12 14 20 6257
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 4 9 12 17 20 4113
# 3: 5 6 10 11 18 19 6242
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 4 10 12 14 19 4156
# 3: 5 6 9 11 16 18 6180
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 4 11 12 17 18 4195
# 3: 5 6 9 10 16 20 6154
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 5 9 11 14 17 4490
# 3: 4 6 10 12 16 18 5844
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 5 9 12 16 19 4529
# 3: 4 6 10 11 17 18 5819
# v1 v2 v3 v4 v5 v6 index
# 1: 1 2 7 8 13 15 2
# 2: 3 5 10 11 18 20 4563
# 3: 4 6 9 12 17 19 5792
7/17/2022 Update: All combinations with variable output group size
The following approach was written for the cases of 2,3,or 4 groups. In tested cases, the code ran in the rough ballpark of 15 minutes or so for two groups but can take longer for 4 groups, depending on the size and number of the input groups.
This example shows 4 input groups with 4 output groups and 1 object from each group, but the user can also choose 2 or 3 output groups. The object "numobs" in the gc() function specifies objects to take from each group and the "groupsout" object specifies the number of output groups. The code below found 13,824 combinations in under ten minutes for this example.
(original_groups4 <- list(group_1 = as.character(1:4), group_2 = as.character(5:8), group_3 = as.character(9:12), group_4 = as.character(13:16)))
$group_1
[1] "1" "2" "3" "4"
$group_2
[1] "5" "6" "7" "8"
$group_3
[1] "9" "10" "11" "12"
$group_4
[1] "13" "14" "15" "16"
ans1 = gc(input_ob = original_groups4, numobs = 1)
###Find all combinations
###groupsout specifies the number of output groups
groupsout = 4 #Number of groups in the output: choose 2,3, or 4
inans = ans1 #dev object from gc() function with all combos
inans = data.table(inans) #order data.table by first few columns
colvec = colnames(inans)
inans$index = 1:nrow(inans)
if(groupsout == 2){ #groupsout = 2 case
for(i in 1:nrow(inans)){ #Remove rows that don't match first element
ntemp = as.character(as.vector(inans[i,..colvec]))
nin = inans$index[i]
sf = inans
for(k in colvec){
sf = sf[!get(k) %in% ntemp]
}
if(dim(sf)[1] > 0){
newframe = data.table(nin, sf$index) #Store row numbers for each combination
if(i == 1){
ansframe = newframe
}else{
ansframe = rbind(ansframe, newframe)
}}}}
if(groupsout == 3){ #groupsout = 3 case
for(i in 1:nrow(inans)){ #Remove rows that don't match first element
sf = NULL;sf2 = NULL;ntemp = NULL; nin = NULL
ntemp = as.character(as.vector(inans[i,..colvec]))
nin = inans$index[i]
sf = inans
for(k in colvec){
sf = sf[!get(k) %in% ntemp]
}
ntemp2 = NULL; sin = NULL #Create last group and write to output frame
for(j in 1:nrow(sf)){ #Remove rows that don't match second element
ntemp2 = as.character(as.vector(sf[j,..colvec]))
sin = sf$index[j]
sf2 = sf
for(m in colvec){
sf2 = sf2[!get(m) %in% ntemp2]
}
if(dim(sf2)[1] > 0){
newframe = data.table(nin, sin, sf2$index) #Store row numbers for each combination
if(i == 1){
ansframe = newframe
}else{
ansframe = rbind(ansframe, newframe)
}}}}}
###groupsout = 4 case
if(groupsout == 4){
for(i in 1:nrow(inans)){ #Remove rows that don't match first element
sf = NULL;sf2 = NULL;sf3 = NULL
ntemp = NULL; nin = NULL;fin = NULL
ntemp = as.character(as.vector(inans[i,..colvec]))
nin = inans$index[i]
sf = inans
for(k in colvec){
sf = sf[!get(k) %in% ntemp]
}
####Create next group
ntemp2 = NULL; sin = NULL
for(j in 1:nrow(sf)){ #Remove rows that don't match second element
ntemp2 = as.character(as.vector(sf[j,..colvec]))
sin = sf$index[j]
sf2 = sf
for(m in colvec){
sf2 = sf2[!get(m) %in% ntemp2]
}
####Create last group and write to output frame
fin = NULL;ntemp3 = NULL
for(p in 1:nrow(sf2)){ #Remove rows that don't match third element
ntemp3 = as.character(as.vector(sf2[p,..colvec]))
fin = sf2$index[p]
sf3 = sf2
for(r in colvec){
sf3 = sf3[!get(r) %in% ntemp3]
}
if(dim(sf3)[1] > 0){
newframe = data.table(nin, sin,fin, sf3$index) #Store row numbers
if(i == 1){
ansframe = newframe
}else{
ansframe = rbind(ansframe, newframe)
}}}}}}#end if groupsout = 4
##Remove duplicate groups
ansframe = ansframe[!duplicated(apply(ansframe,1,function(x) paste(sort(x),collapse=''))),]
#for illustration
dim(ansframe)
[1] 13824 4
##View subset of combination output to check for valid results
forind = as.integer(seq(1,nrow(ansframe),length.out = 50))
for(i in forind){print(inans[as.numeric(ansframe[i])])}
V1 V2 V3 V4 index
1: 1 5 9 13 1
2: 4 8 12 16 256
3: 3 7 11 15 171
4: 2 6 10 14 86
V1 V2 V3 V4 index
1: 1 5 9 15 3
2: 2 6 12 16 96
3: 3 7 10 14 166
4: 4 8 11 13 249
V1 V2 V3 V4 index
1: 1 5 9 16 4
2: 2 7 12 14 110
3: 3 6 11 15 155
4: 4 8 10 13 245
V1 V2 V3 V4 index
1: 1 5 10 13 5
2: 2 8 12 14 126
3: 3 7 9 16 164
4: 4 6 11 15 219
V1 V2 V3 V4 index
1: 1 5 10 15 7
2: 2 6 11 16 92
3: 3 8 12 14 190
4: 4 7 9 13 225
V1 V2 V3 V4 index
1: 1 5 10 16 8
2: 2 7 11 15 107
3: 3 6 9 14 146
4: 4 8 12 13 253
V1 V2 V3 V4 index
1: 1 5 11 13 9
2: 2 8 10 15 119
3: 3 6 12 16 160
4: 4 7 9 14 226
If I understand your question correctly, it seems to be a problem regarding all possible partitions of each group by a given size and re-organizing the partitions across all groups to form new collections, and keep only one of the isomorphics. In this case, I guess the key step is to generate all exclusive partitions by size, which seems related to permutation problem.
Since a base R option is preferrable to OP, probably we can try the code below:
build a helper function permM, which generate all permutations of vector x with given group size M
create a function f to produce desired output, i.e., all possible combinations of new groups, where all combinations are stored in a nested list
# generate all permuations of x with given size M for each group
permM <- function(x, M) {
if (length(x) == M) {
return(list(x))
}
S <- combn(x, M, simplify = FALSE)
res <- c()
for (k in seq_along(S)) {
z <- Recall(x[!x %in% S[[k]]], M)
res <- c(res, lapply(z, c, S[[k]]))
}
res
}
# create all possible combinations of new groups
f <- function(lst, K) {
nms <- names(lst)
l <- lapply(lst, combn, m = length(lst) * K)
g <- apply(
expand.grid(lapply(choose(lengths(lst), length(lst) * K), seq)),
1,
function(idx) {
Map(function(p, q) l[[p]][, q], seq_along(idx), unlist(idx))
}
)
x <- do.call(
c,
lapply(
g,
function(v) {
apply(
expand.grid(lapply(v, permM, M = K)),
1,
function(...) {
setNames(
asplit(do.call(rbind, lapply(..., matrix, K)), 2),
nms
)
}
)
}
)
)
# remove the isomorphics but keep one of them only
x[
!duplicated(lapply(
x,
function(v) {
unname(sort(sapply(v, function(z) toString(sort(z)))))
}
))
]
}
Example
Given a smaller data sample lst <- list(grp1 = 1:4, grp2 = 5:9) as the original_group list, we run
r1 <- f(lst,1)
r2 <- f(lst,2)
and we will see a snapshot of result like below
> head(r1)
[[1]]
[[1]]$grp1
[1] 2 6
[[1]]$grp2
[1] 1 5
[[2]]
[[2]]$grp1
[1] 1 6
[[2]]$grp2
[1] 2 5
[[3]]
[[3]]$grp1
[1] 3 6
[[3]]$grp2
[1] 1 5
[[4]]
[[4]]$grp1
[1] 1 6
[[4]]$grp2
[1] 3 5
[[5]]
[[5]]$grp1
[1] 4 6
[[5]]$grp2
[1] 1 5
[[6]]
[[6]]$grp1
[1] 1 6
[[6]]$grp2
[1] 4 5
and
> head(r2)
[[1]]
[[1]]$grp1
[1] 3 4 7 8
[[1]]$grp2
[1] 1 2 5 6
[[2]]
[[2]]$grp1
[1] 2 4 7 8
[[2]]$grp2
[1] 1 3 5 6
[[3]]
[[3]]$grp1
[1] 2 3 7 8
[[3]]$grp2
[1] 1 4 5 6
[[4]]
[[4]]$grp1
[1] 1 4 7 8
[[4]]$grp2
[1] 2 3 5 6
[[5]]
[[5]]$grp1
[1] 1 3 7 8
[[5]]$grp2
[1] 2 4 5 6
[[6]]
[[6]]$grp1
[1] 1 2 7 8
[[6]]$grp2
[1] 3 4 5 6

Why does `ave` with `table` return character when first argument is character?

Consider two vectors v1and v2,
v1 <- c(3, 3, 3, 3, 2, 2, 2, 1, 1)
v2 <- as.character(v1)
where their tables give identical numerical output.
table(v1)
# v1
# 1 2 3
# 2 3 4
table(v2)
# v1
# 1 2 3
# 2 3 4
Now, aveing with numerics as first argument gives "numeric":
ave(v1, v1, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
ave(v1, v2, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
Whereas character as first argument gives "character":
ave(v2, v1, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
ave(v2, v2, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
Documentation of ave says:
Value
A numeric vector, say y of length length(x). [...]
For me that means it should always return "numeric".
Is this a bug or a feature?

how can I find the all the numeric characters in a specific column in a metrix and printing them?

how can I find the all the numeric characters in a specific column in a metrix and printing them?
for example this list:
dat <- matrix(c(1,"a","b", 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
dimnames = list(c("row1", "row2"),
c("C.1", "C.2", "C.3")))
dat
C.1 C.2 C.3
row1 "1" "a" "b"
row2 "11" "12" "13"
We can use grep.
> grep("\\d+", c(dat), value=TRUE)
[1] "1" "11" "12" "13"
If you want the location of each element in the matrix they come from, then you can use:
> num <- grep("\\d+", c(dat), value=TRUE)
> positions <- sapply(num, function(x) which(dat == x, arr.ind = TRUE))
> rownames(positions) <- c("row", "col")
> positions
1 11 12 13
row 1 2 2 2
col 1 1 2 3
It tells you number 1 is in row 1, col 1 in matrix dat. Number 11 is in row 2, col 1 in dat.
Convert it to a vector, and then remove all NAs. This works because the conversion puts non-numeric data to NA.
v <- as.numeric(dat)
v[!is.na(v)]
[1] 1 11 12 13

count of records within levels of a factor

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

select row lowest values using two matrices

If I have this matrix (mat):
set.seed(140213)
mat <- matrix(runif(16,0,1),nrow = 4)
colnames(mat) <- 1:4
rownames(mat) <- 5:8
#> mat
# 1 2 3 4
#5 0.1120015 0.01454408 0.3411633 0.3456254
#6 0.5709174 0.70443202 0.9114756 0.9157580
#7 0.1500032 0.40889119 0.6231543 0.9736331
#8 0.9773827 0.45136413 0.9706694 0.5022132
I can get the of the two lowest columns with for each row with:
namesmat <- t(apply(mat, 1, function(x)
head(names(x)[order(x, decreasing = FALSE)], 2)))
# [,1] [,2]
#5 "2" "1"
#6 "1" "2"
#7 "1" "2"
#8 "2" "4"
Now my question is:
If I have another matrix (mat2)
set.seed(14022013) ; mat2 <- matrix(runif(16,0,1),nrow = 4)
How can I get the lowest two rows for each column as I did before for mat2 but ignoring the columns I already selected from mat.
E.g. If in mat2 the highest columns for row 5 were cols 3 & 1 but would have to choose the nest highest. If the next highest was 'col 2' I would have to choose the next highest. Let me know if this is unclear.
My head wants to to some sort of paried apply using mat2 and namesmat like:
t(apply(mat, 1, function(x) head(
names(x)[order(x[! names(x) == apply(namesmat,1,c)] ,
decreasing = FALSE)], 2)))
I will be doing this over many mats so that. mat2 selection depends on the mat selection. Then I wil have a mat3 and the selection of lowest columns per row will depend on the selection of mat and mat2 cobined, so ignoring all rows already selected for mat and mat2 continuing on for several mats.
Obviously in this case there are only 4 columns so i can only do this sequence twice.
May be bit convoluted, but gets work done. I have used simpler matrices to demonstrate whats going on..
> mat1 <- matrix(c(1,2,3,4,4,3,2,1), nrow=2 , byrow=T)
> mat2 <- matrix(c(5,6,7,8,8,7,6,5), nrow=2 , byrow=T)
> colnames(mat1) <- c("A", "B", "C", "D")
> colnames(mat2) <- c("A", "B", "C", "D")
> mat1
A B C D
[1,] 1 2 3 4
[2,] 4 3 2 1
> mat2
A B C D
[1,] 5 6 7 8
[2,] 8 7 6 5
> mymax <- function(x) {
+ names( x[order(x)] )[1:2]
+ }
> mymax2 <- function(x,y) {
+ z <- names( x[order(x)] )
+ z[!z %in% y][1:2]
+ }
> namesmat <- t(apply(mat1, 1, mymax))
> namesmat
[,1] [,2]
[1,] "A" "B"
[2,] "D" "C"
> namesmat2 <- t(sapply(1:nrow(mat2), function(i) mymax2(mat2[i,], namesmat[i,])))
> namesmat2
[,1] [,2]
[1,] "C" "D"
[2,] "B" "A"

Resources