Creating all combinations of sampling from two groups of columns in R - r

I have the dataframe below and within this "two groups", the columns A&B and D&E. I would like to find all combinations and then to group by all combinations of applying different filters at columns A&B and D&E but in the form of only choosing 1 column from each group at the time. I dont know the right formula to do this and the problem is way bigger in reality.
df=
Size A B D E
1 1 1 0 0
5 0 0 1 0
10 1 1 1 0
3 1 0 0 0
2 1 1 1 1
55 0 0 0 1
5 1 0 1 1
2 0 0 1 1
1 1 1 1 1
4 1 1 1 0
So the combinations to filter should be
Filter 1: A=1 AND D=1
Filter 2: A=1 AND D=0
Filter 3: A=1 AND E=1
Filter 4: A=1 AND E=0
Filter 5: A=0 AND D=1
Filter 6: A=0 AND D=0
Filter 7: A=0 AND E=1
Filter 8: A=0 AND E=0
Filter 9: B=1 AND D=1
Filter 10: B=1 AND D=0
Filter 11: B=1 AND E=1
Filter 12: B=1 AND E=0
Filter 13: B=0 AND D=1
Filter 14: B=0 AND D=0
Filter 15: B=0 AND E=1
Filter 16: B=0 AND E=0
I want to find a way to efficiently create these filter groups (drawing always 1 filter from either columns A&B or D&E) and then to find the average and count of the Size column for each filter setting. I only managed to do this without different groups to sample the filter from.
What I tried was in the form of this:
groupNames <- names(df)[2:5]
myGroups <- Map(combn,list(groupNames),seq_along(groupNames),simplify = FALSE) %>% unlist(recursive = FALSE)
results = lapply(myGroups, FUN = function(x) {do.call(what = group_by_, args = c(list(df), x)) %>% summarise( n = length(Size), avgVar1 = mean(Size))})
It treats the four columns equally and does not consider sampling from the 2 groups. What could I do to the code to make this work?
Thank you very much.

library(tidyverse)
df <- tribble(~Size, ~A, ~B, ~D, ~E,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
p <- function(...) paste0(...) # for legibility, should rather use glue
all_filtering_groups <- list(c("A", "B"), c("D", "E")) # assuming these are known
all_combns <- map(1:length(all_filtering_groups), ~ combn(all_filtering_groups, .))
res <- list(length(all_combns))
#microbenchmark::microbenchmark({
for(comb_length in seq_along(all_combns)){
res[[comb_length]] <- list(ncol(all_combns[[comb_length]]))
for(col_i in seq_len(ncol(all_combns[[comb_length]]))){
filtering_groups <- all_combns[[comb_length]][,col_i]
group_names <- as.character(seq_along(filtering_groups))
# prepare grid of all combinations
filtering_combs <- c(filtering_groups, rep(list(0:1), length(filtering_groups)))
names(filtering_combs) <- c(p("vars_", group_names), p("vals_", group_names))
full_grid <- expand.grid(filtering_combs)
for(ll in 1:nrow(full_grid)){ # for each line in the full_grid
# find df lines that correspond
cond <- as.logical(rep(TRUE, nrow(df)))
for(grp in group_names){
cond <- cond & df[[full_grid[p("vars_", grp)][ll,]]] == full_grid[p("vals_", grp)][ll,]
}
# and compute whatever
full_grid$lines[ll] <- paste(which(cond), collapse = ", ") #for visual verification
full_grid$n[ll] <- length(df$Size[cond])
full_grid$sum[ll] <- sum(df$Size[cond])
full_grid$mean[ll] <- mean(df$Size[cond])
}
res[[comb_length]][[col_i]] <- full_grid
}
}
#}, times = 10) #microbenchmark
bind_rows(res) %>% relocate(starts_with("vars") | starts_with("vals"))

Following the discussion in the comments, I think we can treat the groups as variables. So we need to reshape the dataframe to have one column per factor, then we can use standard tidyverse approaches. I'm assuming the groups are defined by the column names (A1...Ak, B1...Bk, ...).
library(tidyverse)
df <- tribble(~Size, ~A1, ~A2, ~B1, ~B2,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
get_levels <- function(col){
paste(names(col)[col == "1"], collapse = ",")
}
# Rewrite with groups as factors
df_factors <- df %>%
mutate(id = row_number()) %>% #to avoid aggregating same Size
nest(A = starts_with("A"), B = starts_with("B")) %>%
mutate(A = factor(map_chr(A, get_levels)),
B = factor(map_chr(B, get_levels)))
# Now look at factor combinations
df_factors %>%
group_by(A, B) %>%
summarize(n = n(),
mean = mean(Size))
# A tibble: 8 x 4
# Groups: A [3]
# A B n mean
# <fct> <fct> <int> <dbl>
# 1 "" "B1" 1 5
# 2 "" "B1,B2" 1 2
# 3 "" "B2" 1 55
# 4 "A1" "" 1 3
# 5 "A1" "B1,B2" 1 5
# 6 "A1,A2" "" 1 1
# 7 "A1,A2" "B1" 2 7
# 8 "A1,A2" "B1,B2" 2 1.5
I called "A" and "B" explicitly. It seems still doable to do that with 6 groups. If you have more, it would become necessary to automatize, but I'm not sure how to do that easily.

Related

Make a new column with values according to other columns - in R

Taking this dummy data for example
structure(list(Metastasis_Brain = c("1", "1", "0", "1", "0",
"0"), Metastasis_Liver = c("0", "0", "1", "1", "1", "0"), Metastasis_Bone = c("1",
"1", "0", "1", "1", "0")), class = "data.frame", row.names = c("Patient_1",
"Patient_2", "Patient_3", "Patient_4", "Patient_5", "Patient_6"
))
Example of what I'm searching for: If there is 1 in columns Metastasis_Brain and Metastasis_Liver, the new column will contain "Brain, Liver".
If all three tissues are 1, then that row in the new column will contain "Brain, Liver, Bone".
If all are 0, then it doesn't matter, NA would be fine.
Using tidyverse:
library(tidyverse)
df %>%
rownames_to_column() %>%
left_join(pivot_longer(.,-rowname, names_prefix = '.*_') %>%
filter(value>0) %>%
group_by(rowname) %>%
summarise(nm = toString(name)))
rowname Metastasis_Brain Metastasis_Liver Metastasis_Bone nm
1 Patient_1 1 0 1 Brain, Bone
2 Patient_2 1 0 1 Brain, Bone
3 Patient_3 0 1 0 Liver
4 Patient_4 1 1 1 Brain, Liver, Bone
5 Patient_5 0 1 1 Liver, Bone
6 Patient_6 0 0 0 <NA>
in Base R you could do:
aggregate(ind~rn, subset(transform(stack(df),
ind = sub('.*_', '', ind), rn = rownames(df)), values>0), toString)
rn ind
1 Patient_1 Brain, Bone
2 Patient_2 Brain, Bone
3 Patient_3 Liver
4 Patient_4 Brain, Liver, Bone
5 Patient_5 Liver, Bone
base
df <- data.frame(
stringsAsFactors = FALSE,
row.names = c("Patient_1","Patient_2","Patient_3","Patient_4","Patient_5","Patient_6"),
Metastasis_Brain = c("1", "1", "0", "1", "0", "0"),
Metastasis_Liver = c("0", "0", "1", "1", "1", "0"),
Metastasis_Bone = c("1", "1", "0", "1", "1", "0"),
res = c("Brain, Bone","Brain, Bone",
"Liver","Brain, Liver, Bone","Liver, Bone",NA)
)
df$res <- sapply(apply(df, 1, function(x) gsub("Metastasis_", "", names(df)[x == 1])), toString)
df
#> Metastasis_Brain Metastasis_Liver Metastasis_Bone res
#> Patient_1 1 0 1 Brain, Bone
#> Patient_2 1 0 1 Brain, Bone
#> Patient_3 0 1 0 Liver
#> Patient_4 1 1 1 Brain, Liver, Bone
#> Patient_5 0 1 1 Liver, Bone
#> Patient_6 0 0 0 NA
Created on 2022-06-20 by the reprex package (v2.0.1)

Intrincate variable generation with conditionals against multiple factor variables in R

I'm triying to generate a new variable using multiple conditionals that evaluate against factor variables.
So, let's say I got this factor variables data.frame
x<-c("1", "2", "1","NA", "1", "2", "NA", "1", "2", "2", "NA" )
y<-c("1","NA", "2", "1", "1", "NA", "2", "1", "2", "1", "1" )
z<-c("1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3")
w<- c("01", "02", "03", "04","05", "06", "07", "01", "02", "03", "04")
df<-data.frame(x,y,z,w)
df$x<-as.factor(df$x)
df$y<-as.factor(df$y)
df$z<-as.factor(df$z)
df$w<-as.factor(df$w)
str(df)
So I need to get a new v colum on my dataframe which takes values between 1, 0 or NA with the following conditionals:
Takes value 1 if: x = "1", y = "1", z = "1" or "2", w = "01" to "06"
Takes value 0 if it doesn't meet at least one of the conditionals.
Takes value NA if any of x, y, z, or w is NA.
Had tried using a pipe %>% along mutate and case_when but have been unable to make it work.
So my desired result would be a new column v in df which would look like this:
[1] 1 NA 0 NA 1 NA NA 0 0 0 NA
Here I also use mutate with case_when. Since the NA in your dataset is of character "NA" (literal string of "NA"), we cannot use function like is.na() to idenify it. Would recommend to change it to "real" NA (by removing double quotes in your input).
As I've pointed out in the comment, I'm not sure why the eighth entry is "1" when the corresponding z is not "1" or "2".
library(dplyr)
df %>% mutate(v = case_when(x == "1" & y == "1" & z %in% c("1", "2") & w %in% paste0(0, seq(1:6)) ~ "1",
x == "NA" | y == "NA" | z == "NA" | w == "NA" ~ NA_character_,
T ~ "0"))
x y z w v
1 1 1 1 01 1
2 2 NA 2 02 <NA>
3 1 2 3 03 0
4 NA 1 4 04 <NA>
5 1 1 1 05 1
6 2 NA 2 06 <NA>
7 NA 2 3 07 <NA>
8 1 1 4 01 0
9 2 2 1 02 0
10 2 1 2 03 0
11 NA 1 3 04 <NA>

How to format data from excel containing two rows of column headers to be able to use in R?

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.
Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.
read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)
One option would be to
read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)
# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8
DATA
df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")

How to subset a large data set with multiple conditions?

My full data set is larger but I have this reproducible sample:
structure(list(ID = c("121", "122", "123", "124"), Var1P = c("3",
"1", "3", "3"), Var1C = c("1", "3", "3", "1"), Var2P = c("1",
"1", "1", "1"), Var2P = c("1", "1", "1", "1"), Var3P = c("1",
"1", "1", "1"), Var3C = c("1", "1", "1", "1"), Var4P = c("1",
"1", "1", "1"), Var4C = c("1", "3", "1", "1"), Var5P = c("1",
"1", "3", "1"), Var5C = c("1", "1", "1", "1"), Var6P = c("1",
"1", "1", "1"), Var6C = c("1", "1", "1", "1"), Var7P = c("1",
"1", "1", "1"), Var7C = c("1", "1", "1", "1"), Var8 = c("0",
"1", "1", "1")), row.names = c(84L, 150L, 271L, 303L), class = "data.frame")
I want to subset the data so that only the observations with a score of 3 under Var1P or Var1C and all other columns a score of 2, 1, or 0. I have tried to use the simple subset function:
Data <- subset(Data, Var1P == 3 | Var1C == 3)
But, how can I make this argument even more complex to also tell R to also remove entries with scores of 3 under the other columns?
I thought simply using the following code would work:
Data <- subset(Data, Var1P == 3 | Var1C == 3 & 4:16 == 1 | 4:16 == 0)
It doesn't because R would then be looking at the row numbers, I think. I don't want to type out all of the column names because like I said my full data frame is much larger. I also am trying to avoid loops.
You can divide the data into two sets of columns, select_cols are the columns where you want to select rows with 3 in them and remove_cols are the remaining columns.
We can then select rows with rowSums where select_cols has 3 in it but remove_cols doesn't.
select_cols <- c('Var1P', 'Var1C')
remove_cols <- setdiff(names(Data), select_cols)
Data[rowSums(Data[select_cols] == 3) > 0 & rowSums(Data[remove_cols] == 3) == 0, ]
# ID Var1P Var1C Var2P Var2P Var3P Var3C Var4P Var4C Var5P Var5C Var6P Var6C Var7P Var7C Var8
#84 121 3 1 1 1 1 1 1 1 1 1 1 1 1 1 0
#303 124 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Error when subsetting based on adjusted values of different data frame in R

I am asking a side-question about the method I learned here from #redmode :
Subsetting based on values of a different data frame in R
When I try to dynamically adjust the level I want to subset by:
N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > 0.95*B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))
I get an error
Error in FUN(left, right) : non-numeric argument to binary operator.
Can you think of a way I can get rows pertaining to values in A that are greater than 95% of the value in B? Thank you.
Here is code for A and B to play with.
A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "9.4"),
name2 = c("ctrl", "3", "1", "10", "1", "1", "9.4")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
You have some serious formatting issues with your data.
First, columns should be of the same data type, rows should be observations. (not always true, but a very good way to start) Here you have a row called cond, then a row called hour, then a series of classifications I'm guessing. The way you're data is presented to begin with doesn't make much sense and doesn't lend itself to easy manipulation of your data. But all is not lost. This is what I would do:
Reorganize my data:
C <- data.frame(matrix(as.numeric(unlist(A)), ncol=2)[-(1:2), ])
colnames(C) <- c('A.trt', 'A.cntr')
rownames(C) <- LETTERS[1:nrow(C)]
D <- data.frame(matrix(as.numeric(unlist(B)), ncol=2)[-(1:2), ])
colnames(D) <- c('B.trt', 'B.cntr')
(df <- cbind(C, D))
Which gives:
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# B 10 1 1.0 10.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4
Then you're problem is easily solved by:
df[which(df[, 1] > 0.95*df[, 3] & df[, 2] > 0.95*df[, 4]), ]
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4

Resources