Create new columns from multiple conditions across multiple columns - case

I need to create 2 new columns based on the conditions across 7 columns for over a million rows. Sample data is below.
I need newcol1 to return 1/0 if R ==1 and if any or all of the 6 columns D:N ==1. R==0 regardless of D:N value
I need newcol2 to return 0,1,2,3,4,5,6,or 7 if R==1 and depending on which column D:N==1 return a
#R==1 & D==1 ~1
#R==1 & M==1 ~2
#R==1 & P==1 ~3
#R==1 & A==1 ~4
#R==1 & S==1 ~5
#R==1 & N==1 ~6
#R==1 & D:N has more than 1 column with a 1 ~7
#R==0 & no matter what is in column D:N ~0
#no matter what R== if all D:N ==0 ~0
Sample Data
Id D M P A S N R
1 0 0 0 0 0 0 0
2 1 1 1 1 1 1 0
3 1 1 1 1 1 1 1
4 1 1 0 0 0 0 1
5 1 0 0 0 0 0 1
6 1 0 1 0 0 0 1
7 1 0 0 1 0 0 1
8 1 0 0 0 1 0 1
9 1 0 0 0 0 1 1
10 0 1 0 0 0 0 1
11 0 0 1 0 0 0 1
12 0 0 0 1 0 0 1
13 0 0 0 0 1 0 1
14 0 0 0 0 0 1 1
15 0 1 1 1 1 1 1
16 0 0 0 0 0 0 1
17 1 0 0 0 0 0 1
18 1 0 1 1 0 0 1
19 0 1 0 0 0 1 1
20 0 0 1 0 1 0 1
21 0 0 0 0 0 1 1
22 0 0 0 0 1 0 1
23 0 0 0 1 0 0 1
24 0 0 1 0 0 0 1
I've tried many versions of case_when, across, list,etc. and I can't get the syntax right. Thank you in advance for your help.
new.df <- test.df %>%
rowwise(Id) %>%
mutate(newcol1 = case_when(
R ==0 & D==1 & M==1 & P==1 & A==1 & S==1 & N==1 ~ 0, #0 regardless of D:N value because R==0
R ==1 & for any D==1 & M==1 & P==1 & A==1 & S==1 & N==1 ~ 1, #1 as long as at least 1 D:N==1
),
newcol1 = factor(
newcol1,
level = c( "0", "1")
)
)
new.df2 <- new.df
rowwise(Id) %>%
mutate(newvcol2 = case_when(
R ==1 & D==1 & M:N ==0 ~ 1,
R ==1 & M==1 & D==0 & P:N ==0 ~ 2,
R ==1 & P==1 & D:M==0 & A:N ==0 ~ 3,
R ==1 & A==1 & D:P==0 & S:N ==0 ~ 4,
R ==1 & S==1 & D:A==0 & N==0 ~ 5,
R ==1 & N==1 & D:S ==0 ~ 6,
R ==1 & more than 1 of D:N ==1 ~ 7 ,
TRUE ~ 0), # should include (R ==0 & any value in D:N ~ 0, if all D:N==0 ~0)
newcol2 = factor(
newcol2,
level = c( "0", "1", "2", "3", "4", "5", "6", "7")
)
)
The code from the question below looks very close to what I need but I can't expand it because I don't understand all of it.
Create column based on multiple conditions on different columns
for _, sub in df.groupby(['cust_id']): # test for each cust_id
for col in ['prod_1', 'prod_2', 'first_act']: # test columns in sequence
tmp = sub[sub[col] == 1] # try to match
if len(tmp) != 0: # ok found at least one
df.loc[tmp.index[0], 't0'] = 1 # set t0 to 1 for first index found
break
Desired Result
Id D M P A S N R newcol1 newcol2
1 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 0 0 0
3 1 1 1 1 1 1 1 1 7
4 1 1 0 0 0 0 1 1 7
5 1 0 0 0 0 0 1 1 1
6 1 0 1 0 0 0 1 1 7
7 1 0 0 1 0 0 1 1 7
8 1 0 0 0 1 0 1 1 7
9 1 0 0 0 0 1 1 1 7
10 0 1 0 0 0 0 1 1 2
11 0 0 1 0 0 0 1 1 3
12 0 0 0 1 0 0 1 1 4
13 0 0 0 0 1 0 1 1 5
14 0 0 0 0 0 1 1 1 6
15 0 1 1 1 1 1 1 1 7
16 0 0 0 0 0 0 1 0 0
17 1 0 0 0 0 0 1 1 1
18 1 0 1 1 0 0 1 1 7
19 0 1 0 0 0 1 1 1 7
20 0 0 1 0 1 0 1 1 7
21 0 0 0 0 0 1 1 1 6
22 0 0 0 0 1 0 1 1 5
23 0 0 0 1 0 0 1 1 4
24 0 0 1 0 0 0 1 1 3

Related

How to count columns in a range with nonzero values?

I am essentially trying to replicate the COUNTIF function from excel. I have a data frame called filtered.data like so:
Experiment_ID t_20_n_6 t_20_n_5 t_20_n_4 t_20_n_3 t_20_n_2 t_20_n_1
1 SG100520_social_01 0 0 0 0 2 1
2 K8012921_social_03 0 0 0 0 0 1
3 K8020521_social_01 0 0 0 1 1 1
4 K8020521_social_02 0 0 1 0 0 1
5 K8020521_social_03 0 0 0 0 2 3
6 K8020521_social_04 0 0 0 1 1 2
7 K8020521_social_05 0 0 0 1 1 3
8 K8021221_social_01 1 0 0 0 0 1
9 K8021221_social_03 0 0 0 0 0 2
10 K8021221_social_04 0 0 0 2 0 1
And I need to calculate a sort of average for t_20_n_6:t_20_n_1. I have the totaling part down by using x <- filtered.data %>% mutate(t_20_mean = ( (6*t_20_n_6)+(5*t_20_n_5)+(4*t_20_n_4)+(3*t_20_n_3)+(2*t_20_n_2)+(1*t_20_n_1) )\ ~~~~)
but I need to replace the ~~~~ with a count of the number of nonzero columns from t_20_n_6:t_20_n_1.
I have tried sum(x$t_10_n_6 != 0 | x$t_20_n_5 != 0 | x$t_20_n_4 != 0 | x$t_20_n_3 != 0 | x$t_20_n_2 !=0 | x$t_20_n_1 != 0 ) but the numbers don't make sense.
The results should be:
Experiment_ID t_20_n_6 t_20_n_5 t_20_n_4 t_20_n_3 t_20_n_2 t_20_n_1 t_20_mean
1 SG100520_social_01 0 0 0 0 2 1 2.5
2 K8012921_social_03 0 0 0 0 0 1 1
3 K8020521_social_01 0 0 0 1 1 1 2
4 K8020521_social_02 0 0 1 0 0 1 2.5
5 K8020521_social_03 0 0 0 0 2 3 3.5
6 K8020521_social_04 0 0 0 1 1 2 2.33
7 K8020521_social_05 0 0 0 1 1 3 2.67
8 K8021221_social_01 1 0 0 0 0 1 3.5
9 K8021221_social_03 0 0 0 0 0 2 2
10 K8021221_social_04 0 0 0 2 0 1 3.5
If you are interested in using the number (1 through 6) embedded in the column names for weighting, you could also try this approach.
Use pivot_longer to put data in long format. Then for each Experiment_ID you can sum the values weighted by the number extracted by the column name, and divide by the number of values that are greater than zero.
library(tidyverse)
filtered.data %>%
pivot_longer(cols = -Experiment_ID,
names_pattern = "t_20_n_(\\d+)",
names_transform = list(name = as.integer)) %>%
group_by(Experiment_ID) %>%
summarise(t_20_mean = sum(name * value) / sum(value > 0))
Output
Experiment_ID t_20_mean
<chr> <dbl>
1 K8012921_social_03 1
2 K8020521_social_01 2
3 K8020521_social_02 2.5
4 K8020521_social_03 3.5
5 K8020521_social_04 2.33
6 K8020521_social_05 2.67
7 K8021221_social_01 3.5
8 K8021221_social_03 2
9 K8021221_social_04 3.5
10 SG100520_social_01 2.5

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

Creating a new variable and altering dependent variables in r using ifelse

Let's say we have a df as follows:
A B C D E
1 1 0 0 1
0 0 1 0 0
0 0 0 0 1
1 1 1 1 0
0 1 1 0 1
1 0 1 0 0
So I would like to make another variable F which says, if the sum of A:D is greater than 1, F is 1 and A:D are 0.
Additionally, If E == 1, then F = 0.
So here's how I wrote it but it's not working...
#Counter
df<- df %>%
mutate(case_count = A+B+C+D)
df$F <- ifelse(df$E == 1, 0,
ifelse(df$case_count > 1,
df$A == 0 &
df$B == 0 &
df$C == 0 &
df$D == 0 &
df$F == 1, 0))
And the correct result here should then be
A B C D E case_count F
1 1 0 0 1 2 0
0 0 1 0 0 1 0
0 0 0 0 1 0 0
0 0 0 0 0 4 1
0 1 1 0 1 2 0
0 0 0 0 0 2 1
Using dplyr and the new functions across and c_across
df %>%
rowwise() %>%
mutate(
case_count = sum(c_across(A:D)),
F_ = ifelse(E == 1, 0, ifelse(case_count > 1, 1, 0))
) %>%
mutate(across(A:D, ~ifelse(F_ == 1, 0, .)))
I named the new column F_ instead of just F because the latter may be confused with the abbreviation for FALSE.
Output
# A tibble: 6 x 7
# Rowwise:
# A B C D E case_count F_
# <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
# 1 1 1 0 0 1 2 0
# 2 0 0 1 0 0 1 0
# 3 0 0 0 0 1 0 0
# 4 0 0 0 0 0 4 1
# 5 0 1 1 0 1 2 0
# 6 0 0 0 0 0 2 1
You can try this solution (DF is your original data):
#Create index
DF$I1 <- rowSums(DF[,1:4])
DF[DF[,6]>1,1:4]<-0
#Create F
DF$F <- ifelse(DF$I1>1,1,0)
DF$F <- ifelse(DF$E==1,0,DF$F)
A B C D E I1 F
1 0 0 0 0 1 2 0
2 0 0 1 0 0 1 0
3 0 0 0 0 1 0 0
4 0 0 0 0 0 4 1
5 0 0 0 0 1 2 0
6 0 0 0 0 0 2 1

create a loop to get samples in grouped data which meet a condition

I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0

Using any in nested ifelse statement

data:
set.seed(1337)
m <- matrix(sample(c(0,0,0,1),size = 50,replace=T),ncol=5) %>% as.data.frame
colnames(m)<-LETTERS[1:5]
code:
m %<>%
mutate(newcol = ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(any(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
looks like:
A B C D E newcol desiredResult
1 0 1 1 1 0 0 0
2 0 1 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 NA
5 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0
7 1 1 1 1 0 1 1
8 0 1 1 0 0 0 0
9 0 0 0 0 0 0 NA
10 0 0 1 0 0 0 0
question
I want newcol to be the same as desiredResult.
Why can't I use any in that "stratified" manner of ifelse. Is there a function like any that would work in that situation?
possible workaround
I could define a function
any_vec <- function(...) {apply(cbind(...),1,any)} but this does not make me smile too much.
like suggested in the answer
using pmax works exactly like a vectorized any.
m %>%
mutate(pmaxResult = ifelse(A==1& pmax(B,C) & pmax(D,E),1,
ifelse(pmax(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
Here's an alternative approach. I converted to logical at the beginning and back to integer at the end:
m %>%
mutate_all(as.logical) %>%
mutate(newcol = A & pmax(B,C) & pmax(D, E) ,
newcol = replace(newcol, !newcol & !pmax(A,B,C,D,E), NA)) %>%
mutate_all(as.integer)
# A B C D E newcol
# 1 0 1 1 1 0 0
# 2 0 1 0 0 1 0
# 3 0 1 0 0 0 0
# 4 0 0 0 0 0 NA
# 5 0 1 0 1 0 0
# 6 0 0 1 0 0 0
# 7 1 1 1 1 0 1
# 8 0 1 1 0 0 0
# 9 0 0 0 0 0 NA
# 10 0 0 1 0 0 0
I basically replaced the any with pmax.

Resources