Creating a new variable and altering dependent variables in r using ifelse - r

Let's say we have a df as follows:
A B C D E
1 1 0 0 1
0 0 1 0 0
0 0 0 0 1
1 1 1 1 0
0 1 1 0 1
1 0 1 0 0
So I would like to make another variable F which says, if the sum of A:D is greater than 1, F is 1 and A:D are 0.
Additionally, If E == 1, then F = 0.
So here's how I wrote it but it's not working...
#Counter
df<- df %>%
mutate(case_count = A+B+C+D)
df$F <- ifelse(df$E == 1, 0,
ifelse(df$case_count > 1,
df$A == 0 &
df$B == 0 &
df$C == 0 &
df$D == 0 &
df$F == 1, 0))
And the correct result here should then be
A B C D E case_count F
1 1 0 0 1 2 0
0 0 1 0 0 1 0
0 0 0 0 1 0 0
0 0 0 0 0 4 1
0 1 1 0 1 2 0
0 0 0 0 0 2 1

Using dplyr and the new functions across and c_across
df %>%
rowwise() %>%
mutate(
case_count = sum(c_across(A:D)),
F_ = ifelse(E == 1, 0, ifelse(case_count > 1, 1, 0))
) %>%
mutate(across(A:D, ~ifelse(F_ == 1, 0, .)))
I named the new column F_ instead of just F because the latter may be confused with the abbreviation for FALSE.
Output
# A tibble: 6 x 7
# Rowwise:
# A B C D E case_count F_
# <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
# 1 1 1 0 0 1 2 0
# 2 0 0 1 0 0 1 0
# 3 0 0 0 0 1 0 0
# 4 0 0 0 0 0 4 1
# 5 0 1 1 0 1 2 0
# 6 0 0 0 0 0 2 1

You can try this solution (DF is your original data):
#Create index
DF$I1 <- rowSums(DF[,1:4])
DF[DF[,6]>1,1:4]<-0
#Create F
DF$F <- ifelse(DF$I1>1,1,0)
DF$F <- ifelse(DF$E==1,0,DF$F)
A B C D E I1 F
1 0 0 0 0 1 2 0
2 0 0 1 0 0 1 0
3 0 0 0 0 1 0 0
4 0 0 0 0 0 4 1
5 0 0 0 0 1 2 0
6 0 0 0 0 0 2 1

Related

Calculate amount of times a interaction has been made by the same client id when a conversion is made

I have a dataset like so :
client_id
interaction_1
interaction_2
conversion
A
1
0
0
B
0
1
0
C
0
0
1
A
0
0
1
B
0
1
0
B
0
0
1
C
0
1
0
C
0
0
1
The dataset is already ordered based on a timestamp (ascending). Both the interactions and conversion columns are dummies (0/1). For every conversion, I need to calculate the amount a client_id did a interaction or conversion, but only the interactions since the last conversion (therefore, the column "lag_conversion" can never be >1).
The output should look something like this:
client_id
interaction_1
interaction_2
conversion
lag_interaction_1
lag_interaction_2
lag_conversion
A
1
0
0
0
0
0
B
0
1
0
0
0
0
C
0
0
1
0
0
0
A
0
0
1
1
0
0
B
0
1
0
0
0
0
B
0
0
1
0
2
0
C
0
1
0
0
0
0
C
0
0
1
0
1
1
I've tried the code:
for (i in 1:nrow(mydata)) {
client_id <- mydata$client_id[i]
if (mydata$conversion[i] == 1) {
last_conversion_index <- max(which(mydata$client_id == client_id & mydata$conversion== 1 & 1:nrow(mydata) <= i))
mydata$interaction_1[i:last_conversion_index & mydata$interaction_1== 1] <- 1
}
}
although this only results in a 1 in the row itself, meaning that the first row would look like:
client_id
interaction_1
interaction_2
conversion
lag_interaction_1
lag_interaction_2
lag_conversion
A
1
0
0
1
0
0
Any help is much appreciated! Thanks in advance
With dplyr:
library(dplyr) #1.1.0 and above
dat %>%
mutate(across(everything(), ~ ifelse(conversion == 1, cumsum(lag(.x, default = 0)), 0),
.names = "lag_{col}"), .by = client_id)
client_id interaction_1 interaction_2 conversion lag_interaction_1 lag_interaction_2 lag_conversion
1 A 1 0 0 0 0 0
2 B 0 1 0 0 0 0
3 C 0 0 1 0 0 0
4 A 0 0 1 1 0 0
5 B 0 1 0 0 0 0
6 B 0 0 1 0 2 0
7 C 0 1 0 0 0 0
8 C 0 0 1 0 1 1
below 1.1.0:
dat %>%
group_by(client_id) %>%
mutate(across(everything(), ~ ifelse(conversion == 1, cumsum(lag(.x, default = 0)), 0),
.names = "lag_{col}")) %>%
ungroup()

Converting unique values from data frame intro a reference Matrix

Hello lovely people of SO!
Guys I have the following raw dataset
ID_TRIAL<-c(1,1,1,2,3,4,5,5,5,6,6,6,7,7,8,8,8,8)
TYPE_FAIL<-c("A","B","C","F","A","A","A","B","K","T","F","A","A","B","B","Q","P","I")
ID TRIAL
TYPE_FAIL
1
A
1
B
1
C
2
F
3
A
4
A
5
A
5
B
5
K
6
T
6
F
6
A
7
A
7
B
8
B
8
Q
8
P
8
I
I need to transform this dataset in such manner that I am able to create a matrix whose columns are the TYPE OF FAILS in alphabetical order and its rows are a binary representation of all unique TYPE OF FAILS a TRIAL had for instance
all the TYPES OF FAILS are in alphabetical order: A B C F I K P Q T
So for TRAIL 8 the matrix row will look like this
A
B
C
F
I
K
P
Q
T
0
1
0
0
1
0
1
1
0
The zeros in all other cells represent that during trial 8 for example FAIL TYPE A did not occurred and so on
my desired output would look like this:
TRIAL
A
B
C
F
I
K
P
Q
T
1
1
1
1
0
0
0
0
0
0
2
0
0
0
1
0
0
0
0
0
3
1
0
0
0
0
0
0
0
0
4
1
0
0
0
0
0
0
0
0
5
1
1
0
0
0
1
0
0
0
6
1
0
0
1
0
0
0
0
1
7
1
1
0
0
0
0
0
0
0
8
0
1
0
0
1
0
1
1
0
Thank you all of you guys for helping me out I will be super attentive to read and response to all of your comments
Some of my thought-process behind my solution:
First I need to group by ID TRIAL then
I need to find a function or a routine that will look for
a letter lets say "B" and add a number one to my matrix under the column B for the
row of the TRIAL in case, I can do this using multiple ifelse lines but
my real dataset is quite large and I dont know if there is a way to perform this faster so thank you so much for helping me out on this
Here's a tidyverse solution using dplyr::count and tidyr::pivot_wider.
library(dplyr)
library(tidyr)
df1 <- data.frame(ID_TRIAL = c(1, 1, 1, 2, 3, 4, 5, 5 , 5, 6, 6, 6, 7, 7, 8, 8, 8, 8),
TYPE_FAIL = c("A", "B", "C", "F", "A", "A", "A", "B", "K", "T", "F",
"A", "A", "B", "B", "Q", "P", "I"))
df1 %>%
count(ID_TRIAL, TYPE_FAIL) %>%
pivot_wider(names_from = "TYPE_FAIL",
values_from = "n",
names_sort = TRUE) %>%
replace(is.na(.), 0)
Result:
# A tibble: 8 × 10
ID_TRIAL A B C F I K P Q T
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0
5 1 1 0 0 0 1 0 0 0
6 1 0 0 1 0 0 0 0 1
7 1 1 0 0 0 0 0 0 0
8 0 1 0 0 1 0 1 1 0
Matrix format. Generating a matrix from the table returned values
ID_TRIAL<-c(1,1,1,2,3,4,5,5,5,6,6,6,7,7,8,8,8,8)
TYPE_FAIL<-c("A","B","C","F","A","A","A","B","K","T","F","A","A","B","B","Q","P","I")
df <- data.frame(ID_TRIAL = ID_TRIAL, TYPE_FAIL = TYPE_FAIL)
mat <- table(df) |> matrix(nrow = 8, dimnames = list(unique(df$ID_TRIAL),
sort(unique(df$TYPE_FAIL))))
A B C F I K P Q T
1 1 1 1 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0
5 1 1 0 0 0 1 0 0 0
6 1 0 0 1 0 0 0 0 1
7 1 1 0 0 0 0 0 0 0
8 0 1 0 0 1 0 1 1 0
I thought you meant a literal matrix.
If you meant data.frame you can do. Using the table function to generate some values we can use to pivot wider
data.frame(table(df)) |>
pivot_wider(id_cols = ID_TRIAL, names_from = TYPE_FAIL, values_from = Freq)
ID_TRIAL A B C F I K P Q T
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 0 0 0 0 0 0
2 2 0 0 0 1 0 0 0 0 0
3 3 1 0 0 0 0 0 0 0 0
4 4 1 0 0 0 0 0 0 0 0
5 5 1 1 0 0 0 1 0 0 0
6 6 1 0 0 1 0 0 0 0 1
7 7 1 1 0 0 0 0 0 0 0
8 8 0 1 0 0 1 0 1 1 0

Automatic subsetting of a dataframe on the basis of a prediction matrix

I have created a prediction matrix for large dataset as follows:
library(mice)
dfpredm <- quickpred(df, mincor=.3)
A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
**F 0 0 1 0 0 0 1 0 0 0**
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0
I would like to create a subset of the original df on the basis on dfpredm.
More specifically I would like to do the following:
Let's assume that my dependent variable is F.
According to the prediction matrix F is correlated with C and G.
In addition, C and G are best predicted by D,E and B,D respectively.
The idea is now to create a subset of df based on the dependent variable F,for which in the F row the value is 1.
Fpredictors <- df[,(dfpredm["F",]) == 1]
But also do the same for the variables where the rows in F are 1. I am thinking of first getting the column names like this:
Fpredcol <-colnames(dfpredm[,(dfpredm["c241",]) == 1])
And then doing a for loop with these column names?
For the specific example I would like to end up with the subset.
dfsub <- df[,c("F","C","G","B","E","D")]
I would however like to automate this process. Could anyone show me how to do this?
Here is one strategy that seems like it would work for you:
first_preds <- function(dat, predictor) {
cols <- which(dat[predictor, ] == 1)
names(dat)[cols]
}
# wrap first_preds() for getting best and second best predictors
first_and_second_preds <- function(dat, predictor) {
matches <- first_preds(dat, predictor)
matches <- c(matches, unlist(lapply(matches, function(x) first_preds(dat, x))))
c(predictor, matches) %>% unique()
}
dat[first_and_second_preds(dat, "F")] # order is not exactly the same as your output
F C G D E B
A 1 1 0 1 0 1
B 0 0 1 0 1 0
C 0 0 0 1 1 0
D 1 1 0 0 0 0
E 1 0 1 1 0 1
F 0 1 1 0 0 0
G 0 0 0 1 0 1
H 1 1 0 0 0 0
I 0 0 1 1 1 1
J 1 1 0 0 0 0
Not sure if the ordering in the result is important, but you could add the logic if it is.
Using dat from here (a kinder way to share small R data on SO):
dat <- read.table(
text = "A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
F 0 0 1 0 0 0 1 0 0 0
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0",
header = TRUE
)
Something a little more general that would let you use self_select predictors directly:
all_preds <- function(dat, predictors) {
unlist(lapply(predictors, function(x) names(dat)[which(dat[x, ] == 1 )]))
}
dat[all_preds(dat, c("A", "B"))]
B C D F H I A E G J
A 1 1 1 1 1 1 0 0 0 0
B 0 0 0 0 0 0 1 1 1 1
C 0 0 1 0 0 0 0 1 0 0
D 0 1 0 1 1 0 1 0 0 1
E 1 0 1 1 0 1 0 0 1 0
F 0 1 0 0 0 0 0 0 1 0
G 1 0 1 0 0 0 0 0 0 0
H 0 1 0 1 0 0 1 0 0 1
I 1 0 1 0 0 0 0 1 1 0

Using any in nested ifelse statement

data:
set.seed(1337)
m <- matrix(sample(c(0,0,0,1),size = 50,replace=T),ncol=5) %>% as.data.frame
colnames(m)<-LETTERS[1:5]
code:
m %<>%
mutate(newcol = ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(any(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
looks like:
A B C D E newcol desiredResult
1 0 1 1 1 0 0 0
2 0 1 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 NA
5 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0
7 1 1 1 1 0 1 1
8 0 1 1 0 0 0 0
9 0 0 0 0 0 0 NA
10 0 0 1 0 0 0 0
question
I want newcol to be the same as desiredResult.
Why can't I use any in that "stratified" manner of ifelse. Is there a function like any that would work in that situation?
possible workaround
I could define a function
any_vec <- function(...) {apply(cbind(...),1,any)} but this does not make me smile too much.
like suggested in the answer
using pmax works exactly like a vectorized any.
m %>%
mutate(pmaxResult = ifelse(A==1& pmax(B,C) & pmax(D,E),1,
ifelse(pmax(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
Here's an alternative approach. I converted to logical at the beginning and back to integer at the end:
m %>%
mutate_all(as.logical) %>%
mutate(newcol = A & pmax(B,C) & pmax(D, E) ,
newcol = replace(newcol, !newcol & !pmax(A,B,C,D,E), NA)) %>%
mutate_all(as.integer)
# A B C D E newcol
# 1 0 1 1 1 0 0
# 2 0 1 0 0 1 0
# 3 0 1 0 0 0 0
# 4 0 0 0 0 0 NA
# 5 0 1 0 1 0 0
# 6 0 0 1 0 0 0
# 7 1 1 1 1 0 1
# 8 0 1 1 0 0 0
# 9 0 0 0 0 0 NA
# 10 0 0 1 0 0 0
I basically replaced the any with pmax.

Transform data frame

I have a questionnaire with an open-ended question like "Please name up to ten animals", which gives me the following data frame (where each letter stands for an animal):
nrow <- 1000
list <- vector("list", nrow)
for(i in 1:nrow){
na <- rep(NA, sample(1:10, 1))
list[[i]] <- sample(c(letters, na), 10, replace=FALSE)
}
df <- data.frame()
df <- rbind(df, do.call(rbind, list))
head(df)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 r <NA> a j w e i h u z
# 2 t o e x d v <NA> z n c
# 3 f y e s n c z i u k
# 4 y <NA> v j h z p i c q
# 5 w s v f <NA> c g b x e
# 6 p <NA> a h v x k z o <NA>
How can I transform this data frame to look like the following data frame? Remember that I don't actually know the column names.
r <- 1000
c <- length(letters)
t1 <- matrix(rbinom(r*c,1,0.5),r,c)
colnames(t1) <- letters
head(t1)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0
# [2,] 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1
# [3,] 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
# [4,] 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0
# [5,] 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0
# [6,] 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1
td <- data.frame(t(apply(df, 1, function(x) as.numeric( unique(unlist(df)) %in% x))))
colnames (td) <- unique(unlist(df))
letters could be replaced with a vector of animal names colnames(t1).
You can do the following using tidyr which could be much faster than other approaches, though I like the approach by #germcd very much. You may need to tinker with the select, removing NAs as well as a blank space, which may be an artifact of the simulated data you provided:
require(tidyr)
## Add an ID for each record:
df$id <- 1:nrow(df)
out <- (df %>%
gather(column, animal, -id) %>%
filter(animal != " ") %>%
spread(animal, column)
)
head(out)
This code gathers the unnamed columns into a long format, removes any empty columns or missing data, and then spreads by the unique values of the animal column. This also has the potentially desirable property of preserving the column order in which the animals were named. If it's not desirable then you could easily convert the resulting animal columns to numeric:
out_num <- out
out_num[,-1] <- as.numeric((!is.na(out[,-1])))
head(out_num)
You can try mtabulate from the "qdapTools" package:
library(qdapTools)
head(mtabulate(as.data.frame(t(df))))
# c d i l m o r v x y a f s t k p u b h j n q e g w z
# 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
# 4 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
# 5 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0
# 6 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0
There are, of course, many other options.
For example, cSplit_e from my "splitstackshape" package (with the downside that inefficiently, you need to paste the values together first before you can split them):
library(splitstackshape)
library(dplyr)
As ones and zeroes:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "binary", type = "character", fill = 0) %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 0 0 1 1 0 0 0 0 1
# 2 1 0 0 1 0 1 0 0 0
# 3 1 0 0 0 0 0 0 0 1
# 4 0 1 1 0 0 0 0 1 1
# 5 0 1 0 1 0 0 0 1 0
# 6 0 1 0 0 0 0 0 0 0
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 0 0 1 1 0 1 0 0 1
# 2 0 0 0 1 0 0 0 0 0
# 3 0 1 0 0 0 0 1 0 1
# 4 1 0 1 0 1 0 0 0 0
# 5 0 1 0 0 1 0 1 1 1
# 6 1 1 0 1 0 0 0 1 0
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 0 0 0 1 0 1 1 0
# 2 1 1 0 0 0 0 0 0
# 3 0 1 1 0 0 1 1 0
# 4 0 0 1 0 0 0 1 0
# 5 1 0 0 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
As the original values:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "value", type = "character", fill = "") %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 c d i
# 2 a d f
# 3 a i
# 4 b c h i
# 5 b d h
# 6 b
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 l m o r
# 2 m
# 3 k p r
# 4 j l n
# 5 k n p q r
# 6 j k m q
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 v x y
# 2 s t
# 3 t u x y
# 4 u y
# 5 s
# 6 s t u
Alternatively, you can use "reshape2":
library(reshape2)
## The values
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value")
## ones and zeroes
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value", fun.aggregate = length)

Resources