Generate new column in dataframe, based on group-event in nested groups - r

I have a dataframe with three "main"-groups (x: 1, 2, 3), three groups within the main-groups (v: 2, 3 or 1) and some events within the main-groups (0 and 1 in y):
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3)
v <- c(2, 3, 3, 2, 2, 1, 1, 2, 2)
y <- c(0, 0, 1, 0, 0, 0, 0, 0, 1)
df <- data.frame(x, v, y)
df
> df
x v y
1 1 2 0
2 1 3 0
3 1 3 1
4 2 2 0
5 2 2 0
6 3 1 0
7 3 1 0
8 3 2 0
9 3 2 1
For example: In group 1 (x = 1) there are two more groups (v = 2 and v = 3), event y = 1 happens in group x = 1 and v = 3.
Now i want to generate a new column z, based on the events in y: if there is any y = 1 in one group, all cases in group v in x should get a 1 for z; else NA. How can z be generated this way? df should look like:
> df
x v y z
1 1 2 0 NA
2 1 3 0 1
3 1 3 1 1
4 2 2 0 NA
5 2 2 0 NA
6 3 1 0 1
7 3 1 1 1
8 3 2 0 NA
9 3 2 0 NA
I am grateful for any help.

df %>% group_by(x, v) %>% mutate(z = if(any(y == 1)) 1 else NA)
After grouping by x and y, the new column z is filled with 1's if there are any 1's in y and with NA's otherwise.

Try this:
library(dplyr)
df %>%
group_by(x, v) %>%
mutate(
z = ifelse(any(y == 1), 1, NA)
)

Related

Iterating over columns to create flagging variables

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

How to find the most commonly occurring combinations of Boolean variables by row in R

I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2

Mutate a column in data.table with ifelse and group by

I have some dplyr code I'm moving to data.table, this is a problem I just ran into. I want the difference from one row to the next in b stored in column c if a is greater or equal than 3. However after running this code:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
setDT(df)
df[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)]
all the elements in c are 0. Why is this?
df
a b c
1: 1 0 0
2: 1 1 0
3: 1 0 0
4: 1 1 0
5: 2 0 0
6: 2 1 0
7: 2 1 0
8: 3 0 0
9: 3 3 0
10: 3 4 0
11: 3 5 0
What I thought was the equivalent dplyr:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
df %>%
group_by(a) %>%
mutate(c = ifelse( a >= 3, c(0, diff(b)), b))
From the help for ifelse(test, yes, no), it should return...
A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.
However:
> df %>% group_by(a) %>% do(print(.$a))
[1] 1 1 1 1
[1] 2 2 2
[1] 3 3 3 3
> data.table(df)[, print(a), by=a]
[1] 1
[1] 2
[1] 3
As explained in the help pages, since the first argument has a length of one, if you pass vectors for the other parts, only their first element is used:
> ifelse(TRUE, 1:10, eleventy + million)
[1] 1
You should probably use if ... else ... when working with a constant value, like...
> data.table(df)[, b := if (a >= 3) c(0, diff(b)) else b, by=a]
or even better, in this case you can assign to a subset:
> data.table(df)[a >= 3, b := c(0, diff(b)), by=a]
Regarding why a has length 1 for the data.table idiom, see its FAQ question "Inside each group, why are the group variables length-1?"
I am creating a dataset which has non-zero values for b as the first element of each group by a to illustrate better. Your previous dataset had all zeros and also c(0,diff(b)) was starting with zero so it was hard to differentiate.
What happens here is that output of ifelse is a vector of length 1.
library(data.table)
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(10, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
Look below:
setDT(df)[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 10
#> 3: 1 0 10
#> 4: 1 1 10
#> 5: 2 0 0
#> 6: 2 1 0
#> 7: 2 1 0
#> 8: 3 0 0
#> 9: 3 3 0
#> 10: 3 4 0
#> 11: 3 5 0
Now, let's look at some other examples; here I am using a simple vector of length 4 (instead of c(0,diff(b))):
setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]
#> a b c
#> 1: 1 10 -999
#> 2: 1 1 -999
#> 3: 1 0 -999
#> 4: 1 1 -999
#> 5: 2 0 -999
#> 6: 2 1 -999
#> 7: 2 1 -999
#> 8: 3 0 20
#> 9: 3 3 20
#> 10: 3 4 20
#> 11: 3 5 20
You see that still the first element is getting assigned to all the rows of c for that group of a.
A work-around is using diff on a to see when it's not changing (i.e. diff(a)==0) and use that as a pseudo-grouping along with the other condition; like below:
setDT(df)[, c := ifelse(a >= 3 & c(F,diff(a)==0), c(0,diff(b)), b)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 1
#> 3: 1 0 0
#> 4: 1 1 1
#> 5: 2 0 0
#> 6: 2 1 1
#> 7: 2 1 1
#> 8: 3 0 0
#> 9: 3 3 3
#> 10: 3 4 1
#> 11: 3 5 1

R: modify a variable conditioned on data from multiple previous rows

Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.

R - assigning values to data frame subsets in nested for loop

R version 3.3.2
I am trying to assign certain values to an empty variable of my data frame, using a nested for loop, according to the values of other variables of that data frame. However the output isn't what I expected.
Here is a reproductible example:
id <- c("ID61", "ID61", "ID63", "ID69", "ID69", "ID69", "ID69", "ID69", "ID80", "ID80", "ID80", "ID81", "ID81", "ID81", "ID81")
Round <- c(1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4)
nrPosRound <- c(2, 0, 2, 15, 8, 4, 4, 0, 3, 1, 1, 0, 0, 0, 0)
Y <- rep(NA, 15)
df <- data.frame(id, Round, nrPosRound, Y)
The data frame I've got looks like this:
> df
id Round nrPosRound Y
1 ID61 1 2 NA
2 ID61 2 0 NA
3 ID63 1 2 NA
4 ID69 1 15 NA
5 ID69 2 8 NA
6 ID69 3 4 NA
7 ID69 4 4 NA
8 ID69 5 0 NA
9 ID80 1 3 NA
10 ID80 2 1 NA
11 ID80 3 1 NA
12 ID81 1 0 NA
13 ID81 2 0 NA
14 ID81 3 0 NA
15 ID81 4 0 NA
And I would like it to look like this, after the nested for loop:
> df
id Round nrPosRound Y
1 ID61 1 2 FP
2 ID61 2 0 FP
3 ID63 1 2 FP
4 ID69 1 15 FP
5 ID69 2 8 FP
6 ID69 3 4 FP
7 ID69 4 4 FP
8 ID69 5 0 FP
9 ID80 1 3 1
10 ID80 2 1 1
11 ID80 3 1 1
12 ID81 1 0 0
13 ID81 2 0 0
14 ID81 3 0 0
15 ID81 4 0 0
What I want is to assign the value '1' to the variable 'Y' if, for the same 'id', in a certain 'Round', there are 3 or more Positives (nrPosRound >= 3) and in the following rounds there is at least 1 Positive (nrPosRound >= 1).
'Y' would be assigned the value '0' if, in every 'Round' for the same 'id' the 'nrPosRound' is '0'.
'Y' should be assigned 'FP' (False Positive) if the previous conditions aren't met.
If there is only 1 'Round' for that 'id', 'Y' would have the value '1' if the 'nrPosRound' is >= 3; value '0' if 'nrPosRound' == 0; value 'FP' if 'nrPosRound' <= 3.
Here is my code, with the nested for loop:
for (i in 1:nrow(df)) {
current_id <- df$id[i]
id_group <- df[df$id == curr_id, ]
for (j in 1:nrow(id_group)) {
current_Round <- id_group$Round[j]
remainder_Rounds <- id_group$Round[(j+1):nrow(id_group)]
current_nrPos <- id_group$nrPosRound[id_group$Round == current_Round]
remainder_nrPos <- id_group$nrPosRound[id_group$Round %in% remainder_Rounds]
ifelse(curr_nrPos >= 3 & remainder_nrPos >= 1,
df$Y[i] <- 1, ifelse(curr_nrPos == 0 & remainder_nrPos == 0,
df$Y[i] <- 0, "FP"))
}
}
I think the problem is related to 'remainder_nrPos', since the 2nd ifelse doesn't work like I was hoping. I tried numerous ways but don't seem to be able to make it work like I intended. Any help is appreciated!
This can be done in with dplyr. In the following code, I first group_by id.
I create an intermediary variable min_from_last to see if there was a zero after each round. To do this, I first reorder from last with arrange(desc(Round)).
After that I use cummin to get the cumulative min.
Then, I reorder the data and perform three ifelse to get the result you want. BTW, you may not need the second ifelse as it will be caught by the first one, but I included it as it was in your question.
id <- c("ID61", "ID61", "ID63", "ID69", "ID69", "ID69", "ID69", "ID69", "ID80", "ID80", "ID80", "ID81", "ID81", "ID81", "ID81")
Round <- c(1, 2, 1, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4)
nrPosRound <- c(2, 0, 2, 15, 8, 4, 4, 0, 3, 1, 1, 0, 0, 0, 0)
df1 <- data.frame(id, Round, nrPosRound,stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(id) %>%
arrange(desc(Round)) %>%
mutate(min_from_last=cummin(nrPosRound)) %>%
arrange(Round) %>%
mutate(Y= ifelse(max(nrPosRound)>=3 & min_from_last>0 ,"1",
ifelse(n()==1 & nrPosRound>=3,"1",
ifelse(max(nrPosRound)==0,"0","FP"))))
id Round nrPosRound min_from_last Y
(chr) (dbl) (dbl) (dbl) (chr)
1 ID61 1 2 0 FP
2 ID61 2 0 0 FP
3 ID63 1 2 2 FP
4 ID69 1 15 0 FP
5 ID69 2 8 0 FP
6 ID69 3 4 0 FP
7 ID69 4 4 0 FP
8 ID69 5 0 0 FP
9 ID80 1 3 1 1
10 ID80 2 1 1 1
11 ID80 3 1 1 1
12 ID81 1 0 0 0
13 ID81 2 0 0 0
14 ID81 3 0 0 0
15 ID81 4 0 0 0
Here's a base R solution.
id.vals <- unique(df$id)
for (i in 1:length(id.vals)) {
group.ind <- df$id == id.vals[i]
id_group <- df[group.ind, 'nrPosRound']
n <- length(id_group)
Y <- rep(NA, n)
g3 <- any(id_group >= 3)
a0 <- all(id_group == 0)
for (j in 1:n) {
if (g3 & all(id_group[j:n] >= 1)) Y[j] <- 1
else if (a0) Y[j] <- 0
else Y[j] <- 'FP'
}
df$Y[group.ind] <- Y
}

Resources