Create an index variable for blocks of values - r

I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]

With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3

Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))

Related

Create a new variable from conditions on multiple variables in R using ifelse condition

I am really new to R.
I have a table named RWA2010LONG containing 65 variables. I want to create a new variable named NEWVAR from the 30:49th variable of the table RWA2010LONG and from another variable of the same table named (BIRTH) based on the following condition: for each row, if one of the values of the variables 30:49 of the table RWA2010LONG is equal to the value of the variable BIRTH, NEWVAR takes the value 1. Otherwise, NEWVAR takes the value 0
This is what I tried to do:
RWA2010LONG$ NEWVAR <- for (i in colnames(RWA2010LONG[30:49])){ifelse(i==RWA2010LONG$BIRTH, 1,0)}
Here is an approach. Since you didn't provide data, I am making up some sample data and checking if any values in columns 5 through 10 are the same as BIRTH:
#Example data
df <- data.frame(matrix(rnbinom(100, mu = 5, size = 0.5), ncol = 10,
dimnames = list(c(sprintf("obs_%s", 1:10)),
c("BIRTH",sprintf("col_%s", 2:10)))))
df$newvar <- apply(df[,5:10] == df$BIRTH, 1, any)*1
The apply statement checks for the condition df[,5:10] == df$BIRTH by row (the 1 indicates apply the condition by row, if for future reference, if you put a 2 it will check the condition by column) and returns TRUE or FALSE if the condition is met. The *1 converts those logicals to a numerical value.
Output:
# BIRTH col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 newvar
# obs_1 0 3 4 0 6 18 0 10 5 7 1
# obs_2 5 1 0 7 5 0 2 2 2 3 1
# obs_3 1 2 4 2 1 13 14 1 2 8 1
# obs_4 1 0 0 0 11 0 0 0 15 0 0
# obs_5 1 9 1 0 4 27 2 1 0 0 1
# obs_6 5 1 0 2 0 7 2 4 0 0 0
# obs_7 1 0 0 6 0 0 1 5 0 0 1
# obs_8 44 0 15 1 1 0 1 1 0 6 0
# obs_9 6 6 0 0 0 4 17 0 6 1 1
# obs_10 0 2 0 0 2 11 2 1 9 2 0
An option with if_any
library(dplyr)
df <- df %>%
mutate(newvar = +(if_any(5:10, ~ .x == BIRTH)))

How to count number of columns that have a value by a grouping variable in R?

I have data like this:
repetition Ob1 Ob2 Ob3 Ob4
1 0 0 0 1
1 0 0 3 0
1 1 3 3 0
1 2 3 3 0
2 4 0 2 2
2 4 0 3 0
2 0 0 0 0
3 0 0 0 0
3 4 0 4 0
3 0 0 0 0
I want to count the number of columns per repetition that have a certain value e.g. 1.
So in this case repetition 1 should return a 2 because Ob1 and Ob4 have a value of 1. Everything else gets a 0 because there are no other repetitions with a 1.
you can get count using dplyr package below code:
df$count <- rowSums(df[,2:5] == df$repetition)
df %>% select(repetition, count) %>% group_by(repetition) %>% summarise(count = sum(count))
# A tibble: 3 x 2
repetition count
<int> <dbl>
1 1 2
2 2 2
3 3 0
You can use by like:
by(x[-1]==1, x$repetition, function(y) sum(colSums(y) > 0))
#INDICES: 1
#[1] 2
#------------------------------------------------------------
#INDICES: 2
#[1] 0
#------------------------------------------------------------
#INDICES: 3
#[1] 0
or to return a named vector
c(by(x[-1]==1, x$repetition, function(y) sum(colSums(y) > 0)))
#1 2 3
#2 0 0

How do I create this variable in R?

Consider the following test data set using R:
testdat<-data.frame("id"=c(rep(1,5),rep(2,5),rep(3,5)),
"period"=rep(seq(1:5),3),
"treat"=c(c(0,1,1,1,0),c(0,0,1,1,1),c(0,0,1,1,1)),
"state"=c(rep(0,5),c(0,1,1,1,1),c(0,0,0,1,1)),
"int"=c(rep(0,13),1,1))
testdat
id period treat state int
1 1 1 0 0 0
2 1 2 1 0 0
3 1 3 1 0 0
4 1 4 1 0 0
5 1 5 0 0 0
6 2 1 0 0 0
7 2 2 0 1 0
8 2 3 1 1 0
9 2 4 1 1 0
10 2 5 1 1 0
11 3 1 0 0 0
12 3 2 0 0 0
13 3 3 1 0 0
14 3 4 1 1 1
15 3 5 1 1 1
The first 4 variables are what I have, int is the variable I want to make. It is similar to an interaction between treat and state, but that would include 1s in rows 8-10 which is not desired. Essentially, I only want an interaction when state changes during treat but not otherwise. Any thoughts on how to create this (especially on a large scale for a dataset with a million observations)?
Edit: For clarification on why I want this measure. I want to run something like the following regression:
lm(outcome~treat+state+I(treat*state))
But I'm really interested in the interaction only when treat straddles a change in state. If I were to run the above regression, I(treat*state) pools the effect of the interaction I'm interested in and when treat is 1 entirely when state is 1. In theory, I think these will have two different effects so I need to disaggregate them. I hope this makes sense and I am happy to provide additional details.
I'm sure this is possible in base R, but here's a tidyversion:
library(dplyr)
testdat %>%
group_by(grp = cumsum(c(FALSE, diff(treat) > 0))) %>%
mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 15 x 6
# id period treat state int int2
# <dbl> <int> <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0 0 0
# 2 1 2 1 0 0 0
# 3 1 3 1 0 0 0
# 4 1 4 1 0 0 0
# 5 1 5 0 0 0 0
# 6 2 1 0 0 0 0
# 7 2 2 0 1 0 0
# 8 2 3 1 1 0 0
# 9 2 4 1 1 0 0
# 10 2 5 1 1 0 0
# 11 3 1 0 0 0 0
# 12 3 2 0 0 0 0
# 13 3 3 1 0 0 0
# 14 3 4 1 1 1 1
# 15 3 5 1 1 1 1
Alternative logic for grouping uses run-length encoding, effectively the same (suggested you https://stackoverflow.com/a/35313426):
testdat %>%
group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
# ...
And as in that answer, I wish dplyr had an equivalent to data.table's rleid. The expected logic is to be able to group by consecutive same-values in a column, but not the same value across all rows. If you look at this mid-pipe (before cleaning up grp), you'd see
testdat %>%
group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
ungroup()
# # A tibble: 15 x 7
# id period treat state int grp int2
# <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
# 1 1 1 0 0 0 1 0
# 2 1 2 1 0 0 2 0
# 3 1 3 1 0 0 2 0
# 4 1 4 1 0 0 2 0
# 5 1 5 0 0 0 3 0
# 6 2 1 0 0 0 3 0
# 7 2 2 0 1 0 3 0
# 8 2 3 1 1 0 4 0
# 9 2 4 1 1 0 4 0
# 10 2 5 1 1 0 4 0
# 11 3 1 0 0 0 5 0
# 12 3 2 0 0 0 5 0
# 13 3 3 1 0 0 6 0
# 14 3 4 1 1 1 6 1
# 15 3 5 1 1 1 6 1
But that's just wishful thinking. I guess I could also do
my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
testdat %>%
group_by(grp = my_rleid(treat)) %>%
# ...
Here is a base R way using rle and ave.
r <- rle(testdat$treat)
r$values <- cumsum(r$values) + seq_along(r$values)
int2 <- +(ave(testdat$state, inverse.rle(r), FUN = function(x) x != x[1]) & testdat$treat == 1)
testdat <- cbind(testdat, int2)
testdat
# id period treat state int int2
#1 1 1 0 0 0 0
#2 1 2 1 0 0 0
#3 1 3 1 0 0 0
#4 1 4 1 0 0 0
#5 1 5 0 0 0 0
#6 2 1 0 0 0 0
#7 2 2 0 1 0 0
#8 2 3 1 1 0 0
#9 2 4 1 1 0 0
#10 2 5 1 1 0 0
#11 3 1 0 0 0 0
#12 3 2 0 0 0 0
#13 3 3 1 0 0 0
#14 3 4 1 1 1 1
#15 3 5 1 1 1 1
Timings
Since the question mentions performance as an issue, the real use case data set has 1 million rows, here are the timings of my solution and the one by r2evans.
Write both solutions as functions.
library(dplyr)
f1 <- function(X){
r <- rle(X$treat)
r$values <- cumsum(r$values) + seq_along(r$values)
int2 <- +(ave(X$state, inverse.rle(r), FUN = function(x) x != x[1]) & testdat$treat == 1)
cbind(X, int2)
}
f2 <- function(X){
X %>%
group_by(grp = cumsum(c(FALSE, diff(treat) > 0))) %>%
mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
ungroup() %>%
select(-grp)
}
How many copies of testdat are needed.
log2(1e6/nrow(testdat))
#[1] 16.02468
df1 <- testdat
for(i in 1:15) df1 <- rbind(df1, df1)
nrow(df1)
#[1] 491520
That is half a million, should be enough for a test.
mb <- microbenchmark::microbenchmark(
base = f1(df1),
dplyr = f2(df1),
times = 10
)
rm(df1) # tidy up
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval
# base 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# dplyr 1.283237 1.359772 1.331494 1.369062 1.316815 1.256968 10
The base R solution is around 36% faster.
Another base version using also ave.
testdat$treat & c(0, diff(testdat$state))==1 goes to TRUE when state changes from 0 to 1 when treat is 1. testdat$treat & testdat$state goes to 1 when both are 1.
testdat$int2 <- +ave(testdat$treat & c(0, diff(testdat$state))==1,
cumsum(c(0, abs(diff(testdat$treat & testdat$state)))),
FUN=function(x) rep(x[1], length(x)))
testdat
# id period treat state int int2
#1 1 1 0 0 0 0
#2 1 2 1 0 0 0
#3 1 3 1 0 0 0
#4 1 4 1 0 0 0
#5 1 5 0 0 0 0
#6 2 1 0 0 0 0
#7 2 2 0 1 0 0
#8 2 3 1 1 0 0
#9 2 4 1 1 0 0
#10 2 5 1 1 0 0
#11 3 1 0 0 0 0
#12 3 2 0 0 0 0
#13 3 3 1 0 0 0
#14 3 4 1 1 1 1
#15 3 5 1 1 1 1
Or using Reduce:
testdat$int2 <- Reduce(function(x,y) {if(y==-1) 0 else if(x==1 || y==1) 1 else 0},
(testdat$treat & c(0, diff(testdat$state))==1) -c(0, diff(testdat$treat &
testdat$state) == -1), accumulate = TRUE)
Timings (continue from #Rui-Barradas):
f3 <- function(testdat) {cbind(testdat, int2=+ave(testdat$treat &
c(0, diff(testdat$state))==1, cumsum(c(0, abs(diff(testdat$treat &
testdat$state)))), FUN=function(x) rep(x[1], length(x))))}
f4 <- function(testdat) {cbind(testdat, int2=Reduce(function(x,y) {
if(y==-1) 0 else if(x==1 || y==1) 1 else 0}, (testdat$treat & c(0,
diff(testdat$state))==1) -c(0, diff(testdat$treat & testdat$state) == -1),
accumulate = TRUE))}
microbenchmark::microbenchmark(base = f1(df1), dplyr = f2(df1),
GKi1 = f3(df1), GKi2 = f4(df1), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# base 1132.7269 1188.7439 1233.106 1226.8532 1293.9901 1364.8358 10 c
# dplyr 1376.0856 1436.4027 1466.418 1458.7240 1509.8990 1559.7976 10 d
# GKi1 960.5438 1006.8803 1029.105 1022.6114 1065.7427 1074.6027 10 b
# GKi2 588.0484 667.2482 694.415 699.0845 739.5523 786.1819 10 a

Filtering on a Column Whose Number is Specified in Another Column

I'm looking for a better way to achieve what the code below does with a for loop. The goal is to create a dataframe (or matrix) where each row is a possible n-length sequence of 1s and 0s, followed by an n+1th column which contains a number corresponding to one of the previous columns that contains a 0.
So in the n == 3 case for example, we want to include a row like this:
1 0 0 2
but not this:
1 0 0 1
Here's the code I have now (assuming n == 3 for simplicity):
library(tidyverse)
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3, keep = FALSE)
for (row in 1:nrow(df)) {
df$keep[row] <- df[row, df$target[row]] == 0
}
df <- df %>%
filter(keep == TRUE) %>%
select(-keep)
head(df)
# x y z target
# 1 0 0 0 1
# 2 0 1 0 1
# 3 0 0 1 1
# 4 0 1 1 1
# 5 0 0 0 2
# 6 1 0 0 2
# 7 0 0 1 2
# 8 1 0 1 2
# 9 0 0 0 3
# 10 1 0 0 3
# 11 0 1 0 3
# 12 1 1 0 3
Seems like there has to be a better way to do this, especially with dplyr. But I can't figure out how to use the value of target to specify the column to filter on.
Using base R, we can create a row/column index to filter values from the dataframe and keep rows where the extracted value is 0.
df[df[cbind(seq_len(nrow(df)), df$target)] == 0, ]
# x y z target
#1 0 0 0 1
#3 0 1 0 1
#5 0 0 1 1
#7 0 1 1 1
#9 0 0 0 2
#10 1 0 0 2
#13 0 0 1 2
#14 1 0 1 2
#17 0 0 0 3
#18 1 0 0 3
#19 0 1 0 3
#20 1 1 0 3
data
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3)

Create new column when when values repeat 3 or more times

Problem
I'm trying to create a new column (b) based on values from a previous column (a). Column a is binary, consisting of either 0's or 1's. If there are three or more 1's in a row in column a, then keep them in column b. I'm close to the desired output, but when there are two 1's in a row, the ifelse grabs the second value because it's meeting the first condition.
Desired Output–Column b
df <- data.frame(a = c(1,1,1,0,0,1,0,1,1,0,1,1,1,0,1,1,0,1,1,1,1),
b = c(1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,1))
df
a b
1 1 1
2 1 1
3 1 1
4 0 0
5 0 0
6 1 0
7 0 0
8 1 0 #
9 1 0 #
10 0 0
11 1 1
12 1 1
13 1 1
14 0 0
15 1 0 #
16 1 0 #
17 0 0
18 1 1
19 1 1
20 1 1
21 1 1
Failed Attempt...s
require(dplyr)
df_fail <- df %>% mutate(b=ifelse((lag(df$a) + df$a) > 1 |(df$a + lead(df$a) + lead(df$a,2)) >= 3, df$a,NA))
df_fail
a b
1 1 1
2 1 1
3 1 1
4 0 0
5 0 0
6 1 0
7 0 0
8 1 0
9 1 1 # should be 0
10 0 0
11 1 1
12 1 1
13 1 1
14 0 0
15 1 0
16 1 1 # should be 0
17 0 0
18 1 1
19 1 1
20 1 1
21 1 1
We can use rle from base R to change the elements that have less than 3 repeating 1s to 0
inverse.rle(within.list(rle(df$a), values[values == 1 & lengths <3] <- 0))
#[1] 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1
Or use rleid from data.table
library(data.table)
library(dplyr)
df %>%
group_by(grp = rleid(a)) %>%
mutate(b1 = if(n() <3 & all(a == 1)) 0 else a) %>%
ungroup %>%
select(-grp)

Resources