Restarting a Counter By Groups Under Conditions [duplicate] - r

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 29 days ago.
I have the following dataset:
id = c("A","A","A","A","A","B", "B", "B", "B")
result = c(1,1,0,1,1,0,1,0,1)
my_data = data.frame(id, result)
For each unique id, I want to create a "counter variable" that:
if the first result value is 1 then counter = 1 , else 0
increases by 1 each time when result = 1
becomes 0 when the result = 0
remains 0 until the first result = 1 is encountered
restart to increasing by 1 each time the result = 1
when the next unique id is encountered, the counter initializes back to 1 if result = 1 , else 0
I think the final result should look something like this:
id result counter
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
I have these two codes that I am trying to use:
# creates counter by treating entire dataset as a single ID
my_data$counter = unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))
# creates counter by taking into consideration ID's
my_data$counter = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})
But I am not sure how to interpret these correctly. For example, I am interested in learning about how to write a general function to accomplish this task with general conditions - e.g. if result = AAA then counter restarts to 0, if result = BBB then counter + 1, if result = CCC then counter + 2, if result = DDD then counter - 1.
Can someone please show me how to do this?
Thanks!

We may create a grouping column with rleid and then do the grouping by 'id' and the rleid of 'result'
library(dplyr)
library(data.table)
my_data %>%
group_by(id) %>%
mutate(grp = rleid(result)) %>%
group_by(grp, .add = TRUE) %>%
mutate(counter = row_number() * result)%>%
ungroup %>%
select(-grp)
-output
# A tibble: 9 × 3
id result counter
<chr> <dbl> <dbl>
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
Or using data.table
library(data.table)
setDT(my_data)[, counter := seq_len(.N) * result, .(id, rleid(result))]
-output
> my_data
id result counter
1: A 1 1
2: A 1 2
3: A 0 0
4: A 1 1
5: A 1 2
6: B 0 0
7: B 1 1
8: B 0 0
9: B 1 1

Related

How do I drop all observations except the last of a pattern?

I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?
I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2
df1
TIME ID D
12:30:10 2 0
12:30:42 2 0
12:30:59 2 1
12:31:20 2 0
12:31:50 2 0
12:32:11 2 0
12:32:45 2 1
12:33:10 2 1
12:33:33 2 1
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
I want to end up with the following df2
df2
TIME ID D
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.
Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.
keep_last_pattern <- function(data, col){
x <- data[[col]]
if(x[length(x)] == 0) x[length(x)] <- 1
#
i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
r <- rle(i)
l <- length(r$lengths)
n <- which(as.logical(r$values))
r$values[ n[-length(n)] ] <- 0
r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
j <- as.logical(inverse.rle(r))
#
data[j, ]
}
keep_last_pattern(df1, "D")
df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")
Do you want to rows the sequence in each ID between second last 1 and last 1 ?
Here is a function to do that which can be applied for each ID.
library(dplyr)
extract_sequence <- function(x) {
inds <- which(x == 1)
inds[length(inds) - 1]:inds[length(inds)]
}
df %>%
group_by(ID) %>%
slice(extract_sequence(D)) %>%
ungroup
# TIME ID D
# <chr> <int> <int>
#1 12:33:55 2 1
#2 12:34:15 2 0
#3 12:34:30 2 0
#4 12:35:30 2 0
#5 12:36:30 2 0
#6 12:36:45 2 0
#7 12:37:00 2 0
#8 12:38:00 2 1
Not sure this will help as it's unclear what your pattern is.
Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:
set.seed(123)
df <- data.frame(
grp = sample(LETTERS[1:3], 10, replace = TRUE),
x = 1:10,
y = c(0,1,0,0,1,1,1,1,1,1),
pattern = rep(c("TRUE", "FALSE"),5)
)
If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:
df %>%
filter(pattern == "TRUE") %>%
group_by(grp) %>%
slice_tail(.)
# A tibble: 3 x 4
# Groups: grp [3]
grp x y pattern
<chr> <int> <dbl> <chr>
1 A 1 0 TRUE
2 B 9 1 TRUE
3 C 5 1 TRUE

Check condition row wise for a number of columns [duplicate]

This question already has an answer here:
How to subset all rows in a dataframe that have a particular value
(1 answer)
Closed 2 years ago.
Data example:
df <- data.frame("a" = c(1,2,3,4), "b" = c(4,3,2,1), "x_ind" = c(1,0,1,1), "y_ind" = c(0,0,1,1), "z_ind" = c(0,1,1,1) )
> df
a b x_ind y_ind z_ind
1 1 4 1 0 0
2 2 3 0 0 1
3 3 2 1 1 1
4 4 1 1 1 1
I want to add a new column which checks if the whole row for the columns which end in "_ind" has all values equal to 1. If it does then returns 1 else returns 0. So the result dataframe looks like:
a b x_ind y_ind z_ind keep
1 1 4 1 0 0 0
2 2 3 0 0 1 0
3 3 2 1 1 1 1
4 4 1 1 1 1 1
I can select the columns by using df %>% select(contains("_ind")) however I am not sure how to do a rowwise operation which checks if every value in the row contains a 1, and then append the column back to the original dataframe.
Any help would be appreicated! Working with Dplyr but appreciate any solution
You can use rowSums when your df is equal to 1, i.e.
rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))])
#[1] FALSE FALSE TRUE TRUE
Continuing your dplyr attempt you can do,
df %>%
select(contains("_ind")) %>%
mutate(new = rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind new
#1 1 0 0 FALSE
#2 0 0 1 FALSE
#3 1 1 1 TRUE
#4 1 1 1 TRUE
#OR you can filter directly
df %>%
select(contains("_ind")) %>%
filter(rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind
#1 1 1 1
#2 1 1 1
If you want to also keep the origina columns, you can use,
df %>%
filter_at(vars(ends_with('_ind')), all_vars(. == 1))
# a b x_ind y_ind z_ind
#1 3 2 1 1 1
#2 4 1 1 1 1
NOTE: When we use (.), the dot refers to the resulting data frame. In this case, it refers to columns specify in the condition (i.e. to the columns that end with _ind)
Similarly in base R,
df[rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))]),]
# a b x_ind y_ind z_ind
#3 3 2 1 1 1
#4 4 1 1 1 1
You can use rowwise with c_across in new dplyr :
library(dplyr)
df %>% rowwise() %>% mutate(keep = +all(c_across(ends_with('ind')) == 1))
# a b x_ind y_ind z_ind keep
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
You can use apply with all, using endsWith to get the columns ending with _ind and testing if they are == 1.
df$keep <- +(apply(df[,endsWith(colnames(df), "_ind")]==1, 1, all))
df
# a b x_ind y_ind z_ind keep
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
or using rowSums
df$keep <- +(rowSums(df[,endsWith(colnames(df), "_ind")]!=1) == 0)

Identifying Duplicates in `data.frame` Using `dplyr`

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Find consecutive values in dataframe

I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]

Resources