R: Define starting condition for continous value - r

I´m trying to set up two new variables to incorporate into an existing data.frame which should be a running value starting at 1 (0) if a condition is met with respect to the IDs in the data.frame. So the data.frame is of similar structure to this:
ID Var1
1 0
1 2
1 5
1 12
2 0
2 2
2 NA
2 11
and I want to get to:
ID Var1 start stop
1 0 0 0
1 2 0 1
1 5 1 2
1 12 2 3
2 0 0 0
2 2 0 1
2 NA 1 2
2 11 2 3
Start should be a running value, starting once Var1 > 0 for the first time and stop should operate the same way. Start´s starting value should be 0 and stop´s starting value should be 1. It should further continue running, if Var1 takes on NA or 0 again in the course of the data.frame. I have tried doing the following:
df %>%
group_by(ID) %>%
mutate(stop = ifelse(Var1 > 0,
0:nrow(df), 0))
But the variable it returns doesn´t start with 0, but with the number of the row the condition is first met in.

Sorry, I don't speak dplyr but you can easily adapt this, since data.table is only used for group-by.
DF <- read.table(text = "ID Var1
1 0
1 2
1 5
1 12
2 0
2 2
2 NA
2 11", header = TRUE)
foo <- function(x) {
#quantify leading zeros:
x[is.na(x)] <- 0
lead0 <- cumsum(x > 0)
nlead0 <- sum(lead0 == 0)
#create result using sequence:
list(c(rep.int(0, nlead0), sequence(length(x) - nlead0) - 1),
c(rep.int(0, nlead0), sequence(length(x) - nlead0)))
}
library(data.table)
setDT(DF)
DF[, c("start", "stop") := foo(Var1), by = ID]
# ID Var1 start stop
#1: 1 0 0 0
#2: 1 2 0 1
#3: 1 5 1 2
#4: 1 12 2 3
#5: 2 0 0 0
#6: 2 2 0 1
#7: 2 NA 1 2
#8: 2 11 2 3

Here is base R option using ave + replace
transform(df,
Start = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = function(x) cumsum(c(0, x))[-(length(x) + 1)]),
Stop = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = cumsum)
)
or
transform(df,
Start = ave(ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum), ID, FUN = cumsum) > 1, ID, FUN = cumsum),
Stop = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = cumsum)
)
which gives
ID Var1 Start Stop
1 1 0 0 0
2 1 2 0 1
3 1 5 1 2
4 1 12 2 3
5 2 0 0 0
6 2 2 0 1
7 2 NA 1 2
8 2 11 2 3

Related

Restarting a Counter By Groups Under Conditions [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 29 days ago.
I have the following dataset:
id = c("A","A","A","A","A","B", "B", "B", "B")
result = c(1,1,0,1,1,0,1,0,1)
my_data = data.frame(id, result)
For each unique id, I want to create a "counter variable" that:
if the first result value is 1 then counter = 1 , else 0
increases by 1 each time when result = 1
becomes 0 when the result = 0
remains 0 until the first result = 1 is encountered
restart to increasing by 1 each time the result = 1
when the next unique id is encountered, the counter initializes back to 1 if result = 1 , else 0
I think the final result should look something like this:
id result counter
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
I have these two codes that I am trying to use:
# creates counter by treating entire dataset as a single ID
my_data$counter = unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))
# creates counter by taking into consideration ID's
my_data$counter = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})
But I am not sure how to interpret these correctly. For example, I am interested in learning about how to write a general function to accomplish this task with general conditions - e.g. if result = AAA then counter restarts to 0, if result = BBB then counter + 1, if result = CCC then counter + 2, if result = DDD then counter - 1.
Can someone please show me how to do this?
Thanks!
We may create a grouping column with rleid and then do the grouping by 'id' and the rleid of 'result'
library(dplyr)
library(data.table)
my_data %>%
group_by(id) %>%
mutate(grp = rleid(result)) %>%
group_by(grp, .add = TRUE) %>%
mutate(counter = row_number() * result)%>%
ungroup %>%
select(-grp)
-output
# A tibble: 9 × 3
id result counter
<chr> <dbl> <dbl>
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
Or using data.table
library(data.table)
setDT(my_data)[, counter := seq_len(.N) * result, .(id, rleid(result))]
-output
> my_data
id result counter
1: A 1 1
2: A 1 2
3: A 0 0
4: A 1 1
5: A 1 2
6: B 0 0
7: B 1 1
8: B 0 0
9: B 1 1

Calculate number of time streak of categories change in a row in R

I have the following data frame in R:
Row number A B C D E F G H I J
1 1 1 0 0 1 0 0 1 1
2 1 0 0 0 1 0 0 1
3 1 0 0 0 1 0 0 1 1
I am trying to calculate the number of times the number changes between 1 and 0 excluding the Nulls
The result I am expecting is this
Row Number No of changes
---------- --------------
1 4
2 4
3 4
An explanation for row 1
In row 1, A has a null so we exclude that.
B and C have 1 which is our first set of values.
D and E have 0 which is our second set of values. Now Change = 1
F has our third set of values which is 1. Now Change = 1+1
G and H have 0 which is our third set of values. Now Change = 1+1+1
I and J have 1 which is our fourth set of values. Now Change = 1+1+1+1 =4
Here's a tidyverse approach.
I gather into longer format (from tidyr::pivot_longer), then add a helper column noting when we have a change from 0 to 1 or from 1 to 0, and then sum those by row.
library(tidyverse)
df %>%
# before tidyr 1.0, this would be gather(col, value, -1)
pivot_longer(-1, "col") %>%
group_by(Row.number) %>%
mutate(chg = value == 1 & lag(value) == 0 |
value == 0 & lag(value) == 1) %>%
summarize(no_chgs = sum(chg, na.rm = T))
# A tibble: 3 x 2
Row.number no_chgs
<int> <int>
1 1 4
2 2 4
3 3 4
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "'Row number' A B C D E F G H I J
1 NA 1 1 0 0 1 0 0 1 1
2 NA NA 1 0 0 0 1 0 0 1
3 NA 1 0 0 0 1 0 0 1 1")
Here's a data.table solution:
library(data.table)
dt <- as.data.table(df)
dt[,
no_change := max(rleid(na.omit(t(.SD)))) - 1,
by = RowNumber
]
dt
Alternatively, here's a base version:
apply(df[, -1],
1,
function(x) {
complete_case = complete.cases(x)
if (sum(complete_case) > 0) {
return(length(rle(x[complete_case])$lengths) - 1)
} else {
return (0)
}
}
)

how subset rows that have value larger than other values for multiple columns in R

I have the following data.table
library(data.table)
dt <- data.table(V1=c(1,3,1,0,NA,0),
V2=c(1,0,1,0,1,3),
Q1=c(3,5,10,14,0,3),
Q2=c(0,1,8,NA,0,NA))
and i want to add a new column that will have value 1:
if any of the columns V1,V2 has value larger than 2,
and
if any of the columns Q1,Q2 has value larger than 0
So in the end i want to up with something like this:
> dt
V1 V2 Q1 Q2 new
1: 1 1 3 0 0
2: 3 0 5 1 1
3: 1 1 10 8 0
4: 0 0 14 NA 0
5: NA 1 0 0 0
6: 0 3 3 NA 1
EDIT
In principle i would like to have 2 vectors of column names, so something like v_columms <- names(dt)[names(dt) %like%"V"] and q_columms <- names(dt)[names(dt) %like%"q"] and use these
We can use melt to process multiple columns by specifying the patterns in measure to convert it to 'long' format and then apply the condition
dt[, new := melt(dt, measure = patterns("V", "Q"))[,
+(any(value1 > 2) & any(value2 > 0)),rowid(variable)]$V1]
dt
# V1 V2 Q1 Q2 new
#1: 1 1 3 0 0
#2: 3 0 5 1 1
#3: 1 1 10 8 0
#4: 0 0 14 NA 0
#5: NA 1 0 0 0
#6: 0 3 3 NA 1
Or without melt, if there are only two groups of columns, then
vs <- grep("V", names(dt))
qs <- grep("Q", names(dt))
dt[, new := +(Reduce(`|`, lapply(.SD[, ..vs], `>`, 2)) &
Reduce(`|`, lapply(.SD[, ..qs], `>`, 0)))]
Using dplyr and either case_when or if_else:
dt %>%
mutate(new = case_when((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2) > 0 ~ 1,
TRUE ~ 0))
dt %>%
mutate(new = if_else((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2 > 0), 1 , 0))
V1 V2 Q1 Q2 new
1 1 1 3 0 0
2 3 0 5 1 1
3 1 1 10 8 0
4 0 0 14 NA 0
5 NA 1 0 0 0
6 0 3 3 NA 1
Here's another approach with some helper functions:
foo <- function(.dt, cols, vals, na.rm = TRUE) {
rowSums(.dt[, cols, with=FALSE] > vals, na.rm = na.rm) > 0
}
bar <- function(.dt, cols_list, vals_list) {
as.integer(Reduce("&", Map(function(cols, vals) foo(.dt, cols, vals), cols_list, vals_list)))
}
dt[, new := bar(.SD, list(v_columms, q_columms), list(2, 0))]

Identifying Duplicates in `data.frame` Using `dplyr`

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

Find consecutive values in dataframe

I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]

Resources