count number of consecutive cells under some conditions - r

I'd like to create a new column "mzpceyrs" that records the number of years the variable "code" remains at peace. The "mzinit" variable codes whether or not "ccode" initiates a conflict in a given year.
ccode
year
mzinit
mzpceyrs
2
1816
1
NA
2
1817
1
0
2
1818
1
0
2
1819
0
1
2
1820
0
??
2
1821
0
??
I suppose there would be far more efficient ways to do this, but the following codes are the ones that I've been coming up with. I basically consider four different scenarios:
for(i in 1:nrow(test)){
previousindex<-i-1
if(identical(previousindex,integer(0))){
test$mzpceyrs[i]<-NA}
else if(
test$mzinit[previousindex]==1 & test$mzinit[i]==1){
test$mzpceyrs[i]<-0
}
else if(test$mzinit[previousindex]==1 & test$mzinit[i]==0){
test$mzpceyrs[i]<-1
}
else if(test$mzinit[previousindex]==0 & test$mzinit[i]==1){
test$mzpceyrs[i]<-0
}
else if(test$mzinit[previousindex]==0 & test$mzinit[i]==0){
test$mzpceyrs[i]<-??
}
}
i) If "ccode" initiates a conflict in the previous year AND in the current year, I assign a value of 0 (no peace year).
ii) If "ccode" initiates a conflict in the previous year and DOES NOT in the current year, I assign a value of 1 (one peace year).
iii) If "ccode" does NOT initiate a conflict in the previous year and DOES initiate a conflict in the current year, I assign a value of 0 (no peace year).
iv) If "ccode" DOES NOT initiate a conflict in the previous year and DOES in the current year, I assign the number of peace years "ccode" has remaining.
I'm struggling with how to code the last scenario. Would you be able to share your insights in terms of how to calculate consecutive "0" values after it has 1 in the "mzinit" column? For example, my desired outcome is to have 2 and 3 in the "mzpceyrs" variable in rows 5 and 6. Any advice would be much appreciated.

Using data.table
library(data.table)
setDT(test)[, mzcpeyrs := rowid(mzinit)* (mzinit == 0)]
-output
> test
ccode year mzinit mzcpeyrs
<int> <int> <int> <int>
1: 2 1816 1 0
2: 2 1817 1 0
3: 2 1818 1 0
4: 2 1819 0 1
5: 2 1820 0 2
6: 2 1821 0 3
data
test <- structure(list(ccode = c(2L, 2L, 2L, 2L, 2L, 2L), year = 1816:1821,
mzinit = c(1L, 1L, 1L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))

You can use cumsum() combined with rleid() from the data.table package:
library(data.table)
setDT(df)[, mzcpeyrs:=cumsum(mzinit==0),rleid(mzinit)]
Output:
ccode year mzinit mzcpeyrs
1: 2 1816 1 0
2: 2 1817 1 0
3: 2 1818 1 0
4: 2 1819 0 1
5: 2 1820 0 2
6: 2 1821 0 3
Input:
structure(list(ccode = c(2L, 2L, 2L, 2L, 2L, 2L), year = 1816:1821,
mzinit = c(1L, 1L, 1L, 0L, 0L, 0L)), row.names = c(NA, -6L
), class = "data.frame")
Alternatively, you can use this data.table::rleid() function within a dplyr pipeline of course, but it is slower and more verbose:
df %>%
group_by(rl = data.table::rleid(mzinit)) %>%
mutate(mzcpeyrs = cumsum(mzinit==0)) %>%
ungroup() %>%
select(-rl)

Related

R - sum multiple separate columns for each unique ID? Aggregate?

My dataset features several blocks, each containing several plots. In each plot, three different lifeforms were marked as present/absent (i.e. 1/0):
Block
Plot
tree
bush
grass
1
1
0
1
0
1
2
1
1
1
1
3
1
1
1
2
1
0
0
1
2
2
0
0
1
2
3
1
0
1
I'm looking for a code that will sum the total number of counts for each distict lifeform at the block level.
I would like an output that resembles this:
Block
tree
bush
grass
1
2
3
2
2
1
0
3
I have tried this many ways but the only thing that comes close is:
aggregate(df[,3:5], by = list(df$block), FUN = sum)
However, what this actually returns is:
Block
tree
bush
grass
1
7
7
7
2
4
4
4
It appears to be summing all columns together instead of keeping the lifeforms separate.
I feel as though this should be so simple, as there are many queries online about similar processes, but nothing I try has worked.
library(tidyverse)
df %>%
select(-Plot) %>%
pivot_longer(-Block) %>%
group_by(Block, name) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = name, values_from = sum)
# A tibble: 2 × 4
# Groups: Block [2]
Block bush grass tree
<dbl> <dbl> <dbl> <dbl>
1 1 3 2 2
2 2 0 3 1
You were close. Maybe just a typo?
The data frame style
aggregate(df[,3:5], by = list(Block = df$Block), sum)
Block tree bush grass
1 1 2 3 2
2 2 1 0 3
Or a formula style aggregate
aggregate(. ~ Block, df[,-2], sum)
Block tree bush grass
1 1 2 3 2
2 2 1 0 3
With dplyr
library(dplyr)
df %>%
group_by(Block) %>%
summarize(across(tree:grass, sum))
# A tibble: 2 × 4
Block tree bush grass
<int> <int> <int> <int>
1 1 2 3 2
2 2 1 0 3
Data
df <- structure(list(Block = c(1L, 1L, 1L, 2L, 2L, 2L), Plot = c(1L,
2L, 3L, 1L, 2L, 3L), tree = c(0L, 1L, 1L, 0L, 0L, 1L), bush = c(1L,
1L, 1L, 0L, 0L, 0L), grass = c(0L, 1L, 1L, 1L, 1L, 1L)), class =
"data.frame", row.names = c(NA,
-6L))

Calculate difference between index date and date with first indicator

Suppose I have the following data frame with an index date and follow up dates with a "1" as a stop indicator. I want to input the date difference in days into the index row and if no stop indicator is present input the number of days from the index date to the last observation:
id date group indicator
1 15-01-2022 1 0
1 15-01-2022 2 0
1 16-01-2022 2 1
1 20-01-2022 2 0
2 18-01-2022 1 0
2 20-01-2022 2 0
2 27-01-2022 2 0
Want:
id date group indicator stoptime
1 15-01-2022 1 0 NA
1 15-01-2022 2 0 NA
1 16-01-2022 2 1 1
1 20-01-2022 2 0 NA
2 18-01-2022 1 0 NA
2 20-01-2022 2 0 NA
2 27-01-2022 2 0 9
Convert the 'date' to Date class, grouped by 'id', find the position of 1 from 'indicator' (if not found, use the last position -n()), then get the difference of 'date' from the first to that position in days
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = dmy(date)) %>%
group_by(id) %>%
mutate(ind = match(1, indicator, nomatch = n()),
stoptime = case_when(row_number() == ind ~
as.integer(difftime(date[ind], first(date), units = "days"))),
ind = NULL) %>%
ungroup
-output
# A tibble: 7 × 5
id date group indicator stoptime
<int> <date> <int> <int> <int>
1 1 2022-01-15 1 0 NA
2 1 2022-01-15 2 0 NA
3 1 2022-01-16 2 1 1
4 1 2022-01-20 2 0 NA
5 2 2022-01-18 1 0 NA
6 2 2022-01-20 2 0 NA
7 2 2022-01-27 2 0 9
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = c("15-01-2022",
"15-01-2022", "16-01-2022", "20-01-2022", "18-01-2022", "20-01-2022",
"27-01-2022"), group = c(1L, 2L, 2L, 2L, 1L, 2L, 2L), indicator = c(0L,
0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-7L))

How can I remove rows on more conditions in R?

I have session id's, client id's, a conversion column and all with a specific date. I want to delete the rows after the last purchase of a client. My data looks as follows:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 0 07-01
4 1 1 08-01
5 1 0 09-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
10 2 0 09-01
As output I want:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 1 07-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
I looks quite easy, but it has some conditions. Based on the client id, the sessions after the last purchase of a cutomer need to be deleted. I have many observations, so deleting after a particular date is not possible. It need to check every client id on when someone did a purchase.
I have no clue what kind of function I need to use for this. Maybe a certain kind of loop?
Hopefully someone can help me with this.
If your data is already ordered according to Date, for each ClientId we can select all the rows before the last conversion took place.
This can be done in base R :
subset(df, ave(Conversion == 1, ClientId, FUN = function(x) seq_along(x) <= max(which(x))))
Using dplyr :
library(dplyr)
df %>% group_by(ClientId) %>% filter(row_number() <= max(which(Conversion == 1)))
Or data.table :
library(data.table)
setDT(df)[, .SD[seq_len(.N) <= max(which(Conversion == 1))], ClientId]
We could try
library(dplyr)
df1 %>%
group_by(ClientId) %>%
slice(seq_len(tail(which(Conversion == 1), 1)))
data
df1 <- structure(list(SessionId = 1:10, ClientId = c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), Conversion = c(0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 0L), Date = c("05-01", "06-01", "07-01", "08-01",
"09-01", "05-01", "06-01", "07-01", "08-01", "09-01")),
class = "data.frame", row.names = c(NA,
-10L))

Subset data based on first occurrence of a status flag

My data looks like this:
dfin <-
ID TIME CONC STATUS
1 0 5 0
1 1 4 1
1 2 3 0
2 0 2 0
2 10 2 0
2 15 1 0
I want to subset the dfin for the first occurrence (for each ID) when STATUS==1 and TIME > 0. If the subject ID has no STATUS==1 recorded at any time, then I need to subset the last raw of that subject.
the output here should be:
dfout <-
ID TIME CONC STATUS
1 1 4 1
2 15 1 0
One way with dplyr, we can group_by ID and check if there is any row which satisfies our condition (STATUS == 1 & TIME > 0), if it is then we get the first row which satisfies the condition using which.max, if there is no such row then we just return the last row using n().
library(dplyr)
df %>%
group_by(ID) %>%
slice(ifelse(any(STATUS == 1 & TIME > 0), which.max(STATUS == 1 & TIME > 0), n()))
# ID TIME CONC STATUS
# <int> <int> <int> <int>
#1 1 1 4 1
#2 2 15 1 0
Another approach using only base R. This actually follows the same logic as in dplyr but ave returns length same as input so we keep only unique values and take cumulative sum (cumsum) over it to get corresponding rows from the data frame.
df[cumsum(unique(with(df, ave(STATUS == 1 & TIME > 0, ID, FUN = function(x)
if(any(x)) which.max(x) else length(x))))), ]
# ID TIME CONC STATUS
#2 1 1 4 1
#5 2 10 2 0
Here is one approach with data.table. Convert the data.frame to 'data.table' (setDT(dfin)), grouped by 'ID', if there is any 'STATUS' as 1, then get the logical expression where 'TIME' is greater than 0 or else get the last row (.N) and subset with .SD
library(data.table)
setDT(dfin)[, .SD[if(any(STATUS == 1)) STATUS == 1& TIME > 0 else .N], ID]
# ID TIME CONC STATUS
#1: 1 1 4 1
#2: 2 15 1 0
It can be also written as
setDT(dfin)[, .SD[(STATUS == 1 & TIME > 0)| (!any(STATUS) & seq_len(.N) == .N)], ID]
data
dfin <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TIME = c(0L, 1L,
2L, 0L, 10L, 15L), CONC = c(5L, 4L, 3L, 2L, 2L, 1L), STATUS = c(0L,
1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))

How do I create occasion variable (time) for each ID?

I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")

Resources