How to calculate yearly retention rate by group in R? - r

I have a large data set with individuals located in counties over a period of multiple years. Each year, some individuals move to a different county or leave the data set and new individuals join.
I would like to count the number of individuals that stayed in the same county from year to year and from year 1. Here is the question I found that comes closest to this task (without the additional grouping by counties): Month-over-month Customer Retention Rate in R
Here is a simplified version of the data set:
dt <- setDT(data.frame(ID = rep(c('a', 'b', 'c', 'd', 'a', 'c', 'd', 'e', 'c', 'e', 'f'),2),
CTY = rep(c(1, 2), each = 11),
YEAR = rep(c(1,1,1,1,2,2,2,2,3,3,3),2)))
My solution, so far, relies on a loop
x =matrix(NA, 2,3)
y =matrix(NA, 2,3)
for (i in 1:2) {
for (j in 1:3) {
x[i,j] = ifelse(j == 1, NA, sum(dt[CTY == i & YEAR == j, ID] %in% dt[CTY == i & YEAR == j-1, ID] == T))
y[i,j] = ifelse(j == 1, NA, sum(dt[CTY == i & YEAR == 1, ID] %in% dt[CTY == i & YEAR == j, ID] == T))
}
}
Which gives after joining
colnames(x) <- unique(dt$YEAR)
rownames(x) <- unique(dt$CTY)
x <- reshape2::melt(x)
names(x) <- c("CTY", "YEAR", "stayed")
x <- x[order(x$CTY),]
colnames(y) <- unique(dt$YEAR)
rownames(y) <- unique(dt$CTY)
y <- reshape2::melt(y)
names(y) <- c("CTY", "YEAR", "stayed2")
y <- y[order(y$CTY),]
dt <-dt[x, on = c("CTY", "YEAR")]
dt <-dt[y, on = c("CTY", "YEAR")]
dt
# ID CTY YEAR stayed stayed2
# 1: a 1 1 NA NA
# 2: b 1 1 NA NA
# 3: c 1 1 NA NA
# 4: d 1 1 NA NA
# 5: a 1 2 3 3
# 6: c 1 2 3 3
# 7: d 1 2 3 3
# 8: e 1 2 3 3
# 9: c 1 3 2 1
# 10: e 1 3 2 1
# 11: f 1 3 2 1
# 12: a 2 1 NA NA
# 13: b 2 1 NA NA
# 14: c 2 1 NA NA
# 15: d 2 1 NA NA
# 16: a 2 2 3 3
# 17: c 2 2 3 3
# 18: d 2 2 3 3
# 19: e 2 2 3 3
# 20: c 2 3 2 1
# 21: e 2 3 2 1
# 22: f 2 3 2 1
This is the right final table but it requires manipulation of the loop output that seems unnecessary; in sum, this works but it is clunky and slow.
I have experimented with data.table and dplyr solutions but can't seem to make it work.

Try sapply function like this:
fx <- function(x) ifelse(x$YEAR == 1, NA, sum(dt[CTY == x$CTY & YEAR == x$YEAR, ID] %in% dt[CTY == x$CTY & YEAR == x$YEAR-1, ID] == T))
fy <- function(y) ifelse(y$YEAR == 1, NA, sum(dt[CTY == y$CTY & YEAR == 1, ID] %in% dt[CTY == y$CTY & YEAR == y$YEAR, ID] == T))
x <- merge(data.frame(CTY=1:2),data.frame(YEAR=1:3))
s <- data.frame(x,stayed=sapply(split(x,1:nrow(x)),fx))
s <- data.frame(s,stayed2=sapply(split(x,1:nrow(x)),fy))
merge(dt,s)
# CTY YEAR ID stayed stayed2
# 1: 1 1 a NA NA
# 2: 1 1 b NA NA
# 3: 1 1 c NA NA
# 4: 1 1 d NA NA
# 5: 1 2 a 3 3
# 6: 1 2 c 3 3
# 7: 1 2 d 3 3
# 8: 1 2 e 3 3
# 9: 1 3 c 2 1
# 10: 1 3 e 2 1
# 11: 1 3 f 2 1
# 12: 2 1 a NA NA
# 13: 2 1 b NA NA
# 14: 2 1 c NA NA
# 15: 2 1 d NA NA
# 16: 2 2 a 3 3
# 17: 2 2 c 3 3
# 18: 2 2 d 3 3
# 19: 2 2 e 3 3
# 20: 2 3 c 2 1
# 21: 2 3 e 2 1
# 22: 2 3 f 2 1

Related

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

R Indicate If Group Has Value

data=data.frame(GROUP=c(1,1,1,2,2,2,3,3,3,4,4,4),
VAR=c('A','B','C','A','B','C','A','B','C','A','B','C'),
SCORE=c(3,6,NA,NA,NA,NA,1,NA,5,5,2,NA),
NEWVAR=c(1,1,1,NA,NA,NA,2,2,2,1,1,1))
score1 = c(2,3,7)
score2 = c(0,5,6)
I have 'data' with all columns but 'NEWVAR' and I wish to create it like this:
If score1 %in% SCORE for a particular 'GROUP' then a value '1' is given to 'NEWVAR'
If score2 %in% SCORE & score1 is not in SCORE for a particular 'GROUP' then a value '2' is given to 'NEWVAR'
Using dplyr:
input <- data %>% select(-NEWVAR)
input %>%
group_by(GROUP) %>%
mutate(NEWVAR=case_when(any(SCORE %in% score1) ~ 1,
any(SCORE %in% score2 & !(SCORE %in% score1)) ~ 2))
Since you have tagged this data.table using ifelse in data.table syntax.
library(data.table)
setDT(data)[, NEWVAR := ifelse(any(SCORE %in% score1), 1,
ifelse(any(SCORE %in% score2), 2, NA_integer_)), GROUP]
data
# GROUP VAR SCORE NEWVAR
# 1: 1 A 3 1
# 2: 1 B 6 1
# 3: 1 C NA 1
# 4: 2 A NA NA
# 5: 2 B NA NA
# 6: 2 C NA NA
# 7: 3 A 1 2
# 8: 3 B NA 2
# 9: 3 C 5 2
#10: 4 A 5 1
#11: 4 B 2 1
#12: 4 C NA 1
I think there is also fcase in development version of data.table which is similar to case_when in dplyr and makes writing such nested ifelse easy.
scores = list(score1, score2)
ave(data$SCORE, data$GROUP, FUN = function(x){
chk = sapply(scores, function(y) any(y %in% x))
seq_along(scores)[chk][1]
})
# [1] 1 1 1 NA NA NA 2 2 2 1 1 1
We can use fifelse in data.table
library(data.table)
setDT(data)[, NEWVAR := fifelse(any(SCORE %in% score1), 1,
fifelse(any(SCORE %in% score2), 2, NA_integer_)), GROUP]
data
# GROUP VAR SCORE NEWVAR
# 1: 1 A 3 1
# 2: 1 B 6 1
# 3: 1 C NA 1
# 4: 2 A NA NA
# 5: 2 B NA NA
# 6: 2 C NA NA
# 7: 3 A 1 2
# 8: 3 B NA 2
# 9: 3 C 5 2
#10: 4 A 5 1
#11: 4 B 2 1
#12: 4 C NA 1
Or with fcase from the devel version of data.table
setDT(data)[, NEWVAR := fcase(any(SCORE %in% score1), 1,
any(SCORE %in% score2), 2,
default = NA_integer_), GROUP]

Applying a conditional replace function over every cell in a data frame

I'm trying to subset a dataframe in R by checking if each value is present in a specific list and keeping it if it is. For instance in the following dataframe:
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
A B C
1 2 2 1
2 3 3 3
3 1 4 4
4 4 5 2
5 5 1 5
How could I subset it to include only the values 1, 3 and 4, giving the following as a result:
A B C
1 1
2 3 3 3
3 4 4
4 4
5 1
It doesn't matter what happens to the missing values - they could be changed to NA if this is easier. From browsing similar questions it seems that lapply might do it, but as a novice I'm struggling to apply what I've seen to this scenario.
set.seed(47)
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
# with lapply
keep_vals = c(1, 3, 4)
x[] = lapply(x, function(y) {
y[! y %in% keep_vals] = NA
return(y)
})
x
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
Or with a for loop:
set.seed(47) # reset data
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
keep_vals = c(1, 3, 4)
for (i in 1:ncol(x)) {
x[, i][!x[, i] %in% keep_vals] <- NA
}
x
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
With dplyr
x %>% mutate_all(
~replace(., !. %in% keep_vals, NA)
)
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
using dplyr::bind_rows
do.call(bind_rows,apply(x,1, function(a) a[a %in% c(1,3,4)]))
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 4 NA NA
2 1 1 1
3 3 3 NA
4 NA NA 4
5 NA 4 3
Collapsing each row to the matching numbers, and adjusting each length to ncol. (Assuming you want to "left-align" your numbers, as shown in your expected output.)
d <- setNames(as.data.frame(t(apply(d, 1, function(x) {
x <- x[x %in% c(1, 3, 4)]
`length<-`(x, ncol(d))
}))), names(d))
d
# A B C
# 1 1 NA NA
# 2 3 3 3
# 3 1 4 4
# 4 4 NA NA
# 5 NA NA NA
Since apply throws a matrix, we tell R that we want the transpose as.data.frame and setNames to restore those.
Note, that I changed line 5 of your example data so that it doesn't contain any of the matching numbers, so as not to make it too easy.
Data
d <- read.table(text="A B C
1 2 2 1
2 3 3 3
3 1 4 4
4 4 5 2
5 5 2 5", header=TRUE)

Relative reference to rows in large data set

I have a very large data set (millions of rows) where I need to turn into NA certain rows when a var1 equals "Z". However, I also need to turn into NA the preceding row to a row with var1="Z".
E.g.:
id var1
1 A
1 B
1 Z
1 S
1 A
1 B
2 A
2 B
3 A
3 B
3 A
3 B
4 A
4 B
4 A
4 B
In this case, the second row and the third row for id==1 should be NA.
I have tried a loop but it doesn't work as the data set is very large.
for (i in 1:length(df$var1)){
if(df$var1[i] =="Z"){
df[i,] <- NA
df[(i-1),] <-- NA
}
}
I have also tried to use data.table package unsuccessfully. Do you have any idea of how I could do it or what is the right term to look for info on what I am trying to do?
Maybe do it like this using data.table:
df <- as.data.table(read.table(header=T, file='clipboard'))
df$var1 <- as.character(df$var1)
#find where var1 == Z
index <- df[, which(var1 == 'Z')]
#add the previous lines too
index <- c(index, index-1)
#convert to NA
df[index, var1 := NA ]
Or in one call:
df[c(which(var1 == 'Z'), which(var1 == 'Z') - 1), var1 := NA ]
Output:
> df
id var1
1: 1 A
2: 1 NA
3: 1 NA
4: 1 S
5: 1 A
6: 1 B
7: 2 A
8: 2 B
9: 3 A
10: 3 B
11: 3 A
12: 3 B
13: 4 A
14: 4 B
15: 4 A
16: 4 B
If you want to take in count the preceding indices only if they are from the same id, I would suggest to use the .I and by combination which will make sure that you are not taking indecies from previous id
setDT(df)[, var1 := as.character(var1)]
indx <- df[, {indx <- which(var1 == "Z") ; .I[c(indx - 1L, indx)]}, by = id]$V1
df[indx, var1 := NA_character_]
df
# id var1
# 1: 1 A
# 2: 1 NA
# 3: 1 NA
# 4: 1 S
# 5: 1 A
# 6: 1 B
# 7: 2 A
# 8: 2 B
# 9: 3 A
# 10: 3 B
# 11: 3 A
# 12: 3 B
# 13: 4 A
# 14: 4 B
# 15: 4 A
# 16: 4 B
You can have a base R approach:
x = var1=='Z'
df[x | c(x[-1],F), 'var1'] <- NA
# id var1
#1 1 A
#2 1 <NA>
#3 1 <NA>
#4 1 S
#5 1 A
#6 1 B
#7 2 A
#8 2 B
#9 3 A
#10 3 B
#11 3 A
#12 3 B
#13 4 A
#14 4 B
#15 4 A
#16 4 B

Merge dataframes on matching A, B and *closest* C?

I have two dataframes like so:
set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
y=c(1, 1, 1, 5, 4),
time=c(17, 12, 11.6, 22.5, 2),
val=letters[1:5],
stringsAsFactors=F)
I want to merge to.merge into df (with all.x=T) such that:
df$x == to.merge$x AND
df$y == to.merge$y AND
abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.
How can I do this?
So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):
x y time val
1 1 1 8 NA
2 2 1 11 c
3 3 1 17 NA
4 1 2 27 NA
5 2 2 6 NA
6 3 2 27 NA
7 1 3 28 NA
8 2 3 20 NA
9 3 3 19 NA
10 1 4 2 NA
11 2 4 6 NA
12 3 4 5 NA
13 1 5 21 NA
14 2 5 12 NA
15 3 5 23 d
where to.merge was:
x y time val
1 2 1 17.0 a
2 2 1 12.0 b
3 2 1 11.6 c
4 3 5 22.5 d
5 2 4 2.0 e
Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).
Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.
Finally, there may be rows in to.merge that do not match anything in df.
One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)
df$value <- NA
for (i in 1:nrow(df)) {
row <- df[i, ]
idx <- which(row$x == to.merge$x &
row$y == to.merge$y &
abs(row$time - to.merge$time) <= 1)
if (length(idx)) {
j <- idx[which.min(row$time - to.merge$time[idx])]
df$val[i] <- to.merge$val[j]
}
}
I feel that I can somehow do a merge, like:
to.merge$closest_time_in_df <- sapply(to.merge$time,
function (tm) {
dts <- abs(tm - df$time)
# difference must be at most 1
if (min(dts) <= 1) {
df$time[which.min(dts)]
} else {
NA
}
})
merge(df, to.merge,
by.x=c('x', 'y', 'time'),
by.y=c('x', 'y', 'closest_time_in_df'),
all.x=T)
But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.
Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)
eg
library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))
# use join syntax with roll = 'nearest'
tm[DT, roll='nearest']
# x y time val
# 1: 1 1 8 NA
# 2: 1 2 27 NA
# 3: 1 3 28 NA
# 4: 1 4 2 NA
# 5: 1 5 21 NA
# 6: 2 1 11 c
# 7: 2 2 6 NA
# 8: 2 3 20 NA
# 9: 2 4 6 e
# 10: 2 5 12 NA
# 11: 3 1 17 NA
# 12: 3 2 27 NA
# 13: 3 3 19 NA
# 14: 3 4 5 NA
# 15: 3 5 23 d
You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)
new <- tm[DT, roll=-1, rollends =c(TRUE,TRUE)]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)
new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Using merge couple of times and aggregate once, here is how to do it.
set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)
#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
## x y time.x val time.y dif
## 1 2 1 17.0 a 11 6.0
## 2 2 1 12.0 b 11 1.0
## 3 2 1 11.6 c 11 0.6
## 4 2 4 2.0 e 6 4.0
## 5 3 5 22.5 d 23 0.5
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
## x y dif time.x val time.y
## 1 2 1 0.6 11.6 c 11
## 2 2 4 4.0 2.0 e 6
## 3 3 5 0.5 22.5 d 23
#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
## x y time val
## 1 1 1 8 <NA>
## 2 1 2 27 <NA>
## 3 1 3 28 <NA>
## 4 1 4 2 <NA>
## 5 1 5 21 <NA>
## 6 2 1 11 c
## 7 2 2 6 <NA>
## 8 2 3 20 <NA>
## 9 2 4 6 <NA>
## 10 2 5 12 <NA>
## 11 3 1 17 <NA>
## 12 3 2 27 <NA>
## 13 3 3 19 <NA>
## 14 3 4 5 <NA>
## 15 3 5 23 d
mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. In addition, MichaelChirico has suggested to use the on parameter.
This approach uses
roll = "nearest",
an update by reference, i.e., without copying,
setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table),
the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6).
So, the code below
library(data.table) # version 1.11.4 used
setDT(df)[setDT(to.merge), on = .(x, y, time), roll = "nearest",
val := replace(val, abs(x.time - i.time) > 1, NA)]
df
has updated df:
x y time val
1: 1 1 8 <NA>
2: 2 1 11 c
3: 3 1 17 <NA>
4: 1 2 27 <NA>
5: 2 2 6 <NA>
6: 3 2 27 <NA>
7: 1 3 28 <NA>
8: 2 3 20 <NA>
9: 3 3 19 <NA>
10: 1 4 2 <NA>
11: 2 4 6 <NA>
12: 3 4 5 <NA>
13: 1 5 21 <NA>
14: 2 5 12 <NA>
15: 3 5 23 d
Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer)
In case df must not be changed, a new data.table can be created by
result <- setDT(to.merge)[setDT(df), on = .(x, y, time), roll = "nearest",
.(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result
which returns the same result as above.

Resources