I have a dataset similar to the following format:
Account_ID Date Delinquency age count
1 01/01/2016 0 1 0
1 02/01/2016 1 2 0
1 03/01/2016 2 3 1
1 04/01/2016 0 4 2
1 05/01/2016 1 5 2
1 06/01/2016 2 6 2
2 01/01/2016 0 1 0
2 02/01/2016 0 2 0
2 03/01/2016 1 3 0
2 04/01/2016 0 4 1
2 05/01/2016 1 5 1
3 01/01/2016 1 1 0
3 02/01/2016 2 2 1
3 03/01/2016 3 3 2
3 04/01/2016 4 4 3
3 05/01/2016 5 5 4
3 06/01/2016 6 6 5
I want to count the number of non-zeros in the previous 3 months by account for each row, i.e. I want to create the count variable using the first 4 variables (Account_ID, Date, Delinquency, Age). I would like to know how to do this for n past months. I'm hoping I can extend this exercise to other tasks such as finding the max delinquency in the past 3 months.
welcome to SE!
In case you would like to count non-zero deliquency event for 3 previous months by account for each row, you can use aggregate function as well as zlag function of TSA package in a following manner (see the code below). As the data you provided in count column are dificult to interpret as well as to connect with the condition provided the data in an example were simulated.
library(lubridate)
set.seed(123)
# data simulation
df <- data.frame( id = factor(rep(0:9, 100)),
date = sample(seq(ymd("2010-12-01"), by = 1, length.out = 1000), 1000, replace = TRUE),
deliquency = sample(c(rep(0, 30), 1:5), 1000, replace = TRUE),
age = sample(1:10, 1000, replace = TRUE))
head(df)
# id date deliquency age
# 1 0 2011-08-06 0 10
# 2 1 2013-08-16 0 6
# 3 2 2012-11-17 0 1
# 4 3 2012-09-12 0 9
# 5 4 2011-07-29 0 1
# 6 5 2011-02-25 0 9
# aggregation of non-zero deliquency by month
df$year_month <- df$date
day(df$year_month) <- 1
df_m <- aggregate(deliquency ~ id + year_month, data = df, sum)
df_m <- df_m[order(as.character(df_m$id, df_m$year_month)), ]
df_m$is_zero <- df_m$deliquency > 0
head(df_m)
# id year_month deliquency is_zero
# 1 0 2010-12-01 1 TRUE
# 10 0 2011-01-01 0 FALSE
# 19 0 2011-02-01 0 FALSE
# 29 0 2011-03-01 0 FALSE
# 39 0 2011-04-01 0 FALSE
# 65 0 2011-07-01 1 TRUE
# calculate zero-deliquency events for three last months
library(TSA)
dfx <- df_m
df_m_l <- by(df_m, df_m$id, function(dfx) {
dfx$zero_del <- zlag(dfx$is_zero, 1) + zlag(dfx$is_zero, 2) + zlag(dfx$is_zero, 3)
dfx})
df_m_res <- do.call(rbind, df_m_l)
head(df_m_res)
You can see as an output the data.frame which shows non-zero amount of deliquency events in the last 3 months. E.g. output here is:
id year_month deliquency is_zero zero_del
0.1 0 2010-12-01 1 TRUE NA
0.10 0 2011-01-01 0 FALSE NA
0.19 0 2011-02-01 0 FALSE NA
0.29 0 2011-03-01 0 FALSE 1
0.39 0 2011-04-01 0 FALSE 0
0.65 0 2011-07-01 1 TRUE 0
Related
I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1
There are four types of common coins in US currency:
quarters (25 cents)
dimes (10 cents)
nickels (5 cents), and
pennies (1 cent)
There are six ways to make change for 15 cents:
A dime and a nickel
A dime and 5 pennies
3 nickels
2 nickels and 5 pennies
A nickel and 10 pennies
15 pennies
Task:
How many ways are there to make change for a dollar using these common coins? (1 dollar = 100 cents).
tl;dr
There are 242 possibilities to make 1 dollar out of an unlimited supply of 1, 5, 10 and 25 cent-pieces.
code
here is a go at it using the comboGeneral()-function from the RcppAlgos-package.
Just set sum_constraint to the sum you want the coins values to add up to.
library(RcppAlgos)
library(data.table)
# possible coin-values
vec <- c( 1, 5, 10, 25 )
#desired sum
sum_constraint <- 15
l <- lapply( 1:sum_constraint / min(vec) , function(x) {
#calculate possible combinations (output = matrix)
temp <- comboGeneral( vec,
m = x,
repetition = TRUE,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = sum_constraint )
#create rowwise frequency-table of the freshly created matrix,
#and convert the table to a data.frame
as.data.frame.matrix( table( c( row(temp)), c(temp) ) )
})
#bind the list together to a data.table
answer <- rbindlist(l, idcol = "no_coins", use.names = TRUE, fill = TRUE )
#set missing values to 0
answer[ is.na(answer) ] <- 0
#output
answer
sum_constraint = 15
# no_coins 5 10 1
# 1: 2 1 1 0
# 2: 3 3 0 0
# 3: 6 0 1 5
# 4: 7 2 0 5
# 5: 11 1 0 10
# 6: 15 0 0 15
sum_constraint = 100
# no_coins 25 5 10 1
# 1: 4 4 0 0 0
# 2: 6 3 1 2 0
# 3: 7 3 3 1 0
# 4: 7 2 0 5 0
# 5: 8 3 5 0 0
# ---
# 238: 88 0 3 0 85
# 239: 91 0 0 1 90
# 240: 92 0 2 0 90
# 241: 96 0 1 0 95
# 242: 100 0 0 0 100
# no_coins 25 5 10 1
I have found a lot of material about counting a value in table, but my goal is little different and I havent found any source.
This is my data
ID_1 ID_2 Date RESULT
1 12 3 2011-12-21 0
2 12 13 2011-12-22 0
3 3 12 2011-12-22 1
4 15 13 2011-12-23 0
5 12 13 2011-12-23 1
5 13 15 2011-12-23 1
6 3 12 2011-12-23 0
7 12 13 2011-12-23 0
TARGET
ID_1 ID_2 Date RESULT H2H_ID1 H2H_ID2
1 12 3 2011-12-21 0 0 0
2 12 13 2011-12-22 0 0 0
3 3 12 2011-12-22 1 1 0
4 15 13 2011-12-23 0 0 0
5 12 13 2011-12-23 1 0 1
5 13 15 2011-12-23 1 1 0
6 3 12 2011-12-23 0 2 0
7 12 13 2011-12-23 0 1 1
...
and so on
In RESULT, 0 is a match won by id2, 1 when is won by id1.
I need 2 columns (h2h_id1, h2h_id2) that count the match previously won by the same players (id1 and id2), the traditional head-to-head.
I'll make an example. Row 3.
ID1=3 and ID2=12.
The row where id 3 and id 12 had a previous match is row1, and the winner of the match is id2 (result=0).
So in the row 3 I want to read 1 in the H2H_ID1.
In row 6, same conditions, 2 matches with same ids and same result.
In another post, for a similiar tastk, (column with previous result)
they gave me this code to find only 1 match before (and without sum) but maybe could help.
# emulate the original dataframe
ID_1 <- c(12,12,3,15,16,3)
ID_2<-c(3,13,12,13,17,15)
ids <- cbind(ID_1, ID_2) # IDs columns
x1 <- c(15, 50, 20, 30, 51, 60)
y2 <- c(10, 40, 30, 20, 53, 62)
vars <- cbind(x1, y2) # x&y columns
FindPreviousIDsPair <- function(id_matrix, i_of_row) {
shorten_matrix <- id_matrix[1:(i_of_row - 1),,drop = FALSE]
string_to_search_for <- id_matrix[i_of_row, ]
string_to_search_for_sorted <-
string_to_search_for[order(string_to_search_for)]
found_rows_boolean <- sapply(FUN = function(i) all(shorten_matrix[i,
order(shorten_matrix[i, ])] ==
string_to_search_for_sorted), X = 1:(i_of_row - 1))
found_row_n <- ifelse(any(found_rows_boolean),
max(which(found_rows_boolean)), NA_real_)
found_col_of_DI1 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[1], shorten_matrix[found_row_n, ]), NA_real_)
found_col_of_DI2 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[2], shorten_matrix[found_row_n, ]), NA_real_)
return(c(found_row_n, found_col_of_DI1, found_col_of_DI2))
}
Thanks for your help.
This question already has answers here:
Is there a dplyr equivalent to data.table::rleid?
(6 answers)
Closed 5 years ago.
I'm trying to count # consecutive days inactive (consecDaysInactive), per ID.
I have already created an indicator variable inactive that is 1 on days where id is inactive and 0 when active. I also have an id variable, and a date variable. My analysis dataset will have hundreds of thousands of rows, so efficiency will be important.
The logic I'm trying to create is as follows:
per id, if user is active, consecDaysInactive = 0
per id, if user is inactive, and was active on previous day, consecDaysInactive = 1
per id, if user is inactive on previous day, consecDaysInactive = 1 + # previous consecutive inactive days
consecDaysInactive should reset to 0 for new values of id.
I've been able to create a cumulative sum, but unable to get it to reset at 0 after >= rows of inactive==0.
I've illustrated below the result that I want (consecDaysInactive), as well as the result that I was able to achieve programmatically (bad_consecDaysInactive).
library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))
d <- d %>%
group_by(id) %>%
arrange(id, date) %>%
do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
)
)
d
where consecDaysInactive iterates by +1 for each consecutive day inactive, but resets to 0 each date user is active, and resets to 0 for new values of id. As the output shows below, I'm unable to get bad_consecDaysInactive to reset to 0 -- e.g. row
id date inactive consecDaysInactive bad_consecDaysInactive
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0
2 1 2017-01-02 0 0 0
3 1 2017-01-03 0 0 0
4 1 2017-01-04 1 1 1
5 1 2017-01-05 1 2 2
6 1 2017-01-06 1 3 3
7 1 2017-01-07 0 0 3
8 1 2017-01-08 1 1 4
9 2 2017-01-01 0 0 0
10 2 2017-01-02 1 1 1
11 2 2017-01-03 1 2 2
12 2 2017-01-04 1 3 3
13 2 2017-01-05 1 4 4
14 2 2017-01-06 0 0 4
15 2 2017-01-07 0 0 4
16 2 2017-01-08 1 1 5
I also considered (and tried) incrementing a variable within group_by() & do(), but since do() isn't iterative, I can't get my counter to get past 2:
d2 <- d %>%
group_by(id) %>%
do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive))))
d2
which yielded, as described above:
id date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0 0
2 1 2017-01-02 0 0 0 0
3 1 2017-01-03 0 0 0 0
4 1 2017-01-04 1 1 1 1
5 1 2017-01-05 1 2 2 2
6 1 2017-01-06 1 3 3 2
7 1 2017-01-07 0 0 3 0
8 1 2017-01-08 1 1 4 1
9 2 2017-01-01 0 0 0 0
10 2 2017-01-02 1 1 1 1
11 2 2017-01-03 1 2 2 2
12 2 2017-01-04 1 3 3 2
13 2 2017-01-05 1 4 4 2
14 2 2017-01-06 0 0 4 0
15 2 2017-01-07 0 0 4 0
16 2 2017-01-08 1 1 5 1
As you can see, my iterator bad_consecDaysInactive2 resets at 0, but doesn't increment past 2! If there's a data.table solution, I'd be happy to hear it as well.
Here's a cute way to do it with a for-loop:
a <- c(1,1,1,1,0,0,1,0,1,1,1,0,0)
b <- rep(NA, length(a))
b[1] <- a[1]
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
a
b
It may not be the most efficient way to do this, but it will be pretty darn fast. 11.7 seconds for ten million rows on my computer.
a <- round(runif(10000000,0,1))
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
b
Sys.time()-t
Time difference of 11.73612 secs
But this doesn't account for the need to do things per id. That's easy to fix, at a minimal efficiency penalty. Your example dataframe is sorted by id. If you actual data are not already sorted, then do so. Then:
a <- round(runif(10000000,0,1))
id <- round(runif(10000000,1,1000))
id <- id[order(id)]
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
if(id[i] != id[i-1]){
b[i] <- a[i]
}
}
b
Sys.time()-t
Time difference of 13.54373 secs
If we include the time that it took to sort id, then the time difference is closer to 19 seconds. Still not too bad!
How much of an efficiency savings can we get using Frank's answer in the comments on the OP?
d <- data.frame(inactive=a, id=id)
t2 <- Sys.time()
b <- setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]
Sys.time()-t2
Time difference of 2.233547 secs
my simplified data looks like this:
set.seed(1453); x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
df = data.frame(x, date = as.Date(date))
df
x date
1 2016-01-01
0 2016-01-05
1 2016-01-07
0 2016-01-12
0 2016-01-16
1 2016-01-20
1 2016-01-20
0 2016-01-25
0 2016-01-26
1 2016-01-31
I'd like to calculate the number of occurrences for x == 1 within a specified time period, e.g. 14 and 30 days from the current date (but excluding the current entry, if it is x == 1. The desired output would look like this:
solution
x date x_plus14 x_plus30
1 2016-01-01 1 3
0 2016-01-05 1 4
1 2016-01-07 2 3
0 2016-01-12 2 3
0 2016-01-16 2 3
1 2016-01-20 2 2
1 2016-01-20 1 1
0 2016-01-25 1 1
0 2016-01-26 1 1
1 2016-01-31 0 0
Ideally, I'd like this to be in dplyr, but it is not a must. Any ideas how to achieve this? Thanks a lot for your help!
Adding another approach based on findInterval:
cs = cumsum(df$x) # cumulative number of occurences
data.frame(df,
plus14 = cs[findInterval(df$date + 14, df$date, left.open = TRUE)] - cs,
plus30 = cs[findInterval(df$date + 30, df$date, left.open = TRUE)] - cs)
# x date plus14 plus30
#1 1 2016-01-01 1 3
#2 0 2016-01-05 1 4
#3 1 2016-01-07 2 3
#4 0 2016-01-12 2 3
#5 0 2016-01-16 2 3
#6 1 2016-01-20 2 2
#7 1 2016-01-20 1 1
#8 0 2016-01-25 1 1
#9 0 2016-01-26 1 1
#10 1 2016-01-31 0 0
Earlier I wasn't including the present date and so numbers didn't match.
library(data.table)
setDT(df)[, `:=`(x14 = sum(df$x[between(df$date, date, date + 14, incbounds = FALSE)]),
x30 = sum(df$x[between(df$date, date, date + 30, incbounds = FALSE)])),
by = date]
# x date x14 x30
# 1: 1 2016-01-01 1 3
# 2: 0 2016-01-05 1 4
# 3: 1 2016-01-07 2 3
# 4: 0 2016-01-12 2 3
# 5: 0 2016-01-16 2 3
# 6: 1 2016-01-20 1 1
# 7: 1 2016-01-20 1 1
# 8: 0 2016-01-25 1 1
# 9: 0 2016-01-26 1 1
# 10: 1 2016-01-31 0 0
Or a general solution that will work for any desired range
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) :=
lapply(vec, function(i) sum(df$x[between(df$date,
date,
date + i,
incbounds = FALSE)])),
by = date]
A concise dplyr and purrr solution:
library(tidyverse)
sample %>%
mutate(x_plus14 = map(date, ~sum(x == 1 & between(date, . + 1, . + 14))),
x_plus30 = map(date, ~sum(x == 1 & between(date, . + 1, . + 30))))
x date x_plus14 x_plus30
1 1 2016-01-01 1 4
2 0 2016-01-05 1 4
3 1 2016-01-07 2 3
4 0 2016-01-12 2 3
5 0 2016-01-16 2 3
6 1 2016-01-20 1 1
7 1 2016-01-20 1 1
8 0 2016-01-25 1 1
9 0 2016-01-26 1 1
10 1 2016-01-31 0 0
Here's my stab at it with some dplyr+purrr help. I got slightly different counts due to the <= and >= in the helper function x_next() if you adjust them properly i think you should be able to get what you want. hth.
library("tidyverse")
library("lubridate")
set.seed(1453)
x = sample(0:1, 10, TRUE)
dates = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
df = data_frame(x = x, dates = lubridate::as_date(dates))
# helper function to calculate the sum of xs in the next days_in_future
x_next <- function(d, days_in_future) {
df %>%
# subset on days of interest
filter(dates > d & dates <= d + days(days_in_future)) %>%
# sum up xs
summarise(sum = sum(x)) %>%
# have to unlist them so that the (following) call to mutate works
unlist(use.names=F)
}
# mutate your df
df %>%
mutate(xplus14 = map(dates, x_next, 14),
xplus30 = map(dates, x_next, 30))
As other already mentioned, it is strange that you do not count the day from and you should avoid naming objects by names of functions (sample). However, the code bellow reproduce your desired output:
set.seed(1453);
x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
sample = data.frame(x = x, date = as.Date(sample$date))
getOccurences <- function(one_row, sample_data, date_range){
one_date <- as.Date(one_row[2])
sum(sample$x[sample_data$date > one_date &
sample_data$date < one_date + date_range])
}
sample$x_plus14 <- apply(sample,1,getOccurences, sample, 14)
sample$x_plus30 <- apply(sample,1,getOccurences, sample, 30)
sample
x date x_plus14 x_plus30
1 1 2016-01-01 1 3
2 0 2016-01-05 1 4
3 1 2016-01-07 2 3
4 0 2016-01-12 2 3
5 0 2016-01-16 2 3
6 1 2016-01-20 1 1
7 1 2016-01-20 1 1
8 0 2016-01-25 1 1
9 0 2016-01-26 1 1
10 1 2016-01-31 0 0