R : Roll up by each variable and taking total count - r

I have a data set something similar to this with around 80 variables (flags) and 80,000 rows
< Acc_Nbr flag1 flag2 flag3 flag4 Exposure
< ab 1 0 1 0 1000
< bc 0 1 1 0 2000
< cd 1 1 0 1 3000
< ef 1 0 1 1 4000
< Expected Output
< Variable Count_Acct_Number Sum_Exposure Total_Acct_Number Total_Expo
< flag1 3 8000 4 10000
< flag2 2 5000 4 10000
< flag3 3 7000 4 10000
< flag4 2 7000 4 10000
Basically I want the output to show me count of account number and sum of exposure which are marked as 1 for each variable and in front of them total count of account numbers and exposures.
Please help.

We can convert the 'data.frame' to 'data.table' (setDT(df1), reshape it to 'long' with melt, grouped by 'variable', we get the sum of 'value1', sum of 'Exposure' where 'value1' is 1, number of rows (.N), and the sum of all the values in 'Exposure' to get the expected output.
library(data.table)
melt(setDT(df1), measure=patterns("^flag"))[,
list(Count_Acct_Number= sum(value1),
Sum_Exposure= sum(Exposure[value1==1]),
Total_Acct_Number = .N,
TotalExposure=sum(Exposure)),
by = variable]
# variable Count_Acct_Number Sum_Exposure Total_Acct_Number TotalExposure
#1: flag1 3 8000 4 10000
#2: flag2 2 5000 4 10000
#3: flag3 3 7000 4 10000
#4: flag4 2 7000 4 10000

A straigthforward way is to use the doBy package
library(doBy)
df <- data.frame(account=LETTERS[1:10], exposure=1:10*3.14, mark=round(runif(10)))
res <- as.data.frame(summaryBy(exposure~mark+account, df, FUN=sum))
subset(res, mark==0)
Starting with the base data (note, sample has randoms in it)
> df
account exposure mark
1 A 3.14 1
2 B 6.28 1
3 C 9.42 0
4 D 12.56 0
5 E 15.70 1
6 F 18.84 0
7 G 21.98 1
8 H 25.12 0
9 I 28.26 1
10 J 31.40 0
gives temp result which has marked the marks (in this case there is no actual summing, but would do as well)
> res
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
5 1 B 6.28
6 1 C 9.42
7 1 E 15.70
8 1 G 21.98
9 1 I 28.26
10 1 J 31.40
The final result can be selected with
> subset(res, mark==0)
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12

Related

R: finding absolute difference with dplyr and group_by

I have the following example.
I want to create a new column with the absolute difference in AGE compared to each Treat==1 in the same PairID.
Desired output should be as shown below.
I have tried using dplyr with:
Data complete:
Treat <- c(1,0,0,1,0,0,1,0)
PairID <- c(1,1,1,2,2,2,3,3)
Age <- c(30,60,31,20,20,40,50,52)
D <- data.frame(Treat,PairID,Age)
D
D %>%
group_by(PairID) %>%
abs(Age - Age[Treat == 1])
in Base-R:
D$absD <- unlist(lapply(split(D,D$PairID), function(x) abs(x$Age - x$Age[x$Treat==1])))
> D
Treat PairID Age absD
1 1 1 30 0
2 0 1 60 30
3 0 1 31 1
4 1 2 20 0
5 0 2 20 0
6 0 2 40 20
7 1 3 50 0
8 0 3 52 2

How to reference previous rows by account and date

I have a dataset similar to the following format:
Account_ID Date Delinquency age count
1 01/01/2016 0 1 0
1 02/01/2016 1 2 0
1 03/01/2016 2 3 1
1 04/01/2016 0 4 2
1 05/01/2016 1 5 2
1 06/01/2016 2 6 2
2 01/01/2016 0 1 0
2 02/01/2016 0 2 0
2 03/01/2016 1 3 0
2 04/01/2016 0 4 1
2 05/01/2016 1 5 1
3 01/01/2016 1 1 0
3 02/01/2016 2 2 1
3 03/01/2016 3 3 2
3 04/01/2016 4 4 3
3 05/01/2016 5 5 4
3 06/01/2016 6 6 5
I want to count the number of non-zeros in the previous 3 months by account for each row, i.e. I want to create the count variable using the first 4 variables (Account_ID, Date, Delinquency, Age). I would like to know how to do this for n past months. I'm hoping I can extend this exercise to other tasks such as finding the max delinquency in the past 3 months.
welcome to SE!
In case you would like to count non-zero deliquency event for 3 previous months by account for each row, you can use aggregate function as well as zlag function of TSA package in a following manner (see the code below). As the data you provided in count column are dificult to interpret as well as to connect with the condition provided the data in an example were simulated.
library(lubridate)
set.seed(123)
# data simulation
df <- data.frame( id = factor(rep(0:9, 100)),
date = sample(seq(ymd("2010-12-01"), by = 1, length.out = 1000), 1000, replace = TRUE),
deliquency = sample(c(rep(0, 30), 1:5), 1000, replace = TRUE),
age = sample(1:10, 1000, replace = TRUE))
head(df)
# id date deliquency age
# 1 0 2011-08-06 0 10
# 2 1 2013-08-16 0 6
# 3 2 2012-11-17 0 1
# 4 3 2012-09-12 0 9
# 5 4 2011-07-29 0 1
# 6 5 2011-02-25 0 9
# aggregation of non-zero deliquency by month
df$year_month <- df$date
day(df$year_month) <- 1
df_m <- aggregate(deliquency ~ id + year_month, data = df, sum)
df_m <- df_m[order(as.character(df_m$id, df_m$year_month)), ]
df_m$is_zero <- df_m$deliquency > 0
head(df_m)
# id year_month deliquency is_zero
# 1 0 2010-12-01 1 TRUE
# 10 0 2011-01-01 0 FALSE
# 19 0 2011-02-01 0 FALSE
# 29 0 2011-03-01 0 FALSE
# 39 0 2011-04-01 0 FALSE
# 65 0 2011-07-01 1 TRUE
# calculate zero-deliquency events for three last months
library(TSA)
dfx <- df_m
df_m_l <- by(df_m, df_m$id, function(dfx) {
dfx$zero_del <- zlag(dfx$is_zero, 1) + zlag(dfx$is_zero, 2) + zlag(dfx$is_zero, 3)
dfx})
df_m_res <- do.call(rbind, df_m_l)
head(df_m_res)
You can see as an output the data.frame which shows non-zero amount of deliquency events in the last 3 months. E.g. output here is:
id year_month deliquency is_zero zero_del
0.1 0 2010-12-01 1 TRUE NA
0.10 0 2011-01-01 0 FALSE NA
0.19 0 2011-02-01 0 FALSE NA
0.29 0 2011-03-01 0 FALSE 1
0.39 0 2011-04-01 0 FALSE 0
0.65 0 2011-07-01 1 TRUE 0

Average columns based on other column value and number of rows in R

I'm using R and am trying to create a new dataframe of averaged results from another dataframe based on the values in Column A. To demonstrate my goal here is some data:
set.seed(1981)
df <- data.frame(A = sample(c(0,1), replace=TRUE, size=100),
B=round(runif(100), digits=4),
C=sample(1:1000, 100, replace=TRUE))
head(df, 30)
A B C
0 0.6739 459
1 0.5466 178
0 0.154 193
0 0.41 206
1 0.7526 791
1 0.3104 679
1 0.739 434
1 0.421 171
0 0.3653 577
1 0.4035 739
0 0.8796 147
0 0.9138 37
0 0.7257 350
1 0.2125 779
0 0.1502 495
1 0.2972 504
0 0.2406 245
1 0.0325 613
0 0.8642 539
1 0.1096 630
1 0.2113 363
1 0.277 974
0 0.0485 755
1 0.0553 412
0 0.509 24
0 0.2934 795
0 0.0725 413
0 0.8723 606
0 0.3192 591
1 0.5557 177
I need to reduce the size of the data by calculating the average value for column B and column C for as many rows as the value in Column A stays consecutively the same, up to a maximum of 3 rows. If value A remains either 1, or 0 for more than 3 rows it would roll over into the next row in the new dataframe as you can see below.
The new dataframe requires the following columns:
Value of A B.Av C.Av No. of rows used
0 0.6739 459 1
1 0.5466 178 1
0 0.282 199.5 2
1 0.600666667 634.6666667 3
1 0.421 171 1
0 0.3653 577 1
1 0.4035 739 1
0 0.8397 178 3
1 0.2125 779 1
0 0.1502 495 1
1 0.2972 504 1
0 0.2406 245 1
1 0.0325 613 1
0 0.8642 539 1
1 0.1993 655.6666667 3
0 0.0485 755 1
1 0.0553 412 1
0 0.291633333 410.6666667 3
0 0.59575 598.5 2
1 0.5557 177 1
I haven't managed to find another similar scenario to mine whilst searching Stack Overflow so any help would be really appreciated.
Here is a base-R solution:
## define a function to split the run-length if greater than 3
split.3 <- function(l,v) {
o <- c(values=v,lengths=min(l,3))
while(l > 3) {
l <- l - 3
o <- rbind(o,c(values=v,lengths=min(l,3)))
}
return(o)
}
## compute the run-length encoding of column A
rl <- rle(df$A)
## apply split.3 to the run-length encoding
## the first column of vl are the values of column A
## the second column of vl are the corresponding run-length limited to 3
vl <- do.call(rbind,mapply(split.3,rl$lengths,rl$values))
## compute the begin and end row indices of df for each value of A to average
fin <- cumsum(vl[,2])
beg <- fin - vl[,2] + 1
## compute the averages
out <- do.call(rbind,lapply(1:length(beg), function(i) data.frame(`Value of A`=vl[i,1],
B.Av=mean(df$B[beg[i]:fin[i]]),
C.Av=mean(df$C[beg[i]:fin[i]]),
`No. of rows used`=fin[i]-beg[i]+1)))
## Value.of.A B.Av C.Av No..of.rows.used
##1 0 0.6739000 459.0000 1
##2 1 0.5466000 178.0000 1
##3 0 0.2820000 199.5000 2
##4 1 0.6006667 634.6667 3
##5 1 0.4210000 171.0000 1
##6 0 0.3653000 577.0000 1
##7 1 0.4035000 739.0000 1
##8 0 0.8397000 178.0000 3
##9 1 0.2125000 779.0000 1
##10 0 0.1502000 495.0000 1
##11 1 0.2972000 504.0000 1
##12 0 0.2406000 245.0000 1
##13 1 0.0325000 613.0000 1
##14 0 0.8642000 539.0000 1
##15 1 0.1993000 655.6667 3
##16 0 0.0485000 755.0000 1
##17 1 0.0553000 412.0000 1
##18 0 0.2916333 410.6667 3
##19 0 0.5957500 598.5000 2
##20 1 0.5557000 177.0000 1
Here is a data.table solution:
library(data.table)
setDT(df)
# create two group variables, consecutive A and for each consecutive A every three rows
(df[,rleid := rleid(A)][, threeWindow := ((1:.N) - 1) %/% 3, rleid]
# calculate the average of the columns grouped by the above two variables
[, c(.N, lapply(.SD, mean)), .(rleid, threeWindow)]
# drop group variables
[, `:=`(rleid = NULL, threeWindow = NULL)][])
# N A B C
#1: 1 0 0.6739000 459.0000
#2: 1 1 0.5466000 178.0000
#3: 2 0 0.2820000 199.5000
#4: 3 1 0.6006667 634.6667
#5: 1 1 0.4210000 171.0000
#6: 1 0 0.3653000 577.0000
#7: 1 1 0.4035000 739.0000
#8: 3 0 0.8397000 178.0000
#9: 1 1 0.2125000 779.0000
#10: 1 0 0.1502000 495.0000
#11: 1 1 0.2972000 504.0000
#12: 1 0 0.2406000 245.0000
#13: 1 1 0.0325000 613.0000
#14: 1 0 0.8642000 539.0000
#15: 3 1 0.1993000 655.6667
#16: 1 0 0.0485000 755.0000
#17: 1 1 0.0553000 412.0000
#18: 3 0 0.2916333 410.6667
#19: 2 0 0.5957500 598.5000
#20: 1 1 0.5557000 177.0000

Count flags across multiple rows depending on key

I have a dataset that consists of customers and accounts where a customer can have multiple accounts. The dataset has several 'flags' on each account.
I'm trying to get a count of 'unique' hits on these flags per customer, i.e. if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts.
Example Input data frame:
cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1
Example Output dataframe:
cust acct flag1 flag2 flag3 UniqueSum
a 123 0 1 0 2
a 456 1 1 0 2
b 789 1 1 1 3
c 428 0 1 0 2
c 247 0 1 0 2
c 483 0 1 1 2
I've tried to use the following:
fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))
but this sums the acct column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts.
Using data.table:
require(data.table) # v1.9.6
dt[, un := sum(sapply(.SD, max)), by = cust, .SDcols = flag1:flag3]
We group by cust, and on the subdata for each group for columns flag1, flag2, flag3 (achieved using .SD and .SDcols), we extract each column's max, and summing it up would give the total number of 1's.
We update the original table with these values by reference using the LHS := RHS notation (see Reference Semantics vignette).
where dt is:
dt = fread('cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1')
One way that comes to my mind, is to colSum for each cust and check which are greater than 0. For example,
> tab
cust acct flag1 flag2 flag3
1 a 123 0 1 0
2 a 456 1 1 0
3 b 789 1 1 1
4 c 428 0 1 0
5 c 247 0 1 0
6 c 483 0 1 1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
cust acct flag1 flag2 flag3 uniqueSums
1 a 123 0 1 0 2
2 a 456 1 1 0 2
3 b 789 1 1 1 3
4 c 428 0 1 0 2
5 c 247 0 1 0 2
6 c 483 0 1 1 2
For each value of cust, the function in sapply finds the rows, does a vectorized sum and checks for values that are greater than 0.
Here's an approach using library(dplyr):
df %>%
group_by(cust) %>%
summarise_each(funs(max), -acct) %>%
mutate(UniqueSum = rowSums(.[-1])) %>%
select(-starts_with("flag")) %>%
right_join(df, "cust")
#Source: local data frame [6 x 6]
#
# cust UniqueSum acct flag1 flag2 flag3
# (fctr) (dbl) (int) (int) (int) (int)
#1 a 2 123 0 1 0
#2 a 2 456 1 1 0
#3 b 3 789 1 1 1
#4 c 2 428 0 1 0
#5 c 2 247 0 1 0
#6 c 2 483 0 1 1
I was able to answer my own question after reading Roman's post, I did something like this where f data is my dataset.
fSumData <- ddply(fData, "cust", numcolwise(sum))
fSumData$UniqueHits <- ifelse(fSumData$flag1 >= 1;1,0) + ifelse(fSumData$flag2 >= 1;1;0) + ifelse(fSumData$flag3 >= 1;1;0)
I found this to be a bit faster than Roman's solution when running against my dataset, but am unsure if it's the optimal solution. Thank you all for your input this helped a ton!
The underused rowsum could be, also, of use:
rowSums(rowsum(DF[-(1:2)], DF$cust) > 0)[DF$cust]
#a a b c c c
#2 2 3 2 2 2

Calculate for date difference in days R

How can I calculate if the ID appear consecutively for less then 5 days ? also calculate the day difference between same ID record .
I really cannot get the logic for this problem and I did not know what I can start with.
(The sample data given below is just a sample , my actual data is in huge volume.Hence,optimization is needed.)
sample data :
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"
)
)
Expected Output:
output<- data.frame(
id=c("A","A","A","A","A","B","C","C","C","C","D","D","D","D","D","D","D"),
date=c("1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"1/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"5/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"6/3/2013",
"7/3/2013" ),
num=c(0,1,2,3,4,0,0,1,2,4,0,1,2,3,4,5,6)
)
Calculation Logic :
Do calculation on the date difference. For example, 1/3 to 2/3 is 1 day difference so the row of 2/3, column idu:1 . 2/3 to 3/3 is 1 day difference so add on 1 row 3/3 , column idu:2 . 3/3 to 5/3 is 2 day difference so add 2 to idu . row 5/3 , column idu : 4 . (Base on same ID)
Date | idu
1/3 | 0
2/3 | 1
3/3 | 2
5/3 | 4
Thanks in advance.
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"), stringsAsFactors = F)
library(lubridate)
sample$date <- dmy(sample$date)
sample1 <- sample[order(sample$id, sample$date), ]
sample1$idu <- unlist(sapply(rle(sample1$id)$lengths, seq_len)) -1
id date idu
1 A 2013-03-01 0
5 A 2013-03-02 1
8 A 2013-03-03 2
11 A 2013-03-04 3
13 A 2013-03-05 4
2 B 2013-03-01 0
3 C 2013-03-01 0
6 C 2013-03-02 1
9 C 2013-03-03 2
14 C 2013-03-05 3
4 D 2013-03-01 0
7 D 2013-03-02 1
10 D 2013-03-03 2
12 D 2013-03-04 3
In order to add a time lag column, several options are available. I'd simply do
sample1$diff <- c(0, int_diff(sample1$date)/days(1))
# Remainder cannot be expressed as fraction of a period.
# Performing %/%.
> sample1
id date idu diff
1 A 2013-03-01 0 0
5 A 2013-03-02 1 1
8 A 2013-03-03 2 1
11 A 2013-03-04 3 1
13 A 2013-03-05 4 1
2 B 2013-03-01 0 -4
3 C 2013-03-01 0 0
6 C 2013-03-02 1 1
9 C 2013-03-03 2 1
14 C 2013-03-05 3 2
4 D 2013-03-01 0 -4
7 D 2013-03-02 1 1
10 D 2013-03-03 2 1
12 D 2013-03-04 3 1
And do further changes as needed. replacing all negative values with 0.

Resources