How can I calculate if the ID appear consecutively for less then 5 days ? also calculate the day difference between same ID record .
I really cannot get the logic for this problem and I did not know what I can start with.
(The sample data given below is just a sample , my actual data is in huge volume.Hence,optimization is needed.)
sample data :
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"
)
)
Expected Output:
output<- data.frame(
id=c("A","A","A","A","A","B","C","C","C","C","D","D","D","D","D","D","D"),
date=c("1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"1/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"5/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"6/3/2013",
"7/3/2013" ),
num=c(0,1,2,3,4,0,0,1,2,4,0,1,2,3,4,5,6)
)
Calculation Logic :
Do calculation on the date difference. For example, 1/3 to 2/3 is 1 day difference so the row of 2/3, column idu:1 . 2/3 to 3/3 is 1 day difference so add on 1 row 3/3 , column idu:2 . 3/3 to 5/3 is 2 day difference so add 2 to idu . row 5/3 , column idu : 4 . (Base on same ID)
Date | idu
1/3 | 0
2/3 | 1
3/3 | 2
5/3 | 4
Thanks in advance.
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"), stringsAsFactors = F)
library(lubridate)
sample$date <- dmy(sample$date)
sample1 <- sample[order(sample$id, sample$date), ]
sample1$idu <- unlist(sapply(rle(sample1$id)$lengths, seq_len)) -1
id date idu
1 A 2013-03-01 0
5 A 2013-03-02 1
8 A 2013-03-03 2
11 A 2013-03-04 3
13 A 2013-03-05 4
2 B 2013-03-01 0
3 C 2013-03-01 0
6 C 2013-03-02 1
9 C 2013-03-03 2
14 C 2013-03-05 3
4 D 2013-03-01 0
7 D 2013-03-02 1
10 D 2013-03-03 2
12 D 2013-03-04 3
In order to add a time lag column, several options are available. I'd simply do
sample1$diff <- c(0, int_diff(sample1$date)/days(1))
# Remainder cannot be expressed as fraction of a period.
# Performing %/%.
> sample1
id date idu diff
1 A 2013-03-01 0 0
5 A 2013-03-02 1 1
8 A 2013-03-03 2 1
11 A 2013-03-04 3 1
13 A 2013-03-05 4 1
2 B 2013-03-01 0 -4
3 C 2013-03-01 0 0
6 C 2013-03-02 1 1
9 C 2013-03-03 2 1
14 C 2013-03-05 3 2
4 D 2013-03-01 0 -4
7 D 2013-03-02 1 1
10 D 2013-03-03 2 1
12 D 2013-03-04 3 1
And do further changes as needed. replacing all negative values with 0.
Related
I have a dataset similar to the following format:
Account_ID Date Delinquency age count
1 01/01/2016 0 1 0
1 02/01/2016 1 2 0
1 03/01/2016 2 3 1
1 04/01/2016 0 4 2
1 05/01/2016 1 5 2
1 06/01/2016 2 6 2
2 01/01/2016 0 1 0
2 02/01/2016 0 2 0
2 03/01/2016 1 3 0
2 04/01/2016 0 4 1
2 05/01/2016 1 5 1
3 01/01/2016 1 1 0
3 02/01/2016 2 2 1
3 03/01/2016 3 3 2
3 04/01/2016 4 4 3
3 05/01/2016 5 5 4
3 06/01/2016 6 6 5
I want to count the number of non-zeros in the previous 3 months by account for each row, i.e. I want to create the count variable using the first 4 variables (Account_ID, Date, Delinquency, Age). I would like to know how to do this for n past months. I'm hoping I can extend this exercise to other tasks such as finding the max delinquency in the past 3 months.
welcome to SE!
In case you would like to count non-zero deliquency event for 3 previous months by account for each row, you can use aggregate function as well as zlag function of TSA package in a following manner (see the code below). As the data you provided in count column are dificult to interpret as well as to connect with the condition provided the data in an example were simulated.
library(lubridate)
set.seed(123)
# data simulation
df <- data.frame( id = factor(rep(0:9, 100)),
date = sample(seq(ymd("2010-12-01"), by = 1, length.out = 1000), 1000, replace = TRUE),
deliquency = sample(c(rep(0, 30), 1:5), 1000, replace = TRUE),
age = sample(1:10, 1000, replace = TRUE))
head(df)
# id date deliquency age
# 1 0 2011-08-06 0 10
# 2 1 2013-08-16 0 6
# 3 2 2012-11-17 0 1
# 4 3 2012-09-12 0 9
# 5 4 2011-07-29 0 1
# 6 5 2011-02-25 0 9
# aggregation of non-zero deliquency by month
df$year_month <- df$date
day(df$year_month) <- 1
df_m <- aggregate(deliquency ~ id + year_month, data = df, sum)
df_m <- df_m[order(as.character(df_m$id, df_m$year_month)), ]
df_m$is_zero <- df_m$deliquency > 0
head(df_m)
# id year_month deliquency is_zero
# 1 0 2010-12-01 1 TRUE
# 10 0 2011-01-01 0 FALSE
# 19 0 2011-02-01 0 FALSE
# 29 0 2011-03-01 0 FALSE
# 39 0 2011-04-01 0 FALSE
# 65 0 2011-07-01 1 TRUE
# calculate zero-deliquency events for three last months
library(TSA)
dfx <- df_m
df_m_l <- by(df_m, df_m$id, function(dfx) {
dfx$zero_del <- zlag(dfx$is_zero, 1) + zlag(dfx$is_zero, 2) + zlag(dfx$is_zero, 3)
dfx})
df_m_res <- do.call(rbind, df_m_l)
head(df_m_res)
You can see as an output the data.frame which shows non-zero amount of deliquency events in the last 3 months. E.g. output here is:
id year_month deliquency is_zero zero_del
0.1 0 2010-12-01 1 TRUE NA
0.10 0 2011-01-01 0 FALSE NA
0.19 0 2011-02-01 0 FALSE NA
0.29 0 2011-03-01 0 FALSE 1
0.39 0 2011-04-01 0 FALSE 0
0.65 0 2011-07-01 1 TRUE 0
I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!
Update:
The original solution that I posted worked for your data. But your data has
the unusual property that no customer ever visited the same store in two different
months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4
We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
I have a data set something similar to this with around 80 variables (flags) and 80,000 rows
< Acc_Nbr flag1 flag2 flag3 flag4 Exposure
< ab 1 0 1 0 1000
< bc 0 1 1 0 2000
< cd 1 1 0 1 3000
< ef 1 0 1 1 4000
< Expected Output
< Variable Count_Acct_Number Sum_Exposure Total_Acct_Number Total_Expo
< flag1 3 8000 4 10000
< flag2 2 5000 4 10000
< flag3 3 7000 4 10000
< flag4 2 7000 4 10000
Basically I want the output to show me count of account number and sum of exposure which are marked as 1 for each variable and in front of them total count of account numbers and exposures.
Please help.
We can convert the 'data.frame' to 'data.table' (setDT(df1), reshape it to 'long' with melt, grouped by 'variable', we get the sum of 'value1', sum of 'Exposure' where 'value1' is 1, number of rows (.N), and the sum of all the values in 'Exposure' to get the expected output.
library(data.table)
melt(setDT(df1), measure=patterns("^flag"))[,
list(Count_Acct_Number= sum(value1),
Sum_Exposure= sum(Exposure[value1==1]),
Total_Acct_Number = .N,
TotalExposure=sum(Exposure)),
by = variable]
# variable Count_Acct_Number Sum_Exposure Total_Acct_Number TotalExposure
#1: flag1 3 8000 4 10000
#2: flag2 2 5000 4 10000
#3: flag3 3 7000 4 10000
#4: flag4 2 7000 4 10000
A straigthforward way is to use the doBy package
library(doBy)
df <- data.frame(account=LETTERS[1:10], exposure=1:10*3.14, mark=round(runif(10)))
res <- as.data.frame(summaryBy(exposure~mark+account, df, FUN=sum))
subset(res, mark==0)
Starting with the base data (note, sample has randoms in it)
> df
account exposure mark
1 A 3.14 1
2 B 6.28 1
3 C 9.42 0
4 D 12.56 0
5 E 15.70 1
6 F 18.84 0
7 G 21.98 1
8 H 25.12 0
9 I 28.26 1
10 J 31.40 0
gives temp result which has marked the marks (in this case there is no actual summing, but would do as well)
> res
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
5 1 B 6.28
6 1 C 9.42
7 1 E 15.70
8 1 G 21.98
9 1 I 28.26
10 1 J 31.40
The final result can be selected with
> subset(res, mark==0)
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
I have a data frame called "e" that contains posts froma platform, with unique entry_id and member_id:
row. member_id entry_id timestamp
1 1 a 2008-06-09 12:41:00
2 1 b 2008-07-14 18:41:00
3 1 c 2010-07-17 15:40:00
4 2 d 2008-06-09 12:41:00
5 2 e 2008-09-18 10:22:00
6 3 f 2008-10-03 13:36:00
I have another data frame called "c", that contains comments:
row. member_id comment_id timestamp
1 1 I 2007-06-09 12:41:00
2 1 II 2007-07-14 18:41:00
3 1 III 2009-07-17 15:40:00
4 2 IV 2007-06-09 12:41:00
5 2 V 2009-09-18 10:22:00
6 3 VI 2010-10-03 13:36:00
I want to count all the comments a member wrote before he posted an entry. So the data frame "e" should look like this. Only mind the years when reading the example. The solution however should cover minutes too:
row. member_id entry_id prev_comment_count timestamp
1 1 a 2 2008-06-09 12:41:00
2 1 b 2 2008-07-14 18:41:00
3 1 c 3 2010-07-17 15:40:00
4 2 d 1 2008-06-09 12:41:00
5 2 e 1 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
I alrady tried with the following function:
functionPrevComments <- function(givE) nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) &
(c["timestamp"] <= givE["timestamp"])))
But when I try to sapply it, I get the error
"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""
I used the "$" Operator for referenncing the colums I need before but then I got
"$ operator is invalid for atomic vectors "
How do I apply my function correctly or is there another and better solution the solve my problem ?
Best Regards,
Nikolas
Here's a slightly different option. Make sure you have both "timestamp" columns converted to POSIXct-class before running the code.
e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})
e
# row. member_id entry_id timestamp prev_comment_count
#1 1 1 a 2008-06-09 12:41:00 2
#2 2 1 b 2008-07-14 18:41:00 2
#3 3 1 c 2010-07-17 15:40:00 3
#4 4 2 d 2008-06-09 12:41:00 1
#5 5 2 e 2008-09-18 10:22:00 1
#6 6 3 f 2008-10-03 13:36:00 0
e$type <- "entry"
c$type <- "comment"
names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")
DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type,
DF$member_id,
FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]
# row member_id action_id timestamp type count
#1 1 1 a 2008-06-09 12:41:00 entry 2
#2 2 1 b 2008-07-14 18:41:00 entry 2
#3 3 1 c 2010-07-17 15:40:00 entry 3
#4 4 2 d 2008-06-09 12:41:00 entry 1
#5 5 2 e 2008-09-18 10:22:00 entry 1
#6 6 3 f 2008-10-03 13:36:00 entry 0
If this is not fast enough, it can be improved with data.table or dplyr.
I'm struggling for a while with the following dataset:
id date var1 var2
1 7031 2008-12-01 27 1
2 7031 2009-01-05 6 0
3 7031 2009-02-02 0 3
4 7031 2008-11-01 1 4
5 7500 2009-07-11 30 0
6 7500 2009-10-01 8 0
7 7500 2010-01-01 0 0
8 7041 2009-06-20 26 0
9 7041 2009-08-01 0 0
10 0277 2009-01-01 3 0
I would like to output for each id the last date with non-zero variables. Time series for these users are of different length. I expect as an output smth like:
id last_date
7031 2009-02-02
7500 2009-10-01
7041 2009-06-20
0277 2009-01-01
Any help would be appreciated!
First, subset your data, and then use aggregate():
Here's your sample data:
x <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "
id date var1 var2
1 '7031' 2008-12-01 27 1
2 '7031' 2009-01-05 6 0
3 '7031' 2009-02-02 0 3
4 '7031' 2008-11-01 1 4
5 '7500' 2009-07-11 30 0
6 '7500' 2009-10-01 8 0
7 '7500' 2010-01-01 0 0
8 '7041' 2009-06-20 26 0
9 '7041' 2009-08-01 0 0
10 '0277' 2009-01-01 3 0")
Make sure that your "date" variable values are represented by actual dates and not characters.
x$date <- as.Date(x$date)
Subset:
x2 <- with(x, x[!(var1 == 0 & var2 == 0), ])
Aggregate:
aggregate(date ~ id, x2, max)
# id date
# 1 277 2009-01-01
# 2 7031 2009-02-02
# 3 7041 2009-06-20
# 4 7500 2009-10-01
If you didn't want to create a new object of your subsetted data, you can also use: aggregate(date ~ id, x[!(x$var1 == 0 & x$var2 == 0), ], max)