Extracting event types from last 21 day window - r

My dataframe looks like this. The two rightmost columns are my desired columns.
**Name ActivityType ActivityDate Email(last 21 says) Webinar(last21)**
John Email 1/1/2014 NA NA
John Webinar 1/5/2014 NA NA
John Sale 1/20/2014 Yes Yes
John Webinar 3/25/2014 NA NA
John Sale 4/1/2014 No Yes
John Sale 7/1/2014 No No
Tom Email 1/1/2015 NA NA
Tom Webinar 1/5/2015 NA NA
Tom Sale 1/20/2015 Yes Yes
Tom Webinar 3/25/2015 NA NA
Tom Sale 4/1/2015 No Yes
Tom Sale 7/1/2015 No No
I am just trying to create a yes/no variable that denotes whether there was an email or a webinar in the last 21 days for each "Sale" transaction. I was thinking(mock code) along the lines of using dplyr this way:
custlife %>%
group_by(Name) %>%
mutate(Email(last21days)=lag(ifelse(ActivityType = "Email" & ActivityDate of email within (activity date of sale - 21),Yes,No)).
I am not sure of the way to implement this. Kindly help. Your help is sincerely appreciated!

Here's a possible data.table solution. Here I'm creating 2 temporary data sets- one for Sale and one for the rest of activity types and then joining between them by a rolling window of 21 while using by = .EACHI in order to check conditions in each join. Then, I'm joining the result to the original data set.
Convert the date column to Date class and key the data by Name and Date (for the final/rolling join)
library(data.table)
setkey(setDT(df)[, ActivityDate := as.IDate(ActivityDate, "%m/%d/%Y")], Name, ActivityDate)
Create 2 temporary data sets per each activity
Saletemp <- df[ActivityType == "Sale", .(Name, ActivityDate)]
Elsetemp <- df[ActivityType != "Sale", .(Name, ActivityDate, ActivityType)]
Join by a rolling window of 21 to the sales temporary data set while checking conditions
Saletemp[Elsetemp, `:=`(Email21 = as.logical(which(i.ActivityType == "Email")),
Webinar21 = as.logical(which(i.ActivityType == "Webinar"))),
roll = -21, by = .EACHI]
Join everything back
df[Saletemp, `:=`(Email21 = i.Email21, Webinar21 = i.Webinar21)]
df
# Name ActivityType ActivityDate Email21 Webinar21
# 1: John Email 2014-01-01 NA NA
# 2: John Webinar 2014-01-05 NA NA
# 3: John Sale 2014-01-20 TRUE TRUE
# 4: John Webinar 2014-03-25 NA NA
# 5: John Sale 2014-04-01 NA TRUE
# 6: John Sale 2014-07-01 NA NA
# 7: Tom Email 2015-01-01 NA NA
# 8: Tom Webinar 2015-01-05 NA NA
# 9: Tom Sale 2015-01-20 TRUE TRUE
# 10: Tom Webinar 2015-03-25 NA NA
# 11: Tom Sale 2015-04-01 NA TRUE
# 12: Tom Sale 2015-07-01 NA NA

Here is another option with base R:
df is first split according to Name and then, among each subset, for each Sale, it looks if there is an Email (Webinar) within 21 days from the Sale. Finally, the list is unsplit according to Name.
You just have to replace FALSE by no and TRUE by yes afterwards.
df_split <- split(df, df$Name)
df_split <- lapply(df_split, function(tab){
i_s <- which(tab[,2]=="Sale")
tab$Email21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Email", 3] >= d_s-21)})
tab$Webinar21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Webinar", 3] >= d_s-21)})
tab
})
df_res <- unsplit(df_split, df$Name)
df_res
# Name ActivityType ActivityDate Email21 Webinar21
#1 John Email 2014-01-01 NA NA
#2 John Webinar 2014-01-05 NA NA
#3 John Sale 2014-01-20 TRUE TRUE
#4 John Webinar 2014-03-25 NA NA
#5 John Sale 2014-04-01 FALSE TRUE
#6 John Sale 2014-07-01 FALSE FALSE
#7 Tom Email 2015-01-01 NA NA
#8 Tom Webinar 2015-01-05 NA NA
#9 Tom Sale 2015-01-20 TRUE TRUE
#10 Tom Webinar 2015-03-25 NA NA
#11 Tom Sale 2015-04-01 FALSE TRUE
#12 Tom Sale 2015-07-01 FALSE FALSE
data
df <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom"), ActivityType = c("Email",
"Webinar", "Sale", "Webinar", "Sale", "Sale", "Email", "Webinar",
"Sale", "Webinar", "Sale", "Sale"), ActivityDate = structure(c(16071,
16075, 16090, 16154, 16161, 16252, 16436, 16440, 16455, 16519,
16526, 16617), class = "Date")), .Names = c("Name", "ActivityType",
"ActivityDate"), row.names = c(NA, -12L), index = structure(integer(0), ActivityType = c(1L,
7L, 3L, 5L, 6L, 9L, 11L, 12L, 2L, 4L, 8L, 10L)), class = "data.frame")

Related

Filter data.frame by date range in R

I have a DF like this:
Date <- c("10/17/17","11/11/17","11/23/17","11/25/17","12/3/17","12/10/17","12/16/17")
Ben <- c("1294",NA,"8959","2345",NA,"0303",NA)
James <- c(NA,"4523","3246",NA,"2394","8877","1427")
Alex <- c("3754","1122","5582",NA,"0094",NA,NA)
df1 <- data.frame(Date,Ben,James,Alex)
#df1
Date Ben James Alex
10/17/17 1294 NA 3754
11/11/17 NA 4523 1122
11/23/17 8959 3246 5582
11/25/17 2345 NA NA
12/3/17 NA 2394 0094
12/10/17 0303 8877 NA
12/16/17 NA 1427 NA
As you can see, the DF is sorted by date. I'm trying to put values that are within 2 weeks of the latest date for each column into a new DF, like this:
#df2
Ben James Alex
0303 1427 0094
NA 8877 5582
NA 2394 NA
Ben only has one listed value because there's only one non NA value within 2 weeks of 12/10/17, the latest date that has a non NA value in Ben's column. James's latest non NA date is 12/16/17. He has three values that fall within two weeks of that date: 1427, 8877 and 2394. Alex's latest date is 12/3/17. He has two values within two weeks of his latest date: 0094 and 5582. The number of rows that the new data.frame has should be equal to the column that is longest. Columns with fewer entries within their respective two week ranges should use NA to fill in data, like Ben's column.
I'm currently using the following code, which simply filters the last 3 non NA in each column:
df2 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 3))
using base r to be subset:
lapply(df1[-1],function(x)x[which((m<-tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=14)])->result
max(lengths(result))->len
do.call(cbind.data.frame,lapply(result,`length<-`,len))
Ben James Alex
1 <NA> 2394 5582
2 0303 8877 <NA>
3 <NA> 1427 0094
I just realized those are coded as characters according to the data you gave
To have it exactly as given in the expected results, we would have:
do.call(cbind.data.frame,lapply(result,function(x) `length<-`(rev(x),len)))
Ben James Alex
1 0303 1427 0094
2 <NA> 8877 <NA>
3 <NA> 2394 5582
Whether I well understood what you are looking for, the following code will help you:
I have loaded your dataset (with dput function)
dataset <- structure(list(Date = structure(c(17456, 17481, 17493, 17495,
17499, 17510, 17516), class = "Date"), Ben = c(1294L, NA, 8959L,
2345L, NA, 303L, NA), James = c(NA, 4523L, 3246L, NA, NA, 8877L,
1427L), Alex = c(3754L, 1122L, 5582L, NA, 94L, NA, NA)), .Names = c("Date",
"Ben", "James", "Alex"), row.names = c(NA, -7L), class = "data.frame")
Then load the following packages:
library(lubridate)
library(tidyverse)
Fix last_date and change format to Date variable:
last_date <- mdy("12/16/17")
dataset$Date <- mdy(dataset$Date)
Now, let's select only rows you want:
dataset_filtered <- dataset %>%
filter(Date<=last_date & Date>=(last_date-days(14)))
You'll have:
Date Ben James Alex
1 2017-12-10 303 8877 NA
2 2017-12-16 NA 1427 NA
Please, next time use dput function, not always is Xmas ;-)

Inserting rows into a dataframe based on a vector that contains dates

This is what my dataframe looks like:
df <- read.table(text='
Name ActivityType ActivityDate
John Email 2014-01-01
John Webinar 2014-01-05
John Webinar 2014-01-20
John Email 2014-04-20
Tom Email 2014-01-01
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom Email 2014-04-20
', header=T, row.names = NULL)
I have this vector x which contains different dates
x<- c("2014-01-03","2014-01-25","2015-05-27"). I want to insert rows in my original dataframe in a way that incorporates these dates in the x vector.This is what the output should look like:
Name ActivityType ActivityDate
John Email 2014-01-01
John NA 2014-01-03
John Webinar 2014-01-05
John Webinar 2014-01-20
John NA 2014-01-25
John Email 2014-04-20
John NA 2015-05-27
Tom Email 2014-01-01
Tom NA 2014-01-03
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom NA 2014-01-25
Tom Email 2014-04-20
Tom NA 2015-05-27
Sincerely appreciate your help!
It looks like you've added one of the 'new' dates aginst each of the people, correct?
In which case you can turn your x into a data.frame, and merge/join it on
## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))
## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
Name = rep(c("John","Tom"), 3))
merge(df, x, by=c("Name", "ActivityDate"), all=T)
# Name ActivityDate ActivityType
# 1 John 2014-01-01 Email
# 2 John 2014-05-01 Web
# 3 John 2014-20-01 Web
# 4 John 2014-20-04 Email
# 5 John 2014-01-03 <NA>
# 6 John 2014-01-25 <NA>
# 7 John 2015-05-27 <NA>
# 8 Tom 2014-01-01 Email
# 9 Tom 2014-05-01 Web
# 10 Tom 2014-20-01 Web
# 11 Tom 2014-20-04 Email
# 12 Tom 2014-01-03 <NA>
# 13 Tom 2014-01-25 <NA>
# 14 Tom 2015-05-27 <NA>
Update
As you are having memory issues, you can use data.table thusly
library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)
merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)
or, if you're not looking to merge you can rbind them, using data.table's rbindlist
rbindlist(list(dt, x_dt), fill=TRUE) ## fill sets the 'ActivityType' to NA in X
Update 2
To generate your x with 16000 uniqe names (I've used numbers here, but the principle is the same) and 30 dates
ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)
x <- data.frame(Names = rep(Names, length(ActivityDates)),
ActivityDates = rep(ActivityDates, length(Names)))
1) expand.grid Using expand.grid create a data frame adds with the rows to be added and then use rbind to combine df and adds converting the ActivityDate column to "Date" class. Then sort. No packages are used.
adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))
o <- with(both, order(Name, ActivityDate))
both[o, ]
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
9 John <NA> 2014-01-03
2 John Webinar 2014-01-05
3 John Webinar 2014-01-20
11 John <NA> 2014-01-25
4 John Email 2014-04-20
13 John <NA> 2015-05-27
5 Tom Email 2014-01-01
10 Tom <NA> 2014-01-03
6 Tom Webinar 2014-01-05
7 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
8 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27
2) sqldf This uploads adds and df to an sqlite data base which it creates on the fly, then it performs the sql query and downloads the result. The computation occurs outside of R so it might work with your large data.
adds <- data.frame(Name = NA, ActivityDate = x)
library(sqldf)
sqldf("select *
from (select *
from df
union
select a.Name, NULL ActivityType, ActivityDate
from (select distinct Name from df) a
cross join adds b
) order by 1, 3"
)
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
2 John <NA> 2014-01-03
3 John Webinar 2014-01-05
4 John Webinar 2014-01-20
5 John <NA> 2014-01-25
6 John Email 2014-04-20
7 John <NA> 2015-05-27
8 Tom Email 2014-01-01
9 Tom <NA> 2014-01-03
10 Tom Webinar 2014-01-05
11 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
13 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27

What is the "data table" way of doing this join/merge?

I have a "dictionary" table like this:
dict <- data.table(
Nickname = c("Abby", "Ben", "Chris", "Dan", "Ed"),
Name = c("Abigail", "Benjamin", "Christopher", "Daniel", "Edward")
)
dict
# Nickname Name
# 1: Abby Abigail
# 2: Ben Benjamin
# 3: Chris Christopher
# 4: Dan Daniel
# 5: Ed Edward
And a "data" table like this:
dat <- data.table(
Friend1 = c("Abby", "Ben", "Ben", "Chris"),
Friend2 = c("Ben", "Ed", NA, "Ed"),
Friend3 = c("Ed", NA, NA, "Dan"),
Friend4 = c("Dan", NA, NA, NA)
)
dat
# Friend1 Friend2 Friend3 Friend4
# 1: Abby Ben Ed Dan
# 2: Ben Ed NA NA
# 3: Ben NA NA NA
# 4: Chris Ed Dan NA
I would like to produce a data.table that looks like this
result <- data.table(
Friend1.Nickname = c("Abby", "Ben", "Ben", "Chris"),
Friend1.Name = c("Abigail", "Benjamin", "Benjamin", "Christopher"),
Friend2.Nickname = c("Ben", "Ed", NA, "Ed"),
Friend2.Name = c("Benjamin", "Edward", NA, "Edward"),
Friend3.Nickname = c("Ed", NA, NA, "Dan"),
Friend3.Name = c("Edward", NA, NA, "Daniel"),
Friend4.Nickname = c("Dan", NA, NA, NA),
Friend4.Name = c("Daniel", NA, NA, NA)
)
result
# sorry, word wrapping makes this too annoying to copy
And this is the solution I had in mind:
friend_vars <- paste0("Friend", 1:4)
friend_nicks <- paste0(friend_vars, ".Nickname")
friend_names <- paste0(friend_vars, ".Name")
setnames(dat, friend_vars, friend_nicks)
for (i in 1:4) {
dat[, friend_names[i] := dict$Name[match(dat[[friend_nicks[i]]], dict$Nickname)], with = FALSE]
}
Is there a more "data-table-esque" way to do this? I'm sure it's nice and efficient, but it's ugly to read, and part from data.table's in-place assignment I don't feel like I'm taking good advantage of what the package has to offer.
I'm also not a very strong SQL user, and I'm not too comfortable with join terminology. I have a feeling that Data.table - left outer join on multiple tables could be useful here but I'm not sure how to apply it to my situation.
Using data.table 1.9.5:
for (nm in names(dat)) {
on = setattr("Nickname", 'names', nm)
dat[dict, paste0(nm, ".Name") := i.Name, on=on]
}
We can join using on= instead of setting keys. Now you can use setcolorder() to reorder the names.
I avoid reshaping data unless absolutely necessary. This is where update while join comes in really handy. And now with the on= argument, I couldn't resist posting an answer :-).
I didn't come up w/ a solution that matches exactly your result, but you might be able to work w/ something like this:
dat[, id := .I]
dat.m <- melt(dat, id.vars='id', variable.name='Friend', value.name='Nickname')
setkey(dict, Nickname)
dat.m[, Name := dict[Nickname, Name]]
> dat.m
id Friend Nickname Name
1: 1 Friend1 Abby Abigail
2: 2 Friend1 Ben Benjamin
3: 3 Friend1 Ben Benjamin
4: 4 Friend1 Chris Christopher
5: 1 Friend2 Ben Benjamin
6: 2 Friend2 Ed Edward
7: 3 Friend2 NA NA
8: 4 Friend2 Ed Edward
9: 1 Friend3 Ed Edward
10: 2 Friend3 NA NA
11: 3 Friend3 NA NA
12: 4 Friend3 Dan Daniel
13: 1 Friend4 Dan Daniel
14: 2 Friend4 NA NA
15: 3 Friend4 NA NA
16: 4 Friend4 NA NA
The variable id was just a placeholder so I could melt the DT.
setkey(dict,Nickname)
dat[,paste(names(dat),"Name",sep="."):=lapply(.SD,function(x)dict[J(x)]$Name)]
setcolorder(dat,c(1,5,2,6,3,7,4,8))
dat
# Friend1 Friend1.Name Friend2 Friend2.Name Friend3 Friend3.Name Friend4 Friend4.Name
# 1: Abby Abigail Ben Benjamin Ed Edward Dan Daniel
# 2: Ben Benjamin Ed Edward NA NA NA NA
# 3: Ben Benjamin NA NA NA NA NA NA
# 4: Chris Christopher Ed Edward Dan Daniel NA NA
in base, super ugly:
cbind(dat, lapply(dat, function(x){dict$Name[match(x, dict$Nickname)]}))
Friend1 Friend2 Friend3 Friend4 V2 NA NA NA
1: Abby Ben Ed Dan Abigail Benjamin Edward Daniel
2: Ben Ed NA NA Benjamin Edward NA NA
3: Ben NA NA NA Benjamin NA NA NA
4: Chris Ed Dan NA Christopher Edward Daniel NA

Convert AsIs to numeric separated by coma in data frame

I have such data frame:
structure(list(P1 = c("Mark", "Katrin", "Kate", "Hank", "Tom",
"Marcus"), P2 = c("Tim", "Greg", "Seba", "Teqa", "Justine", "Monica"
), clique = structure(list(`930` = integer(0), `2090` = integer(0),
`3120` = c(2L, 3L, 231L), `3663` = integer(0), `3704` = integer(0),
`4156` = c(19L, 27L)), .Names = c("930", "2090", "3120",
"3663", "3704", "4156"), class = "AsIs")), .Names = c("P1", "P2",
"clique"), row.names = c(930L, 2090L, 3120L, 3663L, 3704L, 4156L
), class = "data.frame")
And I have a problem with the last column called clique. I would like to convert this column to numeric values separated by come in one column or the best option would be to transform integer(0) to NAs and put the numbers in separate columns. Just keep one number in each column.
I will accept both solutions.
example data:
P1 P2 clique
Mark Tim integer(0)
Katrin Greg integer(0)
Kate Seba c(2, 3, 231)
Hank Teqa integer(0)
Tom Justine integer(0)
Marcus Monica c(19, 27)
> class(data$clique)
[1] "AsIs"
Desired output:
P1 P2 clique
Mark Tim NA
Katrin Greg NA
Kate Seba 2,3,231
Hank Teqa NA
Tom Justine NA
Marcus Monica 19,27
or
P1 P2 clique New_column1 New_column2
Mark Tim
Katrin Greg
Kate Seba 2 3 231
Hank Teqa
Tom Justine
Marcus Monica 19 27
You can try listCol_w from my "splitstackshape" package:
library(splitstackshape)
listCol_w(mydf, "clique")[, lapply(.SD, as.numeric), by = .(P1, P2)]
## P1 P2 clique_fl_1 clique_fl_2 clique_fl_3
## 1: Mark Tim NA NA NA
## 2: Katrin Greg NA NA NA
## 3: Kate Seba 2 3 231
## 4: Hank Teqa NA NA NA
## 5: Tom Justine NA NA NA
## 6: Marcus Monica 19 27 NA
I recommend this because you mentioned you wanted the numeric values. You won't be able to store a value like "2,3,231" as a numeric value.
If you still want to try the approach of collapsing the values and then splitting them, you can try:
mydf$clique <- vapply(mydf$clique, function(x) paste(x, collapse = ","), character(1L))
The str would show that you now have a single character string instead of a list of character vectors. You can then use cSplit on that to get the wide form.
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ P1 : chr "Mark" "Katrin" "Kate" "Hank" ...
$ P2 : chr "Tim" "Greg" "Seba" "Teqa" ...
$ clique: chr "" "" "2,3,231" "" ...

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources