Inserting rows into a dataframe based on a vector that contains dates - r

This is what my dataframe looks like:
df <- read.table(text='
Name ActivityType ActivityDate
John Email 2014-01-01
John Webinar 2014-01-05
John Webinar 2014-01-20
John Email 2014-04-20
Tom Email 2014-01-01
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom Email 2014-04-20
', header=T, row.names = NULL)
I have this vector x which contains different dates
x<- c("2014-01-03","2014-01-25","2015-05-27"). I want to insert rows in my original dataframe in a way that incorporates these dates in the x vector.This is what the output should look like:
Name ActivityType ActivityDate
John Email 2014-01-01
John NA 2014-01-03
John Webinar 2014-01-05
John Webinar 2014-01-20
John NA 2014-01-25
John Email 2014-04-20
John NA 2015-05-27
Tom Email 2014-01-01
Tom NA 2014-01-03
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom NA 2014-01-25
Tom Email 2014-04-20
Tom NA 2015-05-27
Sincerely appreciate your help!

It looks like you've added one of the 'new' dates aginst each of the people, correct?
In which case you can turn your x into a data.frame, and merge/join it on
## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))
## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
Name = rep(c("John","Tom"), 3))
merge(df, x, by=c("Name", "ActivityDate"), all=T)
# Name ActivityDate ActivityType
# 1 John 2014-01-01 Email
# 2 John 2014-05-01 Web
# 3 John 2014-20-01 Web
# 4 John 2014-20-04 Email
# 5 John 2014-01-03 <NA>
# 6 John 2014-01-25 <NA>
# 7 John 2015-05-27 <NA>
# 8 Tom 2014-01-01 Email
# 9 Tom 2014-05-01 Web
# 10 Tom 2014-20-01 Web
# 11 Tom 2014-20-04 Email
# 12 Tom 2014-01-03 <NA>
# 13 Tom 2014-01-25 <NA>
# 14 Tom 2015-05-27 <NA>
Update
As you are having memory issues, you can use data.table thusly
library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)
merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)
or, if you're not looking to merge you can rbind them, using data.table's rbindlist
rbindlist(list(dt, x_dt), fill=TRUE) ## fill sets the 'ActivityType' to NA in X
Update 2
To generate your x with 16000 uniqe names (I've used numbers here, but the principle is the same) and 30 dates
ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)
x <- data.frame(Names = rep(Names, length(ActivityDates)),
ActivityDates = rep(ActivityDates, length(Names)))

1) expand.grid Using expand.grid create a data frame adds with the rows to be added and then use rbind to combine df and adds converting the ActivityDate column to "Date" class. Then sort. No packages are used.
adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))
o <- with(both, order(Name, ActivityDate))
both[o, ]
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
9 John <NA> 2014-01-03
2 John Webinar 2014-01-05
3 John Webinar 2014-01-20
11 John <NA> 2014-01-25
4 John Email 2014-04-20
13 John <NA> 2015-05-27
5 Tom Email 2014-01-01
10 Tom <NA> 2014-01-03
6 Tom Webinar 2014-01-05
7 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
8 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27
2) sqldf This uploads adds and df to an sqlite data base which it creates on the fly, then it performs the sql query and downloads the result. The computation occurs outside of R so it might work with your large data.
adds <- data.frame(Name = NA, ActivityDate = x)
library(sqldf)
sqldf("select *
from (select *
from df
union
select a.Name, NULL ActivityType, ActivityDate
from (select distinct Name from df) a
cross join adds b
) order by 1, 3"
)
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
2 John <NA> 2014-01-03
3 John Webinar 2014-01-05
4 John Webinar 2014-01-20
5 John <NA> 2014-01-25
6 John Email 2014-04-20
7 John <NA> 2015-05-27
8 Tom Email 2014-01-01
9 Tom <NA> 2014-01-03
10 Tom Webinar 2014-01-05
11 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
13 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27

Related

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

More efficient methods than nested for loops in R -- matching

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs.
I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to.
But it's slow as hell on bigger datasets.
I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling.
What's a more efficient alternative for what I'm doing here? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it.
dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
Here's a dplyr solution
library(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
# Person_id FirstName LastName DOB Actual_ID
# <dbl> <fct> <fct> <fct> <dbl>
# 1 1. John Smith 2001-01-01 1.
# 2 2. James Jones 2002-01-01 2.
# 3 3. John Jones 2003-01-01 3.
# 4 4. Alex Jones 2004-01-01 4.
# 5 5. Alexander Jones 2004-01-01 5.
# 6 6. Jonathan Smith 2001-01-01 6.
# 7 3. John Jones 2003-01-01 3.
# 8 8. Alex Smith 2006-01-01 8.
# 9 9. James Johnson 2006-01-01 9.
# 10 1. John Smith 2001-01-01 1.
# 11 11. John Smith 2009-01-01 11.
EDIT - Added Performance comparison
for_loop_approach <- function() {
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
}
dplyr_approach <- function() {
require(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
}
library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)
Unit: relative
expr min lq mean median uq max neval
for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743 100
dplyr_approach() 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
There were 50 or more warnings (use warnings() to see the first 50)
I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id:
dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))
This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame):
Person_id FirstName LastName DOB Actual_id
1 1 John Smith 2001-01-01 1
2 2 James Jones 2002-01-01 2
3 3 John Jones 2003-01-01 3
4 4 Alex Jones 2004-01-01 4
5 5 Alexander Jones 2004-01-01 5
6 6 Jonathan Smith 2001-01-01 6
7 7 John Jones 2003-01-01 3
8 8 Alex Smith 2006-01-01 8
9 9 James Johnson 2006-01-01 9
10 10 John Smith 2001-01-01 1
11 11 John Smith 2009-01-01 11
In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, e.g.
dta.test$Person_id <- paste0(LETTERS[1:11],1:11)
You just need a small tweak to make this still work, to make it extract value from the Person_id column:
dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]
Giving:
Person_id FirstName LastName DOB Actual_id
1 A1 John Smith 2001-01-01 A1
2 B2 James Jones 2002-01-01 B2
3 C3 John Jones 2003-01-01 C3
4 D4 Alex Jones 2004-01-01 D4
5 E5 Alexander Jones 2004-01-01 E5
6 F6 Jonathan Smith 2001-01-01 F6
7 G7 John Jones 2003-01-01 C3
8 H8 Alex Smith 2006-01-01 H8
9 I9 James Johnson 2006-01-01 I9
10 J10 John Smith 2001-01-01 A1
11 K11 John Smith 2009-01-01 K11
A data table solution will probably be quickest on large data with lots of groups:
library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

Extracting event types from last 21 day window

My dataframe looks like this. The two rightmost columns are my desired columns.
**Name ActivityType ActivityDate Email(last 21 says) Webinar(last21)**
John Email 1/1/2014 NA NA
John Webinar 1/5/2014 NA NA
John Sale 1/20/2014 Yes Yes
John Webinar 3/25/2014 NA NA
John Sale 4/1/2014 No Yes
John Sale 7/1/2014 No No
Tom Email 1/1/2015 NA NA
Tom Webinar 1/5/2015 NA NA
Tom Sale 1/20/2015 Yes Yes
Tom Webinar 3/25/2015 NA NA
Tom Sale 4/1/2015 No Yes
Tom Sale 7/1/2015 No No
I am just trying to create a yes/no variable that denotes whether there was an email or a webinar in the last 21 days for each "Sale" transaction. I was thinking(mock code) along the lines of using dplyr this way:
custlife %>%
group_by(Name) %>%
mutate(Email(last21days)=lag(ifelse(ActivityType = "Email" & ActivityDate of email within (activity date of sale - 21),Yes,No)).
I am not sure of the way to implement this. Kindly help. Your help is sincerely appreciated!
Here's a possible data.table solution. Here I'm creating 2 temporary data sets- one for Sale and one for the rest of activity types and then joining between them by a rolling window of 21 while using by = .EACHI in order to check conditions in each join. Then, I'm joining the result to the original data set.
Convert the date column to Date class and key the data by Name and Date (for the final/rolling join)
library(data.table)
setkey(setDT(df)[, ActivityDate := as.IDate(ActivityDate, "%m/%d/%Y")], Name, ActivityDate)
Create 2 temporary data sets per each activity
Saletemp <- df[ActivityType == "Sale", .(Name, ActivityDate)]
Elsetemp <- df[ActivityType != "Sale", .(Name, ActivityDate, ActivityType)]
Join by a rolling window of 21 to the sales temporary data set while checking conditions
Saletemp[Elsetemp, `:=`(Email21 = as.logical(which(i.ActivityType == "Email")),
Webinar21 = as.logical(which(i.ActivityType == "Webinar"))),
roll = -21, by = .EACHI]
Join everything back
df[Saletemp, `:=`(Email21 = i.Email21, Webinar21 = i.Webinar21)]
df
# Name ActivityType ActivityDate Email21 Webinar21
# 1: John Email 2014-01-01 NA NA
# 2: John Webinar 2014-01-05 NA NA
# 3: John Sale 2014-01-20 TRUE TRUE
# 4: John Webinar 2014-03-25 NA NA
# 5: John Sale 2014-04-01 NA TRUE
# 6: John Sale 2014-07-01 NA NA
# 7: Tom Email 2015-01-01 NA NA
# 8: Tom Webinar 2015-01-05 NA NA
# 9: Tom Sale 2015-01-20 TRUE TRUE
# 10: Tom Webinar 2015-03-25 NA NA
# 11: Tom Sale 2015-04-01 NA TRUE
# 12: Tom Sale 2015-07-01 NA NA
Here is another option with base R:
df is first split according to Name and then, among each subset, for each Sale, it looks if there is an Email (Webinar) within 21 days from the Sale. Finally, the list is unsplit according to Name.
You just have to replace FALSE by no and TRUE by yes afterwards.
df_split <- split(df, df$Name)
df_split <- lapply(df_split, function(tab){
i_s <- which(tab[,2]=="Sale")
tab$Email21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Email", 3] >= d_s-21)})
tab$Webinar21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Webinar", 3] >= d_s-21)})
tab
})
df_res <- unsplit(df_split, df$Name)
df_res
# Name ActivityType ActivityDate Email21 Webinar21
#1 John Email 2014-01-01 NA NA
#2 John Webinar 2014-01-05 NA NA
#3 John Sale 2014-01-20 TRUE TRUE
#4 John Webinar 2014-03-25 NA NA
#5 John Sale 2014-04-01 FALSE TRUE
#6 John Sale 2014-07-01 FALSE FALSE
#7 Tom Email 2015-01-01 NA NA
#8 Tom Webinar 2015-01-05 NA NA
#9 Tom Sale 2015-01-20 TRUE TRUE
#10 Tom Webinar 2015-03-25 NA NA
#11 Tom Sale 2015-04-01 FALSE TRUE
#12 Tom Sale 2015-07-01 FALSE FALSE
data
df <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom"), ActivityType = c("Email",
"Webinar", "Sale", "Webinar", "Sale", "Sale", "Email", "Webinar",
"Sale", "Webinar", "Sale", "Sale"), ActivityDate = structure(c(16071,
16075, 16090, 16154, 16161, 16252, 16436, 16440, 16455, 16519,
16526, 16617), class = "Date")), .Names = c("Name", "ActivityType",
"ActivityDate"), row.names = c(NA, -12L), index = structure(integer(0), ActivityType = c(1L,
7L, 3L, 5L, 6L, 9L, 11L, 12L, 2L, 4L, 8L, 10L)), class = "data.frame")

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources