In R: add rows based on a date and another condition - r

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!

library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

Related

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

data.table join + update with mult='first' gives unexpected result

In the below example, I have a table of users and a table of transactions where one user can have 0, 1, or more transactions. I execute a join+update with mult='first' on the users table to attempt to insert a column indicating the date of the first occurring transaction for each user.
library(data.table) # v1.10.4
# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")
# Convert date columns to Date type
users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
transactions[, TransactionDate := as.Date(TransactionDate)]
users
UserID User Gender Registered Cancelled FirstTransactionDate
1: 1 Charles male 2012-12-21 <NA> 2012-08-26
2: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23
3: 3 Caroline female 2012-10-23 2016-06-07 2016-05-08
4: 4 Brielle female 2013-07-17 <NA> <NA>
5: 5 Benjamin male 2010-11-25 <NA> <NA>
transactions
TransactionID TransactionDate UserID ProductID Quantity
1: 1 2010-08-21 7 2 1
2: 2 2011-05-26 3 4 1
3: 3 2011-06-16 3 3 1
4: 4 2012-08-26 1 2 3
5: 5 2013-06-06 2 4 1
6: 6 2013-12-23 2 5 6
7: 7 2013-12-30 3 4 1
8: 8 2014-04-24 NA 2 3
9: 9 2015-04-24 7 4 3
10: 10 2016-05-08 3 4 4
##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]
# Unexpected result
users[UserID == 2]
UserID User Gender Registered Cancelled FirstTransactionDate
1: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23 # <- shouldn't this be 2013-06-06?
Why does FirstTransactionDate 2013-12-23 get set for user 2 when an earlier transaction in the transactions table is tied to that user? Is this a bug?
Reading the documentation for data.table's mult more closely, it says that:
When i is a list (or data.frame or data.table) and multiple rows in x
match to the row in i, mult controls which are returned: "all"
(default), "first" or "last".
So if there are multiple rows in x ("users") that match to i ("transactions"), then mult will return the first row in x. However, in your case, there aren't multiple rows in x that match to i, rather there are multiple rows in i that match to x.
As #Arun suggested, the best option would be change around your so that mult = "first" is relevant:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
Another option would be to change up your merge slightly:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
I just create the first transaction date within the transactions dataset. This gets merged on multiple times, but it should be fine because it's always the same value for a UserID.

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Inserting rows into a dataframe based on a vector that contains dates

This is what my dataframe looks like:
df <- read.table(text='
Name ActivityType ActivityDate
John Email 2014-01-01
John Webinar 2014-01-05
John Webinar 2014-01-20
John Email 2014-04-20
Tom Email 2014-01-01
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom Email 2014-04-20
', header=T, row.names = NULL)
I have this vector x which contains different dates
x<- c("2014-01-03","2014-01-25","2015-05-27"). I want to insert rows in my original dataframe in a way that incorporates these dates in the x vector.This is what the output should look like:
Name ActivityType ActivityDate
John Email 2014-01-01
John NA 2014-01-03
John Webinar 2014-01-05
John Webinar 2014-01-20
John NA 2014-01-25
John Email 2014-04-20
John NA 2015-05-27
Tom Email 2014-01-01
Tom NA 2014-01-03
Tom Webinar 2014-01-05
Tom Webinar 2014-01-20
Tom NA 2014-01-25
Tom Email 2014-04-20
Tom NA 2015-05-27
Sincerely appreciate your help!
It looks like you've added one of the 'new' dates aginst each of the people, correct?
In which case you can turn your x into a data.frame, and merge/join it on
## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))
## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
Name = rep(c("John","Tom"), 3))
merge(df, x, by=c("Name", "ActivityDate"), all=T)
# Name ActivityDate ActivityType
# 1 John 2014-01-01 Email
# 2 John 2014-05-01 Web
# 3 John 2014-20-01 Web
# 4 John 2014-20-04 Email
# 5 John 2014-01-03 <NA>
# 6 John 2014-01-25 <NA>
# 7 John 2015-05-27 <NA>
# 8 Tom 2014-01-01 Email
# 9 Tom 2014-05-01 Web
# 10 Tom 2014-20-01 Web
# 11 Tom 2014-20-04 Email
# 12 Tom 2014-01-03 <NA>
# 13 Tom 2014-01-25 <NA>
# 14 Tom 2015-05-27 <NA>
Update
As you are having memory issues, you can use data.table thusly
library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)
merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)
or, if you're not looking to merge you can rbind them, using data.table's rbindlist
rbindlist(list(dt, x_dt), fill=TRUE) ## fill sets the 'ActivityType' to NA in X
Update 2
To generate your x with 16000 uniqe names (I've used numbers here, but the principle is the same) and 30 dates
ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)
x <- data.frame(Names = rep(Names, length(ActivityDates)),
ActivityDates = rep(ActivityDates, length(Names)))
1) expand.grid Using expand.grid create a data frame adds with the rows to be added and then use rbind to combine df and adds converting the ActivityDate column to "Date" class. Then sort. No packages are used.
adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))
o <- with(both, order(Name, ActivityDate))
both[o, ]
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
9 John <NA> 2014-01-03
2 John Webinar 2014-01-05
3 John Webinar 2014-01-20
11 John <NA> 2014-01-25
4 John Email 2014-04-20
13 John <NA> 2015-05-27
5 Tom Email 2014-01-01
10 Tom <NA> 2014-01-03
6 Tom Webinar 2014-01-05
7 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
8 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27
2) sqldf This uploads adds and df to an sqlite data base which it creates on the fly, then it performs the sql query and downloads the result. The computation occurs outside of R so it might work with your large data.
adds <- data.frame(Name = NA, ActivityDate = x)
library(sqldf)
sqldf("select *
from (select *
from df
union
select a.Name, NULL ActivityType, ActivityDate
from (select distinct Name from df) a
cross join adds b
) order by 1, 3"
)
giving:
Name ActivityType ActivityDate
1 John Email 2014-01-01
2 John <NA> 2014-01-03
3 John Webinar 2014-01-05
4 John Webinar 2014-01-20
5 John <NA> 2014-01-25
6 John Email 2014-04-20
7 John <NA> 2015-05-27
8 Tom Email 2014-01-01
9 Tom <NA> 2014-01-03
10 Tom Webinar 2014-01-05
11 Tom Webinar 2014-01-20
12 Tom <NA> 2014-01-25
13 Tom Email 2014-04-20
14 Tom <NA> 2015-05-27

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources