Extract rows with most current date out of a data frame - r

I would like to extract certain rows from a data frame containing one colum as Date (column C). Here is a small example:
The output should look like this:
Before <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0032","0033","0039","0039","0039","0041","0054"),
B=c(11,12,11,11,12,12,12,11,"NA","NA",11,11,11),
C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2013-08-21","2014-07-11","2012-10-06","2012-10-06","2013-10-22","2014-05-28","2014-03-26"))
After <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0033","0039","0041","0054"),
B=c(11,12,11,11,12,12,11,11,11,11),
C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2014-07-11","2013-10-22","2014-05-28","2014-03-26"))
So would I'm aiming for is:
Only give out entries with the latest date (out of row 9,10,11 (BEFORE)) --> give out only row 8 (AFTER)
Give out identical entries only once (row 6 and 7 (BEFORE)) --> give out only row 6 (AFTER)
I wasn't able to find a solution using subset, unique etc. Any help appreciated!

Here are two data.table variations depending on the assumptions on data:
Assuming that your data already has the latest date for each group of A as the last element:
require(data.table)
setDT(Before)[, .SD[.N], by=A]
.SD holds the Subset of Data for each group in A and .N holds the number of observations in that group. So, .SD[.N] gives us the last observation, for each group.
Without any assumptions:
require(data.table)
setDT(Before)[, C := as.Date(C)][, .SD[which.max(C)], by=A]
Here, first we replace C with as.Date(C) using data.table's := operator which modifies by reference (without making any copy, hence fast+memory efficient). Then, for each A data subset, we subset the row correspondng to the maximum value of C.
HTH

require(dplyr)
Before %>%
mutate(C=as.Date(C)) %>%
group_by(A) %>%
arrange(A,desc(C)) %>%
filter(row_number()==1)
#Source: local data frame [10 x 3]
#Groups: A
# A B C
#1 0010 11 2014-01-07
#2 0011 12 2013-06-03
#3 0012 11 2013-07-29
#4 0015 11 2014-07-14
#5 0024 12 2012-12-17
#6 0032 12 2013-08-21
#7 0033 11 2014-07-11
#8 0039 11 2013-10-22
#9 0041 11 2014-05-28
#10 0054 11 2014-03-26

split-apply-combine:
Before$C <- as.Date(Before$C)
library(plyr)
ddply(Before, .(A), function(df) {
df <- df[df$C==max(df$C),]
df[!duplicated(df),]
})
# A B C
#1 0010 11 2014-01-07
#2 0011 12 2013-06-03
#3 0012 11 2013-07-29
#4 0015 11 2014-07-14
#5 0024 12 2012-12-17
#6 0032 12 2013-08-21
#7 0033 11 2014-07-11
#8 0039 11 2013-10-22
#9 0041 11 2014-05-28
#10 0054 11 2014-03-26

By using the fact that dates act like numerics something like the following might do the trick:
Before$C <- as.Date(Before$C) # Convert to dates
ans <- aggregate(C ~ A + B, max, data = Before) # Aggregate date, choose the last date
ans <- ans[ans$B != "NA", ] # Remove NA in col B
print(ans)
# A B C
#1 0010 11 2014-01-07
#2 0012 11 2013-07-29
#3 0015 11 2014-07-14
#4 0033 11 2014-07-11
#5 0039 11 2013-10-22
#6 0041 11 2014-05-28
#7 0054 11 2014-03-26
#8 0011 12 2013-06-03
#9 0024 12 2012-12-17
#10 0032 12 2013-08-21
The max of type Date will return the most recent one.

Related

Iterating over Dates by Group in R using FOR loops

I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))

Scoping when creating a new r data.table column in a function using :=

This is a continuation for a question I posted here Creating a new r data.table column based on values in another column and grouping, and to which #Frank provided an excellent answer.
As I have to do multiple of these calculations with different date intervals, I want to do a function which does them. However, I seem to be running into a scoping problem. I read the Vignettes, FAQ, and a ton of questions here and I still am left baffled.
We'll use the same data:
library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)),
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE),
purchaseAmount = sample(1:20, 10))
Here is the answer #Frank provided:
DT[, new_col :=
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
DT
date zip purchaseAmount new_col
1: 2016-01-08 1150 5 5
2: 2016-01-15 3000 15 15
3: 2016-02-15 1150 16 16
4: 2016-02-20 2000 18 18
5: 2016-03-07 2000 19 19
6: 2016-03-15 2000 11 30
7: 2016-03-17 2000 6 36
8: 2016-04-02 1150 17 17
9: 2016-04-08 3000 7 7
10: 2016-04-09 3000 20 27
And now the actual problem I have encountered. I created the following function which enables dynamically changing the interval:
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
newColName = substitute(newColName)
dt[, newColName :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
sumPreviousPurchases(DT, prevPurch1to10, 0, 10)
DT
date zip purchaseAmount newColName
1: 2016-02-07 1150 5 5
2: 2016-02-14 3000 15 15
3: 2016-03-16 1150 16 16
4: 2016-03-21 2000 18 18
5: 2016-04-06 2000 19 19
6: 2016-04-14 2000 11 30
7: 2016-04-16 2000 6 36
8: 2016-05-02 1150 17 17
9: 2016-05-08 3000 7 7
10: 2016-05-09 3000 20 27
What troubles me is the scoping. The function names the new column newColName regardless of what I insert in the function call. From reading I got that when calling for data.table column names in function arguments, one should use the substitute()-function. However, this does not work here, the result is the same even if I leave the whole newColName = substitute(newColName) line out. I suppose it is because the column does not exist yet, but I do not know how to address this issue.
As a bonus I would like to ask, is there also a way to name the columns dynamically, ie. in the example for instance to be "daysFrom_ to_daysUntil", and the name would be "0_to_10"?
----- EDIT ----
I also stumbled upon a possible answer myself, somewhat similarly to #lmo's answer using an idea from here: http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#assign-a-column-with--named-with-a-character-object
Most important differences on the question: I removed the newColName = substitute(newColName) entirely, and added brackets around the (newColName) on dt[, (newColName) :=
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
#newColName = substitute(newColName)
dt[, (newColName) :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
Additionally I added quotes to the "prevPurch1to10".
sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)
and got the answer
date zip purchaseAmount prevPurch1to10
1: 2016-02-17 1150 7 7
2: 2016-02-22 3000 8 8
3: 2016-03-04 1150 2 2
4: 2016-03-16 2000 14 14
5: 2016-04-03 2000 11 11
6: 2016-04-11 3000 12 12
7: 2016-04-21 1150 17 17
8: 2016-04-22 3000 3 3
9: 2016-05-03 2000 9 9
10: 2016-05-11 3000 4 4
However, there are still the two following weird things:
a) substitute() is not needed when adding the brackets on (newColName). Why is that?
b) quotes are required around the "prevPurch1to10". Again, why? Is there a more data.tableish way to do this, without the quotes?
You can use substitute directly in the assignment:
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
dt[, substitute(newColName) :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
Then give it a try
sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)
which returns
DT
date zip purchaseAmount prevPurch1to10
1: 2016-02-07 1150 5 5
2: 2016-02-14 3000 15 15
3: 2016-03-16 1150 16 16
4: 2016-03-21 2000 18 18
5: 2016-04-06 2000 19 19
6: 2016-04-14 2000 11 30
7: 2016-04-16 2000 6 36
8: 2016-05-02 1150 17 17
9: 2016-05-08 3000 7 7
10: 2016-05-09 3000 20 27
Notes:
The parentheses in your solution () force the evaluation of the argument. This is implemented in base R and is a common technique across many programming languages, based on the mathematical concept of order of operations. (first evaluate objects in parentheses, then exponetiate, etc.). The use of substitute makes the substitution explicit, perhaps for easier reading.
Often, an argument to a function that will define a future object, like prevPurch1to10, requires quotes, since the object does not exist prior to calling the function. Using such an argument without quotes will usually result in an error: "object X not found."

Can I cross tab dates, grouped by year?

I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.

Calculating elapsed time for different interview dates in R

So my data looks like this
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","2013-02-27","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
i.e)
subjid type date
1 a baseline 2013-02-07
2 a first 2013-02-27
3 a second 2013-04-30
4 b baseline 2013-03-03
5 b first 2013-05-23
6 c baseline 2013-01-02
7 c first 2013-07-23
8 d baseline 2013-03-29
9 e baseline 2013-06-03
and I'm trying to make a variable "elapsedtime" that denotes the time elapsed from the baseline date to first and second round interview dates (so that elapsedtime=0 for baselines). Note that it varies individually whether they have taken further interviews.
I tried to reshape the data so that I could subtract each dates but my brain isn't really functioning today--or is there another way?
Please help and thank you.
Screaming out for ave:
I'll throw an NA value in there just for good measure:
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","NA","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
And you should probably sort the data to be on the safe side:
dat$type <- ordered(dat$type,levels=c("baseline","first","second","third") )
dat <- dat[order(dat$subjid,dat$type),]
Turn your date into a proper Date object:
dat$date <- as.Date(dat$date)
Then calculate the differences:
dat$elapsed <- ave(as.numeric(dat$date),dat$subjid,FUN=function(x) x-x[1] )
# subjid type date elapsed
#1 a baseline 2013-02-07 0
#2 a first <NA> NA
#3 a second 2013-04-30 82
#4 b baseline 2013-03-03 0
#5 b first 2013-05-23 81
#6 c baseline 2013-01-02 0
#7 c first 2013-07-23 202
#8 d baseline 2013-03-29 0
#9 e baseline 2013-06-03 0
This makes no assumptions that baseline is the always at position 1:
dat$date <- as.Date(dat$date)
dat$elapesed <- unlist(by(dat, dat$subjid, FUN=function(x) {
as.numeric(x$date - x[x$type=="baseline",]$date)
}))

R: subset columns entries in "df A" to columns entries in "df B" and eliminate if true match

I’m a R beginner and having difficulty with the following pretty simple problem;
I have two data frames ( All_df, Bad_df) and want to generate a third such that
All_df – Bad_df = Good_df
> All_df
Row# Originator Recipient Date Time
4 1 6 2000-05-16 16:15:00
7 2 7 2000-05-16 16:25:00
22 2 4 2000-07-04 18:05:00
25 2 9 2000-08-07 05:23:00
10 3 2 2000-06-17 18:07:00
13 4 8 2000-06-21 06:49:00
> Bad_df
Row# Originator Recipient Date Time
4 2 6 2000-05-16 16:15:00
7 2 7 2000-05-16 16:25:00
22 6 4 2000-07-04 18:05:00
25 12 9 2000-08-07 05:23:00
10 30 2 2000-06-17 18:07:00
13 32 8 2000-06-21 06:49:00
I want to generate Good_df similar to this:
> Good_df
Row# Originator Recipient Date Time
4 1 6 2000-05-16 16:15:00
10 3 2 2000-06-17 18:07:00
13 4 8 2000-06-21 06:49:00
Essentially I need a function which searches All_df$ Originator for values that appear in Bad_df$ Originator, eliminating any matches before returning the remaining values to the Good_df.
I have tried
Good_df <-subset(All_df, Originator %in% Bad_df$Originator)
however nrows of each df looks a little off!
> nrow(All_df)
[1] 26,032
> nrow(Bad_df)
[1] 1,452
> nrow(Good_df)
[1] 12,395
Any assistance would be greatly appreciated.
Quite intuitively,
Good_df <-subset(All_df, Originator %in% Bad_df$Originator)
gives you the subset of All_df for bad originators. What you want is to negate your filter to get the subset of good (or non-bad) originators, using the ! operator:
Good_df <-subset(All_df, ! Originator %in% Bad_df$Originator)
If you are uncomfortable with the precedency rule, you can add a set of parenthesis:
Good_df <-subset(All_df, !(Originator %in% Bad_df$Originator))

Resources