R: Aggregating Large Data Frame under a Grouping Condition - r

I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to:
>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
id date value
1 73 2014-02-11 0.84197491
2 7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4 7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6 7 2014-01-06 0.66484848
Here we have 2 unique IDs with 3 dates and a value assigned to each. I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID.
What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. For example, with sapply:
> sapply(split(sample_frame,sample_frame$id),function(x){
+ mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
7 73
0.3497359 0.6949492
I can't figure out how to get data.table to do this. Thoughts? Hints?

Why not use tail in your "data.table" aggregation step?
set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
"date" = sample(seq(as.Date("2014-01-01"),
as.Date("2014-02-13"),by=1),6),
"value" = runif(6))
DT <- data.table(sample_frame, key = "id,date")
DT
# id date value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
# id V1
# 1: 27 0.6580684
# 2: 37 0.2229450
Since you require the mean of just two values, you can do it directly (without using mean). And you can use the internal variable .N instead of tail to get more speed-up. You just have to take care of the case where there's just 1 date. Basically, this should be much faster.
DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]

Related

how to group sales data by 4 days from yesterday to start date in r?

Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.

Scoping when creating a new r data.table column in a function using :=

This is a continuation for a question I posted here Creating a new r data.table column based on values in another column and grouping, and to which #Frank provided an excellent answer.
As I have to do multiple of these calculations with different date intervals, I want to do a function which does them. However, I seem to be running into a scoping problem. I read the Vignettes, FAQ, and a ton of questions here and I still am left baffled.
We'll use the same data:
library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)),
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE),
purchaseAmount = sample(1:20, 10))
Here is the answer #Frank provided:
DT[, new_col :=
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
DT
date zip purchaseAmount new_col
1: 2016-01-08 1150 5 5
2: 2016-01-15 3000 15 15
3: 2016-02-15 1150 16 16
4: 2016-02-20 2000 18 18
5: 2016-03-07 2000 19 19
6: 2016-03-15 2000 11 30
7: 2016-03-17 2000 6 36
8: 2016-04-02 1150 17 17
9: 2016-04-08 3000 7 7
10: 2016-04-09 3000 20 27
And now the actual problem I have encountered. I created the following function which enables dynamically changing the interval:
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
newColName = substitute(newColName)
dt[, newColName :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
sumPreviousPurchases(DT, prevPurch1to10, 0, 10)
DT
date zip purchaseAmount newColName
1: 2016-02-07 1150 5 5
2: 2016-02-14 3000 15 15
3: 2016-03-16 1150 16 16
4: 2016-03-21 2000 18 18
5: 2016-04-06 2000 19 19
6: 2016-04-14 2000 11 30
7: 2016-04-16 2000 6 36
8: 2016-05-02 1150 17 17
9: 2016-05-08 3000 7 7
10: 2016-05-09 3000 20 27
What troubles me is the scoping. The function names the new column newColName regardless of what I insert in the function call. From reading I got that when calling for data.table column names in function arguments, one should use the substitute()-function. However, this does not work here, the result is the same even if I leave the whole newColName = substitute(newColName) line out. I suppose it is because the column does not exist yet, but I do not know how to address this issue.
As a bonus I would like to ask, is there also a way to name the columns dynamically, ie. in the example for instance to be "daysFrom_ to_daysUntil", and the name would be "0_to_10"?
----- EDIT ----
I also stumbled upon a possible answer myself, somewhat similarly to #lmo's answer using an idea from here: http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#assign-a-column-with--named-with-a-character-object
Most important differences on the question: I removed the newColName = substitute(newColName) entirely, and added brackets around the (newColName) on dt[, (newColName) :=
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
#newColName = substitute(newColName)
dt[, (newColName) :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
Additionally I added quotes to the "prevPurch1to10".
sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)
and got the answer
date zip purchaseAmount prevPurch1to10
1: 2016-02-17 1150 7 7
2: 2016-02-22 3000 8 8
3: 2016-03-04 1150 2 2
4: 2016-03-16 2000 14 14
5: 2016-04-03 2000 11 11
6: 2016-04-11 3000 12 12
7: 2016-04-21 1150 17 17
8: 2016-04-22 3000 3 3
9: 2016-05-03 2000 9 9
10: 2016-05-11 3000 4 4
However, there are still the two following weird things:
a) substitute() is not needed when adding the brackets on (newColName). Why is that?
b) quotes are required around the "prevPurch1to10". Again, why? Is there a more data.tableish way to do this, without the quotes?
You can use substitute directly in the assignment:
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
dt[, substitute(newColName) :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
Then give it a try
sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)
which returns
DT
date zip purchaseAmount prevPurch1to10
1: 2016-02-07 1150 5 5
2: 2016-02-14 3000 15 15
3: 2016-03-16 1150 16 16
4: 2016-03-21 2000 18 18
5: 2016-04-06 2000 19 19
6: 2016-04-14 2000 11 30
7: 2016-04-16 2000 6 36
8: 2016-05-02 1150 17 17
9: 2016-05-08 3000 7 7
10: 2016-05-09 3000 20 27
Notes:
The parentheses in your solution () force the evaluation of the argument. This is implemented in base R and is a common technique across many programming languages, based on the mathematical concept of order of operations. (first evaluate objects in parentheses, then exponetiate, etc.). The use of substitute makes the substitution explicit, perhaps for easier reading.
Often, an argument to a function that will define a future object, like prevPurch1to10, requires quotes, since the object does not exist prior to calling the function. Using such an argument without quotes will usually result in an error: "object X not found."

Can I cross tab dates, grouped by year?

I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.

Combination of merge and aggregate in R

I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Resources