I need to do a fast aggregation by id_client of dates: min, max, difference of dates in months and quantity of months.
Example table:
tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3),
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))
Format dates:
tbl$fecha<-as.Date(as.character(tbl$fecha))
My first approach was ddply:
tbl2<-ddply(tbl, .(id_cliente), summarize, cant=length(id_cliente),
max=max(fecha), min=min(fecha),
dif=length(seq(from=min, to=max, by='month')))
I got the desired result, but with my real table takes too much time.
So I tried tapply:
tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), secuencia),
hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
min=tapply(tbl$fecha, list(tbl$id_cliente), min),
max=tapply(tbl$fecha, list(tbl$id_cliente), max)
))
The result was:
> tbl3
dif hay min max
6 4 15706 15857
1 1 15706 15706
5 3 15706 15826
In this case I got instead of dates, numbers. So since the following works, I tried using as.Date inside tapply:
as.Date(15706, origin='1970-01-01')
MIN<-function(x){as.Date(min(x), origin='1970-01-01')}
The function works but with tapply doesn't.
tbl3<-data.frame(cbind(min=tapply(tbl$fecha, list(tbl$id_cliente), MIN)))
And I still got the number instead of date.
How can I solve this? Thanks.
I know this is a bit late, but I figured I would put this here for the people still googling this issue.
Interestingly, tapply returns the correct results when you keep the date column in text format and then you can convert to a date after:
tbl<-data.frame(id_cliente=c(1,1,1,1,2,3,3,3),
fecha=c('2013-01-01', '2013-06-01','2013-05-01', '2013-04-01', '2013-01-01', '2013-01-01','2013-05-01','2013-04-01'))
tbl3<-data.frame(cbind(dif=tapply(tbl$fecha, list(tbl$id_cliente), seq),
hay=tapply(tbl$fecha, list(tbl$id_cliente), length),
min=tapply(tbl$fecha, list(tbl$id_cliente), min),
max=tapply(tbl$fecha, list(tbl$id_cliente), max)))
head(tbl3)
# dif hay min max
# 1, 2, 3, 4 4 2013-01-01 2013-06-01
# 1 1 2013-01-01 2013-01-01
# 1, 2, 3 3 2013-01-01 2013-05-01
With base R, the ?Date class is converted to the number of days from Jan. 1, 1970. Try using dplyr or data.table to preserve the date class:
dplyr
library(dplyr)
tbl %>% group_by(id_cliente) %>%
summarise(dif=length(seq(min(fecha), max(fecha), by='month')),
hay=length(fecha),
min=min(fecha),
max=max(fecha))
# Source: local data frame [3 x 5]
#
# id_cliente dif hay min max
# 1 1 6 4 2013-01-01 2013-06-01
# 2 2 1 1 2013-01-01 2013-01-01
# 3 3 5 3 2013-01-01 2013-05-01
data.table
library(data.table)
setDT(tbl)[,.(dif=length(seq(min(fecha), max(fecha), by='month')),
hay= .N,
min=min(fecha),
max=max(fecha)), by=id_cliente]
# id_cliente dif hay min max
# 1: 1 6 4 2013-01-01 2013-06-01
# 2: 2 1 1 2013-01-01 2013-01-01
# 3: 3 5 3 2013-01-01 2013-05-01
Related
I have a vector of dates like this:
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
How can I group by month, day, morning, afternoon or week? For instance:
month sum
2014-3 11
==============
week sum
2014-3-1 5
2014-3-9 5
==============
2014-3-1
morning sum
2014-3-1 5
Use the package data.table and get known of the class POSIXlt.
#x is assumed to be you're vector of time objects (POSIXct POSIXlt).
# The following lines are just for getting known to POSIXlt. You do not need to run these.
Secs <- as.POSIXlt(x)[[1]]
Mins <- as.POSIXlt(x)[[2]]
# ...
Month <- as.POSIXlt(x)[[5]] + 1 # months do start with 0 instead of 1
Year <- as.POSIXlt(x)[[6]] - 100 #for 2016 the result would be 116 ...
DayOfYear <- as.POSIXlt(x)[[9]] + 1 #starts with 0
You can calculate more complicated values similarly. Use data.table now.
require(data.table)
X <- as.data.table(x) # creates a data.table object
setnames(X, "Time") # names the 1 column 'Time'
X[ , month := as.POSIXlt(Time)[[5]] + 1] #adds a column month
X[ , doy:= as.POSIXlt(Time)[[8]] + 1] #adds a column day of year
#....
Now you can group your data.table with:
X[ , .N, by = doy]
X[ , .N, by = month]
# ...
.N returns the number of items in each group. You could also combine the grouping:
X[ , .N, by = list(doy, month)]
There are many nice tutorials using data.tables and the grouping and evaluation is similar to sql syntax (which can also be found in tutorials).
A good link to start is the FAQ of the developer:
http://datatable.r-forge.r-project.org/datatable-faq.pdf
EDIT:
Of course you could also make more complicated columns for afternoon and morning like this:
X[ , afternoon:= ifelse(as.POSIXlt(x)[[3]] > 12, TRUE, FALSE)]
Assuming you have a data frame like this where time is in POSIXct format:
df
time
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
You can get month, week and am/pm as follows:
df$month <- format(df$time, '%Y-%m')
df$week <- format(df$time, '%Y-%U')
df$ampm <- ifelse(as.numeric(format(df$time, '%H')) > 12, 'pm', 'am')
df
time month week ampm
1 2014-03-10 22:54:24 2014-03 2014-10 pm
2 2014-03-10 22:53:16 2014-03 2014-10 pm
3 2014-03-10 22:53:01 2014-03 2014-10 pm
4 2014-03-10 22:52:38 2014-03 2014-10 pm
5 2014-03-10 22:52:00 2014-03 2014-10 pm
6 2014-03-01 01:13:08 2014-03 2014-08 am
7 2014-03-01 01:11:30 2014-03 2014-08 am
8 2014-03-01 01:07:41 2014-03 2014-08 am
9 2014-03-01 01:05:28 2014-03 2014-08 am
10 2014-03-01 00:58:40 2014-03 2014-08 am
11 2014-03-27 18:11:57 2014-03 2014-12 pm
Then, you can get your summaries using library dplyr like this:
library(dplyr)
count(df, month)
Source: local data frame [1 x 2]
month n
(chr) (int)
1 2014-03 11
count(df, week)
Source: local data frame [3 x 2]
week n
(chr) (int)
1 2014-08 5
2 2014-10 5
3 2014-12 1
count(df, ampm)
Source: local data frame [2 x 2]
ampm n
(chr) (int)
1 am 5
2 pm 6
I have a big data.table that looks like:
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
dt
start end id cat
1: 2012-07-13 23:45:00 2012-07-14 00:02:00 1 a
2: 2012-07-14 15:30:00 2012-07-14 15:35:00 2 b
3: 2012-07-14 23:57:00 2012-07-15 00:05:00 1 a
I need to get an output that shows total minutes of event on each calendar day by id and category. Using the example above the output should be:
day id cat V1
1: 13.07.2012 1 a 15
2: 14.07.2012 1 a 5
3: 14.07.2012 2 b 5
4: 15.07.2012 1 a 5
I used adply function from plyr package to split duration in intervals by minute:
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
# run the function above for each row in the data.table
dd<-adply(dt,1,fn)
# extract the date from calendar minutes
dd[,day:=format(as.POSIXct(s,"%d.%m.%Y %H:%M%:%S"), "%d.%m.%Y")]
#calculate sum of all minutes of event for each day, id and category
dd[,.N,by=c("day","id","cat")][order(day,id,cat)]
The solution above perfectly suits my needs except the time it takes for calculation. When adply is run in a very big data and several categories defined in fn function, it feels like CPU runs forever.
I will highly appreciate any hint on how to use pure data.table functionality in this problem.
I would suggest a few things
Convert to as.POSIXct only once instead of per each row.
instead of adply which creates a whole data.table in each iteration, just use by within the data.table scope.
In order to do so, simple create an row index using .I
Here's a quick attempt (I've used substr because it will be probably faster than as.Date or as.POSIXct. If you want it to be Date class again, use res[, Date := as.IDate(Date)] on the result istead of doing it by group).
dt[, `:=`(start = as.POSIXct(start), end = as.POSIXct(end), indx = .I)]
dt[, seq(start, end - 1L, by = "mins"), by = .(indx, id, cat)
][, .N, by = .(Date = substr(V1, 1L, 10L), id, cat)]
# Date id cat N
# 1: 2012-07-13 1 a 15
# 2: 2012-07-14 1 a 5
# 3: 2012-07-14 2 b 5
# 4: 2012-07-15 1 a 5
Try to see if this is faster.
It's still data.table in the background, but I'm using a dplyr syntax for the process.
library(data.table)
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
library(dplyr)
dt %>%
rowwise() %>% # for each row
do(fn(.)) %>% # apply your function
select(day=s, id=V1, cat=V2) %>% # rename columns
mutate(day = substr(day,1,10)) %>% # keep only the day
ungroup %>%
group_by(day,id,cat) %>%
summarise(N=n()) %>%
ungroup
# Source: local data frame [4 x 4]
#
# day id cat N
# (chr) (dbl) (chr) (int)
# 1 2012-07-13 1 a 15
# 2 2012-07-14 1 a 5
# 3 2012-07-14 2 b 5
# 4 2012-07-15 1 a 5
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)
I have following data set:
>d
x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012
I want:
> d
x date
1 1 31-12-2013
2 2 31-12-2010
3 3 31-12-2011
4 4 31-12-2012
i.e. Last day, last month and the year of the date object.
Please Help!
You can also just use the ceiling_date function in LUBRIDATE package.
You can do something like -
library(lubridate)
last_date <- ceiling_date(date,"year") - days(1)
ceiling_date(date,"year") gives you the first date of the next year and to get the last date of the current year, you subtract this by 1 or days(1).
Hope this helps.
Another option using lubridate package:
## using d from Roland answer
transform(d,last =dmy(paste0('3112',year(dmy(date)))))
x date last
1 1 1-3-2013 2013-12-31
2 2 2-4-2010 2010-12-31
3 3 2-5-2011 2011-12-31
4 4 1-6-2012 2012-12-31
d <- read.table(text="x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012", header=TRUE)
d$date <- as.Date(d$date, "%d-%m-%Y")
d$date <- as.POSIXlt(d$date)
d$date$mon <- 11
d$date$mday <- 31
d$date <- as.Date(d$date)
# x date
#1 1 2013-12-31
#2 2 2010-12-31
#3 3 2011-12-31
#4 4 2012-12-31
1) cut.Date Define cut_year to give the first day of the year. Adding 366 gets us to the next year and then applying cut_year again gets us to the first day of the next year. Finally subtract 1 to get the last day of the year. The code uses base functionality only.
cut_year <- function(x) as.Date(cut(as.Date(x), "year"))
transform(d, date = cut_year(cut_year(date) + 366) - 1)
2) format
transform(d, date = as.Date(format(as.Date(date), "%Y-12-31")))
3) zoo A "yearmon" class variable stores the date as a year plus 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec. Thus taking its floor and adding 11/12 gets one to Dec and as.Date.yearmon(..., frac = 1) uses the last of the month instead of the first.
library(zoo)
transform(d, date = as.Date(floor(as.yearmon(as.Date(date))) + 11 / 12, frac = 1))
Note: The inner as.Date in cut_year and in the other two solutions can be omitted if it is known that date is already of "Date" class.
ADDED additional solutions.
I have a dataframe called daily which looks like this:
daily[1:10,]
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 0.7000000
3 1 1948-07-03 0.1900000
4 1 1948-07-04 0.1033333
5 1 1948-07-05 0.1982895
6 1 1948-07-06 0.1433333
7 1 1948-07-07 NA
8 1 1948-07-08 NA
9 1 1948-07-09 NA
10 1 1948-07-10 NA
The objective that I would like to accomplish is average all the day values throughout the years (1948-1995) to replace the NA value that occurs on that particular day. For example, since row 7 has an NA for July 7, 1948, I would average all the July 7 from 1948-1995 and replace that particular day with the average.
What I have tried so far is this:
index <- which(is.na(daily$Precipitation)) # find where the NA's occur
daily_avg <- daily # copy dataframe
daily_avg$Date <- strftime(daily_avg$Date, format="2000-%m-%d") # Change the Date format to represent only the day and month and disregard year
daily_avg <- aggregate(Precipitation~Date, FUN = mean, data = daily_avg, na.rm = TRUE) # find the mean precip per day
daily[index,3] <- daily_avg[daily_avg$Date %in% strftime(daily[index,2], format="2000-%m-%d"), 2]
The last line in the code is not working properly, I'm not sure why yet. That is how my thought process of this problem is going. However, I was wondering if there is a better way of doing it using a built in function that I am not aware of. Any help is greatly appreciated. Thank you
I think the data in your example, don't explain the problem. You should give data for a certain day over many years with some NA values. For example, here I change the problem for 2 days over 3 years.
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 NA
3 1 1949-07-01 0.1900000
4 1 1949-07-02 0.1033333
5 1 1950-07-01 NA
6 1 1950-07-02 0.1433333
The idea if I understand , is to replace NA values by the mean of the values over all years. You can use ave and transform to create the new column containing the mean, then replace the NA value with it.
daily$daymonth <- strftime(daily$Date, format="%m-%d")
daily <- transform(daily, mp =ave(Precipitation,daymonth,
FUN=function(x) mean(x,na.rm=TRUE) ))
transform(daily, Precipitation =ifelse(is.na(Precipitation),mp,Precipitation))
Climate_Division Date Precipitation daymonth mp
1 1 1948-07-01 0.2100000 07-01 0.2000000
2 1 1948-07-02 0.1233333 07-02 0.1233333
3 1 1949-07-01 0.1900000 07-01 0.2000000
4 1 1949-07-02 0.1033333 07-02 0.1233333
5 1 1950-07-01 0.2000000 07-01 0.2000000
6 1 1950-07-02 0.1433333 07-02 0.1233333
Using data.table
Some dummy data
set.seed(1)
library(data.table)
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31')
dd <- data.table(date = daily, precip = runif(length(daily)))
# add na values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
## calculate the daily averages
# add day and month
dd[, c('month','day') := list(month(date), mday(date))]
monthdate <- dd[, list(mprecip = mean(precip, na.rm = TRUE)),
keyby = list(month, date)]
# set key for joining
setkey(dd, month, date)
# replace NA with day-month averages
dd[monthdate, precip := ifelse(is.na(precip), mprecip, precip)]
# set key to reorder to daily
setkey(dd, date)
A slightly neater version of mnel's answer, which I would prefer to the accepted one:
set.seed(1)
library(data.table)
# step 1: form data
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31'),by="day")
dd <- data.table(date = daily, precip = runif(length(daily)))
# step 2: add NA values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
# step 3: replace NAs with day-of-month across years averages
dd[, c('month','day') := list(month(date), mday(date))]
dd[,precip:= ifelse(is.na(precip), mean(precip, na.rm=TRUE), precip), by=list(month,day)]