I have a data table containing daily data. From this data table I want to extract weekly data points obtained each Wednesday. If Wednesday is a holiday, i.e. not available in the data table, the next available data point should be taken.
Here a MWE:
library(data.table)
df <- data.table(date=as.Date(c("2012-06-25","2012-06-26","2012-06-27","2012-06-28","2012-06-29","2012-07-02","2012-07-03","2012-07-05","2012-07-06","2012-07-09","2012-07-10","2012-07-11","2012-07-12","2012-07-13","2012-07-16","2012-07-17","2012-07-18","2012-07-19","2012-07-20")))
df[,weekday:=strftime(date,'%u')]
with output:
date weekday
1: 2012-06-25 1
2: 2012-06-26 2
3: 2012-06-27 3
4: 2012-06-28 4
5: 2012-06-29 5
6: 2012-07-02 1
7: 2012-07-03 2
8: 2012-07-05 4 #here the 4th of July was skipped
9: 2012-07-06 5
10: 2012-07-09 1
11: 2012-07-10 2
12: 2012-07-11 3
13: 2012-07-12 4
14: 2012-07-13 5
15: 2012-07-16 1
16: 2012-07-17 2
17: 2012-07-18 3
18: 2012-07-19 4
19: 2012-07-20 5
My desired result, in this case would be:
date weekday
2012-06-27 3
2012-07-05 4
2012-07-11 3
2012-07-18 3
Is there a more efficient way of obtaining this than going week-by-week via for loop and checking whether the Wednesday data point is included in the data or not? I feel that there must be a better way, so any advice would be highly appreciated!
Working solution (following Imo's suggestion):
df[,weekday:=wday(date)] #faster way to get weekdays, careful: numbers increased by 1 vs strftime
df[,numweek:=floor(as.numeric(date-date[1])/7+1)] #get continuous week numbers extending over end of years
df[df[,.I[which.min(abs(weekday-4.25))],by=.(numweek)]$V1] #gets result
Here is one method using a join on a data.table that finds the position (using .I) of the closest value to 3 (that is not 2, using which.min(abs(as.integer(weekday)-3.25))) by week using.
df[df[, .I[which.min(abs(as.integer(weekday)-3.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 3
2: 2012-07-05 4
3: 2012-07-11 3
4: 2012-07-18 3
Note that if your real data spans years, then you need to use by=.(week(date), year(date)).
Note also that there is a data.table function wday that will returns the integer day of the week directly. It is 1 greater than the character integer value returned by strftime, so an adjustment would be required if you wanted to use it directly.
From your data.table with a single variable, you'd do
df[, weekday := wday(date)]
df[df[, .I[which.min(abs(weekday-4.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 4
2: 2012-07-05 5
3: 2012-07-11 4
4: 2012-07-18 4
Note that the dates match those above.
Related
I need to create 'n' number of variables with lags of the original variable from 1 to 'n' on the fly. Something like so :-
OrigVar
DatePeriod, value
2/01/2018,6
3/01/2018,4
4/01/2018,0
5/01/2018,2
6/01/2018,4
7/01/2018,1
8/01/2018,6
9/01/2018,2
10/01/2018,7
Lagged 1 variable
2/01/2018,NA
3/01/2018,6
4/01/2018,4
5/01/2018,0
6/01/2018,2
7/01/2018,4
8/01/2018,1
9/01/2018,6
10/01/2018,2
11/01/2018,7
Lagged 2 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,6
5/01/2018,4
6/01/2018,0
7/01/2018,2
8/01/2018,4
9/01/2018,1
10/01/2018,6
11/01/2018,2
12/01/2018,7
Lagged 3 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,NA
5/01/2018,6
6/01/2018,4
7/01/2018,0
8/01/2018,2
9/01/2018,4
10/01/2018,1
11/01/2018,6
12/01/2018,2
13/01/2018,7
and so on
I tried using the shift function and various other functions. Wtih most of them that worked for me, the lagged variables finished at the last date of the original variable. In other words, the length of the lagged variable is the same as that of the original variable.
What I am looking for the new lagged variable to be shifted down by the 'kth' lag and the data series to be extended by 'k' elements including the index.
The reason I need this is to be able to compute the value of the dependent variable using the regression coeffficients and the corresponding lagged variable value beyond the in-sample period
y1 <- Lag(ciresL1_usage_1601_1612, shift = 1)
head(y1)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA -5171.051 -6079.887 -3687.227 -3229.453 -2110.368
y2 <- Lag(ciresL1_usage_1601_1612, shift = 2)
head(y2)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA NA -5171.051 -6079.887 -3687.227 -3229.453
tail(y2)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-2316.039 -2671.185 -4100.793 -2043.020 -1147.798 1111.674
tail(ciresL1_usage_1601_1612)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-4100.793 -2043.020 -1147.798 1111.674 3498.729 2438.739
Is there a way to do it relatively easily. I know I can do it by looping and adding 'k' rows in a new vector and reloading the data in to this new vector appropriately shifting the data values in the new vector but I don't want to use that method unless I have to. I am quietly confident that there has to be a better way to do it than this!
By the way, the object is a zoo object with daily dates as the index.
Best regards
Deepak
Convert the input zoo object to zooreg and then use lag.zooreg like this:
library(zoo)
# test input
z <- zoo(1:10, as.Date("2008-01-01") + 0:9)
zr <- as.zooreg(z)
lag(zr, -(0:3))
giving:
lag0 lag-1 lag-2 lag-3
2008-01-01 1 NA NA NA
2008-01-02 2 1 NA NA
2008-01-03 3 2 1 NA
2008-01-04 4 3 2 1
2008-01-05 5 4 3 2
2008-01-06 6 5 4 3
2008-01-07 7 6 5 4
2008-01-08 8 7 6 5
2008-01-09 9 8 7 6
2008-01-10 10 9 8 7
2008-01-11 NA 10 9 8
2008-01-12 NA NA 10 9
2008-01-13 NA NA NA 10
ID FROM TO
1881 11/02/2013 11/02/2013
3090 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
If I have data.table of vacation dates, FROM and TO, I want to know how many people were on vacation for any given month.
I tried:
dt[, .N, by=.(year(FROM), month(FROM))]
but obviously it would exclude people who were on vacation across two months. ie. someone on vacation FROM JAN TO FEB would only show up in the JAN count and not the FEB count even though they are still on vacation in FEB
The output of the above code showing year, month and number is exactly what I'm looking for otherwise.
year month N
1: 2013 2 17570
2: 2013 9 16924
3: 2014 11 18809
4: 2013 7 16984
5: 2015 6 14401
6: 2015 12 10239
7: 2014 3 19346
8: 2013 5 14864
EDIT: I want every month someone is away on vacation counted. So ID 111 would be counted in June, July, August and Sept.
EDIT 2:
Running uwe's code on the full dataset produces the Total Count column below.
Subsetting the full data set for people on vacation for <= 30 days and > 30 days produces the counts in the respective columns below. These columns added to each other should equal the Total Count and therefore the DIFFERENCE should be 0 but this isn't the case.
month Total count <=30 >30 (<=30) + (>30) DIFFERENCE
01/02/2012 899 4 895 899 0
01/03/2012 3966 2320 1646 3966 0
01/04/2012 8684 6637 2086 8723 39
01/05/2012 10287 7586 2750 10336 49
01/06/2012 12018 9080 3000 12080 62
The OP has not specified what the exact rules are for counting, for instance, how to count if the same ID has multiple non-overlapping periods of vacation in the same month.
The solution below is based on the following rules:
Each ID may appear in more than one row.
For each row, the total number of month between FROM and TO are counted (including the FROM and TO months). E.g., ID 111 is counted in the months of June, July, August, and September 2015.
Vacation on the last and first day of a month are accounted in full, e.g., vacations starting on May 31 and ending on June 1, are counted in both months.
If an ID has multiple periods of vacation in one month it is only counted once.
To verify that the code implements these rule, I had to enhance the sample dataset provided by the OP with additional use cases (see Data section below)
library(data.table)
library(lubridate)
# coerce dt to data.table object and character dates to class Date
setDT(dt)[, (2:3) := lapply(.SD, dmy), .SDcols = 2:3]
# for each row, create sequence of first days of months
dt[, .(month = seq(floor_date(FROM, "months"), TO, by = "months")), by = .(ID, rowid(ID))][
# count the number of unique IDs per month, order result by month
, uniqueN(ID), keyby = month]
month V1
1: 2013-02-01 1
2: 2013-07-01 1
3: 2013-09-01 2
4: 2014-11-01 1
5: 2014-12-01 1
6: 2015-06-01 1
7: 2015-07-01 1
8: 2015-08-01 1
9: 2015-09-01 1
10: 2015-11-01 1
11: 2015-12-01 1
12: 2016-06-01 1
13: 2016-07-01 1
14: 2016-08-01 1
15: 2016-09-01 1
Data
Based on OP's sample dataset but extended by additional use cases:
library(data.table)
dt <- fread(
"ID FROM TO
1881 11/02/2013 11/02/2013
1881 23/02/2013 24/02/2013
3090 09/09/2013 09/09/2013
3091 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
111 25/11/2015 05/12/2015
11 25/06/2016 01/09/2016"
)
for the data given above, you will do:
melt(dat,1)[,value:=as.Date(sub("\\d+","20",value),"%d/%m/%Y")][,
seq(value[1],value[2],by="1 month"),by=ID][,.N,by=.(year(V1),month(V1))]
year month N
1: 2013 2 1
2: 2013 9 1
3: 2014 11 1
4: 2014 12 1
5: 2013 7 1
6: 2015 6 1
7: 2015 7 1
8: 2015 8 1
9: 2015 9 1
I have two data.tables that are each 5-10GB in size. They look similar to the following.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
The idea is to find rows in B where the datetime is within 0-10 minutes of a matching row in A (matching is done by person) and mark them in A. The question is how can I do it most efficiently using data.table?
One plan is to join the two data tables based on [I]person[/I] only, then calculate the time difference and find rows where the time difference is between 0 and 600 seconds, and finally outer join the latter with A:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Which gives us the correct result:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
However, I am not sure if this is the most efficient way. Specifically, I am using AB[diff < 600 & diff > 0] which I assume will run a vector search not a binary search, but I cannot think of how to do it using a binary search.
Also, I am not sure if converting to POSIXct is the most efficient way of calculating time differences.
Any ideas on how to improve efficiency are high appreciated.
data.table's rolling join is perfect for this task:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
The only difference with your expected output is that it checks timedifference as less or equal to 10 minutes (<=). If that is bad for you you can just delete equal matches
I have a data with the following columns:
CaseID, Time, Value.
The 'time' column values are not at regular intervals of 1. I am trying to add the missing values of time with 'NA' for the rest of the columns except CaseID.
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 12 08:38:00
Desired output:
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 NA 07:54:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 NA 08:37:00
2 12 08:38:00
I tried dt[CJ(unique(CaseID),seq(min(Time),max(Time),"min"))] but it gives the following error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 9827315 rows; more than 9620640 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I cannot able to make it work..any help would be appreciated.
Like this??
dt[,Time:=as.POSIXct(Time,format="%H:%M:%S")]
result <- dt[,list(Time=seq(min(Time),max(Time),by="1 min")),by=Case]
setkey(result,Case,Time)
setkey(dt,Case,Time)
result <- dt[result][,Time:=format(Time,"%H:%M:%S")]
result
# Case Value Time
# 1: 1 100 07:52:00
# 2: 1 110 07:53:00
# 3: 1 NA 07:54:00
# 4: 1 120 07:55:00
# 5: 2 10 08:35:00
# 6: 2 11 08:36:00
# 7: 2 NA 08:37:00
# 8: 2 12 08:38:00
Another way:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
setkey(dt, Time)
dt[, .SD[J(seq(min(Time), max(Time), by='1 min'))], by=Case]
We group by Case and join on Time on each group using .SD (hence setting key on Time). From here you can use format() as shown above.
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)