Here is subset of my original data I am working with:
dput(datumi)
structure(c("21:26", "21:33", "21:38", "23:02", "23:03", "21:27",
"21:34", "21:39", "23:03", "23:04", "21:26", "21:33", "21:38",
"23:02", "23:04", "21:26", "21:34", "21:38", "23:02", "23:04",
"21:27", "21:34", "21:39", "23:02", "23:04"), .Dim = c(5L, 5L
), .Dimnames = list(c("2", "3", "4", "5", "6"), c("Datum_1",
"Datum_2", "Datum_3", "Datum_4", "Datum_5")))
So I am working with time, where e.g., 21:26 means time of the day.
Now I would like to subtract second column from first one and third from second and so on, this means that I would subtract column Datum_2 from Datum_1 and column Datum_3 from Datum_2 and Datum_4 from Datum_3. And my output will be new columns with differences in seconds
I've already created function/loop that does this if my data would be numeric, so for example in case of numeric data I would do this and get the desired output:
dat <- data.frame(
column1 = round(runif(n = 10, min=0, max=5),0),
column2 = round(runif(n = 10, min=0, max=5),0),
column3 = round(runif(n = 10, min=0, max=5),0),
column4 = round(runif(n = 10, min=0, max=5),0)
)
results <- list()
for(i in 1:length(dat)) {
if (i==length(dat)){
results[[i]] <-dat[,i]
} else {results[[i]] <-dat[,i+1] - dat[,i]}
}
results <- t(do.call(rbind,results))
results <- data.frame(results)
But I cannot figure it out for time format and I have tried strptime and as.POSIXct
x1 <- strptime(datumi, "%H:%M")
as.numeric(x1,units="secs")
and
as.POSIXct(datumi,format="%H:%M")
And also looked at this
Subtract time in r
Subtracting Two Columns Consisting of Both Date and Time in R
convert character to time in R
Here is one solution based on the answer given in R: Convert hours:minutes:seconds.
datumi
# Datum_1 Datum_2 Datum_3 Datum_4 Datum_5
# 2 "21:26" "21:27" "21:26" "21:26" "21:27"
# 3 "21:33" "21:34" "21:33" "21:34" "21:34"
# 4 "21:38" "21:39" "21:38" "21:38" "21:39"
# 5 "23:02" "23:03" "23:02" "23:02" "23:02"
# 6 "23:03" "23:04" "23:04" "23:04" "23:04"
makeTime <- function(x) as.POSIXct(paste(Sys.Date(), x))
dat <- apply(datumi, 2, makeTime)
mapply(x = 2:ncol(dat),
y = 1:(ncol(dat) -1),
function(x, y) dat[ , x] - dat[ , y])
# [,1] [,2] [,3] [,4]
# [1,] 60 -60 0 60
# [2,] 60 -60 60 0
# [3,] 60 -60 0 60
# [4,] 60 -60 0 0
# [5,] 60 0 0 0
You can also use as.POSIXct without pasting the current data with the 'format' argument:
makeTime <- function(x) as.POSIXct(x, format = "%H:%M")
Note, the result is the same because as.POSIXct assumes the current date when none is given.
One way you could also do it if you wanted to have column names in addition to your original data would be to do:
df<-as.data.frame(lapply(dat,strptime,format="%H:%M"))
lapply(1:4, function(i) df[,paste0("diff",i,"_",i+1)] <<- difftime(df[,i],df[,i+1],units=c("secs")))
df
Datum_1 Datum_2 Datum_3 Datum_4 Datum_5 diff1_2 diff2_3 diff3_4
2 2016-07-22 21:26:00 2016-07-22 21:27:00 2016-07-22 21:26:00 2016-07-22 21:26:00 2016-07-22 21:27:00 -60 secs 60 secs 0 secs
3 2016-07-22 21:33:00 2016-07-22 21:34:00 2016-07-22 21:33:00 2016-07-22 21:34:00 2016-07-22 21:34:00 -60 secs 60 secs -60 secs
4 2016-07-22 21:38:00 2016-07-22 21:39:00 2016-07-22 21:38:00 2016-07-22 21:38:00 2016-07-22 21:39:00 -60 secs 60 secs 0 secs
5 2016-07-22 23:02:00 2016-07-22 23:03:00 2016-07-22 23:02:00 2016-07-22 23:02:00 2016-07-22 23:02:00 -60 secs 60 secs 0 secs
6 2016-07-22 23:03:00 2016-07-22 23:04:00 2016-07-22 23:04:00 2016-07-22 23:04:00 2016-07-22 23:04:00 -60 secs 0 secs 0 secs
diff4_5
2 -60 secs
3 0 secs
4 -60 secs
5 0 secs
6 0 secs
I've found solution to my problem including function/loop that I've created for numeric data, I just needed to include
difftime(strptime(datumi[,i+1], format = "%H:%M"), strptime(datumi[,i], format = "%H:%M"), units = "secs") in my for loop function so code looks like this
datumi <- as.data.frame(datumi)
results <- list()
for(i in 1:length(dat)) {
if (i==length(dat)){
results[[i]] <-NULL
} else {results[[i]] <-difftime(strptime(datumi[,i+1], format = "%H:%M"), strptime(datumi[,1], format = "%H:%M"), units = "secs") }
}
results <- t(do.call(rbind,results))
results <- data.frame(results)
#And output
X1 X2 X3 X4
2 60 0 0 60
3 60 0 60 60
4 60 0 0 60
5 60 0 0 0
6 60 60 60 60
But because mapply used by #dayne is more convenient for me (because it applys function to multiple list arguments and is more readable for me) I used his solution.
Related
I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1
I have data that looks like
Dates another column
2015-05-13 23:53:00 some values
2015-05-13 23:53:00 ....
2015-05-13 23:33:00
2015-05-13 23:30:00
...
2003-01-06 00:01:00
2003-01-06 00:01:00
The code I then used is
trainDF<-read.csv("train.csv")
diff<-as.POSIXct(trainDF[1,1])-as.POSIXct(trainDF[,1])
head(diff)
Time differences in hours
[1] 23.88333 23.88333 23.88333 23.88333 23.88333 23.88333
However, this doesn't make sense because subtracting the 1st two entries should give 0, since they are the exact same time. Subtracting the 3rd entry from the 1st should give a difference of 20 minutes, not 23.88333 hours. I get the similar values that don't make sense when I try as.duration(diff) and as.numeric(diff). Why is this?
If you just have a series of dates in POSIXct, you can use the diff function to calculate the difference between each date. Here's an example:
> BD <- as.POSIXct("2015-01-01 12:00:00", tz = "UTC") # Making a begin date.
> ED <- as.POSIXct("2015-01-01 13:00:00", tz = "UTC") # Making an end date.
> timeSeq <- seq(BD, ED, "min") # Creating a time series in between the dates by minute.
>
> head(timeSeq) # To see what it looks like.
[1] "2015-01-01 12:00:00 UTC" "2015-01-01 12:01:00 UTC" "2015-01-01 12:02:00 UTC" "2015-01-01 12:03:00 UTC" "2015-01-01 12:04:00 UTC"
[6] "2015-01-01 12:05:00 UTC"
>
> diffTime <- diff(timeSeq) # Takes the difference between each adjacent time in the time series.
> print(diffTime) # Printing out the result.
Time differences in mins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> # For the sake of example, let's make a hole in the data.
>
> limBD <- as.POSIXct("2015-01-01 12:15:00", tz = "UTC") # Start of the hole we want to create.
> limED <- as.POSIXct("2015-01-01 12:45:00", tz = "UTC") # End of the hole we want to create.
>
> timeSeqLim <- timeSeq[timeSeq <= limBD | timeSeq >= limED] # Make a hole of 1/2 hour in the sequence.
>
> diffTimeLim <- diff(timeSeqLim) # Taking the diff.
> print(diffTimeLim) # There is now a large gap, which is reflected in the print out.
Time differences in mins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
However, I read through your post again, and it seems you just want to subtract each item not in the first row by the first row. I used the same sample I used above to do this:
Time difference of 1 mins
> timeSeq[1] - timeSeq[2:length(timeSeq)]
Time differences in mins
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 -27 -28 -29 -30 -31 -32 -33 -34 -35 -36
[37] -37 -38 -39 -40 -41 -42 -43 -44 -45 -46 -47 -48 -49 -50 -51 -52 -53 -54 -55 -56 -57 -58 -59 -60
Which gives me what I'd expect. Trying a data.frame method:
> timeDF <- data.frame(time = timeSeq)
> timeDF[1,1] - timeDF[, 1]
Time differences in secs
[1] 0 -60 -120 -180 -240 -300 -360 -420 -480 -540 -600 -660 -720 -780 -840 -900 -960 -1020 -1080 -1140 -1200 -1260 -1320 -1380
[25] -1440 -1500 -1560 -1620 -1680 -1740 -1800 -1860 -1920 -1980 -2040 -2100 -2160 -2220 -2280 -2340 -2400 -2460 -2520 -2580 -2640 -2700 -2760 -2820
[49] -2880 -2940 -3000 -3060 -3120 -3180 -3240 -3300 -3360 -3420 -3480 -3540 -3600
It seems I'm not encountering the same problem as you. Perhaps coerce everything to POSIX.ct first and then do your subtraction? Try checking the class of your data and make sure it is actually in POSIXct. Check the actual values you are subtracting and that may give you some insight.
EDIT:
After downloading the file, here's what I ran. The file is trainDF:
trainDF$Dates <- as.POSIXct(trainDF$Dates, tz = "UTC") # Coercing to POSIXct.
datesDiff <- trainDF[1, 1] - trainDF[, 1] # Taking the difference of each date with the first date.
head(datesDiff) # Printing out the head.
With results:
Time differences in secs
[1] 0 0 1200 1380 1380 1380
The only thing I did differently was use the time zone UTC, which does not shift hours with daylight savings time, so there should be no effect there.
HOWEVER, I did the exact same method as you and got the same results:
> diff<-as.POSIXct(trainDF[1,1])-as.POSIXct(trainDF[,1])
> head(diff)
Time differences in hours
[1] 23.88333 23.88333 23.88333 23.88333 23.88333 23.88333
So there is something up with your method, but I can't say what. I do find that it is typically safer to coerce and then do some mathematical operation instead of all together in one line.
I am quite new to R and have been struggling with trying to convert my data and could use some much needed help.
I have a dataframe which is approx. 70,000*2. This data covers a whole year (52 weeks/365 days). A portion of it looks like this:
Create.Date.Time Ticket.ID
1 2013-06-01 12:59:00 INCIDENT684790
2 2013-06-02 07:56:00 SERVICE684793
3 2013-06-02 09:39:00 SERVICE684794
4 2013-06-02 14:14:00 SERVICE684796
5 2013-06-02 17:20:00 SERVICE684797
6 2013-06-03 07:20:00 SERVICE684799
7 2013-06-03 08:02:00 SERVICE684839
8 2013-06-03 08:04:00 SERVICE684841
9 2013-06-03 08:04:00 SERVICE684842
10 2013-06-03 08:08:00 SERVICE684843
I am trying to get the number of tickets in every hour of the week (that is, hour 1 to hour 168) for each week. Hour 1 would start on Monday at 00.00, and hour 168 would be Sunday 23.00-23.59. This would be repeated for each week. I want to use the Create.Date.Time data to calculate the hour of the week the ticket is in, say for:
2013-06-01 12:59:00 INCIDENT684790 - hour 133,
2013-06-03 08:08:00 SERVICE684843 - hour 9
I am then going to do averages for each hour and plot those. I am completely at a loss as to where to start. Could someone please point me to the right direction?
Before addressing the plotting aspect of your question, is this the format of data you are trying to get? This uses the package lubridate which you might have to install (install.packages("lubridate",dependencies=TRUE)).
library(lubridate)
##
Events <- paste(
sample(c("INCIDENT","SERVICE"),20000,replace=TRUE),
sample(600000:900000,20000)
)
t0 <- as.POSIXct(
"2013-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
Dates <- sort(t0 + sample(0:(3600*24*365-1),20000))
Weeks <- week(Dates)
wDay <- wday(Dates,label=TRUE)
Hour <- hour(Dates)
##
hourShift <- function(time,wday){
hShift <- sapply(wday, function(X){
if(X=="Mon"){
0
} else if(X=="Tues"){
24*1
} else if(X=="Wed"){
24*2
} else if(X=="Thurs"){
24*3
} else if(X=="Fri"){
24*4
} else if(X=="Sat"){
24*5
} else {
24*6
}
})
##
tOut <- hour(time) + hShift + 1
return(tOut)
}
##
weekHour <- hourShift(time=Dates,wday=wDay)
##
Data <- data.frame(
Event=Events,
Timestamp=Dates,
Week=Weeks,
wDay=wDay,
dayHour=Hour,
weekHour=weekHour,
stringsAsFactors=FALSE)
##
This gives you:
> head(Data)
Event Timestamp Week wDay dayHour weekHour
1 SERVICE 783405 2013-01-01 00:13:55 1 Tues 0 25
2 INCIDENT 860015 2013-01-01 01:06:41 1 Tues 1 26
3 INCIDENT 808309 2013-01-01 01:10:05 1 Tues 1 26
4 INCIDENT 835509 2013-01-01 01:21:44 1 Tues 1 26
5 SERVICE 769239 2013-01-01 02:04:59 1 Tues 2 27
6 SERVICE 762269 2013-01-01 02:07:41 1 Tues 2 27
Say I have a file which contains a few entries like this:
02/10/11 10:26:35 AM UTC, 0
02/10/11 10:26:38 AM UTC, 1
02/10/11 10:26:42 AM UTC, 0
Is there any straightforward way, in R, to turn this information into a full-length binary timeseries (assuming a one second sampling interval), imputed with zeros and ones?
In this example the series would be: 0 0 0 1 1 1 1 0
EDIT: Because Dirk and Josh gave unique solutions I wanted to see how they compare in terms of processing time:
library(xts)
library(data.table)
library(rbenchmark)
doseq <- function(N,Nby){
base.t <<- Sys.time()
t.seq <<- base.t + seq.int(from=0, to=N, by=Nby)
n.t <<- length(t.seq)
val.seq <<- (1:n.t - 1) %% 2
}
josh <- function(N,Nby=10){
doseq(N,Nby)
dt1 <- data.table(time = t.seq, val=val.seq, key="time")
dt2 <- data.table(time = with(dt1, seq(min(time), max(time), by=1)), key = "time")
dtf <- dt1[dt2, rolltolast = TRUE]
return(dtf)
}
dirk <- function(N,Nby=10){
doseq(N,Nby)
xt1 <- xts(val.seq, t.seq)
secs <- seq(start(xt1), end(xt1), by="1 sec")
xtf <- zoo::na.locf(merge(xt1, xts(, secs)))
return(xtf)
}
bm <- benchmark(josh(1e2,10), josh(1e3,10), josh(1e4,10), josh(1e5,10), josh(1e6,10),
dirk(1e2,10), dirk(1e3,10), dirk(1e4,10), dirk(1e5,10), dirk(1e6,10),
columns=c("test", "replications","elapsed", "relative"),
replications=10)
print(bm)
giving:
test replications elapsed relative
6 dirk(100, 10) 10 0.024 1.000
7 dirk(1000, 10) 10 0.026 1.083
8 dirk(10000, 10) 10 0.044 1.833
9 dirk(1e+05, 10) 10 0.321 13.375
10 dirk(1e+06, 10) 10 3.342 139.250
1 josh(100, 10) 10 0.034 1.417
2 josh(1000, 10) 10 0.036 1.500
3 josh(10000, 10) 10 0.070 2.917
4 josh(1e+05, 10) 10 0.453 18.875
5 josh(1e+06, 10) 10 5.381 224.208
So it seems they aren't too different, but the xts method is somewhat faster than the data.table method.
Yes, the xts package can help.
First, create an xts object:
R> pt <- strptime(c("02/10/11 10:26:35 AM", "02/10/11 10:26:38 AM",
+ "02/10/11 10:26:42 AM"), "%d/%m/%y %H:%M:%S %p", tz="UTC")
R> vals <- c(0,1,0)
R> x <- xts(vals, pt)
R> x
[,1]
2011-10-02 10:26:35 0
2011-10-02 10:26:38 1
2011-10-02 10:26:42 0
Warning message:
timezone of object (UTC) is different than current timezone ().
R>
We can ignore the warning -- I have a US timezone.
Now, we can create a sequence of seconds from the beginning to the end of that variable:
R> secs <- seq(start(x), end(x), by="1 sec")
And now for the magic: by merging our original with an 'empty' object of that grid, we expand to the gridL
R> x2 <- merge(x, xts(, secs))
R> x2
x
2011-10-02 10:26:35 0
2011-10-02 10:26:36 NA
2011-10-02 10:26:37 NA
2011-10-02 10:26:38 1
2011-10-02 10:26:39 NA
2011-10-02 10:26:40 NA
2011-10-02 10:26:41 NA
2011-10-02 10:26:42 0
Warning message:
timezone of object (UTC) is different than current timezone ().
All is left is to call na.locf():
R> x2 <- na.locf(merge(x, xts(, secs)))
R> x2
x
2011-10-02 10:26:35 0
2011-10-02 10:26:36 0
2011-10-02 10:26:37 0
2011-10-02 10:26:38 1
2011-10-02 10:26:39 1
2011-10-02 10:26:40 1
2011-10-02 10:26:41 1
2011-10-02 10:26:42 0
Warning message:
timezone of object (UTC) is different than current timezone ().
R>
Here's how you could do that using the data.table package:
library(data.table)
## Some example data
X <- data.table(time = Sys.time() + c(0,3,7), val=c(0,1,0), key = "time")
## A data.table with one row for each second spanned by X
Y <- data.table(time = with(X, seq(min(time), max(time), by=1)), key = "time")
## Merge them
X[Y, rolltolast = TRUE]
# time val
# 1: 2012-09-13 15:58:53 0
# 2: 2012-09-13 15:58:54 0
# 3: 2012-09-13 15:58:55 0
# 4: 2012-09-13 15:58:56 1
# 5: 2012-09-13 15:58:57 1
# 6: 2012-09-13 15:58:58 1
# 7: 2012-09-13 15:58:59 1
# 8: 2012-09-13 15:59:00 0