aggregating 15minute time series data to daily - r

This is the data in my text file: (I have shown 10 rows out of 10,000)
Index is the rownames, temp is time series and m are the values in mm.
"Index" "temp" "m"
1 "2012-02-07 18:15:13" "4297"
2 "2012-02-07 18:30:04" "4296"
3 "2012-02-07 18:45:10" "4297"
4 "2012-02-07 19:00:01" "4297"
5 "2012-02-07 19:15:07" "4298"
6 "2012-02-07 19:30:13" "4299"
7 "2012-02-07 19:45:04" "4299"
8 "2012-02-07 20:00:10" "4299"
9 "2012-02-07 20:15:01" "4300"
10 "2012-02-07 20:30:07" "4301"
Which I import in r using this:
x2=read.table("data.txt", header=TRUE)
I tried using the following code for aggregating the time series to daily data :
c=aggregate(ts(x2[, 2], freq = 96), 1, mean)
I have set the frequency to 96 because for 15 min data 24 hrs will be covered in 96 values.
it returns me this:
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 5366.698 5325.115 5311.969 5288.542 5331.115
But i want the same format in which I have my original data i.e. I also want the time series next to the values.
I need help in achieving that.

Use the apply.daily from the xts package after converting your data to an xts object:
Something like this should work:
x2 = read.table(header=TRUE, text=' "Index" "temp" "m"
1 "2012-02-07 18:15:13" "4297"
2 "2012-02-07 18:30:04" "4296"
3 "2012-02-07 18:45:10" "4297"
4 "2012-02-07 19:00:01" "4297"
5 "2012-02-07 19:15:07" "4298"
6 "2012-02-07 19:30:13" "4299"
7 "2012-02-07 19:45:04" "4299"
8 "2012-02-07 20:00:10" "4299"
9 "2012-02-07 20:15:01" "4300"
10 "2012-02-07 20:30:07" "4301"')
x2$temp = as.POSIXct(strptime(x2$temp, "%Y-%m-%d %H:%M:%S"))
require(xts)
x2 = xts(x = x2$m, order.by = x2$temp)
apply.daily(x2, mean)
## [,1]
## 2012-02-07 20:30:07 4298.3
Update: Your problem in a reproducable format (with fake data)
We don't always need the actual dataset to be able to help troubleshoot....
set.seed(1) # So you can get the same numbers as I do
x = data.frame(datetime = seq(ISOdatetime(1970, 1, 1, 0, 0, 0),
length = 384, by = 900),
m = sample(2000:4000, 384, replace = TRUE))
head(x)
# datetime m
# 1 1970-01-01 00:00:00 2531
# 2 1970-01-01 00:15:00 2744
# 3 1970-01-01 00:30:00 3146
# 4 1970-01-01 00:45:00 3817
# 5 1970-01-01 01:00:00 2403
# 6 1970-01-01 01:15:00 3797
require(xts)
x2 = xts(x$m, x$datetime)
head(x2)
# [,1]
# 1970-01-01 00:00:00 2531
# 1970-01-01 00:15:00 2744
# 1970-01-01 00:30:00 3146
# 1970-01-01 00:45:00 3817
# 1970-01-01 01:00:00 2403
# 1970-01-01 01:15:00 3797
apply.daily(x2, mean)
# [,1]
# 1970-01-01 23:45:00 3031.302
# 1970-01-02 23:45:00 3043.250
# 1970-01-03 23:45:00 2896.771
# 1970-01-04 23:45:00 2996.479
Update 2: A workaround alternative
(Using the fake data I've provided in the above update.)
data.frame(time = x[seq(96, nrow(x), by=96), 1],
mean = aggregate(ts(x[, 2], freq = 96), 1, mean))
# time mean
# 1 1970-01-01 23:45 3031.302
# 2 1970-01-02 23:45 3043.250
# 3 1970-01-03 23:45 2896.771
# 4 1970-01-04 23:45 2996.479

This would be a way to do it in base R:
x2 <- within(x2, {
temp <- as.POSIXct(temp, format='%Y-%m-%d %H:%M:%S')
days <- as.POSIXct(cut(temp, breaks='days'))
m <- as.numeric(m)
})
with(x2, aggregate(m, by=list(days=days), mean))

Related

Linear interpolation R

I have this data.frame (12x2)called df_1 which represents monthly values :
month df_test
[1,] 1 -1.4408567
[2,] 2 -1.0007642
[3,] 3 2.1454113
[4,] 4 1.6935537
[5,] 5 0.1149219
[6,] 6 -1.3205144
[7,] 7 1.0277486
[8,] 8 1.0323482
[9,] 9 -0.1442319
[10,] 10 -0.2091197
[11,] 11 -0.6803158
[12,] 12 0.5965196
and this data.frame(8760x2) called df_2 where each rows represent a value associated to an interval of one hour of a day. This data.frame contains hourly values for one year:
time df_time
1 2015-01-01 00:00:00 -0.4035650
2 2015-01-01 01:00:00 0.1800579
3 2015-01-01 02:00:00 -0.3770589
4 2015-01-01 03:00:00 0.2573456
5 2015-01-01 04:00:00 1.2000178
6 2015-01-01 05:00:00 -0.4276127
...........................................
time df_time
8755 2015-12-31 18:00:00 1.3540119
8756 2015-12-31 19:00:00 0.4852843
8757 2015-12-31 20:00:00 -0.9194670
8758 2015-12-31 21:00:00 -1.0751814
8759 2015-12-31 22:00:00 1.0097749
8760 2015-12-31 23:00:00 -0.1032468
I want to obtain df_1 for each hour of each day. The problem is that all months do not have the same amount of days.
Finally we should obtain a data.frame called df_3 (8760x2) that has interpolated values between the values of df_1.
Thanks for help!
Here's done with zoo. I'm assuming that the monthly value is associated with a specific datetime stamp (middle of the month, midnight) - you have to do that. If you want a different datetime stamp, just change the value.
library(zoo)
library(dplyr)
library(tidyr)
df_3 <- df_1 %>%
mutate(time = paste(2015, month, "15 00:00:00", sep = "-"),
time = as.POSIXct(strptime(time, "%Y-%m-%d %H:%M:%S"))) %>%
full_join(df_2) %>%
arrange(time) %>%
mutate(df_test = na.approx(df_test, rule = 2))

R: count 15 minutes interval in time

I would like to count the amount of sessions started at each 15 minutes intervals for businessdays within a large dataset.
My data looks like:
df <-
Start_datetime End_datetime Duration Volume
2016-04-01 06:20:55 2016-04-01 14:41:22 08:20:27 8.360
2016-04-01 08:22:27 2016-04-01 08:22:40 00:00:13 0.000
2016-04-01 08:38:53 2016-04-01 09:31:58 00:53:05 12.570
2016-04-01 09:33:57 2016-04-01 12:37:43 03:03:46 7.320
2016-04-01 10:05:03 2016-04-01 16:41:16 06:36:13 9.520
2016-04-01 12:07:57 2016-04-02 22:22:32 34:14:35 7.230
2016-04-01 16:56:55 2016-04-02 10:40:17 17:43:22 5.300
2016-04-01 17:29:18 2016-04-01 19:50:29 02:21:11 7.020
2016-04-01 17:42:39 2016-04-01 19:45:38 02:02:59 2.430
2016-04-01 17:47:57 2016-04-01 20:26:35 02:38:38 8.090
2016-04-01 22:00:15 2016-04-04 08:22:21 58:22:06 4.710
2016-04-02 01:12:38 2016-04-02 09:49:00 08:36:22 3.150
2016-04-02 01:32:00 2016-04-02 12:49:47 11:17:47 5.760
2016-04-02 07:28:48 2016-04-04 06:58:56 47:30:08 0.000
2016-04-02 07:55:18 2016-04-05 07:55:15 71:59:57 0.240
I would like to count all the starting sessions per 15minutes starting, where:
For business days
Time PTU Count
00:00:00 - 00:15:00 1 10 #(where count is the amount of sessions started between 00:00:00 and 00:15:00)
00:15:00 - 00:30:00 2 6
00:30:00 - 00:45:00 3 5
00:45:00 - 01:00:00 3 3
And so on and the same data for the weekend.
I have tried the cut function:
df$PTU <- table (cut(df$Start_datetime, breaks="15 minutes"))
data.frame(PTU)
EDIT: When I run this i receive the following error:
Error in cut.default(df$Start_datetime, breaks = "15 minutes") :'x' must be numeric
And some functions with lubridate, but I can't seem to make it work. My final goal is to create a table like the following, but then with 15 minutes interval.
Here is a sort of complete process from datetime "strings" to the format you want. The start is a string vector:
Start_time <-
c("2016-04-01 06:20:55", "2016-04-01 08:22:27", "2016-04-01 08:38:53",
"2016-04-01 09:33:57", "2016-04-01 10:05:03", "2016-04-01 12:07:57",
"2016-04-01 16:56:55", "2016-04-01 17:29:18", "2016-04-01 17:42:39",
"2016-04-01 17:47:57", "2016-04-01 22:00:15", "2016-04-02 01:12:38",
"2016-04-02 01:32:00", "2016-04-02 07:28:48", "2016-04-02 07:55:18"
)
df <- data.frame(Start_time)
And this is an actual processing
## We will use two packages
library(lubridate)
library(data.table)
# convert df to data.table, parse the datetime string
setDT(df)[, Start_time := ymd_hms(Start_time)]
# floor time by 15 min to assign the appropriate slot (new variable Start_time_slot)
df[, Start_time_slot := floor_date(Start_time, "15 min")]
# aggregate by wday and time in a date
start_time_data_frame <- df[, .N, by = .(wday(Start_time_slot), format(Start_time_slot, format="%H:%M:%S") )]
# output looks like this
start_time_data_frame
## wday time N
## 1: 6 06:15:00 1
## 2: 6 08:15:00 1
## 3: 6 08:30:00 1
## 4: 6 09:30:00 1
## 5: 6 10:00:00 1
## 6: 6 12:00:00 1
## 7: 6 16:45:00 1
## 8: 6 17:15:00 1
## 9: 6 17:30:00 1
## 10: 6 17:45:00 1
## 11: 6 22:00:00 1
## 12: 7 01:00:00 1
## 13: 7 01:30:00 1
## 14: 7 07:15:00 1
## 15: 7 07:45:00 1
There's two things you have to keep in mind when using cut on datetimes:
Make sure your data is actually a POSIXt class. I'm quite sure yours isn't, or R wouldn't be using cut.default but cut.POSIXt as a method.
"15 minutes" should be "15 min". See ?cut.POSIXt
So this works:
Start_datetime <- as.POSIXct(
c("2016-04-01 06:20:55",
"2016-04-01 06:22:12",
"2016-04-01 05:30:12")
)
table(cut(Start_datetime, breaks = "15 min"))
# 2016-04-01 05:30:00 2016-04-01 05:45:00 2016-04-01 06:00:00 2016-04-01 06:15:00
# 1 0 0 2
Note that the output gives you the start of the 15 minute interval as names of the table.

combine two data frame with daily record and hourly record

I have two data frames: A
y_m_d SNOW
1 2010-01-01 0.0
2 2010-01-02 0.0
3 2010-01-03 0.1
4 2010-01-04 0.0
5 2010-01-05 0.0
6 2010-01-06 2.3
B:
time temp
1 2010-01-01 00:00:00 20.00000
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
I want to combine two data frame based on time. A is a daily record and B is a hourly record. I want to fill the A record at the beginning of each day at 00:00:00 and leave the rest of day blank.
The result should be look like this:
time temp SNOW
1 2010-01-01 00:00:00 20.00000 0.0
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
6 2010-01-01 05:00:00 22.66667
Could you please give me some advice?
Thank you.
Here's a quick solution:
A$y_m_d <- as.Date(A$y_m_d)
B$SNOW <- sapply(as.Date(B$time), function(x) A[A$y_m_d==x, "SNOW"])
This might not be the most efficient way in the world to do this, but it is a solution. I attempted to create data with the exact same variable types and structure as you.
# Create example data
y_m_d <- as.POSIXct(c("2010-01-01", "2010-01-02"), format="%Y-%m-%d")
SNOW <- c(0, 0.1)
time <- as.POSIXct(c("2010-01-01 00:00:00", "2010-01-01 01:00:00", "2010-01-01 02:00:00", "2010-01-02 00:00:00", "2010-01-02 01:00:00", "2010-01-02 02:00:00"), format="%Y-%m-%d %H:%M:%S")
temp <- rnorm(6, mean=20, sd=4)
A <- data.frame(y_m_d, SNOW)
B <- data.frame(time, temp)
# Check data
A
## y_m_d SNOW
## 1 2010-01-01 0.0
## 2 2010-01-02 0.1
B
## time temp
## 1 2010-01-01 00:00:00 17.52852
## 2 2010-01-01 01:00:00 12.42715
## 3 2010-01-01 02:00:00 21.79584
## 4 2010-01-02 00:00:00 19.90442
## 5 2010-01-02 01:00:00 16.40524
## 6 2010-01-02 02:00:00 16.86854
# Loop through days and construct new SNOW variable
days <- as.POSIXct(format(B$time, "%Y-%m-%d"), format="%Y-%m-%d")
SNOW_new <- c()
for (i in 1:nrow(A)) {
SNOW_new <- c(A[i, "SNOW"], rep(NA, sum(days==A[i, "y_m_d"])-1), SNOW_new)
}
# Create new data frame
C <- data.frame(B, SNOW_new)
## time temp SNOW_new
## 1 2010-01-01 00:00:00 17.52852 0.1
## 2 2010-01-01 01:00:00 12.42715 NA
## 3 2010-01-01 02:00:00 21.79584 NA
## 4 2010-01-02 00:00:00 19.90442 0.0
## 5 2010-01-02 01:00:00 16.40524 NA
## 6 2010-01-02 02:00:00 16.86854 NA
I put NA rather than a blank space because I assume you want the SNOW_new variable to be numeric, not character. But if you do want a blank space, you can just replace the NA in the rep function with a "".
Making sure time variables are in the right format.
A$y_m_d <- as.POSIXct(A$y_m_d, format="%Y-%m-%d")
B$time <- as.POSIXct(B$time, format="%Y-%m-%d %H:%M:%S")
The package lubridate is suited to merge time series data
#install.packages("lubridate")
library(lubridate)
A <- xts(A[,-1], order.by = A$y_m_d)
B <- xts(B[,-1], order.by = B$time)
merge.xts(A, B)

summarize by time interval not working

I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1

how to determine the mean of a time series data?

I have a sample measurement data of every 2 seconds and I would like to determine mean and standard deviation of this time series every 2 minutes. Any help will be appreciated.
Date Times Pos Date and time pressure temp
01.01.2013 02:20:01 A 2013-01-01 02:20:25.335 140.741
01.01.2013 02:20:02 A 2013-01-01 02:20:26.091 140.741
1.01.2013 02:20:03 A 2013-01-01 02:20:26.091 140.741
# example data
set.seed(1)
df <- data.frame(dates = sort(Sys.time() + sample(1:1000, size=100)),
values = rnorm(100, 100, 50))
# 2 minute groups
df$groups <- cut.POSIXt(df$dates, breaks="2 min")
# summary
require(plyr)
ddply(df, "groups", summarise, mean=mean(values), sd=sd(values))
# groups mean sd
# 1 2014-02-03 14:35:00 114.60027 55.67169
# 2 2014-02-03 14:37:00 107.16711 57.97990
# 3 2014-02-03 14:39:00 99.36876 45.03428
# 4 2014-02-03 14:41:00 111.37508 44.37829
# 5 2014-02-03 14:43:00 93.33474 46.33670
# 6 2014-02-03 14:45:00 108.71795 40.43259
# 7 2014-02-03 14:47:00 85.60400 29.38563
# 8 2014-02-03 14:49:00 83.57215 69.01886
# 9 2014-02-03 14:51:00 26.82735 12.52657
Edit:
With regards to your example data:
df <- read.table(sep=";", header=TRUE, stringsAsFactors=FALSE, text="
Date;Times;Pos;Date and time;pressure;temp
01.01.2013;02:20:01;A;2013-01-01 02:20:25;.335;140.741
01.01.2013;02:20:02;A;2013-01-01 02:20:26;.091;140.741
1.01.2013;02:20:03;A;2013-01-01 02:20:26;.091;140.741")
df$dates <- as.POSIXct(paste(df$Date, df$Times),
format="%d.%m.%Y %H:%M:%S")
df$groups <- cut.POSIXt(df$dates, breaks="2 sec")
require(plyr)
ddply(df, "groups", summarise,
mean_pressure=mean(pressure), sd_pressure=sd(pressure),
mean_temp=mean(temp), sd_temp=sd(temp))
# groups mean_pressure sd_pressure mean_temp sd_temp
# 1 2013-01-01 02:20:01 0.213 0.1725341 140.741 0
# 2 2013-01-01 02:20:03 0.091 NA 140.741 NA

Resources