I have a dataframe as:
T1 T2 T3 timestamp
45.37 44.48 13 2015-11-05 10:23:00
44.94 44.55 13.37 2015-11-05 10:24:00
45.32 44.44 13.09 2015-11-05 10:27:00
45.46 44.51 13.29 2015-11-05 10:28:00
45.46 44.65 13.18 2015-11-05 10:29:16
45.96 44.85 13.23 2015-11-05 10:32:00
45.52 44.56 13.53 2015-11-05 10:36:00
45.36 44.62 13.25 2015-11-05 10:37:00
I want to create a new dataframe that contains vaules of T1, T2 and T3 aggregated over 5 min intervals based on the timestamp column. I did come across aggregate and it seems to use one of the columns to group/aggregate the corresponding values in other columns.
If no rows had values over 5-min interval, then the rows to represent NAs. I also like another column that indicates number items used to make the average over 5-min intervals.
Looking for a most efficient way of doing it in R. Thanks
First make sure the timestamp columns is a date.time column. You can skip this line if it already is in this format.
df1$timestamp <- as.POSIXct(df1$timestamp)
xts has some nice functions for working with timeseries. Especially for rolling functions or time aggregating functions. In this case period.apply can help out.
library(xts)
# create xts object. Be sure to exclude the timestamp column otherwise you end up with a character matrix.
df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)
# sum per 5 minute intervals
df1_xts_summed <- period.apply(df1_xts, endpoints(df1_xts, on = "minutes", k = 5), colSums)
# count rows per 5 minute interval and add to data
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)
df1_xts_summed
T1 T2 T3 nrows
2015-11-05 10:24:00 90.31 89.03 26.37 2
2015-11-05 10:29:16 136.24 133.60 39.56 3
2015-11-05 10:32:00 45.96 44.85 13.23 1
2015-11-05 10:37:00 90.88 89.18 26.78 2
If you want it all back into a data.frame:
df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final
timestamp T1 T2 T3 nrows
1 2015-11-05 10:24:00 90.31 89.03 26.37 2
2 2015-11-05 10:29:16 136.24 133.60 39.56 3
3 2015-11-05 10:32:00 45.96 44.85 13.23 1
4 2015-11-05 10:37:00 90.88 89.18 26.78 2
Edit if you want everything rounded at 5 minutes with these as the timestamps you need to do the following:
First step is to replace the timestamps with the 5 minute intervals, taking into account the starting minutes of the timestamps. For this I use the ceiling_date from the lubridate package and add to it the difference between the first values of the timestamp and the ceiling of the first value of the timestamp. This will return the last values of each interval. (If you want to use the start of the interval you need to use floor_date)
df1$timestamp <- lubridate::ceiling_date(df1$timestamp, "5 mins") + difftime(lubridate::ceiling_date(first(df1$timestamp), "5 mins"), first(df1$timestamp), unit = "secs")
Next the same xts code as before which returns the same data, but the timestamp is now the last value of the 5 minute intervals.
df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)
df1_xts_summed <- period.apply(df1_xts, ep, colSums)
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)
df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final
timestamp T1 T2 T3 nrows
1 2015-11-05 10:27:00 90.31 89.03 26.37 2
2 2015-11-05 10:32:00 136.24 133.60 39.56 3
3 2015-11-05 10:37:00 45.96 44.85 13.23 1
4 2015-11-05 10:42:00 90.88 89.18 26.78 2
data:
df1 <- structure(list(T1 = c(45.37, 44.94, 45.32, 45.46, 45.46, 45.96,
45.52, 45.36), T2 = c(44.48, 44.55, 44.44, 44.51, 44.65, 44.85,
44.56, 44.62), T3 = c(13, 13.37, 13.09, 13.29, 13.18, 13.23,
13.53, 13.25), timestamp = c("2015-11-05 10:23:00", "2015-11-05 10:24:00",
"2015-11-05 10:27:00", "2015-11-05 10:28:00", "2015-11-05 10:29:16",
"2015-11-05 10:32:00", "2015-11-05 10:36:00", "2015-11-05 10:37:00"
)), class = "data.frame", row.names = c(NA, -8L))
Related
so i have a large data frame with a date time column of class POSIXct and a another column with price data of class numeric. the date time column has values of the form "1998-12-07 02:00:00 AEST" that are half hour observations across 20 years. a sample data set can be generated with the following code (vary the 100 to whatever number of observations are necessary):
data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"), as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
i want to look at a typical year and typical week. so for the typical year i have the following code:
mean.year <- aggregate(df$price, by = list(format(df$date.time, "%m-%d %H:%M")), mean)
it seems to give me what i want:
Group.1 x
1 01-01 00:00 31.86200
2 01-01 00:30 34.20526
3 01-01 01:00 28.40105
4 01-01 01:30 26.01684
5 01-01 02:00 23.68895
6 01-01 02:30 23.70632
however the column "Group.1" is of class character and i would like it to be of class POSIXct. how can i do this?
for the typical week i have the following code
mean.week <- aggregate(df$price, by = list(format(df$date.time, "%wday %H:%M")), mean)
the output is as follows
Group.1 x
1 0day 00:00 33.05613
2 0day 00:30 30.92815
3 0day 01:00 29.26245
4 0day 01:30 29.47959
5 0day 02:00 29.18380
6 0day 02:30 25.99400
again, column "Group.1" is of class character and i would like POSIXct. also, i would like to have the day of the week as "Monday", "Tuesday", etc. instead of 0day. how would i do this?
Convert the datetime to a character string that can validly be converted back to POSIXct and then do so:
mean.year <- aggregate(df["price"],
by = list(time = as.POSIXct(format(df$date.time, "2000-%m-%d %H:%M"))), mean)
head(mean.year)
## time price
## 1 2000-12-07 02:00:00 -0.56047565
## 2 2000-12-07 02:30:00 -0.23017749
## 3 2000-12-07 03:00:00 1.55870831
## 4 2000-12-07 03:30:00 0.07050839
## 5 2000-12-07 04:00:00 0.12928774
## 6 2000-12-07 04:30:00 1.71506499
To get the day of the week use %a or %A -- see ?strptime for the list of percent codes.
mean.week <- aggregate(df["price"],
by = list(time = format(df$date.time, "%a %H:%M")), mean)
head(mean.week)
## time price
## 1 Mon 02:00 -0.56047565
## 2 Mon 02:30 -0.23017749
## 3 Mon 03:00 1.55870831
## 4 Mon 03:30 0.07050839
## 5 Mon 04:00 0.12928774
## 6 Mon 04:30 1.71506499
Note
The input df in reproducible form -- note that set.seed is needed to make it reproducible:
set.seed(123)
df <- data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"),
as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
I have a following dataframe in R
Date Value
1986-01-02 25.67
1986-01-03 23.56
1986-01-06 34.56
1986-01-07 23.77
1986-01-08 25.67
1986-01-09 26.56
1986-01-10 25.56
1986-01-13 28.77
.
.
.
2018-07-03 73.45
2018-07-04 74.34
2018-07-05 73.45
2018-07-06 74.34
2018-07-09 72.34
Date column is in POSIXct format and excluding weekends (Saturday and Sunday).I want to convert it into a daily time series in R.
I am doing following
ts_object <- ts(df,frequency = 365)
It gives me following ts
Time Series:
Start = c(1, 1)
End = c(23, 193)
Frequency = 365
Date Value
1.000000 505008000 25.67
1.002740 505094400 23.56
1.005479 505353600 34.56
Why its not taking date in correct format. Am I setting the frequency right for daily time series object?
You will have to add missing data (Saturday and Sunday) because frequency = 365 doesn't exclude the missing data.
One way to generate that is as follows
df <- data.frame( Date = seq(as.Date("1986-01-02"), as.Date("1986-01-07"), 1))
df$Date <- as.character(df$Date)
ds <- read.table(text = "Date Value
1986-01-02 25.67
1986-01-03 23.56
1986-01-06 34.56
1986-01-07 23.77", header = T)
df <- merge(df, ds, by = "Date", all.x = T)
df[is.na(df)] <- 0
df
Date Value
1 1986-01-02 25.67
2 1986-01-03 23.56
3 1986-01-04 0.00
4 1986-01-05 0.00
5 1986-01-06 34.56
6 1986-01-07 23.77
I have the following dataframe (ts1):
D1 Diff
1 20/11/2014 16:00 0.00
2 20/11/2014 17:00 0.01
3 20/11/2014 19:00 0.03
I would like to add a new column to ts1 that will be the difference in hours between successive rows D1 (dates) in hours.
The new ts1 should be:
D1 Diff N
1 20/11/2014 16:00 0.00
2 20/11/2014 17:00 0.01 1
3 20/11/2014 19:00 0.03 2
For calculating the difference in hours independently I use:
library(lubridate)
difftime(dmy_hm("29/12/2014 11:00"), dmy_hm("29/12/2014 9:00"), units="hours")
I know that for calculating the difference between each row I need to transform the ts1 into matrix.
I use the following command:
> ts1$N<-difftime(dmy_hm(as.matrix(ts1$D1)), units="hours")
And I get:
Error in as.POSIXct(time2) : argument "time2" is missing, with no default
Suppose ts1 is as shown in Note 2 at the end. Then create a POSIXct variable tt from D1, convert tt to numeric giving the number of seconds since the Epoch, divide that by 3600 to get the number of hours since the Epoch and take differences. No packages are used.
tt <- as.POSIXct(ts1$D1, format = "%d/%m/%Y %H:%M")
m <- transform(ts1, N = c(NA, diff(as.numeric(tt) / 3600)))
giving:
> m
D1 Diff N
1 20/11/2014 16:00 0.00 NA
2 20/11/2014 17:00 0.01 1
3 20/11/2014 19:00 0.03 2
Note 1: I assume you are looking for N so that you can fill in the empty hours. In that case you don't really need N. Also, it would be easier to deal with time series if you use a time series representation. First we convert ts1 to a zoo object, then we create a zero width zoo object with the datetimes that we need and finally we merge them:
library(zoo)
z <- read.zoo(ts1, tz = "", format = "%d/%m/%Y %H:%M")
z0 <- zoo(, seq(start(z), end(z), "hours"))
zz <- merge(z, z0)
giving:
> zz
2014-11-20 16:00:00 2014-11-20 17:00:00 2014-11-20 18:00:00 2014-11-20 19:00:00
0.00 0.01 NA 0.03
If you really did need a data frame back then:
DF <- fortify.zoo(zz)
Note 2: Input used in reproducible form is:
Lines <- "D1,Diff
1,20/11/2014 16:00,0.00
2,20/11/2014 17:00,0.01
3,20/11/2014 19:00,0.03"
ts1 <- read.csv(text = Lines, as.is = TRUE)
Thanks to #David Arenburg and #nicola:
Can use either:
res <- diff(as.POSIXct(ts1$D1, format = "%d/%m/%Y %H:%M")) ; units(res) <- "hours"
Or:
res <- diff(dmy_hm(ts1$D1))
and afterwards:
ts1$N <- c(NA_real_, as.numeric(res))
Data:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "NA", "2013-06-04", "2014-01-03", "NA", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
Expected Outcome:
DB1 <- data.frame(orderItemID = 1:10,
orderDate= c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", "2013-04-14", "2013-06-04", "2014-01-03", "2014-02-21", "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
My question is similar to another one I posted: so don´t be confused.
As you can see above I have some missing values in the delivery dates and I want to replace them by another date. That date should be the order date of the specific item + the average delivery time in (full) days.(2days)
The average delivery time is the time calculated from the average value of all samples that do not contain Missing values = (2days+1day+3days+2days+1day+2days+1day+2days):8=1,75
So I want to replace the NA in delivery time with the order date +2days. When there´s no NA, the date should stay the same.
I tried this already (with lubridate), but it´s not working :(
DB1$deliveryDate[is.na(DB1$deliveryDate) ] <- DB1$orderDate + days(2)
Can someone plz help me?
First, convert the columns to Date objects:
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
Then, replace the NA elements:
DB1$deliveryDate[is.na(DB1$deliveryDate)] <-
DB1$orderDate[is.na(DB1$deliveryDate)] +
mean(difftime(DB1$orderDate,DB1$deliveryDate,units="days"),na.rm=TRUE)
# orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-03-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23
You can do:
DB1 =cbind(DB1$orderItemID,as.data.frame(lapply(DB1[-1], as.character)))
days = round(mean(DB1$deliveryDate-DB1$orderDate, na.rm=T))
mask = is.na(DB1$deliveryDate)
DB1$deliveryDate[mask] = DB1$orderDate[mask]+days
# DB1$orderItemID orderDate deliveryDate
#1 1 2013-01-21 2013-01-23
#2 2 2013-03-31 2013-04-01
#3 3 2013-04-12 2013-04-14
#4 4 2013-06-01 2013-06-04
#5 5 2014-01-01 2014-01-03
#6 6 2014-02-19 2014-02-21
#7 7 2014-02-27 2014-02-28
#8 8 2014-10-02 2014-10-04
#9 9 2014-10-31 2014-11-01
#10 10 2014-11-21 2014-11-23
I re-arrange your data since they were not clean:
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-04-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"))
Assuming that you have entered your data like this (note that NAs are not enclosed in quotes so they are read as NAs and not "NA")...
DB1 <- data.frame(orderItemID = 1:10,
orderDate = c("2013-01-21","2013-03-31","2013-04-12","2013-06-01","2014-01-01", "2014-02-19","2014-02-27","2014-10-02","2014-10-31","2014-11-21"),
deliveryDate = c("2013-01-23", "2013-03-01", NA, "2013-06-04", "2014-01-03", NA, "2014-02-28", "2014-10-04", "2014-11-01", "2014-11-23"),
stringsAsFactors = FALSE)
...and, per Nicola's answer, done this to get the formatting right...
DB1[,2:3]<-lapply(DB1[,2:3],as.Date)
...this also works:
library(lubridate)
DB1$deliveryDate <- with(DB1, as.Date(ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate), origin = "1970-01-01"))
Or you could use dplyr and pipe it:
library(lubridate)
library(dplyr)
DB2 <- DB1 %>%
mutate(deliveryDate = ifelse(is.na(deliveryDate), orderDate + days(2), deliveryDate)) %>%
mutate(deliveryDate = as.Date(.[,"deliveryDate"], origin = "1970-01-01"))
I am quite new to R, and I am trying to find a way to average continuous data into a specific period of time.
My data is a month recording of several parameters with 1s time steps
The table via read.csv has a date and time in one column and several other columns with values.
TimeStamp UTC Pitch Roll Heave(m)
05-02-13 6:45 0 0 0
05-02-13 6:46 0.75 -0.34 0.01
05-02-13 6:47 0.81 -0.32 0
05-02-13 6:48 0.79 -0.37 0
05-02-13 6:49 0.73 -0.08 -0.02
So I want to average the data in specific intervals: 20 min for example in a way that the average for hour 7:00, takes all the points from hour 6:41 to 7:00 and returns the average in this interval and so on for the entire dataset.
The time interval will look like this :
TimeStamp
05-02-13 19:00 462
05-02-13 19:20 332
05-02-13 19:40 15
05-02-13 20:00 10
05-02-13 20:20 42
Here is a reproducible dataset similar to your own.
meteorological <- data.frame(
TimeStamp = rep.int("05-02-13", 1440),
UTC = paste(
rep(formatC(0:23, width = 2, flag = "0"), each = 60),
rep(formatC(0:59, width = 2, flag = "0"), times = 24),
sep = ":"
),
Pitch = runif(1440),
Roll = rnorm(1440),
Heave = rnorm(1440)
)
The first thing that you need to do is to combine the first two columns to create a single (POSIXct) date-time column.
library(lubridate)
meteorological$DateTime <- with(
meteorological,
dmy_hm(paste(TimeStamp, UTC))
)
Then set up a sequence of break points for your different time groupings.
breaks <- seq(ymd("2013-02-05"), ymd("2013-02-06"), "20 mins")
Finally, you can calculate the summary statistics for each group. There are many ways to do this. ddply from the plyr package is a good choice.
library(plyr)
ddply(
meteorological,
.(cut(DateTime, breaks)),
summarise,
MeanPitch = mean(Pitch),
MeanRoll = mean(Roll),
MeanHeave = mean(Heave)
)
Please see if something simple like this works for you:
myseq <- data.frame(time=seq(ISOdate(2014,1,1,12,0,0), ISOdate(2014,1,1,13,0,0), "5 min"))
myseq$cltime <- cut(myseq$time, "20 min", labels = F)
> myseq
time cltime
1 2014-01-01 12:00:00 1
2 2014-01-01 12:05:00 1
3 2014-01-01 12:10:00 1
4 2014-01-01 12:15:00 1
5 2014-01-01 12:20:00 2
6 2014-01-01 12:25:00 2
7 2014-01-01 12:30:00 2
8 2014-01-01 12:35:00 2
9 2014-01-01 12:40:00 3
10 2014-01-01 12:45:00 3
11 2014-01-01 12:50:00 3
12 2014-01-01 12:55:00 3
13 2014-01-01 13:00:00 4