I have a dataframe with many days data on it and i wan to obtain hte max and min per day but im getting the same df as the start showing an hour value. The original df looks like this:
date temperature
1: 2006-04-17 00:00:00 12.67833
2: 2006-04-17 01:00:00 12.14133
3: 2006-04-17 02:00:00 10.36833
4: 2006-04-17 03:00:00 10.78600
5: 2006-04-17 04:00:00 10.76967
6: 2006-04-17 05:00:00 10.92467
And im getting this:
date Max Min
1: 2006-04-17 00:00:00 12.67833 12.67833
2: 2006-04-17 01:00:00 12.14133 12.14133
3: 2006-04-17 02:00:00 10.36833 10.36833
4: 2006-04-17 03:00:00 10.78600 10.78600
5: 2006-04-17 04:00:00 10.76967 10.76967
6: 2006-04-17 05:00:00 10.92467 10.92467
Im using the next code:
library(lubridate)
datatemp<- read.csv("04_2006.csv", header = T)
datatemp$date_time<-parse_date_time(datatemp$date_time,orders = "mdy HMS")
temp_aveg<-aggregate(list(temperature = datatemp$temp),
list(date = cut(datatemp$date_time, "1 hour")),
mean)
library(data.table)
Tmaxmin<-setDT(temp_aveg)[, list(Max=max(temperature), Min=min(temperature)), by=list(date)]
I dont know what im missing?
You are grouping on the both the date and the time rather than just the date.
A solution using lubridate and dplyr.
library(lubridate)
library(dplyr)
datatemp$date <- date(datatemp$date_time)
datatemp <- na.omit(datatemp)
output <- datatemp %>%
group_by(date) %>%
summarise(max_val = max(temperature),
min_val = min(temperature))
Related
I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
I have this dataframe containing a date column, and a unique ID. I would simply want to extract the first observation of each day.
I tried to use the dpylr package (aggregate function) and date function but I'm still a beginner in R. I also tried to look for an answer on this forum without success. Thnx in advance for your return !
Here is the situation:
df <- as.data.frame(c(2013-01-12 07:30:00, 2013-01-12 12:40:00, 2013-01-16 06:50:00, 2013-01-16 15:10:00, 2013-01-14 11:20:00, 2013-01-14 08:15:00),
c(A,B,E,F,C,D))
The outcome should be:
2013-01-12 07:30:00 A
2013-01-14 08:15:00 D
2013-01-16 06:50:00 E
Try the code below. Note, that I have edited your example data.
library(dplyr)
df <- data.frame(date = as.POSIXct(c("2013-01-12 07:30:00",
"2013-01-12 12:40:00",
"2013-01-16 06:50:00",
"2013-01-16 15:10:00",
"2013-01-14 11:20:00",
"2013-01-14 08:15:00")),
id = letters[1:6])
df %>%
group_by(as.Date(date)) %>%
filter(date == min(date))
The result should look like this:
# A tibble: 3 x 3
# Groups: as.Date(date) [3]
date id `as.Date(date)`
<dttm> <fct> <date>
1 2013-01-12 07:30:00 a 2013-01-12
2 2013-01-16 06:50:00 c 2013-01-16
3 2013-01-14 08:15:00 f 2013-01-14
Here is an approach using aggregate from stats package, also editing your dataset definition:
df <- data.frame(times=strptime(c('2013-01-12 07:30:00', '2013-01-12 12:40:00',
'2013-01-16 06:50:00', '2013-01-16 15:10:00',
'2013-01-14 11:20:00', '2013-01-14 08:15:00'),
format = "%Y-%m-%d %H:%M:%S"),
id=c('A','B','E','F','C','D'))
df$day <- as.Date(df$times, format='%Y-%m-%d') #create a day column
aggregate(times ~ day, data = df, FUN='min')
# day times
# 1 2013-01-12 2013-01-12 07:30:00
# 2 2013-01-14 2013-01-14 08:15:00
# 3 2013-01-16 2013-01-16 06:50:00
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00
I have hourly values for precipitation that I'd like to sum up over the hour.
My data (Nd_hourly) looks like this:
Datum Uhrzeit Nd
1 2013-05-01 01:00:00 0.0
2 2013-05-01 02:00:00 0.1
3 2013-05-01 03:00:00 0.0
4 2013-05-01 04:00:00 0.3
(date,time, precipitation)
and I'd like to have an output of Datum - Nd
I did the min and max temperatur with the package plyr and the function ddply with
t_maxmin=ddply(t_air,.(Datum),summarize,Datum=Datum[which.max(T_Luft)],max.value=max(T_Luft),min.value=min(T_Luft))
I then tried to do something similar for the precipitation and tried
Nd_daily=ddply(Nd_hourly,.(Datum),summarize,Datum=Datum, sum(Nd_hourly))
but get the error message
Error: only defined on a data frame with all numeric variables
I assume something may be wrong with my data input? I imported data from Excel 2010 via a .txt file.
Still very new to R and programming in general, so I would really appreciate some help :)
Is this what you want?
library(plyr)
ddply(.data = df, .variables = .(Datum), summarize,
sum_precip = sum(Nd))
# Datum sum_precip
# 1 2013-05-01 0.4
I think #Henrik has identified your problem, but here's an alternative approach, using data.table:
# Create some fake datetime data
datetime <- seq(ISOdate(2000,1,1), ISOdate(2000,1,10), "hours")
# A data.frame with columns for date, time, and random precipitation data.
DF <- data.frame(date=format(datetime, "%Y-%m-%d"),
time=format(datetime, "%H:%M:%S"),
precip=runif(length(datetime)))
head(DF)
# date time precip
# 1 2000-01-01 12:00:00 0.9294353
# 2 2000-01-01 13:00:00 0.5082905
# 3 2000-01-01 14:00:00 0.5222088
# 4 2000-01-01 15:00:00 0.1841305
# 5 2000-01-01 16:00:00 0.9121000
# 6 2000-01-01 17:00:00 0.2434706
library(data.table)
DT <- as.data.table(DF) # convert to a data.table
DT[, list(precip=sum(precip)), by=date]
# date precip
# 1: 2000-01-01 7.563350
# 2: 2000-01-02 10.147659
# 3: 2000-01-03 10.936760
# 4: 2000-01-04 13.925727
# 5: 2000-01-05 11.415149
# 6: 2000-01-06 10.966494
# 7: 2000-01-07 12.751461
# 8: 2000-01-08 15.218148
# 9: 2000-01-09 12.213046
# 10: 2000-01-10 6.219439
There's a great introductory text on data.tables here.
Given your particular data structure, the following should do the trick.
library(data.table)
DT <- data.table(Nd_hourly)
DT[, list(Nd_daily=sum(Nd)), by=Datum]