I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset?
Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each month, day, and hour are gathered. Is it possible to do this without extracting the month, day, and hour, and then looping over the complete average-year data.frame to find the right row?
Some example data:
data.multipleyears <- data.frame(
DATETIME = c("2001-01-01 01:00:00", "2001-05-03 09:00:00", "2007-01-01 01:00:00", "2008-02-29 12:00:00"),
Windspeed = c(10, 5, 8, 3)
)
Which I would like to aggregate in a dataframe like this:
average.year <- data.frame(
DATETIME = c("01-01 00:00:00", "01-01 01:00:00", ..., "12-31 23:00:00")
Aggregate.Windspeed = (100, 80, ...)
)
From there, I can go on calculating the averages, etc. I have probably overlooked some command, but what would be the right syntax for something like this (in pseudocode):
for(i in 1:nrow(data.multipleyears) {
average.year$Aggregate.Windspeed[
where average.year$DATETIME(month, day, hour) == data.multipleyears$DATETIME[i](month, day, hour)] <- average.year$Aggregate.Windspeed + data.multipleyears$Windspeed[i]
}
Or something like that. Help is appreciated!
I predict that ddply and the plyr package are going to be your best friend :). I created a 30 year dataset with hourly random windspeeds between 1 and 10 ms:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
# 30 year dataset
dat = data.frame(dt = begin_date + (0:(24*30*365)) * (3600))
dat = within(dat, {
speed = runif(length(dt), 1, 10)
unique_day = strftime(dt, "%d-%m")
})
> head(dat)
dt unique_day speed
1 1990-01-01 00:00:00 01-01 7.054124
2 1990-01-01 01:00:00 01-01 2.202591
3 1990-01-01 02:00:00 01-01 4.111633
4 1990-01-01 03:00:00 01-01 2.687808
5 1990-01-01 04:00:00 01-01 8.643168
6 1990-01-01 05:00:00 01-01 5.499421
To calculate the daily normalen (30 year average, this term is much used in meteorology) over this 30 year period:
library(plyr)
res = ddply(dat, .(unique_day),
summarise, mean_speed = mean(speed), .progress = "text")
> head(res)
unique_day mean_speed
1 01-01 5.314061
2 01-02 5.677753
3 01-03 5.395054
4 01-04 5.236488
5 01-05 5.436896
6 01-06 5.544966
This takes just a few seconds on my humble two core AMD, so I suspect just going once through the data is not needed. Multiple of these ddply calls for different aggregations (month, season etc) can be done separately.
You can use substr to extract the part of the date you want,
and then use tapply or ddply to aggregate the data.
tapply(
data.multipleyears$Windspeed,
substr( data.multipleyears$DATETIME, 6, 19),
mean
)
# 01-01 01:00:00 02-29 12:00:00 05-03 09:00:00
# 9 3 5
library(plyr)
ddply(
data.multipleyears,
.(when=substr(DATETIME, 6, 19)),
summarize,
Windspeed=mean(Windspeed)
)
# when Windspeed
# 1 01-01 01:00:00 9
# 2 02-29 12:00:00 3
# 3 05-03 09:00:00 5
It is pretty old post, but I wanted to add. I guess timeAverage in Openair can also be used. In the manual, there are more options for timeAverage function.
Related
I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time
I want to be able to create a water year column for a time series. The US water year is from Oct-Sept and is considered the year it ends on. For example the 2014 water year is from October 1, 2013 - September 30, 2014.
This is the US water year, but not the only water year. Therefore I want to enter in a start month and have a water year calculated for the date.
For example if my data looks like
date
2008-01-01 00:00:00
2008-02-01 00:00:00
2008-03-01 00:00:00
2008-04-01 00:00:00
.
.
.
2008-12-01 00:00:00
I want my function to work something like:
wtr_yr <- function(data, start_month) {
does stuff
}
Then my output would be
wtr_yr(data, 2)
date wtr_yr
2008-01-01 00:00:00 2008
2008-02-01 00:00:00 2009
2008-03-01 00:00:00 2009
2008-04-01 00:00:00 2009
.
.
.
2009-01-01 00:00:00 2009
2009-02-01 00:00:00 2010
2009-03-01 00:00:00 2010
2009-04-01 00:00:00 2010
I started by breaking the date up into separate columns, but I don't think that is the best way to go about it. Any advice?
Thanks in advance!
We can use POSIXlt to come up with an answer.
wtr_yr <- function(dates, start_month=9) {
# Convert dates into POSIXlt
dates.posix = as.POSIXlt(dates)
# Year offset
offset = ifelse(dates.posix$mon >= start_month - 1, 1, 0)
# Water year
adj.year = dates.posix$year + 1900 + offset
# Return the water year
adj.year
}
Let's now use this function in an example.
# Sample input vector
dates = c("2008-01-01 00:00:00",
"2008-02-01 00:00:00",
"2008-03-01 00:00:00",
"2008-04-01 00:00:00",
"2009-01-01 00:00:00",
"2009-02-01 00:00:00",
"2009-03-01 00:00:00",
"2009-04-01 00:00:00")
# Display the function output
wtr_yr(dates, 2)
# Combine the input and output vectors in a dataframe
df = data.frame(dates, wtr_yr=wtr_yr(dates, 2))
I had a similar problem a while back but dealing with fiscal years that started in October. I found this function which also computes the quarters within the year. For one part, I only wanted it to output the fiscal year, so I edited a tiny part of the function to do that. There is surely a much cleaner/efficient way of doing it, but this should work for smaller data sets. Here is the edited function:
getYearQuarter <- function(x,
firstMonth=7,
fy.prefix='FY',
quarter.prefix='Q',
sep='-',
level.range=c(min(x), max(x)) ) {
if(level.range[1] > min(x) | level.range[2] < max(x)) {
warning(paste0('The range of x is greater than level.range. Values ',
'outside level.range will be returned as NA.'))
}
quarterString <- function(d) {
year <- as.integer(format(d, format='%Y'))
month <- as.integer(format(d, format='%m'))
y <- ifelse(firstMonth > 1 & month >= firstMonth, year+1, year)
q <- cut( (month - firstMonth) %% 12, breaks=c(-Inf,2,5,8,Inf),
labels=paste0(quarter.prefix, 1:4))
return(paste0(fy.prefix, substring(y,3,4)))
}
vals <- quarterString(x)
levels <- unique(quarterString(seq(
as.Date(format(level.range[1], '%Y-%m-01')),
as.Date(format(level.range[2], '%Y-%m-28')), by='month')))
return(factor(vals, levels=levels, ordered=TRUE))
}
Your input vector should be type Date, and then specify the start month. Assuming you have a data frame(df) with the 'date' column as in your question, this should do the trick.
df$wtr_yr <- getYearQuarter(df$date, firstMonth=10)
You can also achieve adding a column by water year by using the "lfstat" package
https://www.rdocumentation.org/packages/lfstat/versions/0.9.4/topics/water_year
The format of my excel data file is:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:04:00 1
I open my file with this:
ts = read.csv(file=pathfile, header=TRUE, sep=",")
How can I add additional rows with zero number in column “value” into the data frame. Output example:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:03:00 0
01-01-2000 00:04:00 1
This is now completely automated in the padr package. Takes only one line of code.
original <- data.frame(
day = as.POSIXct(c("01-01-2000 00:00:00",
"01-01-2000 00:01:00",
"01-01-2000 00:02:00",
"01-01-2000 00:04:00"), format="%m-%d-%Y %H:%M:%S"),
value = c(4, 3, 1, 1))
library(padr)
library(dplyr) # for the pipe operator
original %>% pad %>% fill_by_value(value)
See vignette("padr") or this blog post for its working.
I think this is a more general solution, which relies on creating a sequence of all timestamps, using that as the basis for a new data frame, and then filling in your original values in that df where applicable.
# convert original `day` to POSIX
ts$day <- as.POSIXct(ts$day, format="%m-%d-%Y %H:%M:%S", tz="GMT")
# generate a sequence of all minutes in a day
minAsNumeric <- 946684860 + seq(0,60*60*24,by=60) # all minutes of your first day
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="1970-01-01", tz="GMT") # convert those minutes to POSIX
# build complete dataframe
newdata <- as.data.frame(minAsPOSIX)
newdata$value <- ts$value[pmatch(newdata$minAsPOSIX, ts$day)] # fill in original `value`s where present
newdata$value[is.na(newdata$value)] <- 0 # replace NAs with 0
Try:
ts = read.csv(file=pathfile, header=TRUE, sep=",", stringsAsFactors=F)
ts.tmp = rbind(ts,list("01-01-2000 00:03:00",0))
ts.out = ts.tmp[order(ts.tmp$day),]
Notice that you need to force load the strings in first column as character and not factors otherwise you will have issue with the rbind. To get the day column to be a factor after than just do:
ts.out$day = as.factor(ts.out$day)
Tidyr offers the nice complete function to generate rows for implicitly missing data. I use replace_na to turn NA values to 0 in second step.
ts%>%
tidyr::complete(day=seq.POSIXt(min(day), max(day), by="min"))%>%
dplyr::mutate(value=tidyr::replace_na(value,0))
Notice that I set the granularity of the dates to minutes since your dataset expects a row every minute.
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))
Here is an example of a subset data in .csv files. There are three columns with no header. The first column represents the date/time and the second column is load [kw] and the third column is 1= weekday, 0 = weekends/ holiday.
9/9/2010 3:00 153.94 1
9/9/2010 3:15 148.46 1
I would like to program in R, so that it selects the first and second column within time ranges from 10:00 to 20:00 for all weekdays (when the third column is 1) within a month of September and do not know what's the best and most efficient way to code.
code dt <- read.csv("file", header = F, sep=",")
#Select a column with weekday designation = 1, weekend or holiday = 0
y <- data.frame(dt[,3])
#Select a column with timestamps and loads
x <- data.frame(dt[,1:2])
t <- data.frame(dt[,1])
#convert timestamps into readable format
s <- strptime("9/1/2010 0:00", format="%m/%d/%Y %H:%M")
e <- strptime("9/30/2010 23:45", format="%m/%d/%Y %H:%M")
range <- seq(s,e, by = "min")
df <- data.frame(range)
OP ask for "best and efficient way to code" this without showing "inefficient code", so #Justin is right.
It's seems that the OP is new to R (and it's officially the summer of love) so I give it a try and I have a solution (not sure about efficiency..)
index <- c("9/9/2010 19:00", "9/9/2010 21:15", "10/9/2010 11:00", "3/10/2010 10:30")
index <- as.POSIXct(index, format = "%d/%m/%Y %H:%M")
set.seed(1)
Data <- data.frame(Date = index, load = rnorm(4, mean = 120, sd = 10), weeks = c(0, 1, 1, 1))
## Data
## Date load weeks
## 1 2010-09-09 19:00:00 113.74 0
## 2 2010-09-09 21:15:00 121.84 1
## 3 2010-09-10 11:00:00 111.64 1
## 4 2010-10-03 10:30:00 135.95 1
cond <- expression(format(Date, "%H:%M") < "20:00" &
format(Date, "%H:%M") > "10:00" &
weeks == 1 &
format(Date, "%m") == "09")
subset(Data, eval(cond))
## Date load weeks
## 3 2010-09-10 11:00:00 111.64 1