convert irregular 6hourly data to daily accumulated using R - r

I have the following data:
Date,Rain
1979_8_9_0,8.775
1979_8_9_6,8.775
1979_8_9_12,8.775
1979_8_9_18,8.775
1979_8_10_0,0
1979_8_10_6,0
1979_8_10_12,0
1979_8_10_18,0
1979_8_11_0,8.025
1979_8_12_12,0
1979_8_12_18,0
1979_8_13_0,8.025
[1] The data is six hourly but some dates have incomplete 6 hourly data. For example, August 11 1979 has only one value at 00H. I would like to get the daily accumulated from this kind of data using R. Any suggestion on how to do this easily in R?
I'll appreciate any help.

You can transform your data to dates very easily with:
dat$Date <- as.Date(strptime(dat$Date, '%Y_%m_%d_%H'))
After that you should aggregate with:
aggregate(Rain ~ Date, dat, sum)
The result:
Date Rain
1 1979-08-09 35.100
2 1979-08-10 0.000
3 1979-08-11 8.025
4 1979-08-12 0.000
5 1979-08-13 8.025
Based on the comment of Henrik, you can also transform to dates with:
dat$Date <- as.Date(dat$Date, '%Y_%m_%d')

# split the "date" variable into new, separate variable
splitDate <- stringr::str_split_fixed(string = df$Date, pattern = "_", n = 4)
df$Day <- splitDate[,3]
# split data by Day, loop over each split and add rain variable
unlist(lapply(split(df$Rain, df$Day), sum))

Related

How to combine aggregate per month and splitting a data set at a certain date

I am trying to create a monthly average of precipitation values of two different time sets, but I can't get the data to be split into two before making the aggregation.
I have a dataset of daily precipitation data from 01-01-2006 to 31-12-2099 and I want to aggregate per month over the time period of (01-01-2015 to 31-12-2054) and (01-01-2055 to 31-12-2099).
I have used the aggregate function to create an average per month like this. But now I have the average per month over the entire data set (2006-2100) and I want to have two lists (one for 01-01-2015 to 31-12-2054 and one for 01-01-2055 to 31-12-2099). I think I need to make a subset or split the data, but I cannot find how to combine this with the aggregate function. Thank you so much!
months = Alentejo_RCP4.5_Average$Month
Alentejo_RCP4.5_Average.myma = aggregate(x = Alentejo_RCP4.5_Average,
by = list(months), FUN = mean)
I also tried this but it just takes the dates and not the attached values to the date.
df <- data.frame(date=as.Date("2015-01-01")+1:365, x=1:365)
list <- split(df,df$date<as.Date("2055-01-01"))
zz <- " Year Month Day Date Average_P
2006 1 1 2006-01-01 6.5
2007 1 2 2007-01-02 2.8
2055 3 3 2055-03-03 3.5
2058 3 4 2058-03-04 5.1
2060 5 5 2060-05-05 3.2"
Data <- read.table(text=zz, header = TRUE)
Instead of splitting the datasets you can create a new column to distinguish between the two groups and take mean of each group and each month.
Data %>%
mutate(Date = as.Date(Date),
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')) %>%
group_by(group, Date) %>%
summarise(Average_P = mean(Average_P)) -> result
Or in base R :
Data$Date <- as.Date(Data$Date)
aggregate(Average_P~group + month,
transform(Data,
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')), mean) -> result
If you need final output as list you can then use split.
split(result,result$group)

How to aggregate using water years (oct 1 2008- sept 31 2009)

I have data measuring precipitation daily using R. My dates are in format 2008-01-01 and range for 10 years. I am trying to aggregate from 2008-10-01 to 2009-09-31 but I am not sure how. Is there a way in aggregate to set a start date of aggregation and group.
My current code is
data<- aggregate(data$total_snow_cm, by=list(data$year), FUN = 'sum')
but this output gives me a sum total of the snowfall for each year from jan - dec but I want it to include oct / 08 to sept / 09.
Assuming your data are in long format, I'd do something like this:
library(tidyverse)
#make sure R knows your dates are dates - you mention they're 'yyyy-mm-dd', so
yourdataframe <- yourdataframe %>%
mutate(yourcolumnforprecipdate = ymd(yourcolumnforprecipdate)
#in this script or another, define a water year function
water_year <- function(date) {
ifelse(month(date) < 10, year(date), year(date)+1)}
#new wateryear column for your data, using your new function
yourdataframe <- yourdataframe %>%
mutate(wateryear = water_year(yourcolumnforprecipdate)
#now group by water year (and location if there's more than one)
#and sum and create new data.frame
wy_sums <- yourdataframe %>% group_by(locationcolumn, wateryear) %>%
summarize(wy_totalprecip = sum(dailyprecip))
For more info, read up on the tidyverse 's great sublibrary called lubridate -
where the ymd() function is from. There are others like ymd_hms(). mutate() is from the tidyverse's dplyr libary. Both libraries are extremely useful!
I'd like to give the actual answer to the question, where the aggregate() way was asked.
You may use with() to wrap the data specification around aggregate(). In the with() you can define date intervals as you can with numbers.
df1.agg <- with(df1[as.Date("2008-10-01") <= df1$year & df1$year <= as.Date("2009-09-30"), ],
aggregate(total_snow_cm, by=list(year), FUN=sum))
Another way is to use aggregate()'s formula interface, where data and, hence, also the interval can be specified inside the aggregate() call.
df1.agg <- aggregate(total_snow_cm ~ year,
data=df1[as.Date("2008-10-01") <= df1$year &
df1$year <= as.Date("2009-09-30"), ], FUN=sum)
Result
head(df1.agg)
# year total_snow_cm
# 1 2008-10-01 171
# 2 2008-10-02 226
# 3 2008-10-03 182
# 4 2008-10-04 129
# 5 2008-10-05 135
# 6 2008-10-06 222
Data
set.seed(42)
df1 <- data.frame(total_snow_cm=sample(120:240, 4018, replace=TRUE),
year=seq(as.Date("2000-01-01"),as.Date("2010-12-31"), by="day"))

Create 10,000 date data.frames with fake years based on 365 days window

Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28

Aggregating weekly (7 day) data to monthly in R

I have data measured over a 7 day period. Part of the data looks as follows:
start wk end wk X1
2/1/2004 2/7/2004 89
2/8/2004 2/14/2004 65
2/15/2004 2/21/2004 64
2/22/2004 2/28/2004 95
2/29/2004 3/6/2004 79
3/7/2004 3/13/2004 79
I want to convert this weekly (7 day) data into monthly data using weighted averages of X1. Notice that some of the 7 day X1 data will overlap from one month to the other (X1=79 for the period 2/29 to 3/6 of 2004).
Specifically I would obtain the February 2004 monthly data (say, Y1) the following way
(7*89 + 7*65 + 7*64 + 7*95 + 1*79)/29 = 78.27
Does R have a function that will properly do this? (to.monthly in the xts library DOES NOT do what I need) If, not what is the best way to do this in R?
Convert the data to daily data and then aggregate:
Lines <- "start end X1
2/1/2004 2/7/2004 89
2/8/2004 2/14/2004 65
2/15/2004 2/21/2004 64
2/22/2004 2/28/2004 95
2/29/2004 3/6/2004 79
3/7/2004 3/13/2004 79
"
library(zoo)
# read data into data frame DF
DF <- read.table(text = Lines, header = TRUE)
# convert date columns to "Date" class
fmt <- "%m/%d/%Y"
DF <- transform(DF, start = as.Date(start, fmt), end = as.Date(end, fmt))
# convert to daily zoo series
to.day <- function(i) with(DF, zoo(X1[i], seq(start[i], end[i], "day")))
z.day <- do.call(c, lapply(1:nrow(DF), to.day))
# aggregate by month
aggregate(z.day, as.yearmon, mean)
The last line gives:
Feb 2004 Mar 2004
78.27586 79.00000
If you are willing to get rid of "end week" from your DF, apply.monthly will work like a charm.
DF.xts <- xts(DF$X1, order.by=DF$start_wk)
DF.xts.monthly <- apply.monthly(DF.xts, "sum")
Then you can always recreate end dates if you absolutely need them by adding 30.

Extract Date in R

I struggle mightily with dates in R and could do this pretty easily in SPSS, but I would love to stay within R for my project.
I have a date column in my data frame and want to remove the year completely in order to leave the month and day. Here is a peak at my original data.
> head(ds$date)
[1] "2003-10-09" "2003-10-11" "2003-10-13" "2003-10-15" "2003-10-18" "2003-10-20"
> class((ds$date))
[1] "Date"
I "want" it to be.
> head(ds$date)
[1] "10-09" "10-11" "10-13" "10-15" "10-18" "10-20"
> class((ds$date))
[1] "Date"
If possible, I would love to set the first date to be October 1st instead of January 1st.
Any help you can provide will be greatly appreciated.
EDIT: I felt like I should add some context. I want to plot an NHL player's performance over the course of a season which starts in October and ends in April. To add to this, I would like to facet the plots by each season which is a separate column in my data frame. Because I want to compare cumulative performance over the course of the season, I believe that I need to remove the year portion, but maybe I don't; as I indicated, I struggle with dates in R. What I am looking to accomplish is a plot that compares cumulative performance over relative dates by season and have the x-axis start in October and end in April.
> d = as.Date("2003-10-09", format="%Y-%m-%d")
> format(d, "%m-%d")
[1] "10-09"
Is this what you are looking for?
library(ggplot2)
## make up data for two seasons a and b
a = as.Date("2010/10/1")
b = as.Date("2011/10/1")
a.date <- seq(a, by='1 week', length=28)
b.date <- seq(b, by='1 week', length=28)
## make up some score data
a.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
b.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
## create a data frame
df <- data.frame(a.date, b.date, a.score, b.score)
df
## Since I am using ggplot I better create a "long formated" data frame
df.molt <- melt(df, measure.vars = c("a.score", "b.score"))
levels(df.molt$variable) <- c("First season", "Second season")
df.molt
Then, I am using ggplot2 for plotting the data:
## plot it
ggplot(aes(y = value, x = a.date), data = df.molt) + geom_point() +
geom_line() + facet_wrap(~variable, ncol = 1) +
scale_x_date("Date", format = "%m-%d")
If you want to modify the x-axis (e.g., display format), then you'll probably be interested in scale_date.
You have to remember Date is a numeric format, representing the number of days passed since the "origin" of the internal date counting :
> str(Date)
Class 'Date' num [1:10] 14245 14360 14475 14590 14705 ...
This is the same as in EXCEL, if you want a reference. Hence the solution with format as perfectly valid.
Now if you want to set the first date of a year as October 1st, you can construct some year index like this :
redefine.year <- function(x,start="10-1"){
year <- as.numeric(strftime(x,"%Y"))
yearstart <- as.Date(paste(year,start,sep="-"))
year + (x >= yearstart) - min(year) + 1
}
Testing code :
Start <- as.Date("2009-1-1")
Stop <- as.Date("2011-11-1")
Date <- seq(Start,Stop,length.out=10)
data.frame( Date=as.character(Date),
year=redefine.year(Date))
gives
Date year
1 2009-01-01 1
2 2009-04-25 1
3 2009-08-18 1
4 2009-12-11 2
5 2010-04-05 2
6 2010-07-29 2
7 2010-11-21 3
8 2011-03-16 3
9 2011-07-09 3
10 2011-11-01 4

Resources