Python, adding a Water-Year time variable in an X-array - datetime

I have the following Xarray named 'scatch' with lat long and lev coords eliminated and only the time coord as a dimension. It has several variables. It is now a multivariate daily time-series from 2002 to 2014. I need to add a new variable "water_year", that shows what water-year is that day of the year. It could be by adding another column in the variables by Xarray.assign or by Xarray.resample but I am not sure, and could use some help. Note: "Water Year" starts from Oct 01, and ends on Sep 30 the next year. So water-year-2003 would be 10-01-2002 to 09-30-2003.
See my Xarray here

I'll create a sample dataset with a single variable for this example:
In [2]: scratch = xr.Dataset(
...: {'Baseflow': (('time', ), np.random.random(4018))},
...: coords={'time': pd.date_range('2002-10-01', freq='D', periods=4018)},
...: )
In [3]: scratch
Out[3]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
We can build a water_year array using the Datetime Components accessor .dt:
In [4]: water_year = (scratch.time.dt.month >= 10) + scratch.time.dt.year
...: water_year
Out[4]:
<xarray.DataArray (time: 4018)>
array([2003, 2003, 2003, ..., 2013, 2013, 2013])
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Because water_year is a DataArray indexed by an existing dimension, we can just add it as a coordinate and xarray will understand that it's a non-dimension coordinate. This is important to make sure we don't create a new dimension in our data.
In [7]: scratch.coords['water_year'] = water_year
In [8]: scratch
Out[8]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
water_year (time) int64 2003 2003 2003 2003 2003 ... 2013 2013 2013 2013
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
Because water_year is indexed by time, we still need to select from the arrays using the time dimension, but we can subset the arrays to specific water years:
In [9]: scratch.sel(time=(scratch.water_year == 2010))
Out[9]:
<xarray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2009-10-01 2009-10-02 ... 2010-09-30
water_year (time) int64 2010 2010 2010 2010 2010 ... 2010 2010 2010 2010
Data variables:
Baseflow (time) float64 0.441 0.7586 0.01377 ... 0.2656 0.1054 0.6964
Aggregation operations can use non-dimension coordinates directly, so the following works:
In [10]: scratch.groupby('water_year').sum()
Out[10]:
<xarray.Dataset>
Dimensions: (water_year: 11)
Coordinates:
* water_year (water_year) int64 2003 2004 2005 2006 ... 2010 2011 2012 2013
Data variables:
Baseflow (water_year) float64 187.6 186.4 184.7 ... 185.2 189.6 192.7

Related

I need to change the Month column to date format

My data set is monthly from Jan 1997 to Dec 2021. I need the month code to be in the correct format, however as.date doesn't recognise the cell contents as they are. Please help.
Month BrentSpot GDP Agriculture Production Construction Services
1 Jan-1997 23.54 63.8229 53.5614 81.9963 87.2775 59.4453
2 Feb-1997 20.85 64.7182 53.9091 82.1917 87.8350 60.5018
3 Mar-1997 19.13 64.9264 54.2569 81.6142 88.6714 60.8375
4 Apr-1997 17.56 65.2327 55.1264 82.0006 89.5170 61.0981
5 May-1997 19.02 64.7336 55.8220 82.0093 89.8144 60.4470
6 Jun-1997 17.58 65.1322 56.3438 82.3350 89.4891 60.8886
Gdp_Brent_Table$Month = seq(ymd('1997-01-01'),ymd('2021-12-01'), by = 'months')
(this seemed to do the trick)

R Weekly Time Series Object

I have the following vector, which contains data for each day of December.
vector1 <- c(1056772, 674172, 695744, 775040, 832036,735124,820668,1790756,1329648,1195276,1267644,986716,926468,828892,826284,749504,650924,822256,3434204,2502916,1262928,1025980,1828580,923372,658824,956916,915776,1081736,869836,898736,829368)
Now I want to create a time series object on a weekly basis and used the following code snippet:
weeklyts = ts(vector1,start=c(2016,12,01), frequency=7)
However, the starting and end points are not correct. I always get the following time series:
> weeklyts
Time Series:
Start = c(2017, 5)
End = c(2021, 7)
Frequency = 7
[1] 1056772 674172 695744 775040 832036 735124 820668 1790756 1329648 1195276 1267644 986716 926468 828892 826284 749504
[17] 650924 822256 3434204 2502916 1262928 1025980 1828580 923372 658824 956916 915776 1081736 869836 898736 829368
Does anybody nows what I am doing wrong?
To get a timeseries that starts and ends as you would expect, you need to think about the timeserie. You have 31 days from december 2016.
The timeserie start option handles 2 numbers, not 3. So something like c(2016, 1) if you start with month 1 in 2016. See following example.
ts(1:12, start = c(2016, 1), frequency = 12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 1 2 3 4 5 6 7 8 9 10 11 12
Now ts and daily data is an annoyance. ts cannot handle leap years. That is why you see people using a frequency of 365.25 to get an annual timeseries. To get a good december 2016 series we can do the following:
ts(vector1, start = c(2016, 336), frequency = 366)
Time Series:
Start = c(2016, 336)
End = c(2016, 366)
Frequency = 366
[1] 1056772 674172 695744 775040 832036 735124 820668 1790756 1329648 1195276 1267644 986716 926468 828892 826284 749504
[17] 650924 822256 3434204 2502916 1262928 1025980 1828580 923372 658824 956916 915776 1081736 869836 898736 829368
Note the following things that are going on:
Frequence is 366 because 2016 is a leap year
start is c(2016, 336), because 336 is the day in the year on "2016-12-01"
Personally I use xts package (and zoo) to handle daily data and use the functions in xts to aggregate to weekly timeseries. These can then be used with packages that like ts timeseries like forecast.
edit: added small xts example
my_df <- data.frame(dates = seq.Date(as.Date("2016-12-01"), as.Date("2017-01-31"), by = "day"),
var1 = rep(1:31, 2))
library(xts)
my_xts <- xts(my_df[, -1], order.by = my_df$dates)
# rollup to weekly. Dates shown are the last day in the weekperiod.
my_xts_weekly <- period.apply(my_xts, endpoints(my_xts, on = "weeks"), colSums)
head(my_xts_weekly)
[,1]
2016-12-04 10
2016-12-11 56
2016-12-18 105
2016-12-25 154
2017-01-01 172
2017-01-08 35
Depending on your needs you can transform this back into data.frames etc etc. Read the help for period.apply as you can specify your own functions in the rolling mechanism. And read the xts (and zoo) vignettes.

Moving average over 5 years with irregular dates

I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")

How to subtract these two time series objects?

I have one time series object that looks as follows (sorry, I don't know how to format it any more nicely):
Jan Feb Mar Apr May Jun
Jul Aug Sep
2010 0.051495184 0.012516017
0.029767280 0.046781229 0.041615717 0.002205329 0.056919026 -0.026339813 0.078932572 ...
It contains data from 2010m01 - 2014m12
And one that looks like this:
Time Series: Start = 673 End = 732 Frequency = 1
[1] 0.01241940
0.01238126 0.01234626 0.01227542 ...
They have the same number of observations. However, when I try to subract them I get the error:
Error in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L], deparse(substitute(e2))[1L]), :
not all series have the same frequency
Can anyone tell me what I can do to subtract the two?
Thanks in advance.
Edit:
str() gives:
Time-Series [1:60] from 2010 to 2015: 0.0515 0.0125 0.0298 0.0468
0.0416 ...
and
Time-Series [1:60] from 673 to 732: 0.0124 0.0124 0.0123 0.0123
0.0122 ...
There is no frequency<- function, but you can change the frequency of time-series objects using the ts function:
> x <- ts(1:10, frequency = 4, start = c(1959, 2))
> frequency(x) <- 12
Error in frequency(x) <- 12 : could not find function "frequency<-"
> y <- ts(x, frequency=12)
> frequency(y)
[1] 12

Performance problems when converting timestamped row data

I've written a function that takes a data.frame which represent intervals of data which occur across a 1 minute timeframe. The purpose of the function is to take these 1 minute intervals and convert them into higher intervals. Example, 1 minute becomes 5 minute, 60 minute etc...The data set itself has the potential to have gaps in the data i.e. jumps in time so it must accommodate for these bad data occurrences. I've written the following code which appears to work but the performance is absolutely terrible on large data sets.
I'm hoping that someone could provide some suggestions on how I might be able to speed this up. See below.
compressMinute = function(interval, DAT) {
#Grab all data which begins at the same interval length
retSet = NULL
intervalFilter = which(DAT$time$min %% interval == 0)
barSet = NULL
for (x in intervalFilter) {
barEndTime = DAT$time[x] + 60*interval
barIntervals = DAT[x,]
x = x+1
while(x <= nrow(DAT) & DAT[x,"time"] < barEndTime) {
barIntervals = rbind(barIntervals,DAT[x,])
x = x + 1
}
bar = data.frame(date=barIntervals[1,"date"],time=barIntervals[1,"time"],open=barIntervals[1,"open"],high=max(barIntervals[1:nrow(barIntervals),"high"]),
low=min(barIntervals[1:nrow(barIntervals),"low"]),close=tail(barIntervals,1)$close,volume=sum(barIntervals[1:nrow(barIntervals),"volume"]))
if (is.null(barSet)) {
barSet = bar
} else {
barSet = rbind(barSet, bar)
}
}
return(barSet)
}
EDIT:
Below is a row of my data. Each row represents a 1 minute interval, I am trying to convert this into arbitrary buckets which are the aggregates of these 1 minute intervals, i.e. 5 minutes, 15 minutes, 60 minutes, 240 minutes, etc...
date time open high low close volume
2005-09-06 2005-09-06 16:33:00 1297.25 1297.50 1297.25 1297.25 98
You probably want to re-use existing facitlities, specifically the POSIXct time types, as well as existing packages.
For example, look at the xts package --- it already has a generic function to.period() as well as convenience wrappers to.minutes(), to.minutes3(), to.minutes10(), ....
Here is an example from the help page:
R> example(to.minutes)
t.mn10R> data(sample_matrix)
t.mn10R> samplexts <- as.xts(sample_matrix)
t.mn10R> to.monthly(samplexts)
samplexts.Open samplexts.High samplexts.Low samplexts.Close
Jan 2007 50.0398 50.7734 49.7631 50.2258
Feb 2007 50.2245 51.3234 50.1910 50.7709
Mar 2007 50.8162 50.8162 48.2365 48.9749
Apr 2007 48.9441 50.3378 48.8096 49.3397
May 2007 49.3457 49.6910 47.5180 47.7378
Jun 2007 47.7443 47.9413 47.0914 47.7672
t.mn10R> to.monthly(sample_matrix)
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
Jan 2007 50.0398 50.7734 49.7631 50.2258
Feb 2007 50.2245 51.3234 50.1910 50.7709
Mar 2007 50.8162 50.8162 48.2365 48.9749
Apr 2007 48.9441 50.3378 48.8096 49.3397
May 2007 49.3457 49.6910 47.5180 47.7378
Jun 2007 47.7443 47.9413 47.0914 47.7672
t.mn10R> str(to.monthly(samplexts))
An ‘xts’ object from Jan 2007 to Jun 2007 containing:
Data: num [1:6, 1:4] 50 50.2 50.8 48.9 49.3 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "samplexts.Open" "samplexts.High" "samplexts.Low" "samplexts.Close"
Indexed by objects of class: [yearmon] TZ:
xts Attributes:
NULL
t.mn10R> str(to.monthly(sample_matrix))
num [1:6, 1:4] 50 50.2 50.8 48.9 49.3 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:6] "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" ...
..$ : chr [1:4] "sample_matrix.Open" "sample_matrix.High" "sample_matrix.Low" "sample_matrix.Close"
R>

Resources