Average of month's data (jan-dec) in xts objects - r

I have this large xts, aggregated monthly with apply.monthly function.
2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1
...
What I want is to calculate the average of Jan-Dec months for all the period. Something like this, but in xts form:
01 541.8
02 23.0
03 34.8
04 12.8
05 21.8
06 44.8
07 22.8
08 55.0
09 287.8
10 15.8
11 113
12 419.3
I want to avoid using dplyr functions like group_by. I think there must be a solution using split and lapply / do.call
I tried spliting the xts in years
xtsobject <- split(xtsobject, f = "years")
and then I dont know how to use properly the lapply function in order to calculate the 12 averages (Jan-Dec) of all the period.
This question
Group by period.apply() in xts
is similar, but in my xts I dont have/want a new column, I think it can be done using the xts index.

Assuming the input data x, shown reproducibly in the Note at the end, useaggregate.zoo like this:
ag <- aggregate(x, cycle(as.yearmon(time(x))), mean, na.rm = TRUE)
ag
giving the following zoo series:
1 77.1
7 269.8
8 251.0
9 201.8
10 95.8
11 NaN
12 49.3
We could plot it like this:
plot(ag, type = "h")
Note
Lines <- "2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1"
library(xts)
z <- read.zoo(text = Lines)
x <- as.xts(z)

You can use the base::months function to extract the month before calculating the mean:
do.call(rbind, lapply(split(x, base::months(index(x))), mean, na.rm=TRUE))
output:
[,1]
April 165.1600
August 290.2444
December 106.8200
February 82.6300
January 62.9100
July 264.9889
June 246.4889
March 100.5500
May 246.3333
November 116.6400
October 151.3667
September 158.5667

It seems the index is a number and not a POSIXct object. You can convert it and use format to extract months and use it in tapply :
tapply(xtsobject[, 1], format(as.POSIXct(zoo::index(xtsobject),
origin = '1970-01-01'), '%m'), mean, na.rm = TRUE)

Related

For loop for converting character data to numeric in a data frame

I have 50 data frames (different name each) with 10 (same name) columns of climate data. The first 5 columns although they are numbers, their class is "character". The rest 4 columns are already in the correct class (numeric) and the last one (named 'wind dir') is in character class so no change is needed.
I tried two ways to convert the class of those 5 columns in all 50 data frames, but nothing worked.
1st way) Firstly I've created a vector with the names of those 50 data frames and I named it onomata.
Secondly I've created a vector col_numbers2 <- c(1:5) with the number of columns I would like to convert.
Then I wrote the following code:
for(i in onomata){
i[col_numbers2] <- sapply(i[col_numbers2], as.numeric)
}
Checking the class of those first five columns I saw that nothing changed. (No error report after executing the code)
2nd way) Then I tried to use the dplyr package with a for loop and the code is as follows:
for(i in onomata){
i <- i %>%
mutate_at(vars(-`wind_dir`),as.numeric)
In this case, I excluded the character column, and I applied the mutate function to the whole data frame, but I received an error message :
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
What do you think I am doing wrong ?
Thank you
Original data table (what I get when I use read.table() for each txt file:
date
Time
Tdry
Humidity
Wind_velocity
Wind_direction
Wind_gust
02/01/15
02:00
2.4
77.0
6.4
WNW
20.9
02/01/15
03:00
2.3
77.0
11.3
NW
30.6
02/01/15
04:00
2.3
77.0
9.7
NW
20.9
02/01/15
05:00
2.3
77.0
11.3
NW
30.6
02/01/15
06:00
2.3
78.0
9.7
NW
19.3
02/01/15
07:00
2.2
79.0
12.9
NNW
35.4
02/01/15
08:00
2.4
79.0
8.0
NW
14.5
02/01/15
09:00
2.6
79.0
8.0
WNW
20.9
Data after I split data in columns 1 and 2 (date, time):
day
month
year
Hour
Minutes
Tdry
Humidity
Wind_velocity
Wind_direction
Wind_gust
02
01
15
02
00
2.4
77.0
6.4
WNW
20.9
02
01
15
03
00
2.3
77.0
11.3
NW
30.6
02
01
15
04
00
2.3
77.0
9.7
NW
20.9
02
01
15
05
00
2.3
77.0
11.3
NW
30.6
02
01
15
06
00
2.3
78.0
9.7
NW
19.3
02
01
15
07
00
2.2
79.0
12.9
NNW
35.4
02
01
15
08
00
2.4
79.0
8.0
NW
14.5
02
01
15
09
00
2.6
79.0
8.0
WNW
20.9
Here are two possible ways. Both relies on getting all your files in a list of dataframes (called df_list in the example below). To acheive this you could use mget() (ex: mget(onomata) or list.files()).
Once this is done, you can use lapply (or mapply) to go through all your dataframes.
Solution 1
To transform your data, I propose you 1st convert it into POSIXct format and then extract the relevant elements to make the wanted columns.
# create a custom function that transforms each dataframe the way you want
fun_split_datehour <- function(df){
df[, "datetime"] <- as.POSIXct(paste(df$date, df$hour), format = "%d/%m/%Y %H:%M") # create a POSIXct column with info on date and time
# Extract elements you need from the date & time column and store them in new columns
df[,"year"] <- as.numeric(format(df[, "datetime"], format = "%Y"))
df[,"month"] <- as.numeric(format(df[, "datetime"], format = "%m"))
df[,"day"] <- as.numeric(format(df[, "datetime"], format = "%d"))
df[,"hour"] <- as.numeric(format(df[, "datetime"], format = "%H"))
df[,"min"] <- as.numeric(format(df[, "datetime"], format = "%M"))
return(df)
}
# use this function on each dataframe of your list
lapply(df_list, FUN = fun_split_datehour)
Adapted from Split date data (m/d/y) into 3 separate columns (this answer)
Data:
# two dummy dataframe, date and hour format does not matter, you can tell as.POSIXct what to expect using format argument (see ?as.POSIXct)
df1 <- data.frame(date = c("02/01/2010", "03/02/2010", "10/09/2010"),
hour = c("05:32", "08:20", "15:33"))
df2 <- data.frame(date = c("02/01/2010", "03/02/2010", "10/09/2010"),
hour = c("05:32", "08:20", "15:33"))
# you can replace c("df1", "df2") with onomata: df_list <- mget(onomata)
df_list <- mget(c("df1", "df2"))
Outputs:
> lapply(df_list, FUN = fun_split_datehour)
$df1
date hour datetime year month day min
1 2010-01-02 5 2010-01-02 05:32:00 2010 1 2 32
2 2010-02-03 8 2010-02-03 08:20:00 2010 2 3 20
3 2010-09-10 15 2010-09-10 15:33:00 2010 9 10 33
$df2
date hour datetime year month day min
1 2010-01-02 5 2010-01-02 05:32:00 2010 1 2 32
2 2010-02-03 8 2010-02-03 08:20:00 2010 2 3 20
3 2010-09-10 15 2010-09-10 15:33:00 2010 9 10 33
And columns year, month, day, hour and min are numeric. You can check using str(lapply(df_list, FUN = fun_split_datehour)).
Note: looking at the question you asked before this one, you might find https://stackoverflow.com/a/24376207/10264278 usefull. In addition, using POSIXct format will save you time if you want to make plots, arrange, etc.
Solution 2
If you do not want to use POSIXct, you could do:
# Dummy data changed to match you situation with already splited date
dfa <- data.frame(day = c("02", "03", "10"),
hour = c("05", "08", "15"))
dfb <- data.frame(day = c("02", "03", "10"),
hour = c("05", "08", "15"))
df_list <- mget(c("dfa", "dfb"))
# Same thing, use lapply() to go through each dataframe of the list and apply() to use as.numeric on the wanted columns
lapply(df_list, FUN = function(df){as.data.frame(apply(df[1:2], 2, as.numeric))}) # change df[1:2] to select columns you want to convert in your actual dataframes
Maybe the following code can help.
First, get the filenames with list.files. Second, read them all in with lapply. If read.table is not the appropriate function, read help("read.table"), it is the same page as for read.csv, read.csv2, etc. Then, coerce the first 5 columns of all data.frames to numeric in one go.
filenames <- list.files(path = "your_directory", pattern = "\\.txt")
onomata <- lapply(filenames, read.table)
onomata <- lapply(onomata, function(X){
X[1:5] <- lapply(X[1:5], as.numeric)
X
})

Importing Text File as Zoo in R

I am trying to import a text file with data that looks like:
Jan 1998 4.36
Feb 1998 4.34
Mar 1998 4.35
Apr 1998 4.37
May 1998 4.45
Jun 1998 4.54
Jul 1998 4.52
Aug 1998 4.68
Sep 1998 4.82
Oct 1998 4.72
Nov 1998 4.80
...
as a zoo in R. I have tried importing it directly as a zoo:
install.packages("zoo")
library("zoo")
FMAGX_prices <- read.csv.zoo("filepath.../FMAGX_prices.csv", format = "%m/%Y")
and importing it as a data frame and then converting it to a zoo. The reason I create the dates vector re-assign it to the front of the data frame is that by default, I get a 3 column data frame, one with the month abbreviation, one with the year, and one with the price:
install.packages("zoo")
library("zoo")
FMAGX_prices <-read.table("filepath.../FMAGX_prices.txt")
dates <- paste(FMAGX_prices$V1, FMAGX_prices$V2, sep = " ")
FMAGX_prices$V3 <- as.numeric(as.character(FMAGX_prices$V3))
FMAGX_prices$dates <- dates
FMAGX_prices <- subset(FMAGX_prices, select= c(dates, V3))
FMAGX_prices <- read.zoo(FMAGX_prices, "%b %Y")
neither method works. I always get the below error:
Error in read.zoo(FMAGX_prices, format = "%b %Y") :
index has 144 bad entries at data rows: 1 2 3 4 5 6 7 8 9 10 11...
My assumption is that there is something wrong with my date format, but I am not sure what it would be.
I've tried various combinations of arguments in the read statements, I've added headers, I've reformatted the data as a CSV, changed the dates to 01/1998, 02/1998, etc (and the corresponding arguments), but I always get that same error

R: 3rd Wedndesday of a specific month using XTS

I want to retrieve the third Wedndesday of specific months in R.
This is not exactly a duplicate question of How to figure third Friday of a month in R because I want to use either Base R or XTS.
The data is in x:
library(xts)
x = xts(1:100, Sys.Date()+1:100)
and I can retrieve wednesdays by using:
wed=x[.indexwday(x) %in% 3]
> wed
[,1]
2015-09-30 6
2015-10-07 13
2015-10-14 20
2015-10-21 27
2015-10-28 34
2015-11-04 41
2015-11-11 48
2015-11-18 55
2015-11-25 62
2015-12-02 69
2015-12-09 76
2015-12-16 83
2015-12-23 90
2015-12-30 97
>
I haven't figured out how to get the third observation in each month of this wed vector using xts but there must be a way.
third=wed[head(endpoints(wed, "months") + 3, -3)]
returns a wrong result.
I have read the xts documentation and couln't find the right function there.
Any help would be appreciated.
Why not just
library(xts)
x = xts(1:3650, Sys.Date()+1:3650)
x[.indexwday(x) == 3 &
.indexmday(x) >= 15 &
.indexmday(x) <= 21
]
If first Wednesday is on 1st then third is on 15th.
If first Wednesday is on 7th then third is on 21st.
So anywhere between 15th and 21st.
Take your wed object, split it by month, then select the 3rd row. Then use do.call and rbind to put it back together.
R> # 3rd or last available Wednesday
R> wedList <- split(wed, "months")
R> do.call(rbind, lapply(wedList, function(x) x[min(nrow(x),3),]))
# [,1]
# 2015-09-30 6
# 2015-10-21 27
# 2015-11-18 55
# 2015-12-16 83
R> # no observation if 3rd Wednesday isn't available
R> do.call(rbind, lapply(wedList, function(x) if(nrow(x) < 3) NULL else x[3,]))
# [,1]
# 2015-10-21 27
# 2015-11-18 55
# 2015-12-16 83

Why is the time series data being plotted backwards in R?

I am stuck on the why that this is happening and have tried searching everywhere for the answer. When I try to plot a timeseries object in R the resulting plot comes out in reverse.
I have the following code:
library(sqldf)
stock_prices <- read.csv('~/stockPrediction/input/REN.csv')
colnames(stock_prices) <- tolower(colnames(stock_prices))
colnames(stock_prices)[7] <- 'adjusted_close'
stock_prices <- sqldf('SELECT date, adjusted_close FROM stock_prices')
head(stock_prices)
date adjusted_close
1 2014-10-20 3.65
2 2014-10-17 3.75
3 2014-10-16 4.38
4 2014-10-15 3.86
5 2014-10-14 3.73
6 2014-10-13 4.09
tail(stock_prices)
date adjusted_close
1767 2007-10-15 8.99
1768 2007-10-12 9.01
1769 2007-10-11 9.02
1770 2007-10-10 9.06
1771 2007-10-09 9.06
1772 2007-10-08 9.08
But when I try the following code:
stock_prices_ts <- ts(stock_prices$adjusted_close, start=c(2007, 1), end=c(2014, 10), frequency=12)
plot(stock_prices_ts, col='blue', lwd=2, type='l')
How the image that results is :
And even if I reverse the time series object with this code:
plot(rev(stock_prices_ts), col='blue', lwd=2, type='l')
I get this
which has arbitrary numbers.
Any idea why this is happening? Any help is much appreciated.
This is happened because your object loose its time serie structure once you apply rev function.
For example :
set.seed(1)
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)
gnp ## gnp has a real time serie structure
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1954 0.37 1.55 1.71 4.31 5.64 5.82
1955 7.31 9.05 10.63 11.32 13.83 15.22 15.60 14.39 16.51 17.47 18.45 20.39
1956 22.21 23.80 25.72 27.50 28.57 27.58 29.20 30.14 30.98 30.51 31.03 32.45
1957
rev(gnp) ## the reversal is just a vector
[1] 110.91 110.38 110.60 110.17 110.45 108.89 106.30 104.60 102.44 ....
In general is a liitle bit painful to manipulate the class ts. One idea is to use an xts object that "generally" conserve its structure one you apply common operation on it.
Even in this case the generic method rev is not implemented fo an xts object, it is easy to coerce the resulted zoo time series to and xts one using as.xts.
par(mfrow=c(2,2))
plot(gnp,col='red',main='gnp')
plot(rev(gnp),type='l',col='red',main='rev(gnp)')
library(xts)
xts_gnp <- as.xts(gnp)
plot(xts_gnp)
## note here that I apply as.xts again after rev operation
## otherwise i lose xts structure
rev_xts_gnp = as.xts(rev(as.xts(gnp)))
plot(rev_xts_gnp)

compute mean of last 5 days of each month in R

I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5
tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.

Resources