Aggregate data by week or few days

Aggregate data by week or few days - r

I'm trying to aggregate a data frame as to obtain a table with weekly averages of a variable. I found the following package provides a nice solution, and I've been using it for aggregating data yearly and monthly. However, the function to aggregate data weekly simply is not working as described. Does anyone has an idea how I can fix this up?
For instance, following the manual:
require(TSAgg)
#Load the data:
data(foo)
##Format the data using the timeSeries function.
foo.ts<-timeSeries(foo[,1], "%d/%m/%Y %H:%M",foo[,3])
##Aggregate the data into 6 days blocks using max
(mean.month <- monthsAgg(foo.ts,mean,6))
#Aggregate the data into weeks, using 7 days and mean:
(foo.week<-daysAgg(foo.ts,mean,7) )
The last command doesn't work. The function is the following:
daysAgg <-
function (data, process, multiple = NULL, na.rm = FALSE)
{
if (is.null(multiple)) {
multiple = 1
}
if (multiple == 1) {
day <- aggregate(data[, 8:length(data)], list(day = data$day,
month = data$month, year = data$year), process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data2 <- data.frame(date = days, data = day[, 4:length(day)])
names(data2) <- c("Date", names(data[8:length(data)]))
return(data2)
}
temp <- data
day <- aggregate(list(data[, 8:length(data)], count = 1),
list(day = data$day, month = data$month, year = data$year),
process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data <- data.frame(date = days, day[, 5:length(day) - 1],
count = day[length(day)])
days = paste(multiple, "days")
all.dates <- seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = "day")
dates <- data.frame(date = all.dates)
aggreGated <- merge(dates, data, by = "date", all.x = TRUE)
aggreGated$date <- rep(seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = days), each = multiple, length = length(all.dates))
results <- aggregate(list(aggreGated[2:length(aggreGated)]),
list(date = aggreGated$date), process, na.rm = TRUE)
results <- subset(results, results$count != 0)
results <- results[, -length(results)]
names(results) <- c("Date", names(temp[8:length(temp)]))
return(results)
}

The problem in the code stems from its usage of the function ymd, which attaches " UTC" to the end of all dates it outputs. It is possible to overload the function by defining ymd again using
ymd <- function(x) {
as.Date(x, "%Y %m %d")
}
before you call daysAgg.

Related

Missing values Jarque Bera test in R

I'm trying to perform the Jarque Bera test on hourly and daily return series in R. It worked fine for my daily return series, however it doesn't work for high frequency data.
That's what I did so far:
# Daily,hourly,minute prices of Tether in USD
df.ohlc.daily_usdt <- get_ohlc(usdt, periods = 86400, after = "2014-01-01", datetime=TRUE)
df.ohlc.hourly_usdt <- get_ohlc(usdt, periods = 3600, after = "2014-01-01", datetime = TRUE)
df.ohlc.min_usdt <- get_ohlc(usdt, periods = 60, after = "2014-01-01", datetime = TRUE)
index_daily_usdt <- df.ohlc.daily_usdt$CloseTime
data_daily_usdt <- data.frame(df.ohlc.daily_usdt[,2:6])
df.ohlc.daily_usdt_xts <- xts(data_daily_usdt, index_daily_usdt)
usdt_daily_return <- dailyReturn(df.ohlc.daily_usdt_xts, log=TRUE)
index_hour_usdt <- df.ohlc.hourly_usdt$CloseTime
data_hour_usdt <- data.frame(df.ohlc.hourly_usdt[,2:6])
df.ohlc.hourly_usdt_xts <- xts(data_hour_usdt, index_hour_usdt)
usdt_hourly_return <- diff(log(Cl(df.ohlc.hourly_usdt_xts)), lag=1)
#Descriptive statistics Tether hourly log returns
usdt_mean_hourly = mean(usdt_hourly_return, na.rm = TRUE)
usdt_sd_hourly = sd(usdt_hourly_return, na.rm = TRUE)
usdt_min_hourly = min(usdt_hourly_return, na.rm = TRUE)
usdt_max_hourly = max(usdt_hourly_return, na.rm = TRUE)
usdt_JB_hourly = jarque.bera.test(usdt_hourly_return)
Error in jarque.bera.test(usdt_hourly_return) : NAs in x
The JB test using Desctools does not work for me. Can someone tell me what other possibility I have to remove NAs to perform the JB test using ts package?

DescTools::JarqueBeraTest(usdt_hourly_return, na.rm=TRUE)
does not work?

Using aggregate to sum values greater than 70 in R

I am trying to sum values that are greater than 70 in several different data sets. I believe that aggregate can do this but my research has not pointed to an obvious solution to obtaining the values that exceed seventy in my data sets. I have first used aggregate to get the daily max values and put these values into the data frame called yearmaxs. Here is my code and what I have tried:
number of times O3 >70 in a year per site
Sys.setenv(TZ = "UTC")
library(openair)
library(lubridate)
filedir <- "C:/Users/dfmcg/Documents/Thesisfiles/8hravg"
myfiles <- c(list.files(path = filedir))
paste(filedir, myfiles, sep = '/')
npsfiles <- c(paste(filedir, myfiles,sep = '/'))
for (i in npsfiles[22]) {
x <- substr(i,45,61)
y <- paste('C:/Users/dfmcg/Documents/Thesisfiles/exceedenceall', x, sep='/')
timeozone <- import(i, date="DATES", date.format = "%Y-%m-%d %H", header=TRUE, na.strings="NA")
overseventy <- c()
yearmaxs <- aggregate(rolling.O3new ~ format(as.Date(date)), timeozone, max)
colnames(yearmaxs) <- c("date", "daymax")
overseventy <- aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
colnames(overseventy) <- c("date", "daymax")
aggregate(daymax ~ format(as.Date(date), "%Y"), overseventy, sum)
I have also tried: sum > "70 and sum(daymax > "70).
My other idea at this point is using a for loop to iterate through the values. I was hoping that a could use aggregate again to sum the values of interest. Any help at all would be greatly appreciated!

I think you want:
aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
To things:
you need numerical comparison, so use as.numeric(daymax) > 70 not daymax > "70";
use the subset argument in aggregate.formula.

Subsetting a spacetime::SDFDF by time

How can I subset a spacetime::SDFDF (spatio-temporal data with full space-time grid) by time?
Sofar, I tried:
library("maps")
library("maptools")
library("spacetime")
library("plm")
states.m <- map("state", plot = FALSE, fill = TRUE)
IDs <- sapply(strsplit(states.m$names, ":"), function(x) x[1])
states <- map2SpatialPolygons(states.m, IDs = IDs)
yrs <- 1970:1986
time <- as.POSIXct(paste(yrs, "-01-01", sep = ""), tz = "GMT")
data("Produc")
Produc.st <- STFDF(states[-8], time, Produc[order(Produc[2], Produc[1]),])
Produc.st#time[c(1,5,17)]
Produc.st[Produc.st#time[c(1,5,17)]]
But that gives me the error: ncol(i) == 2 is not TRUE.
Any ideas?

Please try
Produc.st[,index(Produc.st#time[c(1,5,17)])]
i.e., time selection is done after the ,, and don't select with an xts object as Produc.st#time[c(1,5,17)]) is, but with a time (POSIXct) vector.

How do I rename the columns in a list of multi- series zoo objects?

My R learning curve has got the best of me today. So.. I have a list of multi series zoo objects. I'm trying to rename the columns in each to the same values. I'm attempting this in the last line... and it runs without error... but the names aren't changed. Any ideas would be great.
require("zoo")
Get monthly data of stocks.
symbs = c('AAPL', 'HOV', 'NVDA')
importData <- lapply(symbs, function(symb) get.hist.quote(instrument= symb,
start = "2000-01-01", end = "2013-07-15", quote="AdjClose", provider = "yahoo",
origin="1970-01-01", compression = "m", retclass="zoo"))
names(importData) <- symbs
#Calculate monthly pct chgs of stocks.
monthlyPctChgs = lapply(importData, function(x) diff(x, lag = 1) / lag(x, k = -1))
names(monthlyPctChgs) <- symbs
#Merge the pct chgs and the monthly closing prices
pricingAndPerfsMerged = mapply(merge, importData, lag(monthlyPctChgs, k = -1),
SIMPLIFY = FALSE)
#Rename the columns in each zoo.
lapply(pricingAndPerfsMerged, function(x) colnames(x) = c('AdjClose', 'MonthlyPerf'))

You're renaming columns of a copy. This would be a good place to use a for loop instead:
for (i in seq_along(pricingAndPerfsMerged)) {
colnames(pricingAndPerfsMerged[[i]]) = c('AdjClose', 'MonthlyPerf')
}

r - find same times in n number of data frames

Consider the following example:
Date1 = seq(from = as.POSIXct("2010-05-03 00:00"),
to = as.POSIXct("2010-06-20 23:00"), by = 120)
Dat1 <- data.frame(DateTime = Date1,
x1 = rnorm(length(Date1)))
Date2 <- seq(from = as.POSIXct("2010-05-01 03:30"),
to = as.POSIXct("2010-07-03 22:00"), by = 120)
Dat2 <- data.frame(DateTime = Date2,
x1 = rnorm(length(Date2)))
Date3 <- seq(from = as.POSIXct("2010-06-08 01:30"),
to = as.POSIXct("2010-07-13 11:00"), by = 120)
Dat3Matrix <- matrix(data = rnorm(length(Date3)*3), ncol = 3)
Dat3 <- data.frame(DateTime = Date3,
x1 = Dat3Matrix)
list1 <- list(Dat1,Dat2,Dat3)
Here I build three data.frames as an example and placed them all into a list. From here I would like to write a routine that would return the 3 data frames but only keeping the times that were present in each of the others i.e. all three data frames should be reduced to the times that were consistent among all of the data frames. How can this be done?

zoo has a multi-way merge. This lapply's read.zoo over the components of list1 converting them each to zoo class. tz="" tells it to use POSIXct for the resulting date/times. It then merges the converted components using all=FALSE so that only intersecting times are kept.
library(zoo)
z <- do.call("merge", c(lapply(setNames(list1, 1:3), read.zoo, tz = ""), all = FALSE))
If we later wish to convert z to data.frame try dd <- cbind(Time = time(z), coredata(z)) but it might be better to keep it as a zoo object (or convert it to an xts object) so that further processing is simplified as well.

One approach is to find the respective indices and then subset accordingly:
idx1 <- (Dat1[,1] %in% Dat2[,1]) & (Dat1[,1] %in% Dat3[,1])
idx2 <- (Dat2[,1] %in% Dat1[,1]) & (Dat2[,1] %in% Dat3[,1])
idx3 <- (Dat3[,1] %in% Dat1[,1]) & (Dat3[,1] %in% Dat2[,1])
Now Dat1[idx1,], Dat2[idx2,], Dat3[idx3,] should give the desired result.

You could use merge:
res <- NULL
for (i in 2:length(list1)) {
dat <- list1[[i]]
names(dat)[2] <- paste0(names(dat)[2], "_", i);
dat[[paste0("id_", i)]] <- 1:nrow(dat)
if (is.null(res)) {
res <- dat
} else {
res <- merge(res, dat, by="DateTime")
}
}
I added columns with id's; you could use these to index the records in the original data.frames

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Aggregate data by week or few days - r

The problem in the code stems from its usage of the function ymd, which attaches " UTC" to the end of all dates it outputs. It is possible to overload the function by defining ymd again using ymd <- function(x) { as.Date(x, "%Y %m %d") } before you call daysAgg.

Related

Missing values Jarque Bera test in R

Using aggregate to sum values greater than 70 in R

Subsetting a spacetime::SDFDF by time

How do I rename the columns in a list of multi- series zoo objects?

r - find same times in n number of data frames

Categories

Resources