Using aggregate to sum values greater than 70 in R - r

I am trying to sum values that are greater than 70 in several different data sets. I believe that aggregate can do this but my research has not pointed to an obvious solution to obtaining the values that exceed seventy in my data sets. I have first used aggregate to get the daily max values and put these values into the data frame called yearmaxs. Here is my code and what I have tried:
number of times O3 >70 in a year per site
Sys.setenv(TZ = "UTC")
library(openair)
library(lubridate)
filedir <- "C:/Users/dfmcg/Documents/Thesisfiles/8hravg"
myfiles <- c(list.files(path = filedir))
paste(filedir, myfiles, sep = '/')
npsfiles <- c(paste(filedir, myfiles,sep = '/'))
for (i in npsfiles[22]) {
x <- substr(i,45,61)
y <- paste('C:/Users/dfmcg/Documents/Thesisfiles/exceedenceall', x, sep='/')
timeozone <- import(i, date="DATES", date.format = "%Y-%m-%d %H", header=TRUE, na.strings="NA")
overseventy <- c()
yearmaxs <- aggregate(rolling.O3new ~ format(as.Date(date)), timeozone, max)
colnames(yearmaxs) <- c("date", "daymax")
overseventy <- aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
colnames(overseventy) <- c("date", "daymax")
aggregate(daymax ~ format(as.Date(date), "%Y"), overseventy, sum)
I have also tried: sum > "70 and sum(daymax > "70).
My other idea at this point is using a for loop to iterate through the values. I was hoping that a could use aggregate again to sum the values of interest. Any help at all would be greatly appreciated!

I think you want:
aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
To things:
you need numerical comparison, so use as.numeric(daymax) > 70 not daymax > "70";
use the subset argument in aggregate.formula.

Related

R does not recognise dates in loop

I am looking to loop over a vector of dates, using these dates as a subsetting criterion and carry out calculations. For simplicity's sake we will assume these calculations are a count of rows.
The problem is R treats the vector of dates as 5 digit numbers. This is despite having been coerced as dates using as.Date, therefore, loop creates a list of length 17,896. In my loop list there are only 12 dates.
I very much look forward to any suggestions. Thank you.
# first date of each month in 2018
dates_2018 = seq(as.Date("2018-1-1"), as.Date("2018-12-31"), "days")
loop_date = as.Date(as.vector(tapply(dates_2018, substr(dates_2018, 1, 7), max), mode="any"), origin = "1970-01-01")
# dummy df
df = data.frame(id = 1:length(dates_2018)
,dates_2018)
# count number of days satisfy criteria
y = list()
for (i in loop_date)
{
y[[i]] = nrow(df[df$dates_2018 >= i, ])
}; y
You can do y[[i]] = nrow(df[df$dates_2018 >= as.Date(i,origin = "1970-01-01"), ]) and get a result by y[[17562]], but you will find your result in a list with 17,896 elments. Here is more proper
for (i in seq_along(loop_date))
{
y[[i]] = nrow(df[df$dates_2018 >= loop_date[i], ])
}

How to interpolate missing values in a time series, limited by the number of sequential NAs (R)?

I have missing values in a time series of dates. For example:
set.seed(101)
df <- data.frame(DATE = as.Date(c('2012-01-01', '2012-01-02',
'2012-01-03', '2012-01-05', '2012-01-06', '2012-01-15', '2012-01-18',
'2012-01-19', '2012-01-20', '2012-01-22')),
VALUE = rnorm(10, mean = 5, sd = 2))
How can I write a function that will fill all the missing dates between the first and last date (ie 2012-01-01 and 2012-01-22'), then use interpolation (linear and smoothing spline) to fill the missing values, but not more than 3 sequential missing values (ie no interpolation between 2012-01-06 and 2012-01-15)?
The function will be applied to a very large dataframe. I have been able to write a function that uses linear interpolation to fill all missing values between two dates (see below), but I can not figure out how to stop it interpolating long stretches of missing values.
interpolate.V <- function(df){
# sort data by time
df <- df[order(df$DATE),]
# linnearly interpolate VALUE for all missing DATEs
temp <- with(df, data.frame(approx(DATE, VALUE, xout = seq(DATE[1],
DATE[nrow(df)], "day"))))
colnames(temp) <- c("DATE", "VALUE_INTERPOLATED")
temp$ST_ID <- df$ST_ID[1]
out <- merge(df, temp, all = T)
rm(temp)
return(out)
}
Any help will be greatly appreciated!
Thanks
Function that adds rows for all missing dates:
date.range <- function(sub){
sub$DATE <- as.Date(sub$DATE)
DATE <- seq.Date(min(sub$DATE), max(sub$DATE), by="day")
all.dates <- data.frame(DATE)
out <- merge(all.dates, sub, all = T)
return(out)
}
Use na.approx or na.spline from zoo package with maxgap argument:
interpolate.zoo <- function(df){
df$VALUE_INT <- na.approx(df$VALUE, maxgap = 3, na.rm = F)
return(df)
}

cbind function with yearly vectors

for(t in 1921:2017) {
nam <- paste("", t, sep = "")
assign(nam, window(aCPI_gr, start=c(t,1), end=c(t,12)))
}
aCPI_gr_y <- cbind(`1921`: `2017`) #doesn't work
This loop is generating vectors with CPI data from every month per year. Now i would like to pack all of them in a data frame with cbind, but i am of course to lazy to type every year-vector by hand in the cbind function. is there an easy way to avoid typing every year-vector by hand? something like cbind(1921:2017)
1) matrix If you have a monthly unidimensional ts series spanning 97 years, such as the test series tser below, then this will convert it to a 12 x 97 matrix with one year per column. The dimnames argument can be omitted if you don't need the names.
tser <- ts(seq_len(12 * 97), start = 1921, freq = 12) # test data
m <- matrix(tser, 12, dimnames = list(month = 1:12, year = 1921:2017))
2) tapply An alternative is:
tapply(tser, list(month = cycle(tser), year = as.integer(time(tser))), c)

How to speed up a loop-like function in R

In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.
#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))
#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))
df2<-df2[order(df2$date, df2$hour),]
df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)
#function
period_mean<-function(x){
tmp<-df2[df$cat == df1[x,]$cat,]
#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)
#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)
print(x)
print(p)
return(p)
}
#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)
EDIT I have subsequently learned that part of the reason my original problem was iterating so slowly is that my data classes between the two dataframes were not the same. df1$date was a date field, while df2$date was a character field. Of course, this wasn't apparent with the example I posted because the data types were the same by construction. Hope this helps.
Here's one suggestion:
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.
We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.
We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).
Performance:
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE

r - find same times in n number of data frames

Consider the following example:
Date1 = seq(from = as.POSIXct("2010-05-03 00:00"),
to = as.POSIXct("2010-06-20 23:00"), by = 120)
Dat1 <- data.frame(DateTime = Date1,
x1 = rnorm(length(Date1)))
Date2 <- seq(from = as.POSIXct("2010-05-01 03:30"),
to = as.POSIXct("2010-07-03 22:00"), by = 120)
Dat2 <- data.frame(DateTime = Date2,
x1 = rnorm(length(Date2)))
Date3 <- seq(from = as.POSIXct("2010-06-08 01:30"),
to = as.POSIXct("2010-07-13 11:00"), by = 120)
Dat3Matrix <- matrix(data = rnorm(length(Date3)*3), ncol = 3)
Dat3 <- data.frame(DateTime = Date3,
x1 = Dat3Matrix)
list1 <- list(Dat1,Dat2,Dat3)
Here I build three data.frames as an example and placed them all into a list. From here I would like to write a routine that would return the 3 data frames but only keeping the times that were present in each of the others i.e. all three data frames should be reduced to the times that were consistent among all of the data frames. How can this be done?
zoo has a multi-way merge. This lapply's read.zoo over the components of list1 converting them each to zoo class. tz="" tells it to use POSIXct for the resulting date/times. It then merges the converted components using all=FALSE so that only intersecting times are kept.
library(zoo)
z <- do.call("merge", c(lapply(setNames(list1, 1:3), read.zoo, tz = ""), all = FALSE))
If we later wish to convert z to data.frame try dd <- cbind(Time = time(z), coredata(z)) but it might be better to keep it as a zoo object (or convert it to an xts object) so that further processing is simplified as well.
One approach is to find the respective indices and then subset accordingly:
idx1 <- (Dat1[,1] %in% Dat2[,1]) & (Dat1[,1] %in% Dat3[,1])
idx2 <- (Dat2[,1] %in% Dat1[,1]) & (Dat2[,1] %in% Dat3[,1])
idx3 <- (Dat3[,1] %in% Dat1[,1]) & (Dat3[,1] %in% Dat2[,1])
Now Dat1[idx1,], Dat2[idx2,], Dat3[idx3,] should give the desired result.
You could use merge:
res <- NULL
for (i in 2:length(list1)) {
dat <- list1[[i]]
names(dat)[2] <- paste0(names(dat)[2], "_", i);
dat[[paste0("id_", i)]] <- 1:nrow(dat)
if (is.null(res)) {
res <- dat
} else {
res <- merge(res, dat, by="DateTime")
}
}
I added columns with id's; you could use these to index the records in the original data.frames

Resources