for(t in 1921:2017) {
nam <- paste("", t, sep = "")
assign(nam, window(aCPI_gr, start=c(t,1), end=c(t,12)))
}
aCPI_gr_y <- cbind(`1921`: `2017`) #doesn't work
This loop is generating vectors with CPI data from every month per year. Now i would like to pack all of them in a data frame with cbind, but i am of course to lazy to type every year-vector by hand in the cbind function. is there an easy way to avoid typing every year-vector by hand? something like cbind(1921:2017)
1) matrix If you have a monthly unidimensional ts series spanning 97 years, such as the test series tser below, then this will convert it to a 12 x 97 matrix with one year per column. The dimnames argument can be omitted if you don't need the names.
tser <- ts(seq_len(12 * 97), start = 1921, freq = 12) # test data
m <- matrix(tser, 12, dimnames = list(month = 1:12, year = 1921:2017))
2) tapply An alternative is:
tapply(tser, list(month = cycle(tser), year = as.integer(time(tser))), c)
Related
I am looking to loop over a vector of dates, using these dates as a subsetting criterion and carry out calculations. For simplicity's sake we will assume these calculations are a count of rows.
The problem is R treats the vector of dates as 5 digit numbers. This is despite having been coerced as dates using as.Date, therefore, loop creates a list of length 17,896. In my loop list there are only 12 dates.
I very much look forward to any suggestions. Thank you.
# first date of each month in 2018
dates_2018 = seq(as.Date("2018-1-1"), as.Date("2018-12-31"), "days")
loop_date = as.Date(as.vector(tapply(dates_2018, substr(dates_2018, 1, 7), max), mode="any"), origin = "1970-01-01")
# dummy df
df = data.frame(id = 1:length(dates_2018)
,dates_2018)
# count number of days satisfy criteria
y = list()
for (i in loop_date)
{
y[[i]] = nrow(df[df$dates_2018 >= i, ])
}; y
You can do y[[i]] = nrow(df[df$dates_2018 >= as.Date(i,origin = "1970-01-01"), ]) and get a result by y[[17562]], but you will find your result in a list with 17,896 elments. Here is more proper
for (i in seq_along(loop_date))
{
y[[i]] = nrow(df[df$dates_2018 >= loop_date[i], ])
}
I am trying to create a vector (dateVec) which contains the dates in the column Date propagated by the number of days in column Days. I cannot understand why the code that I created is not working. Dates are in Date format.
> for ( i in mydata[,1] ) {
> dateVec = mydata [,1] + 0 : mydata [,2] }
The data has much more rows, here is a sample as an example:
Date (mydata[,1]) -- Days (mydata[,2])
10/05/2017 ---------- 3
05/05/2017 ---------- 2
The result that I would expect for dateVec would be:
(10/05/2017, 11/05/2017, 12/05/2017, 13/05/2017, 05/05/2017, 06/05/2017, 07/05/2017, ...)
There are a few issues here why your code isn't working.
For loop: Here, your i needs a series of integers to iterate through. As you
have it now, you are trying to loop from i = 1 to "10/05/2017" and "05/05/2017".
A more useful way is to use seq_along to generate a sequence of integers from
1 to the length of the object passed through to seq_along.
dateVec is not indexed, so that you are overwriting dateVec for each
iteration of your loop
Variable length of days. For the first date, you are generating a sequence 3
days long and for the second date, 2 days. You will need a data structure that can handle variable length element such as a list.
To modify your existing code:
mydata <- data.frame(Date = as.Date(c("10/05/2017", "05/05/2017"),
format = "%d/%m/%Y"), Days = c(3, 2))
dateVec <- list()
for (i in seq_along(mydata[, 1])) {
dateVec[[i]] = mydata [i, 1] + 0 : mydata [i, 2]
}
res <- do.call("c", dateVec)
A more r idiomatic approach is to pass the starting date and length of time in parallel using mapply to return a list, which is then concatenated to a vector of dates
res <- do.call("c", mapply(function(x, y) seq(from = x, length.out = y,
by = "1 day"), x = mydata[["Date"]], y = mydata[["Days"]]))
Here's a clunky solution:
library("lubridate")
mydata = data.frame(Date = dmy(c("10/05/2017", "05/05/2017")),
Days = c(3,2))
dateVec = dmy(character())
for(i in 1:length(mydata$Date)){
dateVec = c(dateVec,mydata$Date[i])
for(j in 1:mydata$Days[i]){
dateVec = c(dateVec, mydata$Date[i]+j)
}
}
Note that this uses the lubridate package and doesn't format the dates quite how you did. I also found it interesting that I had to initialize dateVec as a date object. Initially I tried dateVec = c() but R tried to coerce to numeric.
I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.
I am trying to sum values that are greater than 70 in several different data sets. I believe that aggregate can do this but my research has not pointed to an obvious solution to obtaining the values that exceed seventy in my data sets. I have first used aggregate to get the daily max values and put these values into the data frame called yearmaxs. Here is my code and what I have tried:
number of times O3 >70 in a year per site
Sys.setenv(TZ = "UTC")
library(openair)
library(lubridate)
filedir <- "C:/Users/dfmcg/Documents/Thesisfiles/8hravg"
myfiles <- c(list.files(path = filedir))
paste(filedir, myfiles, sep = '/')
npsfiles <- c(paste(filedir, myfiles,sep = '/'))
for (i in npsfiles[22]) {
x <- substr(i,45,61)
y <- paste('C:/Users/dfmcg/Documents/Thesisfiles/exceedenceall', x, sep='/')
timeozone <- import(i, date="DATES", date.format = "%Y-%m-%d %H", header=TRUE, na.strings="NA")
overseventy <- c()
yearmaxs <- aggregate(rolling.O3new ~ format(as.Date(date)), timeozone, max)
colnames(yearmaxs) <- c("date", "daymax")
overseventy <- aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
colnames(overseventy) <- c("date", "daymax")
aggregate(daymax ~ format(as.Date(date), "%Y"), overseventy, sum)
I have also tried: sum > "70 and sum(daymax > "70).
My other idea at this point is using a for loop to iterate through the values. I was hoping that a could use aggregate again to sum the values of interest. Any help at all would be greatly appreciated!
I think you want:
aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
To things:
you need numerical comparison, so use as.numeric(daymax) > 70 not daymax > "70";
use the subset argument in aggregate.formula.
In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.
#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))
#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))
df2<-df2[order(df2$date, df2$hour),]
df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)
#function
period_mean<-function(x){
tmp<-df2[df$cat == df1[x,]$cat,]
#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)
#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)
print(x)
print(p)
return(p)
}
#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)
EDIT I have subsequently learned that part of the reason my original problem was iterating so slowly is that my data classes between the two dataframes were not the same. df1$date was a date field, while df2$date was a character field. Of course, this wasn't apparent with the example I posted because the data types were the same by construction. Hope this helps.
Here's one suggestion:
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.
We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.
We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).
Performance:
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE