R: Subset/extract rows of a data frame in steps of 12 - r

I have a data frame with data for each month of a 26 years period (1993 - 2019), which makes 312 rows in total.
Unfortunately, I had to lag the data, so each year goes now from July t to June t+1. So I can't just extract the year from the date.
Now, I want to exclude the 12-month data for each year in a separate data frame. My first Idea is to insert in the first column the year and use the lapply function to filter afterward.
For this, I created the following loop:
n <- 1
m <- 1993
for (a in 1:26) {
for (i in n:(n+11)) {
t.monthly.ret.lag[i,1] <- m
}
n <- n+1
m <- m+1
}
Unfortunately, R isn't naming the year in steps of 12. Instead, it is counting directly in steps of 1.
Does anyone know how to solve this or maybe know a better way of doing it?

y.first <- 1993
y.last <- 2019
month.col <- rep(c(7:12, 1:6), y.last-y.first+1)
year.col <- rep(c(y.first:y.last), each=length(month.name))
df <- data.frame(year=year.col, month=month.col)
This yields a dataframe with months and year correspondingly tagged, which further allows to use dplyr::group_by() and so on.

You could just create a 312 element long vector giving the year (and one giving the month) using rep() and seq(). Then you can attach them as additional columns to your data.frame or just use them as reference for month and year.
month = rep(seq(1:12),27)
year = c(matrix(rep(seq(1:27),12),ncol=27,byrow=T)+1992)
month = month[7:(length(month)-6)]
year = year[7:(length(year)-6)]
The month vector counts from 1 to 12, beginning at 6, the year vector repeats the year 12 times (the first and last only 6 times).

Related

How to perform calculations on moving subsets of n elements of data frame without loop

I'm trying to calculate Effective Drought Index using R. One of many steps needed to do so is calculate a stored water quantity (EP):
EP365=P1/1+(P1+P2)/2+(P1+P2+P3)/3+(P1+P2+P3+P4)/4+ … +(P1+…+P365)/365
Where P1 is daily precipitation last day, P2 is precipitation two day ago and P365 is precipitation 365 days ago. Calculation of EP must be done for each 365-day period starting with day 1 to 365, 2 to 366 etc.
So I have a dataframe with two columns: date and precip and more than 20000 rows. Simple (and slow) solution is calculate any subset of 365 elements from row 365 to nrow(df):
period_length <- 365
df$EP <- NA
for (i in (period_length:nrow(df))) {
first <- (i - period_length) + 1
SUB <- rev(df[first:i,]$prcp)
EP <- sum(cumsum(SUB)/seq_along(SUB))
df$EP[i] <- EP
}
Of course it works, however the question is how to calculate EP without using loop?
Use rollapplyr with the indicated function. Replace fill=NA with partial=TRUE if you want it to work with fewer than 365 days during the first 364 points or omit both if you want to drop the first 364 points.
library(zoo)
x <- 1:1000 # sample data
ep <- rollapplyr(x, 365, function(x) sum(cumsum(x) / seq_along(x)), fill = NA)

Longest consecutive period above threshold using rle and for loop

I have four years of streamflow data for one month and I'm trying to figure out how to extract the longest consecutive period at or above a certain threshold for each of the four years. In the example below, the threshold is 4. I want to try to accomplish this using a for loop or possibly one of the apply functions, but I'm not sure how to go about it.
Here's my example dataframe:
year <- c(rep(2009,31), rep(2010, 31), rep(2011, 31), rep(2012, 31))
day<-c(rep(seq(1:31),4))
discharge <- c(4,4,4,5,6,5,4,8,4,5,3,8,8,8,8,8,8,8,1,2,2,8,8,8,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,10,3,3,3,3,3,3,1,1,3,8,8,8,8,8,8,8,8,8,1,2,2,8,8,3,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,3,3,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,3)
df<-data.frame(cbind(year, day, discharge))
df$threshold<-ifelse(discharge>=4,1,0)
In this example, the threshold column is coded as 1 if the discharge is at or above the threshold and 0 if not. I'm able to partially get my desired output for one year (2009 in the example below), with the following code:
rl2009<-with(subset(df,year==2009),rle(threshold))
cs2009 <- cumsum(rl2009$lengths)
index2009<-cbind(cs2009[rl2009$values == 1] - rl2009$length[rl2009$values == 1] + 1,
cs2009[rl2009$values == 1])
df2009<-data.frame(index2009)
df2009 #ouput all periods when flow is above threshold
df2009$X3<-df2009$X2-df2009$X1+1
max2009<-df2009[which.max(df2009$X3),]
max2009 #output the first and longest period when flow is above threshold
For 2009, there are three time periods when the discharge equals or exceeds 4, but the period from day 1 to day 10 is chosen because it is the first of the longest period above the threshold. X1 represents the start of the time period, X2 the end of the time period, and X3 the number of days in the time period. If there is more than one period with the same number of days, I want to select the first of such periods.
My desired output for all four years is below:
year X1 X2 X3
2009 1 10 10
2010 9 31 23
2011 10 18 9
2012 12 30 19
The actual data includes many more years and many streams, so it's not feasible to do this for each year individually. If anyone has any thoughts on how to achieve this, it'd be greatly appreciated. Thanks.
Simply, generalize your process with a defined function such as threshold_find and pass dataframes subsetted for each year into it which can be handled with by.
As the object-oriented wrapper to tapply, by slices a dataframe by one or more factors (i.e., year) and returns a list of whatever object the defined function outputs, here being the max dataframe. At end, do.call() row binds all dataframes in by list into one dataframe.
threshold_find <- function(df) {
rl <- with(df, rle(threshold))
cs <- cumsum(rl$lengths)
index <- cbind(cs[rl$values == 1] - rl$length[rl$values == 1] + 1,
cs[rl$values == 1])
df <- data.frame(index)
df$X3 <- df$X2 - df$X1+1
max <- df[which.max(df$X3),]
max
}
finaldf <- do.call(rbind, by(df, df$year, FUN=threshold_find))
finaldf
# X1 X2 X3
# 2009 1 10 10
# 2010 9 31 23
# 2011 10 18 9
# 2012 12 30 19

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Producing Ordered Columns of Integers in R for odd-numbered ranges

Total newb R question, but here it is: lets say I want to create a data frame with two columns, one with all years in a range, and the other with every month in each year. When I'm done, I should have this:
year month
1990 1
1990 2
1990 3
Et cetera. This seems like a pretty obvious job for cbind, to turn a range into a column, and repeat, to produce 12 instances of each year. This works great, but only for an even number of years in the range. So, for instance:
df <- data.frame(cbind(year=rep(c(1990:2000), 12)))
Works fine. And so does this:
df <- data.frame(cbind(year=rep(c(1990:2000), 12), month=c(1:12)))
But this produces overt nonsense:
df <- data.frame(cbind(year=rep(c(1990:2001), 12), month=c(1:12)))
The first line of code produces 12 instances of each year in the range, just as you'd expect; the second line produces the desired result. The third line produces 12 instances of each year, where each year only gets one month number. Thus:
year month
1990 1
1990 1
1990 1
Is there a way around this that doesn't require always adding a year and trimming it off later?
You are looking for expand.grid
df <- expand.grid(year = 1990:2001, month = 1:12)

Resources