Suppose I have the following data set:
Daily observations of the S&P500, and
Quarterly Total Public Debt.
The observation of the quarter is at time
xxxx-01-01
xxxx-04-01
xxxx-07-01
xxxx-10-01
The non trading days such as weekend and holidays are denoted with NAs
2020-01-01 NA
2020-01-02 3257.85
2020-01-02 3234.85
.
.
.
.
2020-03-31 2584.59
This will yield an unequal amount of observation per quarter.
MY question is how do I remove a certain amount of dates such that within each quarter I will have exactly 66 observations of the S&P500?
We can convert the index to yearqtr (from zoo), use that to create a logical index for first 66 observations
xt1[ave(seq_along(index(xt1)), as.yearqtr(index(xt1)), FUN =
seq_along) <= 66]
As #G.Grothendieck mentioned in the comments, the idea would be to first remove the NA elements
xt2 <- na.omit(xt1)
then, calculate the minimum number of elements per each quarter
n <- min(tapply(seq_along(index(xt1)), as.yearqtr(index(xt1)), FUN = length))
Use that in first code block
xt2[ave(seq_along(index(xt2)), as.yearqtr(index(xt2)), FUN =
seq_along) <= n]
Related
This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009
I'm trying to calculate Effective Drought Index using R. One of many steps needed to do so is calculate a stored water quantity (EP):
EP365=P1/1+(P1+P2)/2+(P1+P2+P3)/3+(P1+P2+P3+P4)/4+ … +(P1+…+P365)/365
Where P1 is daily precipitation last day, P2 is precipitation two day ago and P365 is precipitation 365 days ago. Calculation of EP must be done for each 365-day period starting with day 1 to 365, 2 to 366 etc.
So I have a dataframe with two columns: date and precip and more than 20000 rows. Simple (and slow) solution is calculate any subset of 365 elements from row 365 to nrow(df):
period_length <- 365
df$EP <- NA
for (i in (period_length:nrow(df))) {
first <- (i - period_length) + 1
SUB <- rev(df[first:i,]$prcp)
EP <- sum(cumsum(SUB)/seq_along(SUB))
df$EP[i] <- EP
}
Of course it works, however the question is how to calculate EP without using loop?
Use rollapplyr with the indicated function. Replace fill=NA with partial=TRUE if you want it to work with fewer than 365 days during the first 364 points or omit both if you want to drop the first 364 points.
library(zoo)
x <- 1:1000 # sample data
ep <- rollapplyr(x, 365, function(x) sum(cumsum(x) / seq_along(x)), fill = NA)
My data set looks like this: Daily closing share price is given for 25 years from 1991 to 2016 for each trading date.
Company Code Company Name Daily Trading Dates Daily Closing Share price
43677 CENTURY ENKA LTD. 1/1/1991 3550.00
-do- -do- 1/2/1991 3600.00
. 3700.00
. 3800.00
12/31/1991 x
. x
. x
1/1/2016 x
. x
. x
12/31/2016 x
i think the SMA function in the TTR package may help. here is an example:
library(TTR)
p1 <- c(45,68,98,97,45,12,46,98,45,65,97,48,65,95) #dummy price data
SMA(p1,4) #calculate a 4 period simple moving average
#here is outcome
[1] NA NA NA 77.00 77.00 63.00 50.00 50.25 50.25 63.50 76.25
[12] 63.75 68.75 76.25
so within the SMA function, if you set the second argument to 252 -- the number of trading days in a year -- i think you will get an annual average share price for the past year for each date in your dataframe.
I would use package lubridateand either tapply or ave. In what follows I assume that your data is in the form of a data.frame named dat.
library(lubridate)
yr <- year(mdy(date))
res1 <- tapply(dat$price, yr, FUN = mean)
res2 <- ave(dat$price, yr, FUN = mean)
The difference between the two is that ave returns a vector the length of the input vector, whereas tapply returns a vector with as many elements as groups defined by the grouping variable(s), in this case yr.
I've got a working function, but I'm hoping there is a more succinct way of going about this.
I have a dataset of events that are captured with the hour of the week they occurred in. For example, 4 AM on Sunday= 4, 4 AM on Monday = 28 etc. I want to analyze this data on a daily basis. For instance, all of the events that happen between 8 and 10 am on any day.
To do this I have built a function that returns a dichotomous value for the given range for an ordered list. Function two_break accepts an ordered list of integers between 0:168 representing the hours of a week and a range (b1 and b2) for the desired periods of a 24 hour day. b1 and b2 divide the range of the 24 hour day that are desired. i.e. if b1=8 and b2=10 two_break will return all all values of 9, (9+24)=33, (9+48)=57...etc. as 1 and all others 0.
two_break <- function(test_hr,b1,b2){
test_hr<-ifelse(test_hr==1,1.1,test_hr)
for(i in 0:6){
test_hr<-ifelse(test_hr> (b1+24*i) & test_hr< (b2+24*i), 1 ,test_hr)
}
test_hr<-ifelse(test_hr==1,1,0)
return(test_hr)
}
This function works fine, but I'm wondering if anybody out there could do it more efficiently/succinctly.
See full code and data set at my github: anthonyjp87 168 hr transformation file/data.
Cheers!
You can use integer division %/% to capture the day of the week, and modulus, %% to capture the hour in the day:
weekHours <- 1:168
# return the indices of all elements where the hour is between 8AM and 10AM, inclusive
test_hr <- weekHours[weekHours %% 24 %in% 8:10]
Note that midnight is represented by 0. If you want to wrap this into a function, you might use
getTest_hr <- function(weekHours, startTime, stopTime) {
weekHours[weekHours %% 24 %in% seq(startTime, stopTime)]
}
To get the day of the week, you can use integer division:
# get all indices for the third day of the week
dayOfWeek3 <- weekHours[(weekHours %/% 24 + 1) == 3]
To get a binary vector of the selected time periods, simply pull the logical out of the index:
allTimesBinary <- (weekHours %% 24) %in% 8:10
I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))