Replacement of missing day and month in dates using R - r

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.

This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Related

Enforce equal amount of days each quarter R 'xts'

Suppose I have the following data set:
Daily observations of the S&P500, and
Quarterly Total Public Debt.
The observation of the quarter is at time
xxxx-01-01
xxxx-04-01
xxxx-07-01
xxxx-10-01
The non trading days such as weekend and holidays are denoted with NAs
2020-01-01 NA
2020-01-02 3257.85
2020-01-02 3234.85
.
.
.
.
2020-03-31 2584.59
This will yield an unequal amount of observation per quarter.
MY question is how do I remove a certain amount of dates such that within each quarter I will have exactly 66 observations of the S&P500?
We can convert the index to yearqtr (from zoo), use that to create a logical index for first 66 observations
xt1[ave(seq_along(index(xt1)), as.yearqtr(index(xt1)), FUN =
seq_along) <= 66]
As #G.Grothendieck mentioned in the comments, the idea would be to first remove the NA elements
xt2 <- na.omit(xt1)
then, calculate the minimum number of elements per each quarter
n <- min(tapply(seq_along(index(xt1)), as.yearqtr(index(xt1)), FUN = length))
Use that in first code block
xt2[ave(seq_along(index(xt2)), as.yearqtr(index(xt2)), FUN =
seq_along) <= n]

Longest consecutive period above threshold using rle and for loop

I have four years of streamflow data for one month and I'm trying to figure out how to extract the longest consecutive period at or above a certain threshold for each of the four years. In the example below, the threshold is 4. I want to try to accomplish this using a for loop or possibly one of the apply functions, but I'm not sure how to go about it.
Here's my example dataframe:
year <- c(rep(2009,31), rep(2010, 31), rep(2011, 31), rep(2012, 31))
day<-c(rep(seq(1:31),4))
discharge <- c(4,4,4,5,6,5,4,8,4,5,3,8,8,8,8,8,8,8,1,2,2,8,8,8,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,10,3,3,3,3,3,3,1,1,3,8,8,8,8,8,8,8,8,8,1,2,2,8,8,3,8,8,8,8,8,8,4,4,4,5,6,3,1,1,3,3,3,3,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,9,3)
df<-data.frame(cbind(year, day, discharge))
df$threshold<-ifelse(discharge>=4,1,0)
In this example, the threshold column is coded as 1 if the discharge is at or above the threshold and 0 if not. I'm able to partially get my desired output for one year (2009 in the example below), with the following code:
rl2009<-with(subset(df,year==2009),rle(threshold))
cs2009 <- cumsum(rl2009$lengths)
index2009<-cbind(cs2009[rl2009$values == 1] - rl2009$length[rl2009$values == 1] + 1,
cs2009[rl2009$values == 1])
df2009<-data.frame(index2009)
df2009 #ouput all periods when flow is above threshold
df2009$X3<-df2009$X2-df2009$X1+1
max2009<-df2009[which.max(df2009$X3),]
max2009 #output the first and longest period when flow is above threshold
For 2009, there are three time periods when the discharge equals or exceeds 4, but the period from day 1 to day 10 is chosen because it is the first of the longest period above the threshold. X1 represents the start of the time period, X2 the end of the time period, and X3 the number of days in the time period. If there is more than one period with the same number of days, I want to select the first of such periods.
My desired output for all four years is below:
year X1 X2 X3
2009 1 10 10
2010 9 31 23
2011 10 18 9
2012 12 30 19
The actual data includes many more years and many streams, so it's not feasible to do this for each year individually. If anyone has any thoughts on how to achieve this, it'd be greatly appreciated. Thanks.
Simply, generalize your process with a defined function such as threshold_find and pass dataframes subsetted for each year into it which can be handled with by.
As the object-oriented wrapper to tapply, by slices a dataframe by one or more factors (i.e., year) and returns a list of whatever object the defined function outputs, here being the max dataframe. At end, do.call() row binds all dataframes in by list into one dataframe.
threshold_find <- function(df) {
rl <- with(df, rle(threshold))
cs <- cumsum(rl$lengths)
index <- cbind(cs[rl$values == 1] - rl$length[rl$values == 1] + 1,
cs[rl$values == 1])
df <- data.frame(index)
df$X3 <- df$X2 - df$X1+1
max <- df[which.max(df$X3),]
max
}
finaldf <- do.call(rbind, by(df, df$year, FUN=threshold_find))
finaldf
# X1 X2 X3
# 2009 1 10 10
# 2010 9 31 23
# 2011 10 18 9
# 2012 12 30 19

Adding quarters to R date

I have a R time series data, where I am calculating the means for all values up to a particular date, and storing this means in the date + 4 quarters. The dates are all month ends. To achieve this, I am looking to increment 4 quarters to a date. My question is how can I add 4 quarters to an R date data-type. An illustration:
a <- as.Date("2006-01-01")
b <- as.Date("2011-01-01")
date_range <- quarter(seq.Date(a, b, by = "quarter"), with_year = TRUE)
> date_range[1] + 1
[1] 2007.1
> date_range[1] + quarter(1)
[1] 2007.1
> date_range[1] + 0.25
[1] 2006.35
One possible way I am thinking is to get year-quarter dates, and then adding 4 to it. But wasn't sure what is the best way to do this?
The problem is that quarters have different lengths. Q1 is shortest because it includes February (though it ties with Q2 in leap years). Things like this make "adding a quarter to a date" poorly defined. Even adding months to a date can be tricky at the ends months - what is 1 month after January 31?
Beginnings of months are more straightforward, and I would recommend you use the 1st day of quarters rather than the last (if you must use a specific date). lubridate provides functions like floor_date() and ceiling_date() to which you can pass unit = "quarter" and they will return the first day of the current or subsequent quarter, respectively. You can also always add months(3) to a day at the beginning of a month, though of course if your intention is to add 4 quarters you may as well just add 1 year.
Just add 12 months or a year instead?
Or if it must be quarters, define yourself a function, like so:
quarters <- function(x) {
months(3*x)
}
and then use it to add to the date sequence:
date_range <- seq.Date(a, b, by = "quarter")
date_range + quarters(4)
Lubridate has a function for quarters already included. This is a much better solution than creating your own function.
https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/quarter
Old answer but to those arriving here, lubridate has a function %m+%that adds months and preserves monthends.
a <- as.Date("2006-01-01")
Add future months worth of dates:
The original poster wanted 4 quarters in future so that will be 12 months.
future_date <- a %m+% months(12)
future_date
[1] "2007-01-01"
You could also do years as the period:
future_date <- a %m+% years(1)
Remove months from date:
Subtract dates with %m-%
If you wanted a date 3 months ago from 1/1/2006:
past_date <- a %m-% months(3)
past_date
[1] "2005-10-01"
Example with dates not at end of months:
mplus will preserve days in month:
as.Date("2022-10-10") %m-% months(3)
[1] "2022-07-10"
For more, see documentation on "Add and subtract months to a date without exceeding the last day of the new month"
Note that other answers that use Date class will give irregularly spaced series and so are unsuitable for time series analysis.
To do this in such a way that time series analyses can be performed and noting the zoo tag on the question, the yearmon class represents year/month as year + fraction where fraction is 0 for Jan, 1/12 for Feb, 2/12 for Mar, ..., 11/12 for Dec. Thus adding 4 quarters is just a matter of adding 1. (Adding x quarters is done by adding x/4.)
library(zoo)
ym <- yearmon(2006) + 0:11/12 # months in 2006
ym + 1 # one year later
Also this converts yearmon objects to end-of-month Date and in the second line Date to yearmon. Using frac = 0 or omitting frac in the first line would convert to beginning of month dates.
d <- as.Date(ym, frac = 1) # d is Date vector of end-of-months
as.yearmon(d) # convert Date vector to yearmon
If your input dates represent quarters then there is also the yearqtr class which represents a year/quarter as year + fraction where fraction is 0, 1/4, 2/4, 3/4 for the 4 quarters of a year. Adding 4 quarters is done by adding 1 (or to add x quarters add x/4).
yq <- as.yearqtr(2006) + 0:3/4 # all quarters in 2006
yq + 1 # one year later
Conversions work similarly to yearmon:
d <- as.Date(ym, frac = 1) # d is Date vector of end-of-quarters
as.yearqtr(d) # convert Date vector to yearqtr

Best way to input daily data into R to allow further manipulation

I have daily rainfall data in Excel (which I can save as a CSV or txt file) that I would like to manipulate and load into R. I'm very new to R.
The format of the data is such that I have I have the following columns
Year; Month; Rain on day 1 of Month, Rain on Day 2, ... , Rain on day 31;
This means that I have a large array/table. Some data is missing because it wasn't recorded, and some because February 31st, June 31st, etc do not exist.
I would like to analyse things like monthly totals, and their distributions.
What is the best way to input data so it can be easily manipulated, and that I can distinguish between missing data and NULL data (31st Feb)?
Thanks a lot in advance
Several things for you to have a look at. E.g. readxl::read_excel() for reading excel files or Hmisc::monthDays(dates) for determining the number of days for each month in a dates vector.
Anyway, here's one idea as a starter:
# create sample data
set.seed(1)
mat <- matrix(rbinom(5*31, 31, .5), nrow=5)
mat[sample(1:length(mat), 10)] <- NA
df <- data.frame(year=2016, month=1:5, mat)
# reshape data from wide to long format
library(reshape2)
dflong <- melt(df, id.vars = 1:2, variable.name = "day")
# add date column (will be NA if conversion is not possible, i.e. if date does not exists)
dflong$date <- as.Date(with(dflong, paste(year, month, day, sep="-")), format = "%Y-%m-X%e")
# Select only existing dates
dflong <- subset(dflong[order(dflong$month), ], !is.na(date))
# Aggregate: means per month and year (missing values removed)
aggregate(value~year+month, dflong, mean, na.rm=TRUE)
# year month value
# 1 2016 1 15.93548
# 2 2016 2 15.26923
# 3 2016 3 15.10345
# 4 2016 4 15.74074
# 5 2016 5 16.16667

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

Resources