R generate one random date per month between defined interval - r

I'd like to generate a list of random dates between a defined interval using R such that there is only one date for each month present in the interval.
I've tried using a variation of the code from another solution, but I can't seem to limit it to one date per month. I get multiple dates for a given month.
Here's my attempt
df = data.frame(Date=c(sample(seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day"), 9)))
But I seem to get more than one date for a given month. Any inputs would be highly appreciated.

First I create a table, containing all the possible dates that you want to sample. And I store in a column of this table, the index, or the number of the month of each date, using the month() function from lubridate package.
library(lubridate)
dates <- data.frame(
days = seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day")
)
dates$month <- month(dates$day)
Then, the idea is to create a loop with lapply() function. In each loop, I select in the table dates, only the dates of that month, and I paste these months in to the sample() function.
results <- lapply(1:9, function(x){
sample_dates <- dates$days[dates$month == x]
return(sample(sample_dates, size = 1))
})
df <- data.frame(
dates = as.Date(unlist(results), origin = "1970-01-01")
)
Resulting this:
dates
1 2020-01-19
2 2020-02-06
3 2020-03-26
4 2020-04-13
5 2020-05-16
6 2020-06-29
7 2020-07-06
8 2020-08-21
9 2020-09-01
In other words, the ideia of this approach is to provide selected dates to sample() function on each loop. So it will sample, or choose a date, only for that specific month, on each loop.

How about this:
First you create a function that returns a random day from month 'month'
Then you lapply for all months you need, 1 to 9
x <- function(month){
(Date=c(sample(seq(as.Date(paste0('2020/',month,'/01')), as.Date(paste0('2020/',month+1,'/01')), by="day"), 1)))
}
df <- data.frame(
dates = as.Date(unlist(lapply(1:9,x)), origin = "1970-01-01")
)
If you also want the results to be random (not January, February, March...) you only need to add a sample:
df <- data.frame(
dates = as.Date(unlist(sample(lapply(1:9,x))), origin = "1970-01-01")
)

Related

R: Subset/extract rows of a data frame in steps of 12

I have a data frame with data for each month of a 26 years period (1993 - 2019), which makes 312 rows in total.
Unfortunately, I had to lag the data, so each year goes now from July t to June t+1. So I can't just extract the year from the date.
Now, I want to exclude the 12-month data for each year in a separate data frame. My first Idea is to insert in the first column the year and use the lapply function to filter afterward.
For this, I created the following loop:
n <- 1
m <- 1993
for (a in 1:26) {
for (i in n:(n+11)) {
t.monthly.ret.lag[i,1] <- m
}
n <- n+1
m <- m+1
}
Unfortunately, R isn't naming the year in steps of 12. Instead, it is counting directly in steps of 1.
Does anyone know how to solve this or maybe know a better way of doing it?
y.first <- 1993
y.last <- 2019
month.col <- rep(c(7:12, 1:6), y.last-y.first+1)
year.col <- rep(c(y.first:y.last), each=length(month.name))
df <- data.frame(year=year.col, month=month.col)
This yields a dataframe with months and year correspondingly tagged, which further allows to use dplyr::group_by() and so on.
You could just create a 312 element long vector giving the year (and one giving the month) using rep() and seq(). Then you can attach them as additional columns to your data.frame or just use them as reference for month and year.
month = rep(seq(1:12),27)
year = c(matrix(rep(seq(1:27),12),ncol=27,byrow=T)+1992)
month = month[7:(length(month)-6)]
year = year[7:(length(year)-6)]
The month vector counts from 1 to 12, beginning at 6, the year vector repeats the year 12 times (the first and last only 6 times).

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

monthlyReturn and unequal month length

I have 300+ companies and need to calculate monthly return for them and later use it as one of the variables in my data set.
I download prices from Yahoo and calculated monthly return using quantmod package:
require(quantmod)
stockData <- lapply(symbols,function(x) getSymbols(x,auto.assign=FALSE, src='yahoo', from = '2000-01-01'))
stockDataReturn <- lapply(stockData,function(x) monthlyReturn(Ad(x)))
The problem I have is that some companies have different month ends (due to trading halts, etc.) which is reflected in the output list: 2013-12-30 for company AAA and 2013-12-31 for company BBB and the rest of the sample.
When I merge the list using
returns <- do.call(merge.xts, stockDataReturn)
It creates a separate row for 2013-12-30 with all NAs except for AAA company.
How can I resolve this? My understanding is that I need to need to stick to month-year format which I need to use as the index before I merge.
Ideally, what I want is that at the monthlyReturn stage, it uses the beginning of the month date rather than end of the month.
You could use lubridate's floor_date to merge on the same beginning of the month timestamp rather than end of the month timestamp. Or use ceiling date to round to the same end of month timestamp for all securities before merging.
library(lubridate)
stockDataReturn <- lapply(stockDataReturn,
function(x) {
index(x) <- floor_date(index(x), "month")
# Or if you want to round to end of month change to:
# index(x) <- ceiling_date(index(x), "month")
x
})
returns <- do.call(merge, stockDataReturn)
colnames(returns) <- symbols

Best way to input daily data into R to allow further manipulation

I have daily rainfall data in Excel (which I can save as a CSV or txt file) that I would like to manipulate and load into R. I'm very new to R.
The format of the data is such that I have I have the following columns
Year; Month; Rain on day 1 of Month, Rain on Day 2, ... , Rain on day 31;
This means that I have a large array/table. Some data is missing because it wasn't recorded, and some because February 31st, June 31st, etc do not exist.
I would like to analyse things like monthly totals, and their distributions.
What is the best way to input data so it can be easily manipulated, and that I can distinguish between missing data and NULL data (31st Feb)?
Thanks a lot in advance
Several things for you to have a look at. E.g. readxl::read_excel() for reading excel files or Hmisc::monthDays(dates) for determining the number of days for each month in a dates vector.
Anyway, here's one idea as a starter:
# create sample data
set.seed(1)
mat <- matrix(rbinom(5*31, 31, .5), nrow=5)
mat[sample(1:length(mat), 10)] <- NA
df <- data.frame(year=2016, month=1:5, mat)
# reshape data from wide to long format
library(reshape2)
dflong <- melt(df, id.vars = 1:2, variable.name = "day")
# add date column (will be NA if conversion is not possible, i.e. if date does not exists)
dflong$date <- as.Date(with(dflong, paste(year, month, day, sep="-")), format = "%Y-%m-X%e")
# Select only existing dates
dflong <- subset(dflong[order(dflong$month), ], !is.na(date))
# Aggregate: means per month and year (missing values removed)
aggregate(value~year+month, dflong, mean, na.rm=TRUE)
# year month value
# 1 2016 1 15.93548
# 2 2016 2 15.26923
# 3 2016 3 15.10345
# 4 2016 4 15.74074
# 5 2016 5 16.16667

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Resources