I am performing creel surveys and am attempting to construct a random date generator that weights the weekends higher than the weekdays. So far I have a simplistic random date generator that does not take into account the day type. We expect more angling pressure on the weekends (as that is when more people have time to fish) but do not have a way to select random days without including bias. I would like to select 15 days within a given month.
I've already generated a simplistic random date generator:
dates <- data.frame(seq.Date(as.Date(day.start),as.Date(day.end),by="day"))
dates
sample(dates$seq.Date.as.Date.day.start...as.Date.day.end...by....day.., size = 15, replace = FALSE)
[1] "2019-11-10" "2019-11-06" "2019-11-04" "2019-11-27" "2019-11-30" "2019-11-15"
[7] "2019-11-18" "2019-11-21" "2019-11-13" "2019-11-01" "2019-11-19" "2019-11-25"
[13] "2019-11-07" "2019-11-02" "2019-11-23"
Ideally I would have an end product that allows me to input the month start and end and outputs 15 random days.
Explanation in comments in code below:
# Generate initial data; as in question
day_start <- as.Date("2010-10-01")
day_end <- as.Date("2010-10-31")
dates <- data.frame(date = seq.Date(day_start, day_end,by="day"))
# Determine inclusion probabilities for each date; give weekend a higher
# probability.
dates$day <- as.numeric(format(dates$date, "%u"))
dates$psamp <- ifelse(dates$day >= 6, 0.2, 0.1)
# Make sure probabilites add up to requires sample size
samplesize <- 15
dates$psamp <- dates$psamp * samplesize/sum(dates$psamp)
# Do not use sample for sampling without replacement with unequal probabilities!
# The sampling package has a large number of routines for sampling without
# replacement and unequal probabilites. The following gives a fixed size sample
# (sum dates$psamp)
library(sampling)
dates$selected <- UPrandomsystematic(dates$psamp)
As for the reason why I don't use sample see, for example, https://stat.ethz.ch/pipermail/r-help/2008-February/153601.html.
Here's a somewhat general function that does what you want. It takes the start day, end day, and the weight (relative to 1) that you want to put on weekends as its own arguments, and passes on other additional arguments (size, replace, etc.) to sample. No dependencies other than base R.
However, if sampling without replacement, you may want to use the sampling package as recommended in Jan van der Laan's answer.
rday = function(
start_day = as.Date("2019-01-01"),
end_day = as.Date("2019-01-31"),
weekend_weight = 2,
...
) {
if (! "Date" %in% class(start_day)) start_day = as.Date(start_day)
if (! "Date" %in% class(end_day)) end_day = as.Date(end_day)
dates = seq(start_day, end_day, by = "1 day")
weights = rep(1, length(dates))
weights[weekdays(dates) %in% c("Saturday", "Sunday")] = 1
sample(dates, ..., prob = weights)
}
rday(size = 15)
# [1] "2019-01-24" "2019-01-07" "2019-01-21" "2019-01-15" "2019-01-27" "2019-01-04" "2019-01-30" "2019-01-12"
# [9] "2019-01-11" "2019-01-08" "2019-01-20" "2019-01-01" "2019-01-03" "2019-01-19" "2019-01-29"
Related
I have a starting time specified as a year-month character, e.g. "2020-12". From the start, for each of T consecutive months, I need to generate n different dates (year-month-day), where the day is random.
Any help will be useful!
The data I'm working on:
data <- data.frame(
data = sample(seq(as.Date('2000/01/01'), as.Date('2020/01/01'), by="day"), 500),
price = round(runif(500, min = 10, max = 20),2),
quantity = round(rnorm(500,30),0)
)
func <- function(start, months, n) {
startdate <- as.Date(paste0(start, "-01"))
enddate <- seq(startdate, by = "month", length.out = months)
months <- seq_len(months)
enddate_lt <- as.POSIXlt(enddate)
enddate_lt$mon <- enddate_lt$mon + 1
enddate_lt$mday <- enddate_lt$mday - 1
days_per_month <- as.integer(format(enddate_lt, format = "%d"))
days <- lapply(days_per_month, sample, size = n)
dates <- Map(`+`, enddate, days)
do.call(c, dates)
}
set.seed(2021)
func("2020-12", 4, 3)
# [1] "2020-12-08" "2020-12-07" "2020-12-15" "2021-01-27" "2021-01-08" "2021-01-13" "2021-02-21" "2021-02-07" "2021-02-28"
# [10] "2021-03-28" "2021-03-07" "2021-03-15"
func("2020-12", 5, 2)
# [1] "2020-12-06" "2020-12-16" "2021-01-08" "2021-01-10" "2021-02-24" "2021-02-13" "2021-03-20" "2021-03-29" "2021-04-19"
# [10] "2021-04-28"
func("2020-12", 2, 10)
# [1] "2020-12-29" "2020-12-30" "2020-12-04" "2020-12-15" "2020-12-09" "2020-12-27" "2020-12-05" "2020-12-06" "2020-12-23"
# [10] "2020-12-17" "2021-01-03" "2021-01-20" "2021-01-05" "2021-01-22" "2021-01-23" "2021-01-06" "2021-01-10" "2021-01-07"
# [19] "2021-01-19" "2021-01-12"
Most of the dancing with POSIXlt objects is because it gives us clean (base R) access to the number of days in a month, which makes sampleing the days in a month rather simple. It can also be done (code-golf shorter) using the lubridate package, but I don't know that that is any more correct than this code is.
This just dumps out a sequence of random dates, with n days per month. It does not sort within each month, though it does output the months in order. (That's not a difficult extension, there just wasn't a requirement for it.) It doesn't put out a frame, you can easily extend this to fit in a frame or call data.frame(date = do.call(c, dates)) on the last line, depending on what you need to do with the output.
You could convert the start time to a class for monthly data, zoo::yearmon. Then use as.Date.yearmon and its frac argument ("a number between 0 and 1 inclusive that indicates the fraction of the way through the period that the result represents") with random values from runif (uniform between 0 and 1) to convert to a random date within each year-month.
start = "2020-12"
T = 3
n = 2
library(zoo)
set.seed(1)
as.Date(as.yearmon(start) + rep((1:T)/12, each = n), frac = runif(T * n))
# [1] "2021-01-08" "2021-01-12" "2021-02-16" "2021-02-25" "2021-03-07" "2021-03-27"
I'm hoping to retrieve the month number from a fiscal year that starts in November (i.e. the first day of the fiscal year is November 1st). The following code provides my desired output, borrowing the week_start syntax of lubridate::wday, where year_start is analogous to week_start:
library('lubridate')
dateToRetrieve = ymd('2017-11-05')
#output: [1] "2017-11-05"
monthFromDate = month(dateToRetrieve, year_start=11)
#output: [1] 1
Since this functionality doesn't yet exist, I'm looking for an alternative solution that provides the same output. Adding period(10, units="month") to each date does not work because the length of different months leads to issues translating between months (e.g. March 31st minus a month = February 31st, which doesn't make sense).
I checked a somewhat similar question on the lubridate github here, but didn't see any solutions. Does anyone have an idea that will provide my desired functionality?
Many thanks,
1) lubridate Below x can be a character vector or a Date vector:
x <- "2017-11-05" # test data
(month(x) - 11) %% 12 + 1
## [1] 1
2) Base R To do this with only base R first calculate the month number giving mx as shown and then perform the same computation:
mx <- as.POSIXlt(x)$mon + 1
(mx - 11) %% 12 + 1
## [1] 1
It is a not pretty way... but you could create a vector range of months starting at November, call the full month of the date object, then match the two objects together to get the vector position.
suppressPackageStartupMessages(library('lubridate'))
x <- format(ISOdate(2004,1:12,1),"%B")[c(11,12,1:10)]
match(as.character(month(ymd('2017-11-05'), label = TRUE, abbr = FALSE)), x)
#> [1] 1
match(as.character(month(ymd('2017-01-15'), label = TRUE, abbr = FALSE)), x)
#> [1] 3
match(as.character(month(ymd('2017-05-01'), label = TRUE, abbr = FALSE)), x)
#> [1] 7
I am trying to create a ts object using R for a daily time series that starts on 24.02.2015 and ends on 13.04.2015. I have put the frequency=7 for daily data but I cannot find a way to put the exact date as start argument.
I think this is what you want, using the decimal_date() function from 'lubridate' to get the proper start time for a daily series and assuming that the vector of values you want to index as a ts is called x and is of the proper length:
library(lubridate)
df <- ts(x, start = decimal_date(as.Date("2015-02-24")), frequency = 365)
Here's what that looks like if I use rnorm() to generate an x of the proper length:
> df
Time Series:
Start = c(2015, 55)
End = c(2015, 103)
Frequency = 365
[1] 0.4284579 1.9384426 0.1242242 -2.4002789 -0.4064669 0.6945274 -0.5172909 0.4772347 0.8758635 -1.7233406 0.5929249 1.5662611 1.0692173 -0.1354226
[15] 1.1404375 0.7714662 -0.2871663 -5.2720038 -1.7353146 -0.7053329 1.0206803 1.7170262 -0.3469172 0.2594851 2.0371700 -2.1549066 -0.6639050 -0.4912258
[29] -0.3849884 -3.0448583 -1.3317834 1.6173705 0.7176759 -0.8646802 -1.7697016 1.1114061 0.6941131 -0.1942612 -0.1836107 -0.5850649 -1.7449090 -3.3646555
[43] -0.4341833 1.9721407 1.4995265 1.7168002 1.8617295 -3.4578959 1.1639413
Note that for daily indexing, you want frequency = 365, not 7, which denotes weekly indexing.
If you want a vector of dates that you can use in 'zoo' instead, this does it:
seq(from = as.Date("2015-02-24"), to = as.Date("2015-04-13"), by = 1)
So you would create a zoo object like this:
zoo(x, seq(from = as.Date("2015-02-24"), to = as.Date("2015-04-13"), by = 1))
And if you want a table with date column, you can use:
df <- data.frame(date = seq(from = as.Date("2015-02-24"), to = as.Date("2015-04-13"), by = 1))
Using the xts library:
library(xts)
data_xts <- xts(x=dataframe$x, order.by=as.Date(dataframe$date, "%m/%d/%Y"))
With this method, you can't or don't have to specify the end date.
The output looks like this:
[,1]
2020-01-01 7168.3
2020-01-02 7174.4
2020-01-03 6942.3
2020-01-04 7334.8
I have a vector of dates of the form BW01.68, BW02.68, ... , BW26.10. BW stands for "bi-week", so for example, "BW01.68" represents the first bi-week of the year 1968, and "BW26.10" represents the 26th (and final) bi-week of the year 2010. Using R, how could I convert this vector into actual dates, say, of the form 01-01-1968, 01-15-1968, ... , 12-16-2010? Is there a way for R to know exactly which dates correspond to each bi-week? Thanks for any help!
An alternative solution.
biwks <- c("BW01.68", "BW02.68", "BW26.10")
bw <- substr(biwks,3,4)
yr <- substr(biwks,6,7)
yr <- paste0(ifelse(as.numeric(yr) > 15,"19","20"),yr)
# the %j in the date format is the number of days into the year
as.Date(paste(((as.numeric(bw)-1) * 14) + 1,yr,sep="-"),format="%j-%Y")
#[1] "1968-01-01" "1968-01-15" "2010-12-17"
Though I will note that a 'bi-week' seems a strange measure and I can't be sure that just using 14 day blocks is what is intended in your work.
You can make this code a lot shorter. I have spaced out each step to help understanding but you could finish it off in one (long) line of code.
bw <- c('BW01.68', 'BW02.68','BW26.10','BW22.13')
# the gsub will ensure that bw01.1 the same as bw01.01, bw1.01, or bw1.1
#isolating year no
yearno <- as.numeric(
gsub(
x = bw,
pattern = "BW.*\\.",
replacement = ""
)
)
#isolating and converting bw to no of days
dayno <- 14 * as.numeric(
gsub(
x = bw,
pattern = "BW|\\.[[:digit:]]{1,2}",
replacement = ""
)
)
#cutoff year chosen as 15
yearno <- yearno + 1900
yearno[yearno < 1915] <- yearno[yearno < 1915] + 100
# identifying dates
dates <- as.Date(paste0('01/01/',yearno),"%d/%m/%Y") + dayno
# specifically identifinyg mondays of that week no
mondaydates <- dates - as.numeric(strftime(dates,'%w')) + 1
Output -
> bw
[1] "BW01.68" "BW02.68" "BW26.10" "BW22.13"
> dates
[1] "1968-01-15" "1968-01-29" "2010-12-31" "2013-11-05"
> mondaydates
[1] "1968-01-15" "1968-01-29" "2010-12-27" "2013-11-04"
PS: Just be careful that you're aligned with how bw is measured in your data and whether you're translating it correctly. You should be able to manipulate this to get it to work, for instance you might encounter a bw 27.
I'm working with some time data and I'm having problems converting a time difference to years and months.
My data looks more or less like this,
dfn <- data.frame(
Today = Sys.time(),
DOB = seq(as.POSIXct('2007-03-27 00:00:01'), len= 26, by="3 day"),
Patient = factor(1:26, labels = LETTERS))
First I subtract the data of birth (DOB) form today's data (Today).
dfn$ageToday <- dfn$Today - dfn$DOB
This gives me the Time difference in days.
dfn$ageToday
Time differences in days
[1] 1875.866 1872.866 1869.866 1866.866 1863.866
[6] 1860.866 1857.866 1854.866 1851.866 1848.866
[11] 1845.866 1842.866 1839.866 1836.866 1833.866
[16] 1830.866 1827.866 1824.866 1821.866 1818.866
[21] 1815.866 1812.866 1809.866 1806.866 1803.866
[26] 1800.866
attr(,"tzone")
[1] ""
This is where first part of my question comes in; how do I convert this difference to years and months (rounded to months)? (i.e. 4.7, 4.11, etc.)
I read the ?difftime man page and the ?format, but I did not figure it out.
Any help would be appreciated.
Furthermore, I would like to melt my final object and if I try using melt on the data frame above using this command,
require(plyr)
require(reshape)
mdfn <- melt(dfn, id=c('Patient'))
I get this strange warning I haven't see before
Error in as.POSIXct.default(value) :
do not know how to convert 'value' to class "POSIXct"
So, my second question is; how do I create a time diffrence I can melt alongside my POSIXct variables? If I melt without dfn$ageToday everything works like a charm.
Thanks, Eric
The lubridatepackage makes working with dates and times, including finding time differences, really easy.
library("lubridate")
library("reshape2")
dfn <- data.frame(
Today = Sys.time(),
DOB = seq(as.POSIXct('2007-03-27 00:00:01'), len= 26, by="3 day"),
Patient = factor(1:26, labels = LETTERS))
dfn$diff <- new_interval(dfn$DOB, dfn$Today) / duration(num = 1, units = "years")
mdfn <- melt(dfn, id=c('Patient'))
class(mdfn$value) # all values are coerced into numeric
The new_interval() function calculates the time difference between two dates. Note that there is a function today() that could substitute for your use of Sys.time. Finally note the duration() function that creates a standard, ehm, duration that you can use to divide the interval by a length of standard units, in this case, a unit of one year.
In case you want to preserve the contents of Today and DOB, then you may want to convert everything to character first and reconvert later...
library("lubridate")
library("reshape2")
dfn <- data.frame(
Today = Sys.time(),
DOB = seq(as.POSIXct('2007-03-27 00:00:01'), len= 26, by="3 day"),
Patient = factor(1:26, labels = LETTERS))
# Create standard durations for a year and a month
one.year <- duration(num = 1, units = "years")
one.month <- duration(num = 1, units = "months")
# Calculate the difference in years as float and integer
dfn$diff.years <- new_interval(dfn$DOB, dfn$Today) / one.year
dfn$years <- floor( new_interval(dfn$DOB, dfn$Today) / one.year )
# Calculate the modulo for number of months
dfn$diff.months <- round( new_interval(dfn$DOB, dfn$Today) / one.month )
dfn$months <- dfn$diff.months %% 12
# Paste the years and months together
# I am not using the decimal point so as not to imply this is
# a numeric representation of the diference
dfn$y.m <- paste(dfn$years, dfn$months, sep = '|')
# convert Today and DOB to character so as to preserve them in melting
dfn$Today <- as.character(dfn$Today)
dfn$DOB <- as.character(dfn$DOB)
# melt using string representation of difference between the two dates
dfn2 <- dfn[,c("Today", "DOB", "Patient", "y.m")]
mdfn2 <- melt(dfn2, id=c('Patient'))
# alternative melt using numeric representation of difference in years
dfn3 <- dfn[,c("Today", "DOB", "Patient", "diff.years")]
mdfn3 <- melt(dfn3, id=c('Patient'))