Simulate a series of code n(lets say 1000) times while saving the result in a vector in R - r

I'm still relatively new to R so I'm struggling with repeating lines of code several times and saving the result for each repetition.
The aim is to randomly (equal probability) assign a number of events, in my case 100, over a 20 year period. Since days are irrelevant I use the number of months to define the period. Subsequently, I'm counting the events for every 24-month period within the 20 years. Lastly, extracting the maximum number of events occurring within a 24-month period.
Albeit messy and probably inefficient, the code works for the intended purpose. However, I want to repeat this process 1000 times to get a distribution of all the maximum number of events taking place over 24 months to compare to my real data.
here is my coding so far:
library(runner)
library(dplyr)
#First I set the period from the year 2000 to 2019 with one-month increments.
period <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
#I sample random observations assigned to different months over the entire period.
u <- sample(period, size=100, replace=T)
#Make a table in order to register the number of occurrences within each month.
u <- table(u)
#Create a data frame to ease information processing.
simulation <- data.frame(u)
#Change the date column to date format.
simulation$u <- as.Date(simulation$u)
#Compute number of events taking place within every 24-month period (730 = days in 24 months).
u <- u %>%
mutate(
Last_24_month_total = sum_run(
x = simulation$Freq,
k = 730,
idx = as.Date(simulation$u, format = "%d/%m/%Y"))
)
#extract the maximum number of uccurences within a 24 month period
max <- max(u$Last_24_month_total)
Could someone help me understand how to rewrite this process in order to facilitate a thousand repetitions while saving the max value for each repetition?
thanks

As #jogo suggested in the comments, you can use replicate.
I simplified your code.
library(runner)
library(dplyr)
seq_dates <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
replicate(100,
seq_dates %>%
sample(100, replace = TRUE) %>%
table() %>%
sum_run(730, idx = as.Date(names(.))) %>%
max)

Related

R - Use Lubridate to create 1 second intervals in datetime column where only minutes are specified

I am working with a time series that looks something like this:
# making a df with POSIXct datetime sequence with just minutes
#Make reproducible data frame:
set.seed(1234)
datetime <- rep(lubridate::ymd_hm("2016-08-01 15:10"), 60)
# Generate measured value
value <- runif(n = 60, min = 280, max = 1000)
df <- data.frame(datetime, value)
The data is actually recorded at 1 second intervals, but it appears as 60 rows with the same hour and minute with with seconds part always at 00. I want to change it such that each minute has its seconds value increasing at one second intervals. The actual dataset includes many hours of data. Thank you
We can use
df$datetime <- with(df, datetime + seconds(seq_along(datetime)) -1)

Smoothing out missing values in R dataframe

I am using the dataset - https://data.ca.gov/dataset/covid-19-cases/resource/7e477adb-d7ab-4d4b-a198-dc4c6dc634c9 to look into covid cases and deaths in California.
As well as looking at cases/deaths by ethnicity I have grouped the data to give a total column of cases deaths per day. I also used the lag function to give a daily case / death amount.
However on 2 days in December (23rd and 30th) no increment to the cases or deaths columns were made so the daily cases and deaths read 0. The following day the data is 'caught up' with an extra large amount being added on, clearly the sum of the 2 days. (I suspect Christmas and New Year are the causes)
Is there a way of fixing this data? e.g. splitting the double days measurement into half and populating the cells with this, and then retrospectively altering the daily cases and daily deaths figures?
Hopefully the screenshots will clarify what i mean.
Here is the code I have used:
demog_eth <- (read.csv ("./Data/case_demographics_ethnicity.csv", header = T, sep = ","))
demog_eth$date <-as.Date(demog_eth$date)
#Create a DF with total daily information
total_stats <- data.frame(demog_eth$cases,demog_eth$deaths,demog_eth$date)
names(total_stats) <- c('cases', 'deaths', 'date')
total_stats <- total_stats %>% group_by(date) %>% summarise(cases = sum(cases), deaths = sum(deaths))
#Add daily cases and deaths by computing faily difference in totals
##Comment - use lag to look at previous rows
total_stats <- total_stats %>%
mutate(daily_cases = cases-lag(cases),
daily_deaths = deaths-lag(deaths))
The top paragraph of text in the image says cases and deaths. It should say Daily Cases and Daily Deaths. Apologies
df <- data.frame(col=seq(1:100), col2=seq(from=1, to=200, by=2))
df[c(33, 2),] <- 0
zeros <- as.integer(rownames(df[df$col == 0,])) # detect rows with 0
for (i in zeros){
df[i,"col"] <- 0.5 * df[i+1,"col"]
df[i+1,"col"] <- 0.5 * df[i+1,"col"]
}
Sorry, that I used own simple example data. But the mechanism should work if adapted.

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

calculating seasonal range in r for a number of years

I have a data frame of daily temperature measurements spanning 20 years. I would like to calculate the annual range in the data series for each year (i.e. end up with 20 values, representing the range for each year). Example data:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
dat = data.frame(dt = begin_date + (0:(20*365)) * (86400))
dat = within(dat, {speed = runif(length(dt), 1, 10)})
I was thinking of writing a loop which goes through each year and then calculate the range, but was hoping there was another solution.
I think the best way forward would be to have the maximum and minimum values for each year and then calculate the range from that. Can anyone suggest a method to do this without writing a loop to go through each year individually?
Try
library(dplyr)
dat %>%
group_by(year=year(dt)) %>%
summarise(Range=diff(range(speed)))
Or
library(data.table)
setDT(dat)[, list(Range=diff(range(speed))), year(dt)]
Or
aggregate(speed~cbind(year=year(dt)), dat, function(x) diff(range(x)))

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Resources