Related
This is my first question on Stackoverflow, so please bear with me if I make any mistakes or omit necessary information.
I have a dataset consisting of a time series where I need to find the 5-day rolling average of a binary variable for each specific hour of the day. An example of my data can be created using:
library(dplyr)
library(zoo)
set.seed(69)
df <- data.frame(Hour = rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24), times = 10),
Reg = rep(round(runif(24*10, 0, 1))),
HumidityLevel = rep(runif(24*10, 0, 100)))
df_ranges <- data.frame(LowerRange = rep(cbind(rollapply(df$HumidityLevel, 24, min, by = 24)), each = 24)
,UpperRange = rep(cbind(rollapply(df$HumidityLevel, 24, max, by = 24)), each = 24))
df <- cbind(df, df_ranges)
I have computed the simple rolling average using the following code:
df <- df %>%
group_by(Hour) %>%
mutate(AvgReg = lag(rollapplyr(Reg, 5, mean, na.rm = T, partial = T), n = 1))
What I need to do is compute the rolling average of Reg using previous rows where HumidityLevel lies within the range for that specific day. The lower and upper boundary of the range is determined by two columns (LowerRange, UpperRange). The boundary values are dependent on the lowest and highest HumidityLevel-values for the day.
For instance, a day may have levels between 20 and 54. The rolling average for hour 1 of that specific day should then be computed by using previous Hour 1 observations with a HumidityLevel value above or equal to 20 and below or equal to 54.
I hope that my question makes sense.
This is my desired output:
desired_output <- data.frame(RowNum = c(1:10),
Hour = rep(1, times = 10),
Reg = c(1,0,0,1,0,1,0,0,0,0),
HumidityLevel = c(28.36, 65.02, 1.12, 49.61, 24.50, 98.16, 77.33, 97.03, 47.03, 85.71),
LowerBoundary = c(5.67, 7.50, 1.12, 19.32, 0.01, 6.94, 7.48, 0.71, 2.85, 1.59),
UpperBoundary = c(93.60, 89.37, 97.25, 99.63, 91.92, 98.16, 98.48, 99.98, 99.70, 98.86),
AvgReg = c("NA", 1, 0.5, 0.5, 0.5, 0.5, 0.6, 0.4, 0.4, 0.2))
Using data.table you can use between for filter and shift + frollmean for calculation:
setDT(df)[
between(HumidityLevel, LowerRange, UpperRange),
new_col := shift(
frollmean(Reg, c(seq_len(min(5, .N)), rep(5, max(0, .N - 5))), adaptive = TRUE)
),
by = Hour
]
I have a data set that consists of an ID, years and an index:
ID = c("ABW", "ABW", "FRA", "FRA", "FRA", "GER", "GER", "GER")
year = c(2000, 2005, 2000, 2002, 2008, 2005, 2008, 2010)
index = c(NA, NA, 4, NA, 8, NA, 6, NA)
df <- data.frame(ID, year, index)
I am trying to interpolate the missing values in the index but I want the interpolation to be restricted by the ID - e.g. I want R to interpolate the index for all rows with the ID "FRA" and then start the interpolation over again for all rows with the ID "GER". And if there are no values at all for a specific ID (like for the ID "ABW") then I want R to return no interpolated values either.
I have tried this code (but it does not take the ID into consideration):
df <- df %>% mutate(index = na.approx(index, rule = 2)
After the interpolation I want my index column to look like this:
index = c(NA, NA, 4, 6, 8, 6, 6, 6)
Does anyone know how I can do this?
I have a panel data that looks like this:
x <- structure(list(id2 = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8,
8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11), private = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1), capex = c(-0.003423963, -0.028064674, -0.03058208,
-0.00186256, -0.010839419, 0.052905358, 0.058931317, 0.065547734,
0.007644231, -0.025942514, 0.00163772, -0.007530502, 0.010706151,
0.025040116, 0.035105374, 0.036772128, 0.03886272, 0.045399148,
0.042642809, 0.080788992, 0.080848917, 0.057645567, 0.057636742,
0.046084184, 0.041080192, 0.05690382, 0.057325598, 0.051791377,
0.070084445, 0.069627948, 0.077849329, 0.077247024, 0.081251919,
0.071702167, 0.078424804, 0.077482991, 0.078969546, 0.076208547,
0.059055354, 0.056043826, 0.029450929, 0.016044363, 0.048353843,
0.047607671, 0.046497576, 0.047454875, 0.050881654, 0.047155183,
0.055546004, 0.057564467), roa = c(-0.003078416, -0.035302367,
-0.01884984, 0.002839225, -0.001113289, 0.024291474, 0.040153231,
0.051482957, 0.026102768, 0.005372915, 0.004466314, -0.025178509,
-0.002043936, -0.069235161, 0.023604594, 0.010512878, 0.021912357,
0.016721437, -0.09472625, 0.04061316, 0.074661337, 0.0214584,
0.047743626, 0.013149141, -0.010418181, 0.025671346, 0.031785361,
0.084893651, 0.018490626, 0.024941774, 0.023567598, 0.031878859,
0.029931206, 0.043837443, 0.041305128, 0.041293647, 0.039307728,
0.046259467, 0.017479861, 0.029429797, 0.023826957, 0.00763736,
0.017485917, 0.017156925, 0.006504506, 0.021350464, 0.032917287,
0.036106978, 0.04545372, 0.049348988), year = c(2011, 2012, 2013,
2014, 2015, 2011, 2012, 2013, 2011, 2012, 2013, 2014, 2015, 2011,
2012, 2013, 2014, 2015, 2011, 2012, 2013, 2014, 2015, 2011, 2012,
2013, 2014, 2015, 2011, 2012, 2013, 2014, 2015, 2011, 2012, 2013,
2014, 2015, 2011, 2012, 2013, 2014, 2011, 2012, 2013, 2014, 2015,
2011, 2012, 2013)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
Where id2 is firm ID and private is an indicator for private/public status. My goal is to run a randomization test for r-squared as follows:
regress roa on capex for private firms (i.e. private==1) and public (private==0) separately and get the observed difference in R-squared
randomly assign firms to private-public status (note that the data is panel)
rerun the regression and get the difference in r-squared for the random sample
repeat this, say, 1000 times
measure the p-value as the number of times that the randomly generated difference in R2 is larger than the observed difference divided by the number of iterations (1,000)
My issue is that this code takes ages to run, it will be great if someone has an idea of a better way to do this.
you will need estimatr and tidyverse packages to run this code
library(estimatr)
library(tidyverse)
# run model 1
mod1 <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data=x,private==0)
# run model 2
mod2 <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data=x,private==1)
#obtain the observed difference in R2
R2.obs1 <- summary(mod1)$adj.r.squared
R2.obs2 <- summary(mod2)$adj.r.squared
diff_r2_obs <- R2.obs2 - R2.obs1
]
#create a list for the simulated differnce in r2
simulated_r2 <- list()
# prepare the loop
set.seed(8)
nreps = 1000
for(i in 1:nreps){
x1 <- x %>% # randamize the variable private taking into acount each id appears a number of times
distinct(id2, private) %>%
mutate(private1=sample(private), replace=T) %>%
left_join(x, by="id2")
model_m <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data = x1,
subset=private1==0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data = x1,
subset=private1==1)
R2.obs_f <- summary(model_f)$adj.r.squared
r2_diff_sim <- R2.obs_f - R2.obs_m
simulated_r2[i] <- r2_diff_sim
}
simulated_r2 <- unlist(simulated_r2)
exceed_count <- length(simulated_r2[simulated_r2 >=
diff_r2_obs])
p_val <- exceed_count / nreps
p_val
I initialized simulated_r2 as a vector rather than a list. We benefit a lot when we predefine it's length=. Also I replaced the dplyr code with an transform approach which appears to be faster. Runs quite fast with 1000 reps on this data actually. (Note, that in your code there was a mistake when using sample, the replace= argument is outside the call but should be inside!)
nreps <- 1000
simulated_r2 <- numeric(length=nreps)
set.seed(8)
system.time(
for (i in seq_len(nreps)) {
x1 <- transform(subset(x, !duplicated(id2)),
private=sample(private, replace=T))
model_m <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 1)
R2.obs_f <- summary(model_f)$adj.r.squared
r2_diff_sim <- R2.obs_f - R2.obs_m
simulated_r2[i] <- r2_diff_sim
}
)
# user system elapsed
# 8.262 0.000 8.267
(p_val <- length(simulated_r2[simulated_r2 >= diff_r2_obs])/nreps)
# [1] 0.273
Note, that warnings occur if you set replace=TRUE, probably when the nobs of one of the subsets on 0/1 get to small. You should rethink your approach a little.
Parallelize
If this is too slow, you could parallelize the code.
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, 'x')
clusterSetRNGStream(cl, 8) ## set seed
nreps <- 10000
system.time(
simulated_r2 <- parSapplyLB(cl, seq_len(nreps), \(i) {
x1 <- transform(subset(x, !duplicated(id2)),
private=sample(private, replace=F))
model_m <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 1)
R2.obs_f <- summary(model_f)$adj.r.squared
R2.obs_f - R2.obs_m
})
)
# user system elapsed
# 0.067 0.058 17.820
(p_val <- length(simulated_r2[simulated_r2 >= diff_r2_obs])/nreps)
# [1] 0.257
stopCluster(cl)
This takes a different approach to jay.sf's nice solution. Here, I use data.table, create the random permutations of private/public status, use a small function to get difference(s) in r-sq:
library(data.table)
setDT(x)
set.seed(8)
nperms=1000
# get permutations of public/private status
perms = cbind(unique(x[,.(id2)]), unique(x[,.(id2,private)])[,lapply(1:nperms, function(p) sample(private, replace=F))])
# function to get R-sq differences
obsR <- function(r,c,id,p) {
r0 = lm_robust(r[p==0]~c[p==0],clusters=id[p==0],se_type="stata")$adj.r.squared
r1 = lm_robust(r[p==1]~c[p==1],clusters=id[p==1],se_type="stata")$adj.r.squared
r1-r0
}
# empirical r-sq diff
empirR = x[, obsR(roa,capex,id2,private)]
# simulated r-sq differences
simR = x[perms, on=.(id2)][, sapply(.SD, function(v) obsR(roa,capex,id2,v)), .SDcols = patterns("V")]
# pvalue
sum(simR>=empirR)/nperms
With the small provided dataset, this can do a thousand estimates and get the p-value in under 5 seconds on my machine.
user system elapsed
4.56 0.14 4.70
I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.
I would like to create a third inputed/interpolated set of data in which the solution is:
informed by the modeled values where observations are missing
crosses the highly weighted observations
and is informed by the lower weighted observations
So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")
This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]