R: quickly simulate unbalanced panel with variable that depends on lagged values of itself - r

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)

One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Related

How can I effectivize my script for correcting a logger's seasonal drift in R?

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

Why does just one (of 8) numeric predictor variables return NA when I run lm()?

I'm trying to build a linear regression model using eight independent variables, but when I run lm() one variable--what I anticipate being my best predictor!--keeps returning NA. I'm still new to R, and I cannot find a solution.
Here are my independent variables:
TEMPERATURE
HUMIDITY
WIND_SPEED
VISIBILITY
DEW_POINT_TEMPERATURE
SOLAR_RADIATION
RAINFALL
SNOWFALL
My df is training_set and looks like:
I'm not sure whether this matters, but training_set is 75% of my original df, and testing_set is 25%. Created thusly:
set.seed(1234)
split_bike_sharing <- sample(c(rep(0, round(0.75 * nrow(bike_sharing_df))), rep(1, round(0.25 * nrow(bike_sharing_df)))))
This gave me table(split_bike_sharing):
0
1
6349
2116
And then I did:
training_set <- bike_sharing_df[split_bike_sharing == 0, ]
testing_set <- bike_sharing_df[split_bike_sharing == 1, ]
The structure of training_set is like:
To create the model I run the code:
lm_model_weather=lm(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE +
SOLAR_RADIATION + RAINFALL + SNOWFALL, data = training_set)
However, as you can see the resultant model returns RAINFALL as NA. Here is the resultant model:
My first thought was to check RAINFALL datatype, which is numeric with range 0-1 (because at an earlier step I performed min-max normalization). But SNOWFALL also is numeric, and I've done nothing (that I know of!) to the one but not the other. My second thought was to confirm that RAINFALL contains enough values to work, and that does not appear to be an issue: summary(training_set$RAINFALL):
So, how do I correct the NAs in RAINFALL? Truly I will be most grateful for your guidance to a solution.
UPDATE 10 MARCH 2022
I've now checked for collinearity:
X <- model.matrix(RENTED_BIKE_COUNT ~ ., data = training_set)
X2 <- caret::findLinearCombos(X)
print(X2)
This gave me:
I believe this means certain columns are jointly multicollinear. As you can see, columns 8, 13, and 38 are:
[8] is RAINFALL
[13] is SEASONS_WINTER
[38] is HOUR_23
Question: if I want to preserve RAINFALL as a predictor variable (viz., return proper values rather than NAs when I run lm()), what do I do? Remove columns [13] and [38] from the dataset?

Alternative to brute force estimation of parameter in an ecological time series model

I am modeling a hydrologic process (water levels [stage] in lakes measured in mm) that can be described as:
where is estimated from a different model and used as a constant in this model. is the unknown and the value is expected to be between (-0.001,0.001). The starting value of S doesn't matter so long as it is greater than 10m (10000mm). The model runs on a daily time step. I have observed stage from multiple different lakes and fit each lake independently.
Currently, I am brute-force identifying the parameter value by:
Creating a 100 value sequence of parameter values spanning (-0.001,0.001)
Predicting stage using the above equation and estimate RMSE between modeled and observed data (significantly fewer observations than modeled data points)
Identifying the B with the lowest RMSE and selecting B values on either side to create a new sequence of parameter values to search over
Step 2 and 3 are repeated until RMSE decreases by less than 0.01 or increases.
The code for the brute force approach I've been using is below along with the data associated with a single lake.
Is there an alternative approach to estimating the unknown parameter Beta2 given the model above and the fact that I only have observed data for a limited number of days?
Thanks!
library(tidyverse)
library(lubridate)
library(Metrics)
#The Data
dat <- read_csv("https://www.dropbox.com/s/skg8wfpu9274npb/driver_data.csv?dl=1")
observeddata <- read_csv("https://www.dropbox.com/s/bhh27g5rupoqps3/observeddata.csv?dl=1") %>% select(Date,Value)
#Setup initial values and vectors
S = matrix(nrow = nrow(dat),ncol = 1) #create an empty matrix for predicted values
S[1,1] = 10*1000 #set initial value (mm)
rmse.diff <- 10^100 #random high value for difference between min RMSE between successive
#parameter searches
b.levels <- seq(from = -0.001,to = 0.001,length.out=100) #random starting parameter that should contain
#the final value being estimated
n = 0 # counter
#Loop to bruteforce search for best parameter estimate
while(rmse.diff > 0.01 ) {
rmse.vec = rep(NA,length(b.levels))
for(t in 1:length(b.levels)){
for(z in 2:nrow(S)){
S[z,1] <- S[(z-1),1] + (1.071663*(dat$X[z])) + (b.levels[t]*(S[(z-1),1])) #-1.532236
} #end of time series loop
extrap_level <- data.frame(Date= dat$Date, level = S) # predicted lake levels
#calculate an offset to center observed data on extrapolated data
dat.offset = observeddata %>% left_join(extrap_level) %>%
mutate(offset = level-Value) %>% drop_na()
offset <- mean(dat.offset$offset)
dat.compare <- observeddata %>% left_join(extrap_level) %>%
mutate(Value = Value + offset) %>% drop_na()
#calculate RMSE between observed and extrapolated values
rmse.vec[t] <- rmse(actual = dat.compare$Value,predicted = dat.compare$level)
#plot the data to watch how parameter choice influences fit while looping
#plots have a hard time keeping up
if(t ==1 | t==50 | t==100) {
plot(extrap_level$Date,extrap_level$level,type="l")
lines(dat.compare$Date,dat.compare$Value,col="red")
}
}
#find minimum RMSE value
min.rmse <- which(rmse.vec==min(rmse.vec))
if(n == 0) rmse.best <- rmse.vec[min.rmse] else rmse.best = c(rmse.best,rmse.vec[min.rmse])
if (n >= 1) rmse.diff <- (rmse.best[n]-rmse.best[n+1])
if(rmse.diff < 0) break()
best.b <- b.levels[min.rmse]
#take the parameter values on either side of the best prior RMSE and use those as search area
b.levels <- seq(from = (b.levels[min.rmse-1]),to = (b.levels[min.rmse+1]),length.out=100)
n = n + 1
}
rmse.best #vector of RMSE for each parameter search
best.b #Last identified best parameter value

Performing an Interval Regression in R

I am trying to run an interval regression, where my dependent variable, y is made up of 14 intervals, representing incomes. I have 5000 observations. I have six independent variables I am trying to use to predict my y.
I am trying to follow the steps performed here:
http://www.karlin.mff.cuni.cz/~pesta/NMFM404/interval.html#References
So, I actually have y at its exact values, but am trying to learn how to do an interval regression from this. So, first I convert y into an interval.
Income[Income < 10000] <- 1
Income[Income > 10000 & Income < 20001] <- 2
Income[Income > 20000 & Income < 30001] <- 3
...
Income[Income > 300000] <- 14
Okay, fine. From the above link, I should actually convert it to correspond to each lower bound of the interval, and each upper bound. I have to imagine that isn't the only way, but for now, I am following those directions.
lIncome <- rep(0,5000)#lower income bound
uIncome <- rep(0,5000)#upper income bound
for (i in 1:5000){
if(Income[i] == 1){
lIncome[i] = 0
uIncome[i] = 10000
}
if(Income[i] == 2){
lIncome[i] = 10001
uIncome[i] = 20000
}
...
if(Income[i] == 14){
lIncome[i] = 300001
uIncome[i] = Inf
}
}
So now I have columns lIncome and uIncome which correspond to the levels of income. I am fine for this part. Perhaps it is problematic my last interval goes to infinity; but even if I just cap it at 500000 I still get errors.
The instructions next say to incorporate the Surv() function.
So, I perform:
TEST <- Surv(lIncome, uIncome, event = rep(3,5000))
However, my errors start now. I get:
Warning message:
In Surv(lIncome, uIncome, event = rep(3, 5000)) :
Invalid status value, converted to NA
If I try
TEST <- Surv(lIncome, uIncome, event = rep(2,5000))
it works, but then:
m <- survreg(TEST ~ Age + AgeSq + ... , dist="gaussian")
gives:
Error in survreg(TEST ~ Age + AgeSq + NoDegree, dist = "gaussian") :
Invalid survival type
First of all, I am not sure why changing the 3 -> 2 makes it work. Even if I change the Inf value to 500000 (or any appropriate number) having it equal to 2 (or any number) does not resolve the issue.
Second, when I can get past that part, the fact that survreg is failing is leaving me puzzled.
Right now, my approach is to play around with my intervals, to see if I can get it to work somehow, then go from there. I am also looking closer at all the documentation for ?Surv and ?survreg
Any help is very appreciated though, thank you.

Applying an lm function to different ranges of data and separate groups using data.table

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Resources