R Harmonic Prediction Failing - newdata structure - r

I am forecasting a time series using harmonic regression created as such:
(Packages used: tseries, forecast, TSA, plyr)
airp <- AirPassengers
TIME <- 1:length(airp)
SIN <- COS <- matrix(nrow = length(TIME), ncol = 6,0)
for (i in 1:6){
SIN[,i] <- sin(2*pi*i*TIME/12)
COS[,i] <- cos(2*pi*i*TIME/12)
}
SIN <- SIN[,-6]
decomp.seasonal <- decompose(airp)$seasonal
seasonalfit <- lm(airp ~ SIN + COS)
The fitting works just fine. The problem occurs when forecasting.
TIME.NEW <- seq(length(TIME)+1, length(TIME)+12, by=1)
SINNEW <- COSNEW <- matrix(nrow=length(TIME.NEW), ncol = 6, 0)
for (i in 1:6) {
SINNEW[,i] <- sin(2*pi*i*TIME.NEW/12)
COSNEW[,i] <- cos(2*pi*i*TIME.NEW/12)
}
SINNEW <- SINNEW[,-6]
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW, SIN = SINNEW, COS = COSNEW)
seasonal.predictions <- predict(seasonalfit, newdata = prediction.harmonic.dataframe)
This causes the warning:
Warning message:
'newdata' had 12 rows but variables found have 144 rows
I went through and found that the names were SIN.1, SIN.2, et cetera, instead of SIN1 and SIN2... So I manually changed those and it still didn't work. I also manually removed the SIN.6 because it, for some reason, was still there.
Help?
Edit: I have gone through the similar posts as well, and the answers in those questions did not fix my problem.

Trying to predict with a data.frame after fitting an lm model with variables not inside a data.frame (especially matrices) is not fun. It's better if you always fit your model from data in a data.frame.
For example if you did
seasonalfit <- lm(airp ~ ., data.frame(airp=airp,SIN=SIN,COS=COS))
Then your predict would work.
Alternatively you can try to cram matrices into data.frames but this is generally a bad idea. You would do
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW,
SIN = I(SINNEW), COS = I(COSNEW))
The I() (or AsIs function) will keep them as matrices.

Related

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

Is there a R loop function (data.table) to run over 100s of `gam` results without exceeding the memory limit?

Spatial Interpolation using gam
Statement
I am hoping to get many spatial interpolation outputs using Generalised additive models (GAM). There are no problems for predicting a single pollution map, however, I need more than 100 maps. If possible I would like to automate the implementation and also get the results without exceeding the memory limit.
Spatial Interpolation process with GAM (mgcv package)
Just to let you know, here are the essential steps to get a interpolated map.
Get the X, Y coordinates of the pollution monitoring stations
Get the pollution data for each station
Add the pollution data to the data frame that contains X, Y coordinates
Run gam(pollution ~ s(X,Y, k=20)) for each pollution column
Create an empty dataframe with min and max X, Y coordinates as a spatial extent
Predict the spatial extent using predict and gam result
Run the same job over all pollution fields
I will show a hands-on example of how I approached it.
Sample data
To give an example, I created a dataset which is shown below. From the df, you would realise that I have X Y, and 3 pollution variables.
library(data.table)
library(mgcv)
X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)
df <- data.table(X, Y, poll1, poll2, poll3)
How I worked on it
1. Hard code
If you look at the code below, you would realised I copy&pasted the same job to all variables. This will be extremely hard to implement a lot of variables.
# Run gam
gam1 <- gam(poll1 ~ s(X,Y, k=20), data = df)
gam2 <- gam(poll2 ~ s(X,Y, k=20), data = df)
gam3 <- gam(poll3 ~ s(X,Y, k=20), data = df)
# "there are over 5000 variables that needs looping
# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
# Predict gam results to the empty surface
GAM_poll$gam1 <- predict(gam1, GAM_poll, type = "response")
GAM_poll$gam2 <- predict(gam2, GAM_poll, type = "response")
GAM_poll$gam3 <- predict(gam3, GAM_poll, type = "response")
2. Using for Loop
Instead, I made a list and attempted to loop all the variables to get a results. It certainly has no problem per se, but iterating over a multiple variables will take up all the memory (this is what I experienced).
# Run gam using list and for loop
myList <- list()
for(i in 3:length(df)){
myList[[i-2]] <- gam(df[[i]] ~ s(X,Y, k=20), data = df)
}
# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
# Predict gam results to the empty surface
myResult <- list()
for(j in 1:length(myList)){
myResult[[j]] <- predict(myList[[j]], GAM_poll, type = "response")
}
Asking for help
Is there a better way to get the gam results over multiple variables?
Is there a way to not exceed the memory limit during the implementation?
Can you help me data.table, purrr users?
The solution I created only keeps the latest prediction in memory and saves the others to disk before overwriting it with the next solution. The files are named after the column name of the model in a folder called results. I also melted the data.table, mostly because I think the code is a little clearer that way.
library(data.table)
library(mgcv)
X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)
df <- data.table(X, Y, poll1, poll2, poll3)
# melt the data.table
df <- melt.data.table(df, id.vars = c('X', 'Y'))
dir.create('results')
gam1 <- list()
for(i in unique(df$variable)){
gam1[[i]] <- gam(value ~ s(X,Y, k=20), data = df[variable == i])
GAM_poll <- data.table(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
GAM_poll[, 'prediction' := predict(gam1[[i]], GAM_poll, type = "response")]
write.csv(GAM_poll$prediction, paste('results/model_', i, '.csv'), row.names = FALSE)
}

Looping over objects in R

I am trying to loop over objects in R.
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
The function here works perfectly as it should. It returns a vector of 48 rows and it comes from the object x. Now 'x' is nothing but the full regression model from a GLM function (think: mod.fit <- glm (dep~indep, data = data)). The problem is that I have 20 different such ('mod.fit') objects and need to find predictions for each of these. I could literally repeat the code, but I was looking to find a neater solution. So what I want is a matrix with 48 rows and 20 columns for the above function. This is probably basic for an advanced user, but I have only ever used "apply" and "for" loops for numbers and never objects. I looked into lapply but couldn't figure it out.
I tried: (and this is probably dumb)
allmodels <- c(mod.fit, mod.fit2, mod.fit3)
lpred.matrix <- matrix(data=NA, nrow=48, ncol=20)
for(i in allmodels){
lpred.matrix[i,] <- myfunc.linear.pred(i)
}
which obviously won't work because allmodels has a class of "list" and it contains all the stuff from the GLM function. Hope someone can help. Thanks!
In order to use lapply, you must have a list object not a vector object. Something like this should work:
## Load data
data("mtcars")
# fit models
mod.fit1 <- glm (mpg~disp, data = mtcars)
mod.fit2 <- glm (mpg~drat, data = mtcars)
mod.fit3 <- glm (mpg~wt, data = mtcars)
# build function
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
# put models in a list
allmodels <- list("mod1" = mod.fit1, "mod2" = mod.fit2, "mod2" =
mod.fit3)
# use lapply and do.call to generate matrix of prediction results
df <- do.call('cbind', lapply(allmodels, function(x){
a <- myfunc.linear.pred(x)
}))
Hope this helps

How to run a loop inside a loop for a gam object

I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?
I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.

R : Clustered standard errors in fractional probit model

I need to estimate a fractional (response taking values between 0 and 1) model with R. I also want to cluster the standard errors. I have found several examples in SO and elsewhere and I built this function based on my findings:
require(sandwich)
require(lmtest)
clrobustse <- function(fit, cl){
M <- length(unique(cl))
N <- length(cl)
K <- fit$rank
dfc <- (M/(M - 1))*((N - 1)/(N - K))
uj <- apply(estfun(fit), 2, function(x) tapply(x, cl, sum))
vcovCL <- dfc*sandwich(fit, meat = crossprod(uj)/N)
coeftest(fit, vcovCL)
}
I estimate my model like this:
fit <- glm(dep ~ exp1 + exp2 + exp3, data = df, fam = quasibinomial("probit"))
clrobustse(fit, df$cluster)
Everything works fine and I get the results. However, I suspect that something is not right as the non-clustered version:
coeftest(fit)
gives the exact same standard errors. I checked that Stata reports and that displays different clustered errors. I suspect that I have misspecified the function clrobustse but I just don't know how. Any ideas about what could be going wrong here?

Resources