How to run a loop inside a loop for a gam object - r

I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?

I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.

Related

R for loop stopping at 27 of 67 variables(columns) in for-loop Dunn Test

I've been doing Dunns tests as post-hoc and corrections for my thesis data and I've run into an issue.
I have 67 variables and 3 groupings that I'm running through, so loops just easily solve the tons of data analysis.
ind1
var1(6)
var...
var67(73)
a
value
value
value
b
value
value
value
c
value
value
value
I've done a normal kruskal.test for all my variables and groups with no issue, but "dunnTest" and dunn_test" both stop at 27 variables
library(FSA)
Dunn_df = list()
for(i in names(df[,6:73])){
Dunn_df[[i]] <- dunnTest(formula(paste(i, "~ ind1")), data = df, method = "bonferroni")
}
and
library(rstatix)
Dunn_df = list()
for(i in names(df[,6:73])){
Dunn_df[[i]] <- dunn_test(formula(paste(i, "~ ind1")), data = df, p.adjust.method="BH")
}
both output a nested list that I've able to read with print(Dunn_df), but only do the dunn test on columns 6:32.
Without any reproducible example it is very difficult to help you. Using the example data mtcars, the function works just fine on a large number of variables, e.g.
library(rstatix)
library(purrr)
rep(colnames(mtcars),10) |>
purrr::map(
~dunn_test(formula(paste(.x, "~mpg")), data = mtcars, p.adjust.method="BH")
)

Looping over objects in R

I am trying to loop over objects in R.
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
The function here works perfectly as it should. It returns a vector of 48 rows and it comes from the object x. Now 'x' is nothing but the full regression model from a GLM function (think: mod.fit <- glm (dep~indep, data = data)). The problem is that I have 20 different such ('mod.fit') objects and need to find predictions for each of these. I could literally repeat the code, but I was looking to find a neater solution. So what I want is a matrix with 48 rows and 20 columns for the above function. This is probably basic for an advanced user, but I have only ever used "apply" and "for" loops for numbers and never objects. I looked into lapply but couldn't figure it out.
I tried: (and this is probably dumb)
allmodels <- c(mod.fit, mod.fit2, mod.fit3)
lpred.matrix <- matrix(data=NA, nrow=48, ncol=20)
for(i in allmodels){
lpred.matrix[i,] <- myfunc.linear.pred(i)
}
which obviously won't work because allmodels has a class of "list" and it contains all the stuff from the GLM function. Hope someone can help. Thanks!
In order to use lapply, you must have a list object not a vector object. Something like this should work:
## Load data
data("mtcars")
# fit models
mod.fit1 <- glm (mpg~disp, data = mtcars)
mod.fit2 <- glm (mpg~drat, data = mtcars)
mod.fit3 <- glm (mpg~wt, data = mtcars)
# build function
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
# put models in a list
allmodels <- list("mod1" = mod.fit1, "mod2" = mod.fit2, "mod2" =
mod.fit3)
# use lapply and do.call to generate matrix of prediction results
df <- do.call('cbind', lapply(allmodels, function(x){
a <- myfunc.linear.pred(x)
}))
Hope this helps

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

SVM is not generating forecast using R

I have sales data for 5 different product along with weather information.To read the data, we have daily sales data at a particular store and daily weather information like what is the temperature, average speed of the area where store is located.
I am using Support Vector Machine for prediction. It works well for all the products except one. Its giving me following error:
tunedModelLOG
named numeric(0)
Below is the code:
# load the packages
library(zoo)
library(MASS)
library(e1071)
library(rpart)
library(caret)
normalize <- function(x) {
a <- min(x, na.rm=TRUE)
b <- max(x, na.rm=TRUE)
(x - a)/(b - a)
}
# Define the train and test data
test_data <- train[1:23,]
train_data<-train[24:nrow(train),]
# Define the factors for the categorical data
names<-c("year","month","dom","holiday","blackfriday","after1","back1","after2","back2","after3","back3","is_weekend","weeday")
train_data[,names]<- lapply(train_data[,names],factor)
test_data[,names] <- lapply(test_data[,names],factor)
# Normalized the continuous data
normalized<-c("snowfall","depart","cool","preciptotal","sealevel","stnpressure","resultspeed","resultdir")
train_data[,normalized] <- data.frame(lapply(train_data[,normalized], normalize))
test_data[,normalized] <- data.frame(lapply(test_data[,normalized], normalize))
# Define the same level in train and test data
levels(test_data$month)<-levels(train_data$month)
levels(test_data$dom)<-levels(train_data$dom)
levels(test_data$year)<-levels(train_data$year)
levels(test_data$after1)<-levels(train_data$after1)
levels(test_data$after2)<-levels(train_data$after2)
levels(test_data$after3)<-levels(train_data$after3)
levels(test_data$back1)<-levels(train_data$back1)
levels(test_data$back2)<-levels(train_data$back2)
levels(test_data$back3)<-levels(train_data$back3)
levels(test_data$holiday)<-levels(train_data$holiday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$blackfriday)<-levels(train_data$blackfriday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$weeday)<-levels(train_data$weeday)
# Fit the SVM model and tune the parameters
svmReFitLOG=tune(svm,logunits~year+month+dom+holiday+blackfriday+after1+after2+after3+back1+back2+back3+is_weekend+depart+cool+preciptotal+sealevel+stnpressure+resultspeed+resultdir,data=train_data,ranges = list(epsilon = c(0,0.1,0.01,0.001), cost = 2^(2:9)))
retunedModeLOG <- svmReFitLOG$best.model
tunedModelLOG <- predict(retunedModeLOG,test_data)
Working file is available at the below link
https://drive.google.com/file/d/0BzCJ8ytbECPMVVJ1UUg2RHhQNFk/view?usp=sharing
What I am doing wrong? I would appreciate any kind of help.
Thanks in advance.

error in Predict.lm plyr Is there an alternative to mdply for prediction using newdata?

I am at a lost in trying to figure out the logic of how to get predictions using new data passed to predict.lm using plyr in place of a loop. Can anyone help? Example:
Because I am new to r and not a highly skilled programmer, my code will be painfully inefficient.
Stackflow community:
Thanks for the suggestions to create fake code of the problem. I am hoping this will help me solve this headache.
My goal is to make predictions on a new validation dataset using the coefficients from model built on training dataset. I will eventually be building an ARIMA as well as a linear model once I can get help solving the problem. I am building 24 regression models. One model for each hour of the day. My training data would be 90 days and my validation data would be 31 days.
Creating Some Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Adspend <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Adspend)
## ok, done with the fake data generation.
myData
#Run regression on training data
FIT <- dlply(myData, "Hour", function(x) lm(x[,4] ~ x[,3], data=x))
# Create new fake validation dataset (31days)
Hour <- 1:24
Day <- 1:31
dates <-seq(as.Date("2012-03-31"), as.Date("2012-4-30"), by = "day")
newData <- expand.grid( Day, Hour)
names(newData) <- c("Date","Hour")
set.seed(310)
fooNew <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*5*(myHour) + (300*myDate)
}
newData$AdspendNew <- apply(newData, 1, function(x) fooNew(x[2], x[1]))
newData$Date <-dates
I then try to make predictions of Demand using the New values for Adspend
NewDatabyHour <-dlply(newData,"Hour")
PREDFIT <-mdply(cbind(mod=FIT, df=NewDatabyHour), function(mod,df) {
transform(df, pred=predict(mod,df))})
The error I am now getting is the following:
Error in data.frame(list(Date = c(15430, 15431, 15432, 15433, 15434, 15435, :
arguments imply differing number of rows: 31, 90
In addition: Warning message:
'newdata' had 31 rows but variables found have 90 rows
My Question is:
How do I make predictions on new data in which the new data has less observations than the training data?
My second question is: Is the process the same for auto.arima as for LM()?
Thank you again for any help.
Your problem arises in the way you are constructing your formula and then not having consistent names in the newdata argument to predict.lm (also mdply is not really what you want here)
predict.lm will look for objects in newdata that have the same names as the terms in your model object. Your current defintion has x[,4] as your 'x' term.
Instead, use the names, i.e.
FIT <- dlply(myData, "Hour", function(x) lm(Demand ~ Adspend, data=x))
Now when you create newData, continue to use the name Adspend
newData$Adspend <- apply(newData, 1, function(x) fooNew(x[2], x[1]))
Now you can use Map (which is a wrapper for mapply, a base R function not plyr) to move through FIT and NewDatabyHour to do your predictions (and combine with the new data
predicted <- Map(object = FIT, newdata = NewDatabyHour,
f = function(object,newdata) {
newdata$predicted = predict(object, newdata)
newdata})
# combine into whole data frame again
predDF <- rbind.fill(predicted)
Another (entirely) different approach would be to use nlme lmList
Data is partitioned according to the levels of the grouping factor g and individual lm fits are obtained for each data partition, using the model defined in object.
library(nlme)
# fit the model to each subset
FITS <- lmList(Demand ~ Adspend | Hour, data = myData)
# make the predictions
newData$predicted <- predict(FITS, newdata = newData)
(Please note that these regression models are almost certainly not the best way to analyse these data !)

Resources