Confidence interval for regression error, R, - r

Can someone please explain what I am doing wrong here. I want to
find a confidence interval for an average response of my variable
"list1." R has an example online using the 'faithful' dataset and it
works fine. However, whenever I try to find a confidence/prediction
interval, I ALWAYS get this error message. I have been at this for 5
hours and tried a million different things, nothing works.
> list1 <- c(1,2,3,4,5) #first data set
> list2 <- c(2,4,5,6,7) # second data set
> frame <- data.frame(list1,list2) # made a data.frame object
> reg <- lm(list1~list2,data=frame) # regression
> newD = data.frame(list1 = 2.3) #new data input for confidence/prediction interval estimation
> predict(reg,newdata=newD,interval="confidence")
fit lwr upr
1 0.7297297 -0.08625234 1.545712
2 2.3513514 1.88024388 2.822459
3 3.1621622 2.73210185 3.592222
4 3.9729730 3.45214407 4.493802
5 4.7837838 4.09033237 5.477235
Warning message:
'newdata' had 1 row but variables found have 5 rows #Why does this keep happening??

The problem is that you are trying to pass in a new independent variable for prediction, but the name of that predictor matches the dependent variable from the initial model. The formula syntax in the regression is y ~ x. When you use the predict() function, you can pass new independent (x) variables. See the Details section of ?predict for more details.
This however seems to work:
newD2 = data.frame(list2 = 2.3) #note the name is list2 and not list1
predict(reg, newdata = newD2, interval = "confidence")
---
fit lwr upr
1 0.972973 0.2194464 1.7265

Related

In R using the pls package, how can I obtain estimates of coefficients by group/factor

I've started looking at the pls package & I am unsure about how to extract separate coefficients by group/factor. I can run separate models per group, or consider the X ~ group interaction term, but that isn't what I'm after.
I'm using the following syntax:
model1 <- plsr(outcome ~ pred * group, data =plsDATA,2)
I've tried using the following:
model2 <- plsr(outcome ~ embed(pred:as.factor(group)), data=plsDATA,2)
but this results in this error:
Error in model.frame.default(formula = outcome ~ embed(pred:as.factor(group)), :
variable lengths differ (found for 'embed(pred:as.factor(group))')
In addition: Warning messages:
1: In pred:as.factor(group) :
numerical expression has 640 elements: only the first used
2: In pred:as.factor(group) :
numerical expression has 32 elements: only the first used
I'm not sure why I'm getting the variable lengths error since running the following command gives compatible dimensions:
dim(group)
[1] 32 1
dim(outcome)
[1] 32 1
dim(pred)
[1] 32 20
The code is below:
library(pls) #Dummy Data
setwd("/Users/John/Documents")
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group is the factor & outcome is Y
pred <- as.matrix(Data[,3:22])
group <- as.matrix(Data[,1])
outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
plsDATA <- data.frame(SampN=c(1:nrow(Data)))
plsDATA$pred <- pred
plsDATA$group <- group
plsDATA$outcome <-outcome #define the model - ask for two components
model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)#Get coefficients from this object
According to your question, you are wanting to extract the coefficients. There is a function, 'coef()' that will pull them out easily. See the results below.
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group
is the factor & outcome is Y
> pred <- as.matrix(Data[,3:22])
> group <- as.matrix(Data[,1])
> outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
> plsDATA <- data.frame(SampN=c(1:nrow(Data)))
> plsDATA$pred <- pred
> plsDATA$group <- group
> plsDATA$outcome <-outcome #define the model - ask for two components
> model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)
> coef(model1)
, , 2 comps
outcome
predpred1 -1.058426e-02
predpred2 2.634832e-03
predpred3 3.579453e-03
predpred4 1.135424e-02
predpred5 3.271867e-04
predpred6 4.438445e-03
predpred7 8.425997e-03
predpred8 3.001517e-03
predpred9 2.111697e-03
predpred10 -9.264594e-04
predpred11 1.885554e-03
predpred12 -2.798959e-04
predpred13 -1.390471e-03
predpred14 -1.023795e-03
predpred15 -3.233470e-03
predpred16 5.398053e-03
predpred17 9.796533e-03
predpred18 -8.237801e-04
predpred19 4.778983e-03
predpred20 1.235484e-03
group 9.463735e-05
predpred1:group -8.814101e-03
predpred2:group 9.013430e-03
predpred3:group 7.597494e-03
predpred4:group 1.869234e-02
predpred5:group 1.462835e-03
predpred6:group 6.928687e-03
predpred7:group 1.925111e-02
predpred8:group 3.752095e-03
predpred9:group 2.404539e-03
predpred10:group -1.288023e-03
predpred11:group 4.271393e-03
predpred12:group 6.704938e-04
predpred13:group -3.943964e-04
predpred14:group -5.468510e-04
predpred15:group -5.595737e-03
predpred16:group 1.090501e-02
predpred17:group 1.977715e-02
predpred18:group -3.013597e-04
predpred19:group 1.169534e-02
predpred20:group 3.389127e-03
The same results could also be achieved with the call model1$coefficients or model1[[1]]. Based on the question, I think this is the result you are looking for.
Actually, I've just figured this out. You need to dummy code the grouping variable & make it the outcome (i.e. predicted variable). In this case, I had two columns representing group membership. In each case, membership in the group was indicated by 1 and non-membership by 0. Then I called the first two columns as group (i.e. group <- as.matrix(Data[,1:2])) & ran the rest of the code as before substituting group for outcome.

Error in linear model when values are 0

I have a data set that has names, value 1, and value 2. I need to run a regression and obtain the t-statistic for each of the names. I got help on StackOverflow in constructing the linear model. I noticed that sometimes I get data that's 0's. It's OK and I want the model to keep running and not bomb. However, when the 0's are in there, the linear model bombs.
v1<-rnorm(1:50)
v2<-rnorm(1:50)
data<-data.frame(v1,v2)
data[1:50,"nm"]<-"A"
data[50:100,"nm"]<-"B"
data[50:100,"v1"]<-0
data[50:100,"v2"]<-0
data<-data[c("nm","v1","v2")]
## run regression and generate universe
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
return(summary(mod)$coefficients[2,3])
}
lm <- ddply(data, .(nm), plyrFunc)
As you can see, for name B, since everything is 0, the model bombs. I cannot just remove all 0's because often times the values are indeed 0.
I don't know how to edit the above code so that it keeps going.
Can anyone let me know? Thank you!
The model actually works fine, it is a subsetting of summary(mod)$coefficients that throws you an error because it contains only one row in the all-zeroes case:
> summary(lm(v1~v2,data[data$nm=="A",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1462766 0.1591779 -0.9189503 0.3628138
v2 -0.1315238 0.1465024 -0.8977590 0.3738900
> summary(lm(v1~v2,data[data$nm=="B",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NaN NaN
Thus, you need to modify your function to take this case into account:
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
res <- summary(mod)$coefficients
if (nrow(res)>1) res[2,3] else NA
}
library(plyr)
result <- ddply(data, .(nm), plyrFunc)
Output for your sample data set:
nm V1
1 A -0.1825896
2 B NA

How to run lm models using all possible combinations of several variables and a factor

this is not my subject so I am sorry if my question is badly asked or if the data is incomplete. I am trying to run 31 lineal models which have a single response variable (VELOC), and as predictor variables have a factor (TRAT, with 2 levels, A and B) and five continuous variables.
I have a loop I used for gls but only with continuous predictor variables, so I thought it could work. But it did not and I believe the problem must be a silly thing.
I don't know how to include the data, but it looks something like:
TRAT VELOC l b h t m
1 A 0.02490 -0.05283 0.06752 0.03435 -0.03343 0.10088
2 A 0.01196 -0.01126 0.10604 -0.01440 -0.08675 0.18547
3 A -0.06381 0.00804 0.06248 -0.04467 -0.04058 -0.04890
4 A 0.07440 0.04800 0.05250 -0.01867 -0.08034 0.08049
5 A 0.07695 0.06373 -0.00365 -0.07319 -0.02579 0.06989
6 B -0.03860 -0.01909 0.04960 0.09184 -0.06948 0.17950
7 B 0.00187 -0.02076 -0.05899 -0.12245 0.12391 -0.25616
8 B -0.07032 -0.02354 -0.05741 0.03189 0.05967 -0.06380
9 B -0.09047 -0.06176 -0.17759 0.15136 0.13997 0.09663
10 B -0.01787 0.01665 -0.08228 -0.02875 0.07486 -0.14252
now, the script I used is:
pred.vars = c("TRAT","l", "b", "h","t","m") #define predictor variables
m.mat = permutations(n = 2, r = 6, v = c(F, T), repeats.allowed = T)# I run all possible combinations of pred.vars
models = apply(cbind(T, m.mat), 1, function(xrow) {paste(c("1", pred.vars)
[xrow], collapse = "+")})# fill the models
models = paste("VELOC", models, sep = "~")#fill the left side
all.aic = rep(NA, length(models))# AIC of models
m.mat = cbind(1, m.mat)# Which predictors are estimated in the models beside
#the intercept
colnames(m.mat) = c("(Intercept)", pred.vars)
n.par = 2 + apply(m.mat,1, sum)# number of parameters estimated in the Models
coefs=m.mat# define an object to store the coefficients
for (k in 1:length(models)) {
res = try(lm(as.formula(models[k]), data = xdata))
if (class(res) != "try-error") {
all.aic[k] = -2 * logLik(res)[1] + 2 * n.par[k]
xx = coefficients(res)
coefs[k, match(names(xx), colnames(m.mat))] = xx
}
}
And I get this error:"Error in coefs[k, match(names(xx), colnames(m.mat))] = xx : NAs are not allowed in subscripted assignments"
Thanks in advance for your help. I'll appreciate any corrections about how to post properly questions.
Lina
I suspect the dredge function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel) to get all combinations nested within the full model.
You should then be able to get the coefficients and AIC values from the results of this.
Something like:
require(MuMIn)
data(iris)
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)
to get the parameter estimates for all models (a bit messy) you can then use
coefTable(combinations)
or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.
coefTable(combinations)[1]
to get the coefficients in the model at row 1. This should also print coefficients for factor levels.
See the MuMIn helpfile for more details and ways to extract information.
Hope that helps.
To deal with:
'global.model''s 'na.action' argument is not set and
options('na.action') is "na.omit"
require(MuMIn)
data(iris)
options(na.action = "na.fail") # change the default "na.omit" to prevent models
# from being fitted to different datasets in
# case of missing values.
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)

How do I produce a set of predictions based on a new set of data using predict in R? [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)
mySample <- my[sample(nrow(my), 100),]
predict(fit, mySample)
I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :
'newdata' had 100 rows but variables found have 1000 rows
How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.
It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than
fit <- lm(my$y ~ my$x)
use
fit <- lm(y ~ x, my)
So a complete example would be
set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]
head(predict(fit, mySample))
# 694 278 298 825 366 980
# 0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087 0.09627245
couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.
In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x
Is this what you want
library(caret)
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
## Divide data into train and test set
Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)
train <- my[Index, ]
test <- my[-Index,]
lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))
lmpredict<-predict(lmfit,test)
this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1
Lag(x)

Manually conduct leave-one-out cross validation for a GLMM using a for() loop in R

I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.

Resources