Predicted values from R glm model, while leaving out one term - r

I have a glm in R that nicely explains abundance of a species of the form
x<-glm(log(abundance) ~ distance+sampling_effort, data=df)
All terms are significant (p-value<0.01) and model assumptions seem to be valid. The data is actually from a raster map. Now I want to create predicted values from my model, but while leaving out the sampling_effort term. So it would create a new raster map that compensates for sampling effort and thus provides a better prediction of abundance if sampling_effort would be equal everywhere. How can I do this?

Ok, after some better googling I found the answer already on http://r.789695.n4.nabble.com/Remove-term-from-formula-for-predict-lm-td1017686.html
Basically the easiest way is just to set the sampling_effort to 0 in a new dataset and use that with predict like this:
newdata <- df
newdata$sampling_effort = 0
predicted_values_compensated <- predict(x, newdata)

Related

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Predicted(?) values from an lmer model

I have a data frame of bird counts. I have the participants ID number, the number of birds they counted, the year they counted them, their lat and long coordinates, and their effort. I have made this model:
model = lmer(count~year+lat+long+effort+(1|participant), data = df)
I now want the model to plot predicted values from that same data set. So, that data was for 1997-2017, and I want the model to give me predicted values for each year. I want to plot these, so the final plot will have the predicted count on the y-axis, and the year (categorical) on the x-axis. Each year will have one data point w/ a confidence interval.
I have tried figuring out predict(), but I'm not quite sure how to use that to get what I want. It seems to need a new data frame, but I don't have a new data set to run through the model to predict a future count. I want the model to go back and work on the previous data that I put into it already, based off of the Beta values in the output of summary(model).
I found this thread, and it seems to be basically what I'm looking to do, but I can't get the sjPlot dependencies to download, sjlabelled throws an error every time: How to plot predicted values with standard errors for lmer model results?
You could try the ggeffects-package, which will be used in the forthcoming sjPlot-update to plot predicted values.
library(ggeffects)
dat <- ggpredict(model, terms = "dat")
plot(dat)
If you're missing dependencies, try:
install.packages(
c("sjlabelled", "sjmisc", "sjstats", "ggeffects", "sjPlot"),
dependencies = TRUE
)
You may even want to install ggeffects from GitHub, since the current dev-version has some fixes and improvements for mixed models.
devtools::install_github("strengejacke/ggeffects")
I found the package I was looking for, it's called predictedmeans and has a function where you put in the model and the model term you want predictions for predictmeans(model, model term). It works perfectly!

R: Limit/Set values of predicted results from linear model

New to R.
Looking to limit the range of values that can be predicted.
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- lm(G~S+L+M+V,data=df.Train)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
round(predict(m.Train, df.Test, type="response"),digits=1)
#seq(0,4,.1) #Predicted values should fall in this range
I've experimented with the predict() options but no luck.
Is there an option in predict? Should I be limiting it in the model?
Thank you
There are ways to transform your response variable, G in this occasion but there needs to be a good reason to do this. For example, if you want the output to be probabilities between 0 and 1 and your response variable is binary (0,1) then you need a logistic regression.
It all comes down to what data you have and whether a model / transformation of the response variable would be appropriate. In your example you do not specify what the data is and therefore we cannot say anything about which model or which transformation to use.
Setting the above on the side, if you really care about the prediction and do not care about the model or the transformation (but why wouldn't you care?) it looks like your data could use a quasipossion generalised linear model which might provide the output you need:
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- glm(G~S+L+M+V,data=df.Train, family=quasipoisson)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
> predict(m.Train, df.Test, type="response")
1 2 3 4 5
4.000000 2.840834 3.062754 3.615447 4.573276
#probably not as good as you want
The model is using a log link by default which ensures the values will be positive. There is no guarantee that the model will not predict values greater than 4 but since you fed it values of less than 4 (your G variable) then chances are that most of the predictions will follow that distribution (like in this example). You might then need to consider how to treat predictions that go above 4.
In general you should consider carefully which model to choose and which response transformation. The poison model above for example is usually used for count data. However, you should never manipulate predictions on your own so if you choose the lm model in the end make sure you use the predictions it gives.
EDIT
It looks like in your case a non-linear regression might be what you need. The problem using a linear model like lm is that predictions can be greater than the max of the observed cases and less than the min of the observed cases. In which case doing a linear regression might not be appropriate. There are algorithms that will never predict a value greater than the max or less than the min. Such a case might be better suited in your case. One of these algorithms is the k-nearest neighbor for example:
library(FNN)
> knn.reg(df.Train[1:4], test=df.Test[1:4], y=df.Train[5], k=3)
Prediction:
[1] 3.066667 3.066667 3.066667 2.700000 3.100000
As you can see the predictions will never go above 4. That said knn is a local solution algorithm so again you need to research whether this is a good approach or not for your problem and your data. In terms of predictions though it definitely confirms your conditions. Knn is a very easy to understand algorithm that relies on distances between points to calculate predictions.
Hope it helps :)

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Resources