R: Prediction using glm() gamma family - r

I am using glm() function in R with link= log to fit my model. I read on various websites that fitted() returns the value which we can compare with the original data as compared to the predict().
I am facing some problem while fitting the model.
data<-read.csv("training.csv")
data$X2 <- as.Date(data$X2, format="%m/%d/%Y")
data$X3 <- as.Date(data$X3, format="%m/%d/%Y")
data_subset <- subset(...)
attach(data_subset)
#define variable
Y<-cbind(Y)
X<-cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X14)
# correlation among variables
cor(Y,X)
model <- glm(Y ~ X , data_subset,family=Gamma(link="log"))
summary(model)
detach(data_subset)
validation_data<-read.csv("validation.csv")
validation_data$X2 <- as.Date(validation_data$X2, format="%m/%d/%Y")
validation_data$X3 <- as.Date(validation_data$X3, format="%m/%d/%Y")
attach(validation_data)
predicted_valid<-predict(model, newdata=validation_data)
I am not sure how does predict work with gamma log link. I want to transform the predicted values so that it can be compared with the original data. Can someone please help me.

Add type="response" to your predict call, to get predictions on the response scale. See ?predict.glm.
predict(model, newdata=*, type="response")

Looks to me like fitted doesn't work the way you seem to think it does.
You probably want to use predict there, since you seem to want to pass it data.
see ?fitted vs ?predict

Related

I am having trouble with plotting this logistic regression model

Please help me with plotting this model. I tried just using the plot function but I'm not sure how to incorprate the testing dataset. Please help/Thank You.
TravelInsurance <- read.csv(file="TravelInsurancePrediction.csv",header=TRUE)
set.seed(2022)
Training <- sample(c(1:1987),1500,replace=FALSE)
Test <- c(1:1987)[-Training]
TrainData <- TravelInsurance[Training,]
TestData <- TravelInsurance[Test,]
TravIns=as.factor(TravelInsurance$TravelInsurance)
years= TravelInsurance$Age
EMPTY=as.factor(TravelInsurance$Employment.Type)
Grad=as.factor(TravelInsurance$GraduateOrNot)
Income=TravelInsurance$AnnualIncome
Fam=TravelInsurance$FamilyMembers
CD=as.factor(TravelInsurance$ChronicDiseases)
FF=as.factor(TravelInsurance$FrequentFlyer)
logreg = glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial)
Too long for a comment.
Couple of things here:
You divide your dataset into train and test but then build the
model using the full dataset??
Passing vectors is not a good way to use glm(...), or any of the R modeling functions. Better to pass the data frame and reference the columns in the formula.
So, with your dataset,
logreg <- glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial, data=TrainData)
pred <- predict(logreg, newdata=TestData, type='response')
As this is a logistic regression, the responses are probabilities (that someone buys travel insurance?). There are several ways to assess goodness-of-fit. One visualization uses receiver operating characteristic (ROC) curves.
library(pROC)
roc(TestData$TravIns, pred, plot=TRUE)
The area under the roc curve (the "auc") is a measure of goodness of fit; 1.0 is prefect, 0.5 is no better than random chance. See the docs: ?roc and ?auc

Compare predicted values and actual values by Caret

I uses Caret to get a prediction for electricity usage per day. I am wondering if there is a way to plot both predicted data and the actual data to inspect the difference.
I check this post, but I didn't find the solution I want.
how to plot actual and predicted values?
Assume two list of numbers.
actual <- [1,2,3,4]
predicted <- [1.3,2,3.1,4]
Is there a simply way to plot the two line to observe the difference?
Thanks!
you mentioned using caret, so there must be some model you're trying to fit. if correct, there exists a model object and you could use plotObsVsPred function. Here is a simplified example from help link:
# regression example
library(mlbench)
data(BostonHousing)
plsFit <- train(BostonHousing[1:100, -c(4, 14)],
BostonHousing$medv[1:100],
"pls")
predVals <- extractPrediction(list(plsFit),
testX = BostonHousing[101:200, -c(4, 14)],
testY = BostonHousing$medv[101:200])
plotObsVsPred(predVals)

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

R Prediction on a Linear Regression Model

I'm sure this is something that can be done, just not sure how!
I have a dataset that is around 500 rows(csv) and it shows footballers match stas(e,g passes, shots on target)etc.I have some of their salaries(around 10) and I'n trying to predict their salaries using a linear regression equation.
In the below, if Y is salaries, is there a way on R to essentially autopopulate? what the rest of the salaries might be based on the ten salaries I do have?
lm(y ~ x1 + x2 +x3)
Any help would be much appreciated.
This is what the predict function does.
Note that you don't need to call predict.lm explicitly. Because the result of a call to lm is an object with class "lm", R "knows" to use predict.lm when you call predict on it.
Eg:
lm1 <- lm(y ~ x1 + x2 +x3)
y.fitted <- predict(lm1)
You should also be able to test the predictive accuracy of your model using cross validation with the function cv.lm in the DAAG library. With this function you create test data to test the model which is generated using training data.

How to get Loess function for my data in R?

I have a some data and I draw them on a plot, using R.
After that, I draw the loess function about that data.
Here is the code:
data <- read.table("D:/data.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
ur <- subset(data, select = c(users,responseTime))
ur <- ur[with(ur, order(users, responseTime)), ]
plot(ur, xlab="Users", ylab="Response Time (ms)")
lines(ur)
loess_fit <- loess(responseTime ~ users, ur)
lines(ur$users, predict(loess_fit), col = "blue")
Here's my plot's image:
How can I get the function of this regression?
For example: responseTime = 68 + 45 * users.
Thanks.
You can use the loess_fit object from your code to predict the response time. If you want to estimate the average response time for 230 users, you could do:
predict(loess_fit, newdata=data.frame(users=230))
Here is an interesting blog post on this subject.
EDIT: If you want to make predictions for values outside your data, you need a theory or further assumptions. The most simple assumption would be a linear fit,
lm_fit <- lm(responseTime ~ users, data=ur)
predict(lm_fit, newdata=data.frame(users=400))
However, your data may show heteroscedacity (non-constant variance) and may show non-normal residuals. You might want to check if that is the case. If it is, then a robust linear fitting procedure such as rlm from the package MASS, or a generalized linear model glm might be worth a try. I am not an expert for that, maybe someone else or at Cross Validated can provide better help.
The loess.demo function in the TeachingDemos package shows the logic underlying the loess fit. This can help you understand what is going on and why there is not a simple prediction function. However, for predicting, there is a predict function that works with loess fits to create the prediction. You can also find the linear equation that will predict for a specific value of x (but it will be different for each value of x you may want to predict for).

Resources