Generalized additive models for calibration - r

I work on calibration of probabilities. I'm using a probability mapping approach called generalized additive models.
The algorithm I wrote is:
probMapping = function(x, y, datax, datay) {
if(length(x) < length(y))stop("train smaller than test")
if(length(datax) < length(datay))stop("train smaller than test")
datax$prob = x # trainset: data and raw probabilities
datay$prob = y # testset: data and raw probabilities
prob_map = gam(Target ~ prob, data = datax, familiy = binomial, trace = TRUE)
prob_map_prob = predict(prob_map, newdata = datay, type = "prob")
# return(str(datax))
return(prob_map_prob)
}
The package I'm using is mgcv.
x - prediction on train dataset
y - prediction on test dataset
datax - traindata
datay - testdata
Problems:
The output values are not between 0 and 1
I get the following warning message:
In predict.gam(prob_map, newdata = datay, type = "prob") :
Unknown type, reset to terms.

The warning is telling you that predict.gam doesn't recognize the value you passed to the type parameter. Since it didn't understand, it decided to use the default value of type, which is "terms".
Note that predict.gam with type="terms" returns information about the model terms, not probabilties. Hence the output values are not between 0 and 1.
For more information about mgcv::predict.gam, take a look here.

Related

Trouble in GAM model in R software

I am trying to run the following code on R:
m <- gam(Flp_pop ~ s(Flp_CO, bs = "cr", k = 30), data = data, family = poisson, method = "REML")
My dataset is like this:
enter image description here
But when I try to execute, I get this error message:
"Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)"
I am very new to R, maybe it is a very basic question. But does anyone know why this is happening?
Thanks!
The Poisson distribution has support on the non-negative integers and you are passing a continuous variable as the response. Here's an example with simulated data
library("mgcv")
library("gratia")
library("dplyr")
df <- data_sim("eg1", seed = 2) %>% # simulate Gaussian response
mutate(yabs = abs(y)) # make y non negative
mp <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
# fails
which reproduces the error you saw
Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The warnings are of the form:
$> warnings()[1]
Warning message:
In dpois(y, y, log = TRUE) : non-integer x = 7.384012
Indicating the problem; the model is evaluating the probability mass for your response data given the estimated model and you're evaluating this at the indicated non-integer value, which returns a 0 mass plus the warning.
If we'd passed the original Gaussian variable as the response, which includes negative values, the function would have errored out earlier:
mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
which raises this error:
r$> mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
Error in eval(family$initialize) :
negative values not allowed for the 'Poisson' family
An immediate but not necessarily advisable solution is just to use the quasipoisson family
mq <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = quasipoisson, method = "REML")
which uses the same mean variance relationship as the Poisson distribution but not the actual distribution so we can get away with abusing it.
Better would be to ask yourself why you were trying to fit a model that is ostensibly for counts to a response that is a continuous (non-negative) variable?
If the answer is you had a count but then normalised it in some way (say by dividing by some measure of effort like area surveyed or length of observation time) then you should use an offset of the form + offset(log(effort_var)) added to the model formula, and use the original non-normalised integer variable as the response.
If you really have a continuous response and the poisson was an over sight, try fitting with family = Gamma(link = "log")) or family = tw().
If it's something else, you should edit your question to include that info and perhaps we here can help or the question could be migrated to CrossValidated if the issue is more statistical in nature.

How do we make a model in r using more than one row

Below given is my R code to create a model using R programming to predict the prices of diamonds from the diamond dataset. Here I am not able to create the model by giving a log for each row. Without using log I am getting a horrible model with incorrect prediction prices. I am also pasting the error shown and the dataset for reference.
The error is as given below
> mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'x'
Link for the dataset is attached here: https://www.kaggle.com/shivam2503/diamonds
Below given is the code for the same
setwd ("C:/akash/study videos/virginia")
akash = read.csv("diamonds.csv")
#summary(akash)
ind = sample(2, nrow(akash),replace = TRUE , prob = c(0.8,0.2))
train = akash[ind==1,]
test = akash[ind==2,]
mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
summary(mod)
predicted = predict(mod,newdata = test)
mon = round(exp(predicted),0)
head(mon)
#head(test)
#View(akash)
Your model fails because the minimum values for your variable x,y,z are 0, so when you log-transform these variables you obtain -inf:
lapply(c("x","y","z"),function(x)summary(log(diamonds[[x]])))
you can try to log-transform just the outcome, remove the minimum values from the transformation, or simply change the model.
Just for example: here I compare the RMSE for the lm with no transformation, the lm with log(price) transformation, and a simple random forests model from the package ranger. I'm using caret to use the same model interface (by default carte::train perform a 25 bootstrap resample to choose the best parameters for the given model, so in this example, only random forest has some tuning parameters).
library(ggplot2)#for "diamonds" dataset
data("diamonds")
set.seed(5)
ind = sample(2, nrow(diamonds),replace = TRUE , prob = c(0.8,0.2))
train = diamonds[ind==1,]
test = diamonds[ind==2,]
library(caret)
rf <- train(price~carat+x+y+z,data=train,method="ranger")
lm <- train(price~carat+x+y+z,data=train,method="lm")
lm_log <- train(log(price)~carat+x+y+z,data=train,method="lm")
RMSE(predict(rf,test),test$price)/mean(test$price)*100
RMSE(predict(lm,test),test$price)/mean(test$price)*100
RMSE(exp(predict(lm_log,test)),test$price)/mean(test$price)*100
which give me:
[1] 35.73012
[1] 40.2437
[1] 45.92143

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

glmnet, multinomial prediction returned object

I am attempting to do classification prediction using glmnet, however I cannot deduce what the return object of "glmnet.predict" is supposed to represent. Using the code
mlogit_r<-glmnet(train_x, cbind(cns_label, renal_label,breast_label,nsclc_label,ovarian_label,leuk_label,colon_label, mela_label),
family="multinomial", alpha=0)
pred <- predict(mlogit_r, train_x, type="class")
with train_x being 57(n) x 6830(p), and the y object being 57(n) x 8 (num classes). The returned prediction object is a 57 x 100 matrix with labels. Which of these are the predicted labels?
It does not show in the documentation, as it just says
The object returned depends the . . . argument which is passed on to the
predict method for glmnet objects.
When you fit a glmnet model without specifying the lambda value, by default a range containing 100 lambda values is fit. When you call predict on such a model without specifying the lambda, the predictions are made for all lambda hence you receive 100 different predictions from a 100 different models.
Usually one runs cross validation to choose one lambda that is best and then predicts using it:
library(glmnet)
data(iris)
lets use 120 rows for training:
z <- sample(1:nrow(iris), 120)
now run a 5 - fold cross validation using miss classification error to chose the best lambda:
cv_fit <- cv.glmnet(as.matrix(iris[z,-5]),
iris[z,5],
nfolds = 5,
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
plot(cv_fit)
Here you can see the lambda.min corresponding to the dashed line on the left (lambda with lowest error in 5 fold cross validation) and lambda.1se (lambda with error of 1 se withing the lowest error near it on slightly on the right.
These values are in:
cv_fit$lambda.min
#[1] 0.05560455
cv_fit$lambda.1se
#[1] 0.09717054
Now when you know the best lambda you can either build a model on 100 lambda values:
fit <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
family = "multinomial")
and predict on a specific one:
predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")
or build a model on one lambda
fit1 <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
lambda = cv_fit$lambda.min,
family = "multinomial")
and predict without specifying lambda:
all.equal(as.vector(predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")),
as.vector(predict(fit1, as.matrix(iris[-z,-5]), type = "class")))
#TRUE
To see how much the coefficients were constrained you can plot the model and the lambda used:
plot(fit, xvar = "lambda")
abline(v = log(cv_fit$lambda.min), lty = 2)

predicting outcome with a model in R

I am trying to do a simple prediction, using linear regression
I have a data.frame where some of the items are missing price (and therefor noted NA).
This apperantely doesn't work:
#Simple LR
fit <- lm(Price ~ Par1 + Par2 + Par3, data=combi[!is.na(combi$Price),])
Prediction <- predict(fit, data=combi[is.na(combi$Price),]), OOB=TRUE, type = "response")
What should I put instead of data=combi[is.na(combi$Price),]) ?
Change data to newdata. Look at ?predict.lm to see what arguments predict can take. Additional arguments are ignored. So in your case data (and OOB) is ignored and the default is to return predictions on the training data.
Prediction <- predict(fit, newdata = combi[is.na(combi$Price),])
identical(predict(fit), predict(fit, data = combi[is.na(combi$Price),]))
## [1] TRUE

Resources