How to interpolate and `predict` using mgcv::gam? - r

I've begun by mastering how to use splines to interpolate 1-dimentional function.
model = spline(bdp[,4]~bdp[,1])
I could then use
predict(model, c(0))
to predict function value in point 0.
Then I've searched the Internet to find something to spline 3-dimentional data and I came across an answer on stackoverflow suggesting that mgcv::gam is the best choice.
And so I tried:
model=gam(bdp[,4]~s(bdp[,1],bdp[,2],bdp[,3]))
and then I did:
predict(model, newdata=c(0,0,0), type="response")
hoping that it will return a value of spline interpolation for point (0,0,0).
It calculated for a while and returned lots of multidimentional data that I could not understand.
I must be doing something wrong. What do I do to receive a value for a single point from gam object? And, just to be sure, can you agree/disagree that gam is the right choice to interpolate splines for 3D data or would you suggest something else?
I'm adding a reproducible example.
This is a data file (please unpack in c:/r/) https://www.sendspace.com/file/b4mazl
# install.packages("mgcv")
library(mgcv)
bdp = read.table("c:/r/temp_bdp.csv")
bdg=gam(bdp[,4]~s(bdp[,1],bdp[,2],bdp[,3]))
#this returns lots of data, not just function value that I wanted.
predict(bdg, newdata=data.frame(0,0,0,0), type="response")
Minimal reproducible example:
tmp = t(matrix(runif(4*200),4))
tmpgam=gam(tmp[,4]~s(tmp[,1],tmp[,2],tmp[,3]))
predict(tmpgam, newdata=data.frame(0,0,0,0), type="response")
For
predict(bdg, newdata=data.frame(0,0,0,0), type="response")
it returns a lot of numbers any warns that newdata didn't have enough data
for
predict(bdg, c(0,0,0,0), type="response")
it returns nothing and also warns about the same.

So with nearly all types of models you fit, if you plan to use the predict function, it's best to use a "proper" formula with column names rather than using matrix/data.frame slices. The reason is that when predict runs, it matches the values in newdata to the model using the names in both so they should match identically. When you index the data.frame like that, it create weird names in the model. Do the best way to fit the model and predict is
bdg <- gam(V4~s(V1,V2,V3), data=bdp)
predict(bdg, newdata=data.frame(V1=0, V2=0, V3=0))
# 1
# 85431440244
That's assuming
names(bdp)
# [1] "V1" "V2" "V3" "V4"
So here we fit with "V1","V2","V3" and newdata has columns "V1","V2" and "V3"
So i've only focused on the R-coding part. As far as the question if this is an appropriate analysis is better fitted for https://stats.stackexchange.com/

Related

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Fit and calibrate data frame via factors

At first, I use RStudio.
I have a data frame (APD) and I would like to fit the w.r.t to the factor Serial_number. The fit is a lm fit. Then I would like to use this fit to do a calibration (calibrate() out of the investr-package).
Here is an example picture of my data:
Here's the data: Data
Currently I use following lines to fit via Serial_number:
Coefficients<- APD %>%
group_by(Serial_number) %>%
do(tidy(fit<- lm(log(log(Amplification)) ~ Voltage_transformed, .)))
But here, I cannot apply the calibrate()-function. Calibrate function needs an object, that inherits from "lm". And tidy only works for S3/S4-objects.
Do you have an idea?
In your posted code, you're trying to rbind the predicted values from each model, not the coefficients. The function for coefficients is just coefficients(object).
I would also suggest un-nesting your code, since that makes it hard to read and change later on. Here are two generalized functions (each make assumptions, so edit as needed):
lm_by_variable <- function(data_, formula_, byvar) {
by(
data_,
data_[[byvar]],
FUN = lm,
formula = formula_,
simplify = FALSE
)
}
combine_coefficients <- function(fit_list) {
all_coefficients <- lapply(fit_list, coefficients)
do.call('rbind', all_coefficients)
}
lm_by_variable(...) should be pretty self-evident: group by byvar, use lm with the given formula on each subset, and don't simplify the result. Simplifying results is really only useful for interactive work. In a script, it's better to know exactly what will be returned. In this case, a list.
The next function, combine_coefficients(...) returns a matrix of the fitted coefficients. It assumes every fitted model in fit_list has the same terms. We could add logic to make it more robust, but that doesn't seem necessary in this case.

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Feeding newdata to R predict function

R's predict function can take a newdata parameter and its document reads:
newdata An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.
But I found that it is not totally true depending on how the model is fit. For instance, following code works as expected:
x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))
But if the model is fit as such:
m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))
The last two line will produce exactly the same result. The predict function works in a way ignoring newdata parameter, i.e. it can't really compute the prediction on new data at all.
The culprit, of course, is lm(y ~ x, data=dataTrain) versus lm(dataTrain$y ~ dataTrain$x). But I didn't find any document that mentioned the difference between these two. Is it a known issue?
I'm using R 2.15.2.
See ?predict.lm and the Note section, which I quote below:
Note:
Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.
Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var and there are no such variables with names like that either in newdata or along the search path that R will traverse to look for them.
In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.
The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x and data$y then tried to predict for terms x and y. As far as R is concerned those are different names and hence different things and it did right to not match them.

Resources