HI i am doing prediction with my data.if i use data.frame it throws the folloing error.
input(bedrooms="2",bathrooms="2",area="1000") were specified with different types from the fit
here is my program
input <- function(bedrooms,bathrooms,area)
{
delhi <- read.delim("delhi.tsv", na.strings = "")
delhi$lnprice <- log(delhi$price)
heddel <- lm(lnprice ~ bedrooms+ area+ bathrooms,data=delhi)
valuepred = predict (heddel,data.frame(bedrooms=bedrooms,area=area,bathrooms=bathrooms),na.rm = TRUE)
final_prediction = exp(valuepred)
final_prediction
}
if i remove the data.frame it predicts the value for over all data.i got the following output.
1 2 3 4 5 6 7
15480952 11657414 10956873 6011639 6531880 9801468 16157549
9 10 11 14 15 16 17
10698786 5596803 14688143 20339651 22012831 16157618 26644246
but it needs to display one value only.
any idea how to resolve this..any help will be appreciated
Sharon, you want to make a prediction for the specific values of bedroom, bathroom and area, but are putting them in as character rather than numeric values. This is causing the error you are seeing. when you remove the data.frame statement from predict, it will produce predictions based on the data set used to build the model, i.e. delhi.
Try
input(bedrooms=2,bathrooms=2,area=1000)
Too long for a comment.
The other answer should solve your problem, but if you really believe that log(price) is linear in bedrooms + bathrooms + area then you are better off with a generalized linear model (glm) in the poisson family. So something like:
fit <- glm(price~bedrooms+bathrooms+area, dehli, family=poisson)
Then predict using type="response"
pred <- predict(fit, data.frame(bedrooms, bathrooms, area), type="response")
Related
My data frame looks like:
head(bush_status)
distance status count
0 endemic 844
1 exotic 8
5 native 3
10 endemic 5
15 endemic 4
20 endemic 3
The count data is non-normally distributed. I'm trying to fit a generalized additive model to my data in two ways so i can use anova to see if the p-value supports m2.
m1 <- gam(count ~ s(distance) + status, data=bush_status, family="nb")
m2 <- gam(count ~ s(distance, by=status) + status, data=bush_status, family="nb")
m1 works fine, but m2 sends the error message:
"Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons,
scale.penalty = scale.penalty, :
Can't find by variable"
This is pretty beyond me so if anyone could offer any advice that would be much appreciated!
From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).
I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)
I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.
I have a logit model object fit using glm2. The predictors are continuous and time varying so I am using basis splines. When I predict(FHlogit, foo..,) the model object it provides a prediction. All is well.
Now, what I would like to do is extract the part of FHLogit and the basis matrix the provides the prediction. I do not want to extract information about the model from str(FHLogit) I am trying to extract the part that says Beta * Predictor = 2. So, I can manipulate the basis matrix for each predictor
I don't think using basis splines will affect this. If so, please provide a reproducible example.
Here's a simple case:
df1 <- data.frame(y=c(0,1,0,1),
x1=seq(4),
x2=c(1,3,2,6))
library(glm2)
g1 <- glm2(y ~ x1 + x2, data=df1)
### default for type is "link"
> stats::predict.glm(g1, type="link")
1 2 3 4
0.23809524 0.66666667 -0.04761905 1.14285714
Now, being unsure how these no.s were arrived at we can look at the source for the above, with predict.glm. We can see that type="link" is the simplest case, returning
pred <- object$fitted.values # object is g1 in this case
These values are the predictions resulting from the original data * the coefficients, which we can verify with e.g.
all.equal(unname(predict.glm(g1, type="link")[1]),
unname(coef(g1)[1] + coef(g1)[2]*df1[1, 2] + coef(g1)[3]*df1[1, 3]))
users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!