Used Predict function on New Dataset with different Columns - r

Using "stackloss" data in R, I created a regression model as seen below:
stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,data=stackloss)
stackloss.lm
newdata = data.frame(Air.Flow=stackloss$Air.Flow, Water.Temp= stackloss$Water.Temp, Acid.Conc.=stackloss$Acid.Conc.)
Suppose I get a new data set and would need predict its "stack.loss" based on the previous model as seen below:
#suppose I need to used my model on a new set of data
stackloss$predict1[-1] <- predict(stackloss.lm, newdata)
I get this error:
Error in `$<-.data.frame`(`*tmp*`, "predict1", value = numeric(0)) :
replacement has 0 rows, data has 21
Is their a way to used the predict function on different data set with the same columns but different rows?
Thanks in advance.

You can predict into a new data set of whatever length you want, you just need to make sure you assign the results to an existing vector of appropriate size.
This line causes a problem because
stackloss$predict1[-1] <- predict(stackloss.lm, newdata)
because you can't assign and subset a non-existing vector at the same time. This also doesn't work
dd <- data.frame(a=1:3)
dd$b[-1]<-1:2
The length of stackloss which you used to fit the model will always be the same length so re-assigning new values to that data.frame doesn't make sense. If you want to use a smaller dataset to predict on, that's fine
stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,data=stackloss)
newdata = head(data.frame(Air.Flow=stackloss$Air.Flow, Water.Temp= stackloss$Water.Temp, Acid.Conc.=stackloss$Acid.Conc.),5)
predict(stackloss.lm, newdata)
1 2 3 4 5
38.76536 38.91749 32.44447 22.30223 19.71165
Since the result has the same number of values as newdata has rows (n=5), it makes sense to attach these to newdata. It would not make sense to attach to stackloss because that has a different number of rows (n=21)
newdata$predcit1 <- predict(stackloss.lm, newdata)

Related

how to use data matrix in predict function in R

Context
I have data frame with multiple columns and i have to do regression between alternate columns (like between 1&2, 3&4) and predict using a test data set and then find residuals. I do not want to reference each column using column name (like target.data$nifty), hence I am using data matrix.
Problem
When i predict using a data matrix, I am getting the following error:
number of items to replace is not a multiple of replacement length
I know my training data set has 255 rows and test data set has 50 rows but it should be able to predict on test data set and give me 50 predicted values, but it is not.
b is a matrix having training data set of 255 rows and 8 columns and t is a matrix having test data set of 50 rows and 8 columns.
I tried using matrix directly as newdata in predict but it was giving following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of
length one
... so i converted it into a data frame. Please suggest how can I use matrix inside predict.
Here's the code i am using:
b<-as.matrix(target.data)
t<-as.matrix(target.test)
t_pred <- matrix(,nrow = 50,ncol = 8)
t_res <- matrix(,nrow = 50,ncol = 8)
for(i in 1:6) {
if (!i %% 2) {
t_model <- lm(b[,i] ~ b[,i])
t_pred[,i] <- predict(t_model, newdata=data.frame(t[,i]), type = 'response')
t_res[,i] <- t_pred[,i] - t[,i]
}
}

How to use predict from a model stored in a list in R?

I have a dataframe dfab that contains 2 columns that I used as argument to generate a series of linear models as following:
models = list()
for (i in 1:10){
models[[i]] = lm(fc_ab10 ~ (poly(nUs_ab, i)), data = dfab)
}
dfab has 32 observations and I want to predict fc_ab10 for only 1 value.
I thought of doing so:
newdf = data.frame(newdf = nUs_ab)
newdf[] = 0
newdf[1,1] = 56
prediction = predict(models[[1]], newdata = newdf)
First I tried writing newdf as a dataframe with only one position, but since there are 32 in the dataset on which the model was built, I thought I had to provide at least 32 points as well. I don't think this is necessary though.
Every time I run that piece of code I am given the following error:
Error: variable 'poly(nUs_ab, i) was fitted with type “nmatrix.1” but type “numeric” was supplied.
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
I thought all I need to use predict was a LM model, predictors (the number 56) given in a column-named dataframe. Obviously, I am mistaken.
How can I fix this issue?
Thanks.
newdf should be a data.frame with column name nUs_ab, otherwise R won't be able to know which column to operate upon (i.e., generate the prediction design matrix). So the following code should work
newdf = data.frame(nUs_ab = 56)
prediction = predict(models[[1]], newdata = newdf)

subscript out of bounds Error, Random Forest Model

I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()
The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.

Multiple regression predicting using R, predicting a data.frame

I have been given data in a data.frame called petrol which has 125 rows and the following columns:
hydrcarb, tanktemp, disptemp, tankpres, disppres, sqrtankpres, sqrdisppres
I have been asked to delete the last 25 rows from petrol, fit the model where hydrcarb is the response variable and the rest are the explanatory variables, and to do this for the first 100 rows. Then use the fitted model to predict for the remaining 25.
This is what I have done so far:
#make a new table that only contains first 100
petrold <- petrol[-101:-125,]
petrold
#FITTING THE MODEL
petrol.lmB <- lm(hydrcarb~ tanktemp + disptemp + tankpres + disppres + sqrtankpres + sqrdisppres, data=petrol)
#SELECT LAST 25 ROWS FROM PETROL
last25rows <-petrol[101:125,c('tanktemp','disptemp','tankpres','disppres','sqrtankpres','sqrdisppres')]
#PREDICT LAST 25 ROWS
predict(petrold,last25rows[101,c('tanktemp','disptemp','tankpres','disppres','sqrtankpres','sqrdisppres')])
I know I have done something wrong for my predict command since R gives me the error message:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "data.frame"
So I am not sure how to get predicted values for hydrcarb for 25 different sets of data.
Alex A. already pointed out that predict expects a model as first argument. In addition to this, you should pass predict all rows you want predict at once. Besides, I recommend that you subset your dataframe "on-the-fly" instead of creating unnecessary copies. Lastly, there's a shorter way to write the fromula you pass to lm:
# data for example
data(Seatbelts)
petrol <- as.data.frame(Seatbelts[1:125, 1:7])
colnames(petrol) <- c("hydrcarb", "tanktemp", "disptemp", "tankpres", "disppres", "sqrtankpres", "sqrdisppres")
# fit model using observations 1:100
petrol.lmB <- lm(hydrcarb ~ ., data = petrol[1:100,])
#predict last 25 rows
predict(petrol.lmB, newdata = petrol[101:125,])

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources