Make a model matrix if missing the response variable and where matrix multiplication recreates the predict function - r

I want to create a model matrix for a test dataset which is missing the response variable, and where I can perfectly replicate the results of calling predict() on the model if building predictions using matrix multiplication. See code below for example.
I have code which can do this (again, see below for example), but it requires that I create a placeholder response variable in my test data. This doesn't seem very clean, and I'm wondering if there's a way to get the code to work without this workaround.
# Make data, fit model
set.seed(1); df_train = data.frame(y = rnorm(10), x = rnorm(10), z = rnorm(10))
set.seed(2); df_test = data.frame(x = rnorm(10), z = rnorm(10))
fit = lm(y ~ poly(x) + poly(z), data = df_train)
# Make model matrices. Get error for the test data as 'y' isnt found
mm_train = model.matrix(terms(fit), df_train)
mm_test = model.matrix(terms(fit), df_test) #"Error in eval(predvars, data, env) : object 'y' not found"
# Make fake y variable for test data then build model matrix. I want to know if there's a less hacky way to do this
df_test$y = 1
mm_test = model.matrix(terms(fit), df_test)
# Check predict and matrix multiplication give identical results on test data. NB this is not the case if contstructing the model matrix using (e.g.) mm_test = model.matrix(formula(fit)[-2], df_test) for the reason outlined here https://stackoverflow.com/questions/59462820/why-are-predict-lm-and-matrix-multiplication-giving-different-predictions.
preds_1 = round(predict(fit, df_test), 5)
preds_2 = round(mm_test %*% fit$coefficients, 5)
all(preds_1 == preds_2) #TRUE

Related

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

R predict() returns only fitted values for nls() model

First of all, I would like to mention I am just a beginner in R.
I have encountered a problem when trying to predict data from a model generated by nls(). I fitted the exponential decay function into my data and everything seems to be fine, e.g. I got a decent regression line. However, when I use predict() on a new data set, it returns only fitted values.
My code is:
df = data.frame(Time = c(0,5,15,30), Value = c(1, 0.38484677,0.18679383, 0.06732328))
model <- nls(Value~a*exp(-b*Time), start=list(a=1, b=0.15), data = df)
plot(Value~Time, data = df)
lines(df$Time, predict(model))
newtime <- data.frame(Time = seq(1,20, by = 1))
pr = predict(model, newdata = newtime$Time)
pr
[1] 0.979457389 0.450112312 0.095058637 0.009225664
Could someone explain me please, what I am doing wrong? I know there are here some answers to that problem, but none helped me.
Thank you in advance for your help!
The newdata parameter should be a data.frame with the same names as your input data. When you use newdata = newtime$Time you are actually passing in newtime$Time which is not a data.frame anymore since it 'dropped' down to a vector. You can just pass in newtime like so
pr = predict(model, newdata = newtime)

R C5.0 Undefined columns selected when using a formula stored in a variable

I'm using the C50 R Package for predicting with decision trees.
I have the following code:
library("partykit")
library("C50")
//Creates a sample data frame
data <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
//This is the variable I want to predict
variable <- "ID"
//First I convert the column to factor
data[, variable] <- factor(data[, variable])
//Then I create the formula
formula <- as.formula(paste(variable, " ~ ."))
//And finally I fit the model with the formula
model <- C5.0(formula, data=data, trials=10)
Until this point everything is ok, the problem comes here, when I try to plot the tree:
png(filename = paste("test.png"), width = 800, height = 60)
plot(model) //This line throws the error: Error in `[.data.frame`(mf, rsp) : undefined columns selected
dev.off()
But if I change the line:
model <- C5.0(formula, data=data, trials=10)
To:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok.
After doing a bit of debugging I have this adicional info:
Inside C5.0 function there is this code:
call <- match.call()
If I look inside call this is what I get:
C5.0.formula(formula = formula, data = data, trials = 10)
But if the call to C5.0 is:
model <- C5.0(ID ~ ., data=data, trials=10)
Then the call object is this:
C5.0.formula(formula = ID ~ ., data = data, trials = 10)
This may seem normal, but debugging the plot() function I've seen that in some point the function as.party(x, trial = trial) is called, where x is the C5.0 object. Inside as.party() function there is another call, the model.frame(obj) call, where obj is the C5.0 object, and here is the problem. Inside the model.frame() function I found this line:
rsp <- strsplit(paste(formula$call[2]), " ")[[1]][1]
Remember? The error had a reference to that rsp variable. And the problem is that formula$call has 2 different values. If I make the first call to C.5 like this one:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok as formula$call contains C5.0.formula(formula = ID ~ ., data = data, trials = 10) and rap is ID so the next call to:
tmp <- mf[rsp]
Is executed without problems (mf is the initial data frame).
But with the call:
C5.0.formula(formula = formula, data = data, trials = 10)
The formula$call object contains:
C5.0.formula(formula = formula, data = data, trials = 10)
And rsp is "formula" so the line:
tmp <- mf[rsp]
Fails as there is no "formula" column in the data frame.
Is this the expected behavior? If it is, there is no way I can call C5.0 with a formula stored in a variable?
I need to do the call that way in order to test the algorithm with many different formulas.
Any help would be appreciated. Thank you all in advance.
Unfortunately it is a bug. See issue 8 on github. Looks like Max Kuhn (developer of C50) hasn't had the time to look into it, since the bug report is from August. You might want to attach your issue there as well. That might bring it back to the attention of the developer.

predict.glm() on blind test data

I'm using regularized logistic regression for a classification problem using the glmnet package. In the development process, everything is working fine, but I have a problem when it comes to making predictions on blind test data.
Because I don't know the class label, my data frame for testing has a column less than the one I used for training. This seems to be a problem for predict.glm(), because it expects matching dimensions - I can "fix" it by adding a column with some arbitrary labels in the test data, but this seems like a bad idea. I hope this example will illustrate the problem:
library(glmnet)
example = data.frame(rnorm(20))
colnames(example) = "A"
example$B = rnorm(20)
example$class = ((example$A + example$B) > 0)*1
testframe = data.frame(rnorm(20))
colnames(testframe) = "A"
testframe$B = rnorm(20)
x = model.matrix(class ~ ., data = example)
y = data.matrix(example$class)
# this is similar to the situation I have with my data
# the class labels are ommited on the blind test set
So if I just proceed like this, I get an error:
x.test = as.matrix(testframe)
ridge = glmnet(x,y, alpha = 0, family = "binomial", lambda = 0.01789997)
ridge.pred = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
Error in cbind2(1, newx) %*% nbeta:
Cholmod error 'X and/or Y have
wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
I can "fix" the problem by adding a class column to my test data:
testframe$class = 0
x.test = model.matrix(class ~ ., data = testframe)
ridge.pred2 = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
So I have a couple of questions about this:
a) Is this workaround with adding a column safe to do? It feels very wrong/dangerous to do this, because I don't know if the predict method will use it (why would it require this column to be there otherwise?
b) What's a better / "the correct" way to do this?
Thanks in advance!
Answer
When you create the matrix x, drop the (Intercept) column (which is always the first column). Then your predict function will work without the workaround. Specifically, use this line to create x.
x = model.matrix(class ~ ., data = example)[,-1]
Explanation
You are getting an error because model.matrix is creating a column for an intercept in the model, which is not on your x.test matrix.
colnames(x)
# [1] "(Intercept)" "A" "B"
colnames(x.test)
# [1] "A" "B"
Unless you set intercept=FALSE, glmnet will add an intercept to the model for you. Thus, the simplest thing to do is exclude the intercept column from both the x and x.test matrices.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Resources