error when doing the summary of polr in r - r

I am trying to do a proportional odds logistic regression model of the form:
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
The regression ran fine,however, when I implement the summary function I get an error:
svd(X) : infinite or missing values in 'x'
I checked to see if there are any missing values in the "AC1" column (assuming AC1 is "x" as mentioned in the error), but does not have any values missing. The range of AC1 is 1.3 to 170000. DS1 is a factor having the levels 0,1 and 2.
Would be a great help if someone can help me with this. Thanks
A reproducible example is:
pddat1 <- data.frame(cbind(DS1=c(rep(0,400),rep(1,60),rep(2,40)),
AC1=runif(500,1,170000)))
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)

A simple transformation solved the issue. svd(X) refers to singular value decomposition of covariates matrix.
dsnac <- polr(DS1~scale(AC1) , data = pddat1, method=c("logistic"))
summary(dsnac)
However, it is something has to do with your data. Calling clm function from ordinal package lead to the same conclusions with a warning such as "Model is nearly unidentifiable: very large eigenvalue - Rescale variables?"
library(ordinal)
dsnac <- clm(as.factor(DS1) ~ AC1, data=pddat1)
summary(dsnac)
If you downsize the maximum value in the runif command everything works fine
pddat1 <- data.frame(cbind(DS1=factor(c(rep(0,400),rep(1,60),rep(2,40))),
AC1=runif(500,1,15)))
str(pddat1)
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(DS1 ~ AC1, data = pddat1, method=c("logistic"))
summary(dsnac)

Related

Error when predicting new fitted values from R gamlss object

I have a gamlss model that I'd like to use to make new y predictions (and confidence intervals) from in order to visualize how well the model fits the real data. I'd like to make predictions from a new data set of randomized predictor values (rather than the original data), but I'm running into an error message. Here's some example code:
library(gamlss)
# example data
irr <- c(0,0,0,0,0,0.93,1.4,1.4,2.3,1.5)
lite <- c(0,1,2,2.5)
blck <- 1:8
raw <- data.frame(
css =abs(rnorm(500, mean=0.5, sd=0.1)),
nit =abs(rnorm(500, mean=0.72, sd=0.5)),
irr =sample(irr, 500, replace=TRUE),
lit =sample(lite, 500, replace=TRUE),
block =factor(sample(blck, 500, replace=TRUE))
)
# the model
mod <- gamlss(css~nit + irr + lit + random(block),
sigma.fo=~irr*nit + random(block), data=raw, family=BE)
# new data (predictors) for making css predictions
pred <- data.frame(
nit =abs(rnorm(500, mean=0.72, sd=0.5)),
irr =sample(irr, 500, replace=TRUE),
lit =sample(lite, 500, replace=TRUE),
block =factor(sample(blck, 500, replace=TRUE))
)
# make predictions
predmu <- predict(mod, newdata=pred, what="mu", type="response")
This gives the following error:
Error in data[match(names(newdata), names(data))] :
object of type 'closure' is not subsettable
When I run this on my real data, it gives this slightly different error:
Error in `[.data.frame`(data, match(names(newdata), names(data))) :
undefined columns selected
When I use predict without newdata, it works fine making predictions on the original data, as in:
predmu <- predict(mod, what="mu", type="response")
Am I using predict wrong? Any suggestions are greatly appreciated! Thank you.
No, you are not wrong. I have experienced the same issue.
The documentation indicates the implementation of predict is incomplete. this appears to be an example of an incomplete feature/function.
Hedgehog mentioned that predictions based on new-data is not possible yet.
BonnieM therefore "moved the model" into lmer().
I would like to further comment on this idea:
BonniM tried to get predictions based on the object mod
mod <- gamlss(css~nit + irr + lit + random(block),
sigma.fo=~irr*nit + random(block), data=raw, family=BE)
"Moving into lme()" in this scenario could look as follows:
mod2 <- gamlss(css~nit + irr + lit + re(random=~1|block),
sigma.fo=~irr*nit + re(random=~1|block),
data=raw,
family=BE)
Predictions on new-data based on mod2 are implemented within the gamlss2 package.
Furthermore, mod and mod2 should be the same models.
See:
Stasinopoulos, M. D., Rigby, R. A., Heller, G. Z., Voudouris, V., & De Bastiani, F. (2017). Flexible regression and smoothing: using GAMLSS in R. Chapman and Hall/CRC. Chapter 10.9.1
Best regards
Kai
I had a lot of random problems in this direction, and found fitting using the weights argument, and some extra dummy observations set to weight zero (but the predictors I was interested in) to be one workaround.
I was able to overcome the undefined columns selected error by ensuring that the new data for the newdata parameter had the EXACT column structure as what was used when running the gamlss model.

Calculating prediction accuracy of a tree using rpart's predict method

I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In
addition: Warning message: Incompatible methods ("Ops.factor",
"Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict() method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
My response is very similar to #mtoto's one but a bit more simply... I hope it also helps.
mean(test$class == t_pred)

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points

I am using R's ROCR package to calculate the area under the curve of large data-sets. However, The code does not work for all the datasets except a few.
the code i have used:
pred <- prediction(mydata$Total.Regexes, mydata$actual)
perf <- performance(pred, "tpr", "fpr")
I checked the dataset there is no non-Na points present in the dataset. However, since the dataset is huge, it may go out of my sight. So, is there any other process to refine the dataset (for non-NA values, if there any) without disturbing the remaining values?
And this is the error it shows for a few dataset:
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
I checked using:
is.na(dataset)
dataset <- na.omit(dataset)
but still it doesn't work. There is no non-Na values present in the dataset.. I can't reproduce the error with a simple dataset, so I've posted the problem dataset in my dropbox.
https://www.dropbox.com/s/pjko6o6h23m43le/DC4.csv
Please Help!
I had a similar problem.
By "casting" the arguments to prediction I managed to get everything working properly.
Try:
pred <- prediction(as.numeric(mydata$Total.Regexes), as.numeric(mydata$actual))
perf <- performance(pred, "tpr", "fpr")
I had the similar problem. This is a 'bad' way to solve it:
MODEL <- glm(y ~ x + z, my_data, family = "binomial")
pred_probab <- predict(MODEL, type = "response")
type = "response"is specified for return prediction sample as probabilities
pr <- prediction(pred_probab, Two_levels_factor)
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
The sample "Two_levels_factor" with n = 2000 has only one level value "positive_result". For logit regression it must have two levels.
levels(Two_levels_factor)
[1] "negative_result"
Two_levels_factor[1] <- "positive_result"
levels(Two_levels_factor)
[1] "positive_result" "negative_result"
pr <- prediction(pred_probab, Two_levels_factor)

ROC curve in R using ROCR package

Can someone explain me please how to plot a ROC curve with ROCR.
I know that I should first run:
prediction(predictions, labels, label.ordering = NULL)
and then:
performance(prediction.obj, measure, x.measure="cutoff", ...)
I am just not clear what is meant with prediction and labels. I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. In my case the class attribute is y_n, which I suppose should be used for the labels. But what about the predictions? Here are the steps of what I do (dataset name= bank_part):
pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)
After running the last line I get this error:
Error in prediction(tablebank, bank_part$y_n) :
Number of cross-validation runs must be equal for predictions and labels.
Thanks in advance!
Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below:
bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,
keep.forest=TRUE,importance=TRUE)
bankrf.pred<-predict(bankrf, bank_testing, type='response')
Now the bankrf.pred is a factor object with labels c=("0", "1"). Still, I don't know how to plot ROC, cause I get stuck to the prediction part. Here's what I do
library(ROCR)
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1)
But this is still incorrect, cause I get the error message
Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors
The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable.
So something like the following should work:
> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
to generate an ROC.
EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment).
There's no new code here, but... here's a function I use quite often for plotting an ROC:
plotROC <- function(truth, predicted, ...){
pred <- prediction(abs(predicted), truth)
perf <- performance(pred,"tpr","fpr")
plot(perf, ...)
}
Like #Jeff said, your predictions need to be continuous for ROCR's prediction function. require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. So:
require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))
gives you what you want. Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs'), predict(..., type='prob')[,2], etc., so just check out the help files for each function you're calling.
This is how you can do it:
have our data in a csv file,("data_file.csv") but you may need to give the full path here. In that file have the column headers, which here I will use
"default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value.
R code:
rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~ var1 + var2 + var3, family = "binomial" , data = df)
summary(mylogit)
library(ROCR)
df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
Note that df$score will give you the probability of default.
In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use
df2 <- read.csv("data_file2.csv")
df2$score<-predict.glm(mylogit,newdata=df2, type="response" )
pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
The problem is, as pointed out by others, prediction in ROCR expects numerical values. If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response', which is the default. Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). But when you plot that, ROCR generates a single meaningful point on ROC curve. For having many points on your ROC curve, you really need the probability associated with each prediction - i.e. use type='prob' in generating predictions.
The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation.
In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix.
Try this one:
library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)
The function prediction is present is many packages. You should explicitly specify(ROCR::) to use the one in ROCR. This one worked for me.

Resources