R predict factor variables - r

I try to do some prediction in R . I loaded & cleaned the data, fit a model and did a prediction which looks pretty good. My problem now is that my prediction gives me a percentage of probability of the occurence of e certain factor instead of the factor itself:
I have a dataset on how well people perform some exercise. This performance is messured in A-D ( which is a factor-variable in my dataset). When I do the prediction I get this output:
but I want to have it like that:
[ B A E A A C D A A A C ]
How would I do that? This is my code:
modFitA1 <- rpart(classe ~ ., data=PML_Train_red, method="class")
Predictn<-predict(modFitA1, newdata= PML_Test_red)
Predictn

Even though you put method="class" in your model statement, you need to add type="class" to your predict statement.
Predictn<-predict(modFitA1, newdata= PML_Test_red, type="class")

Related

Ordinal regression - proportional odds assumption not met for variable in interaction

I try to analyze a dataset with an ordinal response (0-4) and three categorical factors. I'm interested in the interactions of all three factors as well as the main effects. I used the clm function of the package "ordinal" and checked the assumptions by using the "nominal_test" function. It revealed a significant difference for one of the predictors. And now I don't know how to proceed... I tried to put the problematic factor and all its interactions in the "nominal" argument (see code) and R gives me warnings. Nevertheless, I made several likelyhood ratio tests always comparing a model including an interaction with one missing it (ANOVA(without,with, test="Chisq")) and get some nice significant results. Still, I feel like I have no clue what I'm doing here and I don't trust the results. So my question is: Is it ok what I did? What else can I do? or is the data just 'unanalyzable'?
Here is the code for the test:
# this is the model
res=clm(cue~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
data=xdata)
#proportional odds assumption
nominal_test(res)
# Df logLik AIC LRT Pr(>Chi)
#<none> -221.50 467.00
#intention 3 -215.05 460.11 12.891 0.004879 **
#outcome 3 -219.44 468.87 4.124 0.248384
#age
#Gender 3 -219.50 469.00 3.994 0.262156
#intention:outcome
#intention:age
#outcome:age 6 -217.14 470.28 8.716 0.190199
#intention:outcome:age 12 -188.09 424.19 66.808 1.261e-09 ***
And here is an example of how I tried to solve it -> and check the 3-way-interaction of all three predictors. I did the same for the 2-way-interactions as well...
res=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~ intention:age:outcome+
intention:age+
intention:outcome+
intention,
data=xdata)
res.red=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~
intention:age+
intention:outcome+
intention,
data=xdata)
anova(res,res.red, test="Chisq")
# no.par AIC logLik LR.stat df Pr(>Chisq)
#res.red 26 412.50 -180.25
#res 33 424.11 -179.05 2.3945 7 0.9348
And here is the warning that R gives me when I try to cenverge the model:
Warning message:
(-3) not all thresholds are increasing: fit is invalid
In addition: Absolute convergence criterion was met, but relativecriterion was not met
I'm especially concerned about the sentence "Fit is not valid"... I don't know what to do with this and would be happy about any idea or hint!
Thank you!
Have you tried to use a more general model like the partial proportional odds model? Your data only has to be nominal, not ordinal to use this model. If you find hugh differences between the log likelihoods, your assumption about ordinality is not met.
You can use vlgm() from the VGAM package. Here are a few examples.
As I don't know how your data looks like, I can't say whether it's unanalyzable, but the code would be something like this:
library(VGAM)
res <- vglm(cue ~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
family = cumulative(parallel = FALSE ~ intention),
data = xdata)
summary(res)
I think you could use pchiq() as proposed in the example I posted above to compare both models like you did before with anova():
pchisq(deviance(res) - deviance(res.red),
df = df.residual(res) - df.residual(res.red), lower.tail = FALSE)

SVM:Need numeric dependent variable for regression

I have the following data
scorer<-function(points){
points["scores"] <- as.vector((points$X-5)^2+(points$Y-5)^2-9)
points["class"]<-(as.vector( points$scores<0 ))
points
}
dt<-scorer(data.frame(X=c(0,1,5,20,5,3,9,3,5,5),Y=c(0,9,9,0,-18,3,4,5,7,4)))
Then i am trying to predict the last column (class) using SVM
library(e1071)
model <- svm(class ~ . , dt)
predictedClass <- predict(model, dt)
but it complains with:
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
The advice from nya really works.
Please, have a look type parameter description
svm can be used as a classification machine, as a regression machine, or for
novelty detection. Depending on whether y is a factor or not, the default setting
for type is C-classification or eps-regression ... page 50
With your dataset you can make classification using svm method.
But if you want absolutely to make regression, try to transform your variable "class" in numeric form which can take value 1 for negative score and 0 for positif score.
function(points) {
points["scores"] <- as.vector((points$X-5)^2+(points$Y-5)^2-9)
points["class"]<-as.vector( ifelse(points$scores<0 ,1,0))
points
}
dt<-scorer(data.frame(X=c(0,1,`enter code here`5,20,5,3,9,3,5,5),Y=c(0,9,9,0,-18,3,4,5,7,4)))
svm(class~.,dt)

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

r support vector machine e1071 training not working

I am playing around with Support Vector Machines in the R-Language. Specifically I am using the e1071 package.
As long as I follow the manual pages or the tutorial at wikibooks everythings works. But if I try to use my own datasets with those examples things aren't that good anymore.
It seems that the model creation fails for some reason. At least I am not getting the levels on the target column. Below you find the example for clarification.
Maybe someone can help me to figure out what I am doing wrong here. So here is all the code and data.
Test dataset
target,col1,col2
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
R-Script
library(e1071)
dataset <- read.csv("test.csv", header=TRUE, sep=',')
tuned <- tune.svm(target~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned)
model <- svm(target~., data = dataset, kernel="radial", gamma=0.001, cost=10)
summary(model)
Output of the summary(model) statement
+ summary(model)
Call:
svm(formula = target ~ ., data = dataset, kernel = "radial", gamma = 0.001,
cost = 10)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 10
gamma: 0.001
epsilon: 0.1
Number of Support Vectors: 28
>
Wikibooks examaple
If I compare this output to the output of the wikibooks example, it's missing some information. Please notice the "Levels"-Section in the output:
library(MASS)
library(e1071)
data(cats)
model <- svm(Sex~., data = cats)
summary(model)
Output
> summary(model)
Call:
svm(formula = Sex ~ ., data = cats)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.5
Number of Support Vectors: 84
( 39 45 )
Number of Classes: 2
Levels:
F M
Putting Roland's answer in the proper "answer" format:
target is numeric
sex is a factor
Let me give a few more suggestions:
it seems as if target really should be a factor. (It has only 2 levels, 0 & 1, and I suspect you're trying to classify into either 0 or 1.) So stick in a dataset$target <- factor(dataset$target) somewhere.
right now, because target is a numeric, a regression model is being run instead of a classification.
it's worthwhile to do a similar check for any of your variables before running a model (especially a model). In the case you gave, for instance, it's not obvious what col1 and col2 are. If either of them are a grouping or classification, you should also make them factors, too.
In R, many functions have multiple ways in which they will run, depending upon the data types fed to them. If you feed factors into a model, it will run classification. If you feed numerics, regression. This is actually common in many programming languages, and is called function overloading.

Resources