I use the predict function to predict the results based on a model.
What I get is a vector of the predicted classes. I want to retrieve the same results but instead of the form
1 class_1
2 class_1
3 class_4
4 class_2
I want to have the results in the form
class_1 class_2 class_3 class_4
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
I have tried passing type=class and type=response but the results are the same.
I am completely new to R and I am still trying to find my way around R's documentation but I think that this is something trivial that I should be able to figure out, though I am pretty stuck.
After viewing the docs on predict.randomForest at
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
It appears the valid choices for type are response, prob. or votes.
Using the code below, I was able to reproduce your format, but using probabilities.
> predict(model, x, type='prob')
0 1
1 1.000 0.000
2 0.180 0.820
3 0.138 0.862
attr(,"class")
To obtain booleans, another option is you could do one hot encoding on the response results
result_classes = list()
for (level in levels(y)){
result_classes[[level]] <- predict(model, x, type='response') == level
}
data.frame(result_classes )
Result:
X0 X1
1 TRUE FALSE
2 FALSE TRUE
3 FALSE TRUE
Related
I have a study with several cases, all containing data from multiple ordinal factor variables (genotypes) and multiple numeric variables (various blood samples (concentrations)). I am trying to set up an explorative model to test linearity between any of the numeric variables (dependent in the model) and any of the ordinal factor variables (independent in the model).
Dataset structure example (independent variables): genotypes
case_id genotype_1 genotype_2 ... genotype_n
1 0 0 1
2 1 0 2
... ... ... ...
n 2 1 0
and dependent variables (with matching case id:s): samples
case_id sample_1 sample_2 ... sample_n
1 0.3 0.12 6.12
2 0.25 0.15 5.66
... ... ... ...
n 0.44 0.26 6.62
Found one similar example in the forum which doesn't solve the problem:
model <- apply(samples,2,function(xl)lm(xl ~.,data= genotypes))
I can't figure out how to make simple linear regressions that go through any combination of a given set of dependent and independent variables. If using apply family I guess the varying (x) term should be the dependent variable in the model since every dependent variable should test linearity for the same set of independent variables (individually).
Extract from true data:
> genotypes
case_id genotype_1 genotype_2 genotype_3 genotype_4 genotype_5
1 1 2 2 1 1 0
2 2 NaN 1 NaN 0 0
3 3 1 0 0 0 NaN
4 4 2 2 1 1 0
5 5 0 0 0 1 NaN
6 6 2 2 1 0 0
7 9 0 0 0 0 1
8 10 0 0 0 NaN 0
9 13 0 0 0 NaN 0
10 15 NaN 1 NaN 0 1
> samples
case_id sample_1 sample_2 sample_3 sample_4 sample_5
1 1 0.16092019 0.08814160 -0.087733372 0.1966070 0.09085343
2 2 -0.21089678 -0.13289427 0.056583528 -0.9077926 -0.27928376
3 3 0.05102400 0.07724300 -0.212567535 0.2485348 0.52406368
4 4 0.04823619 0.12697286 0.010063683 0.2265085 -0.20257192
5 5 -0.04841221 -0.10780329 0.005759269 -0.4092782 0.06212171
6 6 -0.08926734 -0.19925538 0.202887833 -0.1536070 -0.05889369
7 9 -0.03652588 -0.18442457 0.204140717 0.1176950 -0.65290133
8 10 0.07038933 0.05797007 0.082702589 0.2927817 0.01149564
9 13 -0.14082554 0.26783539 -0.316528107 -0.7226103 -0.16165326
10 15 -0.16650266 -0.35291579 0.010063683 0.5210507 0.04404433
SUMMARY: Since I have a lot of data I want to create a simple model to help me select which possible correlations to look further into. Any ideas out there?
NOTE: I am not trying to fit a multiple linear regression model!
I feel like there must be a statistical test for linearity, but I can't recall it. Visual inspection is typically how I do it. Quick and dirty way to test for linearity for a large number of variables would be to test the corr() of each pair of dependent/independent variables. Small multiples would be a handy way to do it.
Alternately, for each dependent ordinal variable, run a corrplot vs. each independent (numerical) variable, a logged version of the independent variable, and the exponentiated version of the independant variable. If the result of CORR for the logged or exponented version has a higher p-value than the regular version, it seems likely you have some linearity issues.
I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0
I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.
max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data
I have made some classification models where 1 means it is the same person, and 0 means they are different.
If I print the head of my predictions it looks the following way:
> head(PredictCTree)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(PredictSVM)
1 1.1 1.2 1.3 1.7 1.14
0 0 0 0 0 0
Levels: 0 1
> head(PredictForest)
1.212 1.839 1.906 1.951 1.1011 1.1151
1 1 1 0 1 1
Levels: 0 1
So if I want to average them and add them up I have to make them numeric, but here is where I am struggling:
Example:
> PredictForest[1]
1.212
1
Levels: 0 1
basically I want to add 1 + 0 (for PredictForest and SVM)
as.numeric(PredictForest[1])
[1] 2
but I end up getting this answer:
> as.numeric(PredictForest[1]) + as.numeric(fitted.results[1] + as.numeric(PredictCTree[1] ))
[1] 4
Any suggestions?
My expected output would be:
> as.numeric(PredictForest[1]) + as.numeric(fitted.results[1] + as.numeric(PredictCTree[1] ))
[1] 1
So later on I could divide or give weights in order to test and get the most probable class.
Thank you!
If you try to convert a factor into a number, it'll give you the number of the level in the factor. To convert into numbers, you can first run as.character, which will safely turn it into a format that you can run as.numeric on.
test <- as.factor(c(0, 1))
as.numeric(test)
# [1] 1 2
as.numeric(as.character(test))
# [1] 0 1
The R FAQ recommends a different approach for speed
7.10 How do I convert factors to numeric?
It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use
as.numeric(as.character(f))
to get the numbers back. More efficient, but harder to remember, is
as.numeric(levels(f))[as.integer(f)]
In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).
I am trying to use svm() to classify my data. A sample of my data is as follows:
ID call_YearWeek week WeekCount oc
x 2011W01 1 0 0
x 2011W02 2 1 1
x 2011W03 3 0 0
x 2011W04 4 0 0
x 2011W05 5 1 1
x 2011W06 6 0 0
x 2011W07 7 0 0
x 2011W08 8 1 1
x 2011W09 9 0 0
x 2011W10 10 0 0
x 2011W11 11 0 0
x 2011W12 12 1 1
x 2011W13 13 1 1
x 2011W14 14 1 1
x 2011W15 15 0 0
x 2011W16 16 2 1
x 2011W17 17 0 0
x 2011W18 18 0 0
x 2011W19 19 1 1
The third column shows week of the year. The 4th column shows number of calls in that week and the last column is a binary factor (if a call was received in that week or not). I used the following lines of code:
train <- data[1:105,]
test <- data[106:157,]
model <- svm(oc~week,data=train)
plot(model,train,week)
plot(model,train)
none of the last two lines work. they dont show any plots and they return no error. I wonder why this is happening.
Thanks
Seems like there are two problems here, first is that not all svm types are supported by plot.svm -- only the classification methods are, and not the regression methods. Because your response is numeric, svm() assumes you want to do regression so it chooses "eps-regression" by default. If you want to do classification, change your response to a factor
model <- svm(factor(oc)~week,data=train)
which will then use "C-classification" by default.
The second problem is that there does not seem to be a univariate predictor plot implemented. It seems to want two variables (one for x and one for y).
It may be better to take a step back and describe exactly what you want your plot to look like.
I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.