Difficulty creating ROC curve using library(ROCR) in R - r

I have a simple 2 column data frame the labels (binary) are Benign and Malignant and predictor is a five-point ordinal variable here is the summary
data.frame': 127 obs. of 2 variables:
$ GRADE : Ord.factor w/ 5 levels "benign"<"likely benign"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ BENIGN.MALIGN: Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 1 1 1 ...
But when I use:
pred<-prediction(myTable[[1]],myTable[[2]])
I get this error message:
Error in prediction(myTable[[1]], myTable[[2]]) :
Format of predictions is invalid.
What can i do to rectify this? Thanks

If you are using the grade as a score and have verified or assumed that the intervals in the score are equidistant, you can convert the 5 point likert into a numeric form as follows:
pred <- prediction(as.numeric(myTable[[1]]), myTable[[2]])

Related

Convert dummy variable from numeric to factor for chi-square test in R

I want to perform chi-square test in R using the following datasets. After perform dummy variable creation. The p-value i get from chi-square test is 1, which is incorrect. I suspect it is because of after dummy variable creation, the data structure change from factor to numeric. This is a hypothesis testing question that wants to check whether the defective % varies by 4 countries center at 5% confidence interval. Please advice what is the possible error and what is the solution.
Subset of datasets used
Phillippines Indonesia Malta India
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Defective
Error Free Defective Defective Error Free
Error Free Error Free Error Free Error Free
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Error Free
The structure of the initial data is factor:
> str(data)
'data.frame': 300 obs. of 4 variables:
$ Phillippines: Factor w/ 2 levels "Defective","Error Free": 2 2 2 2 2 2 2 2 2 2 ...
$ Indonesia : Factor w/ 2 levels "Defective","Error Free": 2 2 1 2 2 2 1 2 2 2 ...
$ Malta : Factor w/ 2 levels "Defective","Error Free": 1 2 1 2 1 2 2 2 2 2 ...
$ India : Factor w/ 2 levels "Defective","Error Free": 2 1 2 2 2 2 2 2 2 2 …
I convert dummy variable for the following categorical data (error free and defective) by following code:
library(caret)
dmy <- dummyVars("~ .", data = data, fullRank = T)
trsf <- data.frame(predict(dmy, newdata = data))
After dummy variable creation, the data structure of dummy variable turn to numeric:
> str(trsf)
'data.frame': 300 obs. of 4 variables:
$ Phillippines.Error.Free: num 1 1 1 1 1 1 1 1 1 1 ...
$ Indonesia.Error.Free : num 1 1 0 1 1 1 0 1 1 1 ...
$ Malta.Error.Free : num 0 1 0 1 0 1 1 1 1 1 ...
$ India.Error.Free : num 1 0 1 1 1 1 1 1 1 1 ...
P-value of chi-square is 1
> chisq.test(trsf)
Pearson's Chi-squared test
data: trsf
X-squared = 112.75, df = 897, p-value = 1
Warning message:
In chisq.test(trsf) : Chi-squared approximation may be incorrect
I try apply as.factor and perform chi-square but get the following error:
trsf_2 <- as.factor(trsf)
str(trsf_2)
Factor w/ 4 levels "c(1, 1, 1, 1, 1, 0, 0, 0, 0, 1)",..: NA NA NA NA
- attr(*, "names")= chr [1:4] "Phillippines.Error.Free" "Indonesia.Error.Free" "Malta.Error.Free" "India.Error.Free"
> chisq.test(trsf_2)
Error in chisq.test(trsf_2) :
all entries of 'x' must be nonnegative and finite
In addition: Warning message:
In Ops.factor(x, 0) : ‘<’ not meaningful for factors
You could try
dataset <- as.data.frame(lapply(data, as.numeric))
chisq.test(dataset).
However, I am not sure that chi-square is the most appropriate method for binary variables. May I suggest Phi coefficient? You can find information below:
https://en.wikipedia.org/wiki/Phi_coefficient.
However, you will need to create a loop if you do not want to do it manually for each set of two variables (i.e. countries).

R multiway split trees using ctree {partykit}

I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!
A couple of remarks:
The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.

Mean of all means of subsets of data differs from overall mean

I have a large data set which looks like so:
str(ldt)
data.frame': 116105 obs. of 11 variables:
$ s : Factor w/ 35 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PM : Factor w/ 3 levels "C","F","NF": 3 3 3 3 3 3 3 3 3 3 ...
$ day : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ block : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
$ item : chr "parity" "grudoitong" "gunirec" "pirul" ...
$ C : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
$ S : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 2 2 1 2 1 ...
$ R : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 1 2 1 2 1 ...
$ RT : num 0.838 1.026 0.93 0.553 0.815 ...
When I get means by factor from this data set, and then get the mean of those means it's slightly different from the mean of the original data set. It's different again when I split it into more factors and get the mean of those means. For example:
mean(ldt$RT[ldt$C])
[1] 0.6630013
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], PM= ldt$PM[ldt$C]),mean))
[1] 0.6638781
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], day = ldt$day[ldt$C], item=ldt$S[ldt$C], PM=ldt$PM[ldt$C]),mean))
[1] 0.6648401
What on earth is causing this discrepancy? The only thing I can imagine is that the subset means are getting rounded off. Is that why the answers are different? What's the exact mechanic at work here?
Thank you
The mean of means is not the same as the mean of all numbers.
Simple example: Take the dataset
1,3,5,6,7
The mean of 1 and 3 obviously is 2, the mean of 5,6,7 is 6.
The mean of the means therefore would be 4.
However, we have 1+3+5+6+7 = 22 and 22/5 = 4.4.
Thus, your problem is on the mathematical side of your calculation on not with your code.
To overcome this problem you would have to use the weighted mean, e.g. weight the summands of the outer mean with the number of values in each group, divided by the total number of observations. In our example:
2/5 * 2 + 3/5 * 6 = 4.4

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

can't plot predict line in R

I'm using those data :
'data.frame': 1584 obs. of 3 variables:
$ Individual: Factor w/ 3 levels "AG201","AG202",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Used : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ NDVI : int 4724 4576 4894 4297 4670 4932 4346 3810 3481 4058 ...
I'm doing a glm with "NDVI" as a continuous explanatory variable, and then I'm plotting the model through the scatterplot of the data (I'm reproducing the same script as in the R book, Crawley, p.596)
model<-glm(Used~NDVI,binomial);
xv<-seq(0,10000,0.2);
yv<-predict(model,list(NDVI=xv),type="response");
plot(NDVI,Used);
lines(xv,yv);
My problem is that no line appears on my graph...
Any idea what's wrong?
Following Gavin's insight, here's a suggestion:
plot(NDVI, as.numeric(Used)-1 )
lines(xv,yv)
Factors are integer vectors starting at 1L with assignments by default in alpha order of the labels. So you should be OK with "no" < "yes" leading to the No's being 1 and hte Yes's being 2 and then shifting down to the correct scale [0,1]. You may need to also look at str(yv)

Resources