can't plot predict line in R - r

I'm using those data :
'data.frame': 1584 obs. of 3 variables:
$ Individual: Factor w/ 3 levels "AG201","AG202",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Used : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ NDVI : int 4724 4576 4894 4297 4670 4932 4346 3810 3481 4058 ...
I'm doing a glm with "NDVI" as a continuous explanatory variable, and then I'm plotting the model through the scatterplot of the data (I'm reproducing the same script as in the R book, Crawley, p.596)
model<-glm(Used~NDVI,binomial);
xv<-seq(0,10000,0.2);
yv<-predict(model,list(NDVI=xv),type="response");
plot(NDVI,Used);
lines(xv,yv);
My problem is that no line appears on my graph...
Any idea what's wrong?

Following Gavin's insight, here's a suggestion:
plot(NDVI, as.numeric(Used)-1 )
lines(xv,yv)
Factors are integer vectors starting at 1L with assignments by default in alpha order of the labels. So you should be OK with "no" < "yes" leading to the No's being 1 and hte Yes's being 2 and then shifting down to the correct scale [0,1]. You may need to also look at str(yv)

Related

Recursive partitioning for factors/characters problem

Currently I am working with the dataset predictions. In this data I have converted clear character type variables into factors because I think factors work better than characters for glmtree() code (tell me if I am wrong with this):
> str(predictions)
'data.frame': 43804 obs. of 14 variables:
$ month : Factor w/ 7 levels "01","02","03",..: 6 6 6 6 1 1 2 2 3 3 ...
$ pred : num 0.21 0.269 0.806 0.945 0.954 ...
$ treatment : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ type : Factor w/ 4 levels "S","MS","ML",..: 1 1 4 4 4 4 4 4 4 4 ...
$ i_mode : Factor w/ 143 levels "AAA","ABC","CBB",..: 28 28 104 104 104 104 104 104 104 104 ...
$ r_mode : Factor w/ 29 levels "0","5","8","11",..: 4 4 2 2 2 2 2 2 2 2 ...
$ in_mode: Factor w/ 22 levels "XY",..: 11 11 6 6 6 6 6 6 6 6 ...
$ v_mode : Factor w/ 5 levels "1","3","4","7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ di : num 1157 1157 1945 1945 1945 ...
$ cont : Factor w/ 5 levels "AN","BE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hk : num 0.512 0.512 0.977 0.977 0.941 ...
$ np : num 2 2 2 2 2 2 2 2 2 2 ...
$ hd : num 1 1 0.408 0.408 0.504 ...
$ nd : num 1 1 9 9 9 9 7 7 9 9 ...
I want to estimate a recursive partitioning model of this kind:
library("partykit")
glmtr <- glmtree(formula = pred ~ treatment + 1 | (month+type+i_mode+r_mode+in_mode+v_mode+di+cont+np+nd+hd+hk),
data = predictions,
maxdepth=6,
family = quasibinomial)
My data does not have any NA. However, the following error arises (even after changing characters by factors):
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
Any clue?
Thank you
You are right that glmtree() and the underlying mob() function expect the split variables to be factors in case of nominal information. However, computationally this is only feasible for factors that have either a limited number of levels because the algorithm will try all possible partitions of the number of levels into two groups. Thus, for your i_mode factor this necessitates going through nl levels and mi splits into two groups with:
nl <- 143
mi <- 2^(nl - 1L) - 1L
mi
## [1] 5.575186e+42
Internally, mob() tries to create a matrix for storing all log-likelihoods associated with the corresponding partitioned models. And this is not possible because such a matrix cannot be represented. (And even if you could, then you wouldn't finish fitting all the associated models.) Admittedly, the error message is not very useful and should be improved. We will look into that for the next revision of the package.
For solving the problem, I would recommend to turn the variables i_mode, r_mode, and in_mode into variables that are more suitable for binary splitting with exhaustive search. Maybe, some of the variables are actually ordinal? If so, I would recommend to turn them into ordinal factors or in case of i_mode even into a numeric variable because the number of levels is large enough. Alternatively, you can maybe create several factors with different properties about the different levels that could then be used for partitioning.

lattice plot error: need finite xlim values calls

Whenever I try and plot across factors I keep getting the error.
Here is how my data looks like:
str(dataWithNoNa)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ dayType : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
I am trying to plot using the lattice plotting system using Weekday/Weekend as a factor.
Here is what I tried:
plot(dataWithNoNa$steps~ dataWithNoNa$interval | dataWithNoNa$dayType, type="l")
Error in plot.window(...) : need finite 'xlim' values
I even checked to make sure my data had no NAs:
sum(is.na(dataWithNoNa$interval))
## [1] 0
sum(is.na(dataWithNoNa$steps))
## [1] 0
What am I doing wrong?
Try this:
library(lattice)
xyplot(steps ~ interval | factor(dayType), data=df)
Output:
Sample data:
df <- data.frame(
steps=c(1.717,0.3396,0.1321,0.1509,0.0755),
interval=c(0,5,10,15,20),
dayType=c(1,1,1,2,2)
)

Mean of all means of subsets of data differs from overall mean

I have a large data set which looks like so:
str(ldt)
data.frame': 116105 obs. of 11 variables:
$ s : Factor w/ 35 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PM : Factor w/ 3 levels "C","F","NF": 3 3 3 3 3 3 3 3 3 3 ...
$ day : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ block : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
$ item : chr "parity" "grudoitong" "gunirec" "pirul" ...
$ C : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
$ S : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 2 2 1 2 1 ...
$ R : Factor w/ 2 levels "Nonword","Word": 2 1 1 1 2 1 2 1 2 1 ...
$ RT : num 0.838 1.026 0.93 0.553 0.815 ...
When I get means by factor from this data set, and then get the mean of those means it's slightly different from the mean of the original data set. It's different again when I split it into more factors and get the mean of those means. For example:
mean(ldt$RT[ldt$C])
[1] 0.6630013
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], PM= ldt$PM[ldt$C]),mean))
[1] 0.6638781
mean(tapply(ldt$RT[ldt$C],list(s=ldt$s[ldt$C], day = ldt$day[ldt$C], item=ldt$S[ldt$C], PM=ldt$PM[ldt$C]),mean))
[1] 0.6648401
What on earth is causing this discrepancy? The only thing I can imagine is that the subset means are getting rounded off. Is that why the answers are different? What's the exact mechanic at work here?
Thank you
The mean of means is not the same as the mean of all numbers.
Simple example: Take the dataset
1,3,5,6,7
The mean of 1 and 3 obviously is 2, the mean of 5,6,7 is 6.
The mean of the means therefore would be 4.
However, we have 1+3+5+6+7 = 22 and 22/5 = 4.4.
Thus, your problem is on the mathematical side of your calculation on not with your code.
To overcome this problem you would have to use the weighted mean, e.g. weight the summands of the outer mean with the number of values in each group, divided by the total number of observations. In our example:
2/5 * 2 + 3/5 * 6 = 4.4

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Difficulty creating ROC curve using library(ROCR) in R

I have a simple 2 column data frame the labels (binary) are Benign and Malignant and predictor is a five-point ordinal variable here is the summary
data.frame': 127 obs. of 2 variables:
$ GRADE : Ord.factor w/ 5 levels "benign"<"likely benign"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ BENIGN.MALIGN: Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 1 1 1 ...
But when I use:
pred<-prediction(myTable[[1]],myTable[[2]])
I get this error message:
Error in prediction(myTable[[1]], myTable[[2]]) :
Format of predictions is invalid.
What can i do to rectify this? Thanks
If you are using the grade as a score and have verified or assumed that the intervals in the score are equidistant, you can convert the 5 point likert into a numeric form as follows:
pred <- prediction(as.numeric(myTable[[1]]), myTable[[2]])

Resources