I am using the effects package to construct some probability graphs showing the predicted probabilities from a logistic regression model.However, I get an odd error message and don't know what the issue is.
When I attempt to generate the plots, I get the following error. The warning is not an issue, it's that I'm not understanding what the error message is telling me.
library(effects)
dat$won_ping = as.factor(dat$won_ping)
mod2 = glm(won_ping ~ our_bid +
age_of_oldest_driver2 +
credit_type2 +
coverage_type2 +
home_owner2 +
vehicle_driver_score +
currently_insured2 +
zipcode2,
data=dat, family=binomial(link="logit"))
> plot(effect("our_bid*vehicle_driver_score", mod2), rescale.axis=FALSE, multiline=TRUE)
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
our_bid:vehicle_driver_score does not appear in the model
Error in plot(effect("our_bid*vehicle_driver_score", mod2), rescale.axis = FALSE, :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in apply(mod.matrix[, components], 1, prod) :
subscript out of bounds
Here's info on my data and my glm commands:
> str(dat)
'data.frame': 85240 obs. of 71 variables:
$ our_bid : num 155 123 183 98 108 159 98 123 98 200 ...
$ won_ping : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ zipcode2 : Factor w/ 4 levels "1:6999","10000:14849",..: 4 3 2 1 3 2 3 1 2 2 ...
$ age_of_oldest_driver2 : Factor w/ 4 levels "18 to 21","22 to 25",..: NA 3 NA NA NA NA 3 NA 3 NA ...
$ currently_insured2 : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 1 1 ...
$ credit_type2 : Ord.factor w/ 4 levels "POOR"<"FAIR"<..: 2 3 2 3 2 2 1 3 3 2 ...
$ coverage_type2 : Factor w/ 4 levels "BASIC","MINIMUM",..: 4 3 3 3 3 3 3 3 4 3 ...
$ home_owner2 : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ vehicle_driver_score : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
And finally, here might be some useful info:
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] effects_2.2-1 colorspace_1.1-1 nnet_7.3-1 MASS_7.3-16 lattice_0.20-0 foreign_0.8-46
loaded via a namespace (and not attached):
[1] tools_2.14.0
Help! What is the error message mean? Normally, if a "subscript is out of bounds" that'd mean I'm selecting something outside the bounds of that data structure, but that simply is not occuring.
EDIT:
To #Rowland
As I said above, the warning and error messages are seperate and unrelated. Let's say I take out zipcode2 and run the glm:
mod2 = glm(won_ping ~ our_bid +
age_of_oldest_driver2 +
credit_type2 +
coverage_type2 +
home_owner2 +
vehicle_driver_score +
currently_insured2,
data=dat, family=binomial(link="logit"))
> plot(effect("our_bid*home_owner2", mod2), rescale.axis=FALSE, multiline=TRUE)
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
our_bid:home_owner2 does not appear in the model
This produces just the warning, which is fine as I get the desired result. So the fact that ":" does not appear in the model is not the issue, and DOES NOT cause the error message.
Try this:
with(dat, table(our_bid, vehicle_driver_score))
I suspect you have some unpopulated cells. With your edit, it seems unlikely that the sparseness I am hypothesizing as the problem lies with these two variables. It still remains possible that despite your large number of cases that there are still empty cells when the model is constructed with all of those factor variables.
Related
I am trying to conduct a multiple imputation using a 2-level zero-inflated negative binomial model using MICE. This is my code:
ini<-mice(mydata2, m = 10, maxit=0)
meth<-ini$method
pred<-ini$predictorMatrix
meth[3]<-"2l.zinb"
pred[1,]<-c(-2,0,3,3,3)
set.seed(123456)
##impute missing data
imp2l<-mice(mydata2, maxit=10, method=meth, predictorMatrix=pred, m=10, print = FALSE)
When I run the imputation line of code I get the following error:
Error in parse(text = x, keep.source = FALSE) :
:1:141: unexpected ')'
1:nz~DistrictCode+AY2012to2013+AY2013to2014+AY2014to2015+AY2015to2016+AY2016to2017+AY2017to2018+AY2018to2019+treatTx55to80+treatTx80ormore+(1|)
I have followed other threads that have a similar error and have made sure that my variables don't have any illegal characters or spaces in the variable name or even the label names. My data structure looks like this:
'data.frame': 7461 obs. of 5 variables:
$ DistrictCode : num 61176 61176 61176 61176 61176 ...
$ AY : Factor w/ 8 levels "2011to2012","2012to2013",..: 1 2 3 4 5 6 7 8 1 2 ...
$ TotalExpulsions: num 23 24 15 10 17 13 16 13 14 4 ...
$ prepost : Factor w/ 2 levels "Preinterventionperiod",..: 1 1 2 2 2 2 2 2 1 1 ...
$ treat : Factor w/ 3 levels "Control0to55",..: 1 1 1 1 1 1 1 1 2 2 ...
Is there something that I'm missing?
Thank you in advance.
I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....
I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!
A couple of remarks:
The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.
I have a data.frame, all columns are numeric. I want to convert one integer column to factor, but doing so will convert all other columns to class character. Is there anyway to just convert one column to factor?
The example is from Converting variables to factors in R:
myData <- data.frame(A=rep(1:2, 3), B=rep(1:3, 2), Pulse=20:25)
myData$A <-as.factor(myData$A)
The result
apply(myData,2,class)
# A B Pulse
# "character" "character" "character"
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods base ...
str(myData$A)
# Factor w/ 2 levels "1","2": 1 2 1 2 1 2
Your code actually works when I test it.
This is my output from str(myData):
'data.frame': 6 obs. of 3 variables:
$ A : Factor w/ 2 levels "1","2": 1 2 1 2 1 2
$ B : int 1 2 3 1 2 3
$ Pulse: int 20 21 22 23 24 25
Your issue is because, as ?apply states:
‘apply’ attempts to coerce
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame)
This is done before executing the function on each column. And when you run as.matrix(myData) you end up with everything forced to one class, in this case character data:
is.character(as.matrix(myData))
#[1] TRUE
I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.