R anesrake issue with list names non-binary argument - r

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

Related

Recursive partitioning for factors/characters problem

Currently I am working with the dataset predictions. In this data I have converted clear character type variables into factors because I think factors work better than characters for glmtree() code (tell me if I am wrong with this):
> str(predictions)
'data.frame': 43804 obs. of 14 variables:
$ month : Factor w/ 7 levels "01","02","03",..: 6 6 6 6 1 1 2 2 3 3 ...
$ pred : num 0.21 0.269 0.806 0.945 0.954 ...
$ treatment : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ type : Factor w/ 4 levels "S","MS","ML",..: 1 1 4 4 4 4 4 4 4 4 ...
$ i_mode : Factor w/ 143 levels "AAA","ABC","CBB",..: 28 28 104 104 104 104 104 104 104 104 ...
$ r_mode : Factor w/ 29 levels "0","5","8","11",..: 4 4 2 2 2 2 2 2 2 2 ...
$ in_mode: Factor w/ 22 levels "XY",..: 11 11 6 6 6 6 6 6 6 6 ...
$ v_mode : Factor w/ 5 levels "1","3","4","7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ di : num 1157 1157 1945 1945 1945 ...
$ cont : Factor w/ 5 levels "AN","BE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hk : num 0.512 0.512 0.977 0.977 0.941 ...
$ np : num 2 2 2 2 2 2 2 2 2 2 ...
$ hd : num 1 1 0.408 0.408 0.504 ...
$ nd : num 1 1 9 9 9 9 7 7 9 9 ...
I want to estimate a recursive partitioning model of this kind:
library("partykit")
glmtr <- glmtree(formula = pred ~ treatment + 1 | (month+type+i_mode+r_mode+in_mode+v_mode+di+cont+np+nd+hd+hk),
data = predictions,
maxdepth=6,
family = quasibinomial)
My data does not have any NA. However, the following error arises (even after changing characters by factors):
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
Any clue?
Thank you
You are right that glmtree() and the underlying mob() function expect the split variables to be factors in case of nominal information. However, computationally this is only feasible for factors that have either a limited number of levels because the algorithm will try all possible partitions of the number of levels into two groups. Thus, for your i_mode factor this necessitates going through nl levels and mi splits into two groups with:
nl <- 143
mi <- 2^(nl - 1L) - 1L
mi
## [1] 5.575186e+42
Internally, mob() tries to create a matrix for storing all log-likelihoods associated with the corresponding partitioned models. And this is not possible because such a matrix cannot be represented. (And even if you could, then you wouldn't finish fitting all the associated models.) Admittedly, the error message is not very useful and should be improved. We will look into that for the next revision of the package.
For solving the problem, I would recommend to turn the variables i_mode, r_mode, and in_mode into variables that are more suitable for binary splitting with exhaustive search. Maybe, some of the variables are actually ordinal? If so, I would recommend to turn them into ordinal factors or in case of i_mode even into a numeric variable because the number of levels is large enough. Alternatively, you can maybe create several factors with different properties about the different levels that could then be used for partitioning.

Convert dummy variable from numeric to factor for chi-square test in R

I want to perform chi-square test in R using the following datasets. After perform dummy variable creation. The p-value i get from chi-square test is 1, which is incorrect. I suspect it is because of after dummy variable creation, the data structure change from factor to numeric. This is a hypothesis testing question that wants to check whether the defective % varies by 4 countries center at 5% confidence interval. Please advice what is the possible error and what is the solution.
Subset of datasets used
Phillippines Indonesia Malta India
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Defective
Error Free Defective Defective Error Free
Error Free Error Free Error Free Error Free
Error Free Error Free Defective Error Free
Error Free Error Free Error Free Error Free
The structure of the initial data is factor:
> str(data)
'data.frame': 300 obs. of 4 variables:
$ Phillippines: Factor w/ 2 levels "Defective","Error Free": 2 2 2 2 2 2 2 2 2 2 ...
$ Indonesia : Factor w/ 2 levels "Defective","Error Free": 2 2 1 2 2 2 1 2 2 2 ...
$ Malta : Factor w/ 2 levels "Defective","Error Free": 1 2 1 2 1 2 2 2 2 2 ...
$ India : Factor w/ 2 levels "Defective","Error Free": 2 1 2 2 2 2 2 2 2 2 …
I convert dummy variable for the following categorical data (error free and defective) by following code:
library(caret)
dmy <- dummyVars("~ .", data = data, fullRank = T)
trsf <- data.frame(predict(dmy, newdata = data))
After dummy variable creation, the data structure of dummy variable turn to numeric:
> str(trsf)
'data.frame': 300 obs. of 4 variables:
$ Phillippines.Error.Free: num 1 1 1 1 1 1 1 1 1 1 ...
$ Indonesia.Error.Free : num 1 1 0 1 1 1 0 1 1 1 ...
$ Malta.Error.Free : num 0 1 0 1 0 1 1 1 1 1 ...
$ India.Error.Free : num 1 0 1 1 1 1 1 1 1 1 ...
P-value of chi-square is 1
> chisq.test(trsf)
Pearson's Chi-squared test
data: trsf
X-squared = 112.75, df = 897, p-value = 1
Warning message:
In chisq.test(trsf) : Chi-squared approximation may be incorrect
I try apply as.factor and perform chi-square but get the following error:
trsf_2 <- as.factor(trsf)
str(trsf_2)
Factor w/ 4 levels "c(1, 1, 1, 1, 1, 0, 0, 0, 0, 1)",..: NA NA NA NA
- attr(*, "names")= chr [1:4] "Phillippines.Error.Free" "Indonesia.Error.Free" "Malta.Error.Free" "India.Error.Free"
> chisq.test(trsf_2)
Error in chisq.test(trsf_2) :
all entries of 'x' must be nonnegative and finite
In addition: Warning message:
In Ops.factor(x, 0) : ‘<’ not meaningful for factors
You could try
dataset <- as.data.frame(lapply(data, as.numeric))
chisq.test(dataset).
However, I am not sure that chi-square is the most appropriate method for binary variables. May I suggest Phi coefficient? You can find information below:
https://en.wikipedia.org/wiki/Phi_coefficient.
However, you will need to create a loop if you do not want to do it manually for each set of two variables (i.e. countries).

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

NA/NaN/Inf in foreign function error with mantelhaen.test()

I have a 100k row dataframe on which I want to compute a Cochran–Mantel–Haenszel test.
My variables are the educational level and a computed score factored in quantiles, and my grouping variable is the sex, and the code line looks like this :
mantelhaen.test(db$education, db$score.grouped, db$sex)
This code throws this error and warning :
Error in qr.default(a, tol = tol) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message: In ntot * rowsums : NAs produced by integer overflow
The error seems to be caused by my first variable, as on 7 variables tested I got the problem with only 2 of them, which seems to share no obvious common thing.
Missing values and factor levels don't seem to differ between variables which throws error and variable which doesn't. I tried with complete cases (with na.omit) and the problem persists.
What does trigger this error ? does it mean ?
How can I get rid of it ?
Interesting posts : R: NA/NaN/Inf in foreign function call (arg 1), What is integer overflow in R and how can it happen?
ADDENDUM : Here is the result of str (failures are education and imc.cl):
str(db[c("education","score.grouped","sex", ...)])
'data.frame': 104382 obs. of 7 variables:
$ age.cl: Ord.factor w/ 5 levels "<30 ans"<"30-40 ans"<..: 5 2 1 1 3 4 2 3 4 4 ...
..- attr(*, "label")= chr "age"
$ emploi2 : Factor w/ 8 levels "Agriculteurs exploitants",..: 3 5 6 8 8 8 8 3 3 3 ...
..- attr(*, "label")= chr "CSP"
$ tabac : Factor w/ 4 levels "ancien fumeur",..: 4 1 4 4 3 4 4 1 4 4 ...
..- attr(*, "label")= chr "tabac"
$ situ_mari2 : Factor w/ 3 levels "Vit seul","Divorsé, séparé ou veuf",..: 3 2 1 1 1 3 1 3 2 3 ...
..- attr(*, "label")= chr "marriage"
$ education : Factor w/ 3 levels "Universitaire",..: 1 1 1 1 3 1 1 1 1 1 ...
$ revenu.cl : Factor w/ 4 levels "<1800 euros/uc",..: 3 4 2 NA 4 1 1 4 4 1 ...
$ imc.cl : Ord.factor w/ 6 levels "Maigre"<"Normal"<..: 2 2 1 2 3 1 3 2 2 3 ...
..- attr(*, "label")= chr "IMC"
EDIT : by diving inside the function, the error and warning are caused by a call to qr.solve. I don't understand anything about this but I'll try to dive deeper
EDIT2 : inside qr.solve, the error is thrown by a Fortran call to .F_dqrdc2. This is so much beyond my level my nose is starting to bleed.
EDIT3 : I tried to head my data to find out which line is in cause :
db2 = db %>% head(99787) #fails at 99788
db2 = db %>% tail(99698) #fails at 99699
mantelhaen.test(db2$education, db2$score.grouped, db2$sex)
This gives me not much information but maybe it could give you.
I was able to replicate the problem by making the data set bigger.
set.seed(101); n <- 500000
db <- data.frame(education=
factor(sample(1:3,replace=TRUE,size=n)),
score=
factor(sample(1:5,replace=TRUE,size=n)),
sex=
sample(c("M","F"),replace=TRUE,size=n))
After this, mantelhaen.test(db$education, db$score, db$sex) gives the reported error.
Thankfully, the real problem is not within the guts of the QR decomposition code: rather, it occurs when setting up a matrix prior to QR decomposition. There are two computations, ntot*colsums and ntot*rowsums, that overflow R's capacity for integer computation. There's a relatively easy way to work around this by creating a modified version of the function:
copy the source code: dump("mantelhaen.test",file="my_mh.R")
edit the source code
l. 1: modify name of function to my_mantelhaen.test (to avoid confusion)
lines 199 and 200: change ntot to as.numeric(ntot), converting the integer to double precision before the overflow happens
source("my_mh.R") to read in the new function
Now
my_mantelhaen.test(db$education, db$score, db$sex)
should work.
You should definitely test the new function against the old function for cases where it works to make sure you get the same answer.
Now posted to the R bug list, we'll see what happens ...
update 11 May 2018: this is fixed in the development version of R (3.6 to be).

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources