Factors in aov() - r

I got a wired problem following the example 1 in R Guide.
Here is the example
> datafilename="http://personality-project.org/r/datasets/R.appendix1.data"
> data.ex1 = read.table(datafilename,header=T) #read the data into a table
> aov.ex1 = aov(Alertness~Dosage,data=data.ex1) #do the analysis of variance
> summary(aov.ex1) #show the summary table
But, when I applied aov on my own data, things changed.
> test.data <- data.frame(fac=letters[c(1:3,1:3)], x=1:6)
> test.result <- aov(fac~x, data=test.data)
Error in storage.mode(y) <- "double" :
invalid to change the storage mode of a factor
In addition: Warning message:
In model.response(mf, "numeric") :
using type="numeric" with a factor response will be ignored
I'm totally confused. what's the difference between test.data and data.ex1 in example of R guide?
> str(test.data)
'data.frame': 6 obs. of 2 variables:
$ fac: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3
$ x : int 1 2 3 4 5 6
> str(data.ex1)
'data.frame': 18 obs. of 2 variables:
$ Dosage : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 2 2 2 2 ...
$ Alertness: int 30 38 35 41 27 24 32 26 31 29 ...

it should be aov(x ~ fac, data = test.data), which works. The formula needs to be response ~ factor, not factor ~ response.

Related

MICE Error in parse(text = x, keep.source = FALSE) : <text>:1:141: unexpected ')'

I am trying to conduct a multiple imputation using a 2-level zero-inflated negative binomial model using MICE. This is my code:
ini<-mice(mydata2, m = 10, maxit=0)
meth<-ini$method
pred<-ini$predictorMatrix
meth[3]<-"2l.zinb"
pred[1,]<-c(-2,0,3,3,3)
set.seed(123456)
##impute missing data
imp2l<-mice(mydata2, maxit=10, method=meth, predictorMatrix=pred, m=10, print = FALSE)
When I run the imputation line of code I get the following error:
Error in parse(text = x, keep.source = FALSE) :
:1:141: unexpected ')'
1:nz~DistrictCode+AY2012to2013+AY2013to2014+AY2014to2015+AY2015to2016+AY2016to2017+AY2017to2018+AY2018to2019+treatTx55to80+treatTx80ormore+(1|)
I have followed other threads that have a similar error and have made sure that my variables don't have any illegal characters or spaces in the variable name or even the label names. My data structure looks like this:
'data.frame': 7461 obs. of 5 variables:
$ DistrictCode : num 61176 61176 61176 61176 61176 ...
$ AY : Factor w/ 8 levels "2011to2012","2012to2013",..: 1 2 3 4 5 6 7 8 1 2 ...
$ TotalExpulsions: num 23 24 15 10 17 13 16 13 14 4 ...
$ prepost : Factor w/ 2 levels "Preinterventionperiod",..: 1 1 2 2 2 2 2 2 1 1 ...
$ treat : Factor w/ 3 levels "Control0to55",..: 1 1 1 1 1 1 1 1 2 2 ...
Is there something that I'm missing?
Thank you in advance.

R SVM Predict - Error in predict.svm: test data does not match model

I started with a data frame of 23,515 rows and 3 columns. I split the data 70/30 into training/testing. I am fitting a classification model with SVM from the e1071 package to predict variable MISSING. After I fit the model, I attempt to predict MISSING in my test set but I get the error below:
> ftplh_svm <- svm(MISSING ~ V1+V2, data=train_vars, type="C-classification", kernel="linear")
> p <- predict(ftplh_svm, test_vars, type="class")
Error in predict.svm(object, ...) : test data does not match model !
I tried removing the predicted class from the test set as recommended in another post:
> p <- predict(ftplh_svm, test_vars[-3], type="class")
Error in predict.svm(object, ...) : test data does not match model !
I also tried dropping empty levels as recommended by Brad, but no levels ended up being dropped and I got the same results:
> train_vars$V1 <- droplevels(as.factor(train_vars$V1))
> train_vars$V2 <- droplevels(as.factor(train_vars$V2))
> train_vars$MISSING <- droplevels(as.factor(train_vars$MISSING))
> test_vars$V1 <- droplevels(as.factor(test_vars$V1))
> test_vars$V2 <- droplevels(as.factor(test_vars$V2))
> test_vars$MISSING <- droplevels(as.factor(test_vars$MISSING))
> ftplh_svm <- svm(MISSING ~ V1+V2, data=train_vars, type="C-classification", kernel="linear")
> p <- predict(ftplh_svm, test_vars, type="class")
Error in predict.svm(object, ...) : test data does not match model !
Structure of my training set and test set:
> str(train_vars)
'data.frame': 16395 obs. of 3 variables:
$ V1: Factor w/ 148 levels "AAC","AAL","AGP",..: 1 1 2 2 2 2 2 2 2 2 ...
$ V2 : Factor w/ 284 levels "6AR","AAC","AAL",..: 79 42 180 180 180 180 180 180 180 180 ...
$ MISSING : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
> str(test_vars)
'data.frame': 7129 obs. of 3 variables:
$ V1: Factor w/ 111 levels "AAC","AAL","AGP",..: 1 2 2 2 2 2 2 2 2 2 ...
$ V2 : Factor w/ 265 levels "AAC","AAL","ABZ",..: 225 169 169 169 169 169 169 169 169 169 ...
$ MISSING : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
Test to see if there are new levels in my test set (I did this for each variable):
> train_lev <- levels(train_vars$V1)
> test_lev <- levels(test_vars$V1)
> # these levels only exist in the test set
> new_levels <- setdiff(test_lev,train_lev)
> new_levels
character(0)
> # how many observations is it?
> obs <- which(test_vars$V1 %in% new_levels)
> length(obs)
[1] 0

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

lattice plot error: need finite xlim values calls

Whenever I try and plot across factors I keep getting the error.
Here is how my data looks like:
str(dataWithNoNa)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ dayType : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
I am trying to plot using the lattice plotting system using Weekday/Weekend as a factor.
Here is what I tried:
plot(dataWithNoNa$steps~ dataWithNoNa$interval | dataWithNoNa$dayType, type="l")
Error in plot.window(...) : need finite 'xlim' values
I even checked to make sure my data had no NAs:
sum(is.na(dataWithNoNa$interval))
## [1] 0
sum(is.na(dataWithNoNa$steps))
## [1] 0
What am I doing wrong?
Try this:
library(lattice)
xyplot(steps ~ interval | factor(dayType), data=df)
Output:
Sample data:
df <- data.frame(
steps=c(1.717,0.3396,0.1321,0.1509,0.0755),
interval=c(0,5,10,15,20),
dayType=c(1,1,1,2,2)
)

How to apply Naive Bayes model to new data

I asked a question on this this morning but am deleting that and posting here with more betterer wording.
I created my first machine learning model using train and test data. I returned a confusion matrix and saw some summary stats.
I would now like to apply the model to new data to make predictions but I don't know how.
Context: Predicting monthly "churn" cancellations. Target variable is "churned" and it has two possible labels "churned" and "not churned".
head(tdata)
months_subscription nvk_medium org_type churned
1 25 none Community not churned
2 7 none Sports clubs not churned
3 28 none Sports clubs not churned
4 18 unknown Religious congregations and communities not churned
5 15 none Association - Professional not churned
6 9 none Association - Professional not churned
Here's me training and testing:
library("klaR")
library("caret")
# import data
test_data_imp <- read.csv("tdata.csv")
# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]
#training
rn_train <- sample(nrow(tdata),
floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)
# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)
Everything up till here works fine.
Now I have new data, structure and laid out the same way as tdata above. How can I apply my model to this new data to make predictions? Intuitively I was seeking a new column cbinded that had the predicted class for each record.
I tried this:
## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]
actual_predictions <- predict(model, pdata)
#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)
# output
head(predicted_data)
Which threw errors
actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 3
How can I apply my model to the new data? I'd like a new data frame with a new column that has the predicted class?
** following comment, here is head and str of new data for prediction**
head(pdata)
months_subscription nvk_medium org_type churned
1 26 none Community not churned
2 8 none Sports clubs not churned
3 30 none Sports clubs not churned
4 19 unknown Religious congregations and communities not churned
5 16 none Association - Professional not churned
6 10 none Association - Professional not churned
> str(pdata)
'data.frame': 6433 obs. of 4 variables:
$ months_subscription: int 26 8 30 19 16 10 3 5 14 2 ...
$ nvk_medium : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
$ org_type : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
$ churned : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...
This is most likely caused by a mismatch in the encoding of factors in the training data (variable tdata in your case) and the new data used in the predict function (variable pdata), typically that you have factor levels in the test data that are not present in the training data. Consistency in the encoding of the features must be enforced by you, because the predict function will not check it. Therefore, I suggest that you double-check the levels of the features nvk_medium and org_type in the two variables.
The error message:
Error in object$tables[[v]][, nd] : subscript out of bounds
is raised when evaluating a given feature (the v-th feature) in a data point, in which nd is the numeric value of the factor corresponding to the feature. You also have warnings, indicating that the posterior probabilities for all the cases in data points ("observation") 1, 2, and 3 are all zero, but it is not clear if this is also related to the encoding of the factors...
To reproduce the error that you are seeing, consider the following toy data (from http://amunategui.github.io/binary-outcome-modeling/), which has a set of features somewhat similar to that in your data:
# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age', 'Sex', 'Title', 'Survived')]
# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]
with:
> str(Data_train)
'data.frame': 656 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age : num 35 28 34 28 29 28 28 28 45 28 ...
$ Sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...
> str(Data_test)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
Then everything goes as expected:
model <- NaiveBayes(Survived ~ ., data = Data_train)
# This will work
pred_1 <- predict(model, Data_test)
> str(pred_1)
List of 2
$ class : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"
However, if the encoding is not consistent, e.g.:
# Mess things up, by "displacing" the factor values (i.e., 'Nothing'
# will now be encoded as number 5, which was not present in the
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
as.character(Data_test_2$Title),
levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)
> str(Data_test_2)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
then:
> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

Resources