Flexmix package in R - extracting from Flexmix output object into dataframe - r

I have output from a 2-component mixture model run using the Flexmix package in R. I am trying to extract the list of model coefficients, which is stored in what seems to be a list(mix2#components$Comp.1) inside an object of "Formal class FLXcomponent". I would like to store the estimates from each component ins separate dataframes.
### Simulated data for regression mixture model using Flexmix
### Class 1
x<-seq(from=1,to=2, by=0.01)
y<-seq(from=0,to=1, by=0.01)
z<-x+y+y^2
class_label <- c(rep(c(1), length(z)))
dat1<-data.frame(x,y,z,class_label)
### Class2
x<-seq(from=2,to=3, by=0.01)
y<-seq(from=10,to=11, by=0.01)
z<-x^2+y+y^2
class_label <- c(rep(c(2), length(z)))
dat2<-data.frame(x,y,z,class_label)
simdat<-rbind(dat1,dat2)
### Run the model
mix2 <- flexmix(z ~ x+y+x^2+y^2, data=simdat, k=2)
out2<-summary(mix2)
out2
### Extract model coefficients for Component 1
mix2#components$Comp.1
str(mix2#components$Comp.1)
mix2#components[[1]][["Comp.1"]][,1]
mix2#components$Comp.1[,1]
I tried using the getSlots() function in R on mix2, but this gives an error:
getSlots(mix2#components$Comp.1)
Error in .getClassesFromCache(Class) :
class should be either a character-string name or a class definition
How can I extract the coefficients in the model components and save them in a dataframe?
For instance, neither of the approaches below works:
outdat<-as.data.frame(mix2#components[[1]][["Comp.1"]][,1])
outdat<-as.data.frame(mix2#components$Comp.1)

This seems to work, although I am open to other (better) approaches.
mix2 <- flexmix(z ~ I(x^2)+I(y^2), data=simdat, k=2)
p1<-parameters(mix2, component=1)[[1]]
p2<-parameters(mix2, component=1)[[2]]
and so on.

Related

How to export 'flexmix' model (in R) into Tex?

I have used the R package 'flexmix' to create some regression models. I now want to export the results to Tex.
Unlike conventional models created with lm(), the flexmix models are not saved as named numerics but as FLXRoptim objects.
When I now use the normal syntax from the 'texreg' package in order to create Tex code from the model results, I am getting error messages:
"unable to find an inherited method for function ‘extract’ for signature ‘"FLXRoptim"’"
I have to access the models directly, these are stored as 'Coefmat' and I did not manage to make this usable for texreg().
library(flexmix)
library(texreg)
data("patent")
## 1. Flexmix model ##
flex.model <- flexmix(formula = Patents ~ lgRD, data = patent, k = 3,
model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
re.flex.model <- refit(flex.model)
## 2. Approach of results extraction ##
comp1.flex <- re.flex.model#components[[1]][["Comp.1"]]
## 3. Not working: Tex Export ##
texreg(comp1.flex)
Do you guys have an idea how to make these model results usable for Tex export?
I have now found a workaround: 'Texreg' allows us to create Texreg models with manually specified columns.
createTexreg(coef.names, coef, se, pvalues)
Using the example from above:
## Take estimates, SEs, and p-values for Comp1 ##
est1 <- re.flex.model#components[[1]][["Comp.1"]][,1]
se1 <- re.flex.model#components[[1]][["Comp.1"]][,2]
pval1 <- re.flex.model#components[[1]][["Comp.1"]][,4]
## Take estimates, SEs, and p-values for Comp2 ##
est2 <- re.flex.model#components[[1]][["Comp.2"]][,1]
se2 <- re.flex.model#components[[1]][["Comp.2"]][,2]
pval2 <- re.flex.model#components[[1]][["Comp.2"]][,4]
## Create Texreg objects and export into Tex ##
mymodel1 <- createTexreg(row.names(comp1.flex), est1, se1, pval1)
mymodel2 <- createTexreg(row.names(comp1.flex), est2, se2, pval2)
models.flex = list(mymodel1, mymodel2)
texreg(models.flex)
That's probably the most practical way to turn such specific models into a conventional Tex output.

Anova test regression vs. knn in R

I'm trying to take an anova test for two different models in R: a lm model vs. a knn model. The problem is this error appears:
Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset
I think this make sense because I want to know if there are statistical evidences of difference between models. In order to give you a reproducible example, here you have:
#Getting dataset
xtra <- read.csv("california.dat", comment.char="#")
names(xtra) <- c("Longitude", "Latitude", "HousingMedianAge",
"TotalRooms", "TotalBedrooms", "Population", "Households",
"MedianIncome", "MedianHouseValue")
n <- length(names(xtra)) - 1
names(xtra)[1:n] <- paste ("X", 1:n, sep="")
names(xtra)[n+1] <- "Y"
#Regression model
reg.model<-lm(Y~.,data=xtra)
#Knn-model
knn.model<-kknn(Y~.,train=xtra,test=xtra,kernel = "optimal")
anova(reg.model,knn.model)
What I'm doing wrong?
Thanks in advance.
My guess would be that the two models aren't comparable with anova() and this error is being thrown because one of the models will be deemed empty.
From the documentation for anova(object,...):
object - an object containing the results returned by a model fitting
function (e.g., lm or glm).
... - additional objects of the same type.
When you look to see if the models can be compared you can see they're of different types:
> class(knn.model)
[1] "kknn"
> class(reg.model)
[1] "lm"
Probably more importantly if you try and run anova() for knn.model you can see that you cannot apply the function to a kknn object:
> anova(knn.model)
Error in UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "kknn"

S4 object creation in R

I am busy with comparing different machine learning techniques in R.
This is the case: I made several functions that, in an automated way
are able to create each a different prediction model (e.g: logistic regression, random forest, neural network, hybrid ensemble , etc.) , predictions, confusion matrices, several statistics (e.g AUC and Fscore) ,and different plots.
Now I would like to create a list of S4 (or S3?) objects in R, where each object contains the model, predictions, the plots, confusion matrix , auc and fscore.
The idea is that each function creates such object and then append it to the object list in the return statement.
How should I program such class? And how can I define that each model can be of some different type (I suppose that all models that I create are S3 objects, so how do can I define this in my S4 class?
The end result should be able to do something like this: modelList[i]#plot should for example summon the requested plot. and names(modelList[i]) should give the name of the used model (if this is not possible, modelList[i]#name will do). Also, it should be possible to select the best model out of the list, based on a parameter, such as AUC.
I am not experienced in creating such object, so this is the code / idea I have at the moment:
modelObject <- setClass(
# Set the name for the class
"modelObject",
# Define the slots
slots = c(
modelName = "character"
model = #should contain a glm, neural network, random forest , etc model
predictions = #should contain a matrix or dataframe of custid and prediction
rocCurve = #when summoned, the ROC curve should be plotted
plotX = #when summoned, plot X should be plotted
AUC = "numeric" #contains the value of the AUC
confusionMatrix = "matrix" #prints the confusion matrix in the console
statX = "numeric"#contains statistic X about the confusion matrix e.g. Fscore
),
# Set the default values for the slots. (optional)
prototype=list(
# I guess i can assign NULL to each variable of the S4 object
),
# Make a function that can test to see if the data is consistent.
# This is not called if you have an initialize function defined!
validity=function(object)
{
#not really an idea how to handle this
}
return(TRUE)
}
)
Use setOldClass() to promote each S3 class to it's S4 equivalent
setOldClass("lm")
setOldClass(c("glm", "lm"))
setOldClass(c("nnet.formula", "nnet"))
setOldClass("xx")
Use setClassUnion() to insert a common base class in the hierarchy
setClassUnion("lmORnnetORxx", c("lm", "nnet", "xx"))
.ModelObject <- setClass("ModelObject", slots=c(model="lmORnnetORxx"))
setMethod("show", "ModelObject", function(object) {
cat("model class: ", class(object#model), "\n")
})
In action:
> library(nnet)
> x <- y <- 1:10
> .ModelObject(model=lm(x~y))
model class: lm
> .ModelObject(model=glm(x~y))
model class: glm lm
> .ModelObject(model=nnet(x~y, size=10, trace=FALSE))
model class: nnet.formula nnet
I think that you would also like to implement a Models object that contains a list where all elements are ModelObject; the constraint would be imposed by a validity method (see ?setValidity).
What I would do, is for each slot you want in your modelObject class, determine the range of expected values. For example, your model slot has to support all the possible classes of objects that can be returned by model training functions (e.g. lm(), glm(), nnet(), etc.). In the example case, you see the following objects returned:
```
x <- y <- 1:10
class(lm(x~y))
class(glm(x~y))
class(nnet(x~y, size=10))
```
Since there is no common class among the objects returned, it might make more sense to use an S3, which has less rigorous syntax and would allow you to assign various classes of output to the same field name. Your question is actually quite tough to answer, given that there are so many different approaches to take with R's myriad OO systems.

Error when using predict() on a randomForest object trained with caret's train() using formula

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.
When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error.
When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.
Here is a working example:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 <- caretRf$finalModel
p1 <- predict(modRf1, newdata=imp85)
p2 <- predict(modRf2, newdata=imp85)
p3 <- predict(modRf3, newdata=imp85)
p4 <- predict(modRf4, newdata=imp85)
Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:
Error in predict.randomForest(modRf2, newdata = imp85) :
variables in the training data missing in newdata
It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at
rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)
We see:
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelTypegas"
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelType"
So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.
Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?
First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train if you want the same levels or use predict.train
There can be two reasons why you get this error.
1. The categories of the categorical variables in the train and test sets don't match. To check that, you can run something like the following.
Well, first of all, it is good practice to keep the independent variables/features in a list. Say that list is "vars". And say, you separated "Data" into "Train" and "Test". Let's go:
for (v in vars){
if (class(Data[,v]) == 'factor'){
print(v)
# print(levels(Train[,v]))
# print(levels(Test[,v]))
print(all.equal(levels(Train[,v]) , levels(Test[,v])))
}
}
Once you find the non-matching categorical variables, you can go back, and impose the categories of Test data onto Train data, and then re-build your model. In a loop similar to above, for each nonMatchingVar, you can do
levels(Test$nonMatchingVar) <- levels(Train$nonMatchingVar)
2. A silly one. If you accidentally leave the dependent variable in the set of independent variables, you may run into this error message. I have done that mistake. Solution: Just be more careful.
Another way is to explicitly code the testing data using model.matrix, e.g.
p2 <- predict(modRf2, newdata=model.matrix(~., imp85))

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Resources