R bnlearn - parameter learning with naive.bayes() check.data() error - r

I have a graph structure, determined from another method, and I want to do parameter learning. The bnlearn methods, however, seem to do parameter learning directly on the dataset (strictly in a dataframe). I have two questions: how do I do parameter learning from an igraph or graphNEL structure with bnlearn?
Second question: I am getting a check.data() error when I try to do parameter learning using my dataset. Their example code works, and I can't understand why my dataset does not. See their code below and a reproducible example, below.
Here is their example code:
require(bnlearn)
require(Rgraphviz)
data(learning.test)
bn <- naive.bayes(learning.test, "A")
pred <- predict(bn, learning.test)
table(pred, learning.test[,"A"])
My reproducible example (errors on naive.bayes() call):
require(bnlearn, Rgraphviz)
data <- data <- matrix(sample.int(200, 61*252, TRUE), nrow=252, ncol=61)
data <- as.data.frame(matrix(as.numeric(as.matrix(data)), ncol=ncol(data),
byrow=TRUE))
bn <- naive.bayes(data, names(data)[1])
Error message:
Error in check.data(data, allowed.types = discrete.data.types) :
valid data types are:
* all variables must be unordered factors.
* all variables must be ordered factors.
* variables can be either ordered or unordered factors.
I do not think this error comes from detecting integers, because when I cast my data to a dataframe, I first cast it to numeric, because other methods in bnlearn require numeric or factored data. This dataset IS count data, but I want to use the method assuming I am using continuous datasets. Does this make sense?

Related

R -- make a prediction based upon multiple fields (factors), but not the entire data frame

Ok, I have a data frame with 250 observations of 9 variables. For simplicity, let's just label them A - I
I've done all the standard stuff (converting things to int or factor, creating the data partition, test and train sets, etc).
What I want to do is use columns A and B, and predict column E. I don't want to use the entire set of nine columns, just these three when I make my prediction.
I tried only using the limited columns in the prediction, like this:
myPred <- predict(rfModel, newdata=myData)
where rfModel is my model, and myData only contains the two fields I want to use, as a dataframe. Unfortunately, I get the following error:
Error in predict.randomForest(rfModel, newdata = myData) :
variables in the training data missing in newdata
Honestly, I'm very new to R, and I'm not even sure this is feasible. I think the data that I'm collecting (the nine fields) are important to use for "training", but I can't figure out how to make a prediction using just the "resultant" field (in this case field E) and the other two fields (A and B), and keeping the other important data.
Any advice is greatly appreciated. I can post some of the code if necessary.
I'm just trying to learn more about things like this.
A assume you used random forest method:
library(randomForest)
model <- randomForest(E ~ A+ B - c(C,D,F,G,H,I), data = train)
pred <- predict(model, newdata = test)
As you can see in this example only A and B column would be taken to build a model, others are removed from model building (however not removed from the dataset). If you want to include all of them use (E~ .). It also means that if you build your model based on all column you need to have those columns in test set too, predict won't work without them. If the test data have only A and B column the model has to be build based on them.
Hope it helped
As I mentioned in my comment above, perhaps you should be building your model using only the A and B columns. If you can't/don't want to do this, then one workaround perhaps would be to simply use the median values for the other columns when calling predict. Something like this:
myData <- cbind(data[, c("A", "B)], median(data$C), median(data$D), median(data$E),
median(data$F), median(data$G), median(data$H), median(data$I))
myPred <- predict(rfModel, newdata=myData)
This would allow you to use your current model, built with 9 predictors. Of course, you would be assuming average behavior for all predictors except for A and B, which might not behave too differently from a model built solely on A and B.

Mixed Anova in R

I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2

How to extract aggregated imputed data from R-package 'mice'?

I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)

applying lm to multiple datasets

Below are 4 datasets (I've just created them randomly for the sake of providing a reproducible code). I created a list of these so I could apply "lm" to these multiple datasets at once :
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C<-data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R<-data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E<-data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
dsets<-list(H,C,R,E)
models<-lapply(dsets,function(x)lm(X1~.,data=x))
lapply(models,summary)
The variables in each of the datasets are different (in count as well as names. However,if you run the code they will all be x1,x2..and so on). The first column/variable in each will be the response and rest would be the independent variables.
This code works but not on my actual dataset. Since my datasets have actual names for variables, I used the position of the variable instead as below:
dsets<-list(H,C,R,E)
models<lapply(dsets,function(x)lm(x[,1]~.,data=x))
lapply(models,summary)
Using the above, the results are messed up. It also includes the response variable as the independent variable.
Could anyone assist?
EDIT: I realized that x[,1] is calling the whole column and not the column name
models<lapply(dsets,function(x)lm(colnames(x)[1]~.,data=x))
lapply(models,summary)
but this doesn't work either. I get the following error
Error in model.frame.default(formula = colnames(H[1]) ~ ., data = H, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Var1')
models <- lapply(dsets,
function(data){
lm(reformulate(termlabels=".", response=names(data)[1]), data)
})
reformulate allows you to construct a formula from character strings.

specify model with selected terms using lm

A pretty straightforward for those with intimate knowledge of R
full <- lm(hello~., hellow)
In the above specification, linear regression is being used and hello is being modeled against all variables in dataset hellow.
I have 33 variables in hellow; I wish to specify some of those as independent variable. These variables have names that carry a meaning so I really don't want to rename them to x1 x2 etc.
How can I, without having to type the individual names of the variables (since that is pretty tedious), specify a select number of variables from the whole bunch?
I tried
full <- lm(hello~hellow[,c(2,5:9)]., hellow)
but it gave me an error "Error in model.frame.default(formula = hello ~ hellow[, : invalid type (list) for variable 'hellow[, c(2, 5:9)]'
reformulate will construct a formula given the names of the variables, so something like:
(Construct data first):
set.seed(101)
hellow <- setNames(as.data.frame(matrix(rnorm(1000),ncol=10)),
c("hello",paste0("v",1:9)))
Now run the code:
ff <- reformulate(names(hellow)[c(2,5,9)],response="hello")
full <- lm(ff, data=hellow)
should work. (Works fine with this example.)
An easier solution just occurred to me; just select the columns/variables you want first:
hellow_red <- hellow[,c(1,2,5,9)]
full2 <- lm(hello~., data=hellow_red)
all.equal(coef(full),coef(full2)) ## TRUE

Resources