Error in bn.fit predict function in bnlear R - r

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!

The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

Related

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Why are polychoric correlation coefficients in matrices calculated by different R packages slightly different for the same data?

I calculated polychoric correlation matrices for the same data frame (20 ordinal variables, 190 missing values) in R, using three different packages and the coefficients for same variables are slightly different from each other.
I used the lavCor function from "lavaan" (I did list the ordinal variables when calling the function), polychoric function from "psych" (1.9.1) (took the rhos), and cor_auto function from "qgraph" (which is supposed to automatically calculate polychoric correlations for ordinal data). I am confused because I thought they were supposed to give exactly the same results. I read package documentations but could not find anything that helped me understand why. Could anyone let me know why this happens? I am sure I am missing some tiny difference between those, but I cannot figure it out.
PS: I guess this could have happened because psych package adjusts missing values (I have 190) using the correction for continuity, but I still do not understand why qgraph yields different results than lavaan as qgraph says it uses lavaan's lavCor function to calculate polychoric correlations.
Thanks!!
depanx<-data[1:20]
cor.depanx<-cor_auto(depanx)
polychor<-polychoric(depanx)
polymat<-polychor$rho
lav<-lavCor(depanx,ordered=c("unh","enj","trd","rst","noG","cry","cnc","htd","bdp","lnl","lov",
"cmp","wrg","pst","sch","dss","hlt","bad","ftr","oth"))
# as a result, matrices "cor.depanx", "polymat", and "lav" are different from each other.
Nice question! I do not know what the "data" dataset in you example is, but i recreate the two possible scenarios, which have most probably caused the discrepancy between cor_auto and lavCor results. In summary, first you must set the "ordinalLevelMax" argument in cor_auto based on your data and second you need to synchronize the "missing" argument in the two functions. Detailed explanation in the code snippet below:
depanx<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace = T)),
stringsAsFactors = F)
colnames(depanx)=LETTERS[1:5]
lav<-lavaan::lavCor(depanx,ordered=colnames(depanx))
cor.depanx<-cor_auto(depanx)
all(lav==cor.depanx)#TRUE
#The first argument in cor auto, which you need to pay attention to is
#"ordinalLevelMax". #It is set to 7 by default in cor_auto,
#so any variable with levels more than 7 is sent to lavCor as plain numeric and not
#ordinal.
#Now we create the same dataset with 8 level variables. lavCor detects all as ordinal,
#since we have labeled them as so by "ordered" argument of lavCor, so it uses
#ploychorial
#correlations. Since "ordinalLevelMax" in cor_auto is 7 by default and you have not
#changed it,
#cor_auto detect none as ordinaland does not send them to lavCor as Ordinalvariables,
#so Lavcor computes pearson correlations between them,all.
depanx2<-data.frame(lapply(1:5,function(x)sample(1:8,100,replace =T)),
stringsAsFactors = F)
colnames(depanx2)=LETTERS[1:5]
lav2<-lavaan::lavCor(depanx2,ordered=colnames(depanx2))
cor.depanx2<-cor_auto(depanx2)
all(lav2==cor.depanx2)#FALSE
# the next argument you must synchronise in lavCor and cor_auto is the "missing",
#which is by default set to "pairwise" and "listwise" in cor_auto and lavCor,
#respectively.
#here we set row 10:20 value of the fifth variable to NA, without synchronizing the
#argument
depanx3<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace =T)),
stringsAsFactors = F)
colnames(depanx3)=LETTERS[1:5]
depanx3[10:20,5]<-NA
lav3<-lavaan::lavCor(depanx3,ordered=colnames(depanx3))
cor.depanx3<-cor_auto(depanx3)
all(lav3==cor.depanx3)#FALSE

R - Machine Learning: Subset selection packages and approaches for a categorical response variable

I have a clinical data set with around 26 variables (mix of numerical and categorical) including response variable.
Categorical response variable 'RETINOPATHY' has 2 factor levels: "Yes" and "No".
Now, the task is to find a best feature variables out of 26 total to predict the categorical response variable. Verify the AIC(Akaike) etc value for different subset.
I found the library called 'leaps' is pretty handy for this task,but it works only when response variable is Numerical. "regsubsets" won't work for categorical variables as it's based on linear regression. see below:
regsubsets(finalDataDF$RETINOPATHY~., data = finalDataDF, nbest = 5, method="exhaustive")
In above sample, RETINOPATHY is a categorical response variable, which is a function of remaining variables.
I searched a lot but couldn't get the proper explanation like which R package is available to get the feature subsets for the prediction of RETINOPATHY.
Please guide me further. Thanks in advance.

Error : Sets of levels in train and test don't match (knncat R)

I am trying to do knn classification using knncat in R since I have categorical attributes in my data set.
knncat(FinalData, FinalTestData, k=10, classcol = 15)
when i execute the above statement, it gives me the error that : Sets of levels in train and test do not match.
On checking of levels for all of the attributes, i did get a difference. I have a country attribute which can take from 1-41 values in train data set.
However in test data set, one particular country never appears and thus it is causing this error. How am I supposed to deal with that ?
I'm not sure but you may match the factor levels as below.
train <- factor(c("a","b","c"))
test <- factor(c("a","b"))
levels(test) <- levels(train)
test
[1] a b
Levels: a b c
Perhaps I am wrong, but wouldn't this still be problematic because the KNN algorithm bases its tuning off of Euclidian distance calculations, right?
Wouldn't you still need to create a binary variable for each level of your categorical features, which would mean that you would have an issue given that certain levels might not appear in both the training and test sets.
Could someone perhaps enlighten me with regards to this.
Also, as a note, this is meant to be more of a spur than a hijack.

Classification column is removed after using dummyVars in caret package - R

I am playing around with the caret package and came upon this question.
I am using dummyVars to split my categorical columns into separate dummy variables. It seems that dummyVars code removes the classification column in the input data set. For example:
library(earth)
data(etitanic)
dummies <- dummyVars(survived ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "sex.female" "sex.male" "age"
[7] "sibsp" "parch"
So when I try to split the data, I get an error.
train = createDataPartition(et$survived, p=.75, list=FALSE)
Error in createDataPartition(et$survived, p = 0.75, list = FALSE) :
y must have at least 2 data points
Could anyone let me know if this is the expected behavior of caret's dummyVars. I can easily add in the survived column into the data set using say,
et$survived<-etitanic$survived
and then train a model. But I presume that there must be a better way or else the caret package would not remove the classification column. Am I missing something here? Could someone throw more light on this please?
Thanks
As far as I know there is no way to keep the classification column in (or at least not as a factor; and that is because the output is a matrix and therefore it is always numeric). This is because the reason of the dummyVars function is to create dummy variables for the factor predictor variables. It is also designed to provide an alternative to the base R function model.matrix which offers more choices (model.matrix also does not keep the classification column).
Also, and maybe more importantly, functions that require the classification column to be of factor class and only of factor class offer either a way to provide the factor as a separate argument (like the svm function from the e1071 package) or specifically require it as a separate argument (like the knn function from the FNN package). In both cases you do not need to have the factor in your data.frame. You just need to provide it as a separate vector in the function you want to use.
However, there is an alternative for the cases where you do not need the classification column to be of factor type in which case you can simply do:
library(earth)
data(etitanic)
etitanic2 <- etitanic
#convert the classification colunn to numeric
etitanic2$survived <- as.numeric(etitanic2$survived)
#use formula without specifying the response variable
dummies <- dummyVars( ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
> names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "survived" "sex.female" "sex.male" "age"
[8] "sibsp" "parch"
By converting the classification column into numeric and by not specifying a response variable in the formula, the survived column is kept in the output data.frame but as of numeric class.

Resources