R package bnlearn: cpquery vs predict - different results? - r

I want to use my bayesian network as a classifier, first on complete evidence data (predict), but also on incomplete data (bnlearn::cpquery). But it seems that, even working with the same evidence, the functions give different results (not only based on slight deviation due to sampling).
With complete data, one can easily use R's predict function:
predict(object = BN,
node = "TargetVar",
data = FullEvidence ,
method = "bayes-lw",
prob = TRUE)
By analyzing the prob attribute, I understood that the predict-function simply chooses the factor level with the highest probability assigned.
When it comes to incomplete evidence (only outcomes of some nodes are known), predict doesn't work anymore:
Error in check.fit.vs.data(fitted = fitted,
data = data,
subset = setdiff(names(fitted), :
required variables [.....] are not present in the data.`
So, I want to use bnlearn::cpquery with a list of known evidence:
cpquery(fitted = BN,
event = TargetVar == "TRUE",
evidence = evidenceList,
method = "lw",
n = 100000)
Again, I simply want to use the factor with the highest probability as prediction. So if the result of cpquery is higher than 0.5, I set the prediction to TRUE, else to FALSE.
I tried to monitor the process by giving the same (complete) data to both functions, but they don't give me back the same results. There are large differences, e.g. predict's "prob"-attribute gives me a p(false) = 27% whereas cpquery gives me p(false) = 2,2%.
What is the "right" way of doing this? Using only cpquery, also for complete data? Why are there large differences?
Thanks for your help!

As user20650 put it, increasing the number of samples in the predict call was the solution to get very similar results. So just provide the argument n = ... in your function call.
Of course that makes sense, I just didn't know about that argument in the predict() function.
There's no documentation about it in the bn.fit utilities and also none in the quite generic documentation of predict.

Related

MatchIt: Full Matching - Long Vector Error

I am running an analysis to assess the land conservation policy impact on land use change at parcel level. To address the non-random nature of conservation program enrollment, I am running a matching analysis between treated and non-treated parcel level data. I am getting this error when I am trying to run a full matching using MatchIt package.
Error in cbind(treatmentids, controlids) :
long vectors not supported yet: ../include/Rinlinedfuns.h:535
The configuration I am using is:
m1.out <- matchit(formula = Y ~ X1 + X2 + ..... Xn, data = dataframe,
method = "full", distance = 'glm', link = 'logit',
estimand = 'ATT', ratio = 1, pop.size = 16)
Where X1 .. Xn are continuous covariates, Y is a binary treatment variable. The dataset contains 121226 rows out of which 51693 are treatment and the rest are control samples.
I am running R (4.0.2) with MatchIt(4.3.4) on a windows machine. Genetic or nearest neighbor matching methods are running without any issues. I appreciate any help on this.
This is an error from optmatch. The problem is too big for optmatch::fullmatch(), the function matchit() calls with method = "full", to handle. This is because fullmatch() does a search over all pairwise distances, which in this case is over 2.5 billion in number. The problem may simply be infeasible for full matching. See here for the same problem.
Some things you can try are to impose a very strict caliper, which reduces the number of eligible nodes, or add an exact matching constraint using the exact argument, which splits the problem into smaller chunks that may be more manageable. You can also try using subclassification with a large number of subclasses, which approximates full matching.
Also note that the ratio and pop.size arguments do nothing with full matching, so you should exclude them from your call to matchit().

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

How can we specify a custom lambda sequence to glmnet

I am new to the glmnet package in R, and wanted to specify a lambda function based on the suggestion in a published research paper to the glmnet.cv function. The documentation suggests that we can supply a decreasing sequence of lambdas as a parameter. However, in the documentation there are no examples of how to do this.
It would be very grateful if someone can suggest how to go about doing this. Do I pass a vector of 100 odd values (default value for nlambda) to the function? What restrictions should be there for the min and max value of this vector, if any? Also, are their things to keep in mind regarding nvars, nobs etc. while specifying the vector?
Thanks in advance.
You can define a grid like this :
grid=10^seq(10,-2,length=100) ##get lambda sequence
ridge_mod=glmnet(x,y,alpha=0,lambda=grid)
This is fairly easy though it's not well explained in the original documentation ;)
In the following I've used cox family but you can change it based on your need
my_cvglmnet_fit <- cv.glmnet(x=regression_data, y=glmnet_response, family="cox", maxit = 100000)
Then you can plot the fitted object created by the cv.glmnet and in the plot you can easily see where the lambda is minimum. one of those dotted vertical lines is the minimum lambda and the other one is the 1se.
plot(my_cvglmnet_fit)
the following lines helps you see the non zero coefficients and their corresponding values:
coef(my_cvglmnet_fit, s = "lambda.min")[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # the non zero coefficients
colnames(regression_data)[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # The features that are selected
here are some links that may help:
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
http://blog.revolutionanalytics.com/2013/05/hastie-glmnet.html

Using the glmulti package in R for exhaustive search multiple regression for akaike weights

I was wondering if someone could help me understand why I am getting an error message when I enter a script into R. For abit of background information I am looking into the effect 6 different variables (which I think is 63 combinations or models) (X) have on gross primary and net ecosystem production (Y) seperatly at different spatial scales for my environmental science honours project. I have decided to use exhaustive search multiple regression analysis with akaikes information criterion (AIC) to try and find a group of models for best fit. (and hierarchical partitioning to compare variance attributed to different X variables) I want to get the weights so I can rank which models "best meet" the criterion see if there is one or a group of them that outfit the rest and therefore be a more likely fit to the data.
I recently posted a similar question on the hier.part package on Cross Validated received a great answer and was told to come here if I had any similar questions in the future.
The package I am using for R is glmulti. which can be found here
The script i am using is this
require(glmulti)
GPPANDDRIVER<-read.table("C:\\Databases at different scales for R\\River Rhine and Netherlands\\GPP and drivers rhineland (comma delimited).csv",header=T,sep=",")
GPP<-GPPANDDRIVER$GPP
IND_VARS<-subset(GPPANDDRIVER,select=-GPP)
# glmulti S4 generic
glmulti(y=GPP, xr=IND_VARS, data, exclude = c(), name = "glmulti.analysis", intercept = TRUE, marginality = FALSE, bunch=30, chunk = 1, chunks = 1,
level = 2, minsize = 0, maxsize = -1, minK = 0, maxK = -1, method = "h", crit = "aic", confsetsize = 63, popsize = 40, mutrate = 10^-3, sexrate = 0.1, imm = 0.3, plotty = TRUE, report = TRUE, deltaM = 0.05, deltaB = 0.05, conseq = 5, fitfunction = "glm", resumefile = "id", includeobjects=TRUE,)
Here is the link for the .csv data for sites in the rhineland mentioned in the example, http://www.filedropper.com/gppanddriversrhinelandcommadelimited
I am extremely new to R so I presumed popsize means the number of replicates which is 40 for this scale so I used 40, I also assumed confsetsize meant number of possible models which I believe is 63 due to the 6 variables?
If anyone could help it would be greatly appreciated
Thanks for you patience and apologies for the basic question
Richard
edit I just tried running the script this morning and it now crashes R.
This worked for me. I think the main thing is not to blindly include all the parameters in the model call. Most of these have default values, thus (if the package writer has done their job) you should be able to leave them as they are and not worry too much (although of course you should RTFM and (try to) understand what they mean ...)
dat <- read.csv("GPPdriversRhineland.csv")
library(glmulti)
I decided to rename the predictors with shorter tags:
prednames <- c("NDVI","solar.rad","avg.temp","precip",
"nutr.avail","water.cap")
names(dat)[1:6] <- prednames
This is all you need to fit all combinations of main effects: since you have six predictors, there are 64 level-1 models (including the null model).
g1 <- glmulti("GPP",xr=prednames,data=dat,level=1)
For a bigger computational challenge:
g2 <- glmulti("GPP",xr=prednames,data=dat,level=2)
I believe there are 2^(choose(6,2)+6) = 2.1 million possible models here. I haven't looked at ?glmulti closely enough to tell it how to stop fitting models. I just started it off (so far it has evaluated 66,000 models), but it has found a 2-level model with AIC about 500.5, which is much better than the min-AIC of 518 in the set of 1-level models ...
PS I played around with settings a bit more, trying the genetic algorithm approach rather than the exhaustive approach (I don't see an obvious way to tell glmulti "use the exhaustive approach, but stop after N tries"). Even with slightly more permissive-than-default genetic algorithm settings, it seems to get stuck at AIC approx 504, above the value found in the (partial) exhaustive screening I tried first.
e.g.:
g2 <- glmulti("GPP",xr=prednames,data=dat,level=2,marginality=TRUE,
method="g",conseq=25,popsize=500,mutrate=1e-2)
PPS: the reason I was getting better results in the exhaustive case was that I had marginality=FALSE, i.e. the model was allowed to leave out main-effect parameters that were involved in interactions included in the model. This isn't necessarily sensible. If I turn off the marginality constraint, then the genetic algorithm can get down to AIC=499 without too much trouble ...
glmulti("GPP",xr=prednames,data=dat,level=2,marginality=TRUE,
method="d")
is also useful: it prints out the number of candidate models defined for a given specification.

Random Forest with classes that are very unbalanced

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:
strata
sampsize
The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:
randomForest(x=predictors,
y=response,
data=train.data,
mtry=lista.params[1],
ntree=lista.params[2],
na.action=na.omit,
nodesize=lista.params[3],
maxnodes=lista.params[4],
sampsize=c(250000,2000),
do.trace=100,
importance=TRUE)
The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)
The list.params is a list with different parameters (duh! I know...)
Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?
And finally, sometimes I get the following error:
Error in randomForest.default(x = predictors, y = response, data = train.data, :
Still have fewer than two classes in the in-bag sample after 10 attempts.
Sorry If I am doing so many (and maybe stupid) questions ...
You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)
One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html
In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.
There are a few options.
If you have a lot of data, set aside a random sample of the data. Build your model on one set, then use the other to determine a proper cutoff for the class probabilities using an ROC curve.
You can also upsample the data in the minority class. The SMOTE algorithm might help (see the reference below and the DMwR package for a function).
You can also use other techniques. rpart() and a few other functions can allow different costs on the errors, so you could favor the minority class more. You can bag this type of rpart() model to approximate what random forest is doing.
ksvm() in the kernlab package can also use unbalanced costs (but the probability estimates are no longer good when you do this). Many other packages have arguments for setting the priors. You can also adjust this to put more emphasis on the minority class.
One last thought: maximizing models based on accuracy isn't going to get you anywhere (you can get 99.99% off the bat). The caret can tune models based on the Kappa statistic, which is a much better choice in your case.
Sorry, I don't know how to post a comment on the earlier answer, so I'll create a separate answer.
I suppose that the problem is caused by high imbalance of dataset (too few cases of one of the classes are present). For each tree in RF the algorithm creates bootstrap sample, which is a training set for this tree. And if you have too few examples of one of the classes in your dataset, then the bootstrap sampling will select examples of only one class (major class). And thus tree cannot be grown on only one class examples. It seems that there is a limit on 10 unsuccessful sampling attempts.
So the proposition of DWin to reduce the degree of imbalance to lower values (1:100 or 1:10) is the most reasonable one.
Pretty sure I disagree with the idea of removing observations from your sample.
Instead you might consider using a stratified sample to set a fixed percentage of each class each time it is resampled. This can be done with the Caret package. This way you will not be omitting observations by reducing the size of your training sample. It will not allow you to over represent your classes but will make sure that each subsample has a representative sample.
Here is an example I found:
len_pos <- nrow(example_dataset[example_dataset$target==1,])
len_neg <- nrow(example_dataset[example_dataset$target==0,])
train_model <- function(training_data, labels, model_type, ...) {
experiment_control <- trainControl(method="repeatedcv",
number = 10,
repeats = 2,
classProbs = T,
summaryFunction = custom_summary_function)
train(x = training_data,
y = labels,
method = model_type,
metric = "custom_score",
trControl = experiment_control,
verbose = F,
...)
}
# strata refers to which feature to do stratified sampling on.
# sampsize refers to the size of the bootstrap samples to be taken from each class. These samples will be taken as input
# for each tree.
fit_results <- train_model(example_dataset
, as.factor(sprintf("c%d", as.numeric(example_dataset$target)))
,"rf"
,tuneGrid = expand.grid(mtry = c( 3,5,10))
,ntree=500
,strata=as.factor(example_dataset$target)
,sampsize = c('1'=as.integer(len_pos*0.25),'0'=as.integer(len_neg*0.8))
)

Resources