Prediction with cpdist using "probabilities" as evidence - r

I have a very quick question with an easy reproducible example that is related to my work on prediction with bnlearn
library(bnlearn)
Learning.set4=cbind(c("Yes","Yes","Yes","No","No","No"),c(9,10,8,3,2,1))
Learning.set4=as.data.frame(Learning.set4)
Learning.set4[,c(2)]=as.numeric(as.character(Learning.set4[,c(2)]))
colnames(Learning.set4)=c("Cause","Cons")
b.network=empty.graph(colnames(Learning.set4))
struct.mat=matrix(0,2,2)
colnames(struct.mat)=colnames(Learning.set4)
rownames(struct.mat)=colnames(struct.mat)
struct.mat[1,2]=1
bnlearn::amat(b.network)=struct.mat
haha=bn.fit(b.network,Learning.set4)
#Some predictions with "lw" method
#Here is the approach I know with a SET particular modality.
#(So it's happening with certainty, here for example I know Cause is "Yes")
classic_prediction=cpdist(haha,nodes="Cons",evidence=list("Cause"="Yes"),method="lw")
print(mean(classic_prediction[,c(1)]))
#What if I wanted to predict the value of Cons, when Cause has a 60% chance of being Yes and 40% of being no?
#I decided to do this, according the help
#I could also make a function that generates "Yes" or "No" with proper probabilities.
prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
Here is what the help says:
"In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values"
When I predict the value of a variable using categorical variables, I for now just used a certain modality of said variable as in the first prediction in the example. (Having the evidence set at "Yes" gets Cons to take a high value)
But if I wanted to predict Cons without knowing the exact modality of the variable Cause with certainty, could I use what I did in the second prediction (Just knowing the probabilities) ?
Is this an elegant way or are there better implemented ones I don't know off?

I got in touch with the creator of the package, and I will paste his answer related to the question here:
The call to cpquery() is wrong:
Prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
A query with the 40%-60% soft evidence requires you to place these new probabilities in the network first
haha$Cause = c(0.40, 0.60)
and then run the query without an evidence argument. (Because you do not have any hard evidence, really, just a different probability distribution for Cause.)
I will post the code that lets me do what I wanted off of the fitted network from the example.
change=haha$Cause$prob
change[1]=0.4
change[2]=0.6
haha$Cause=change
new_prediction=cpdist(haha,nodes="Cons",evidence=TRUE,method="lw")
print(mean(new_prediction[,c(1)]))

Related

ezANOVA not providing Greenhouse Geiser correct df though violated

I've noticed that sometimes when I use ezANOVA from package ez I get columns stating what the Greenhouse-Geiser corrected df values are, and other times the tables with the sphericity corrections do not include the new df values, even though there are violations. For example, I just ran a 2-way repeated measures anova, and my table output looks like this:
I wish I could give repeatable data, but I genuinely don't know why it does or doesn't do it sometimes. Does anyone else know? I'll show my code below in case there's something I'm missing regarding the actual ezANOVA function. I could do the Df values by hand, but haven't found a good resource online to show me how to correct them using the epsilon value and I unfortunately was never taught that.
ez::ezANOVA(data = IntA2, wid = rat, dv = numReinforcers, within = .(component, minute))
Edit: A very nice person on the internet has explained to me how to calculate the new df values by hand (multiplying the GG epsilon by the old Dfs, in case any one else was wondering!) but I'm still unclear on why sometimes the function does it for you and other times it does not.

How to remove the models that failed convergence from a set of random questions?

I want to include some random replications of model estimations (e.g., GARCH model) in the question. The code uses a different data series randomly. In this process, some GARCH estimations for some random data series may not achieve numerical convergence. Therefore, I need to code the question/problem in such a way that it has to remove the models that failed convergence from the set of questions. How can I code this when I use R-exams?
Basic idea
In general when using random data in the generation of exercises, there is a chance that sometimes something goes wrong, e.g., the solution does not fall into a desired range (i.e., becomes too large or too small), or the solution does not even exist due to mathematical intractability or numerical problems (as you point out) etc.
Of course, it is best to avoid such problems in the data-generating process so that they do not occur at all. However, it is not always possible to do so or not worth the effort because problems occur very rarely. In such situations I typically use a while() loop to re-generate the random data if necessary. As this might run potentially for several iterations it is important, though, to make the probably sufficiently small that it is needed.
Worked example
A worked example can be found in the fourfold exercise that ships with the package. It randomly generates a fourfold table with probabilities that should subsequently be reconstructed from partial information in the actual exercise. In order for the exercise to be well-defined all entries of the table must be (strictly) between 0 and 1 and they must sum up to 1. The simulation code actually tries to assure that but edge cases might occur. Rather than writing more code to avoid these edge cases, a simple while() loop tries to catch them and sample a new table if needed:
ok <- FALSE
while(!ok) {
[...generate probabilities...]
tab <- cbind(c(prob1, prob3), c(prob2, prob4))
[...compute solutions...]
ok <- sum(tab) == 1 & all(tab > 0) & all(tab < 1)
}
Application to catching errors
The same type of strategy could also be used for other problems such as the ones you describe. You can wrap the model estimation into a code like
fit <- try(mymodel(...), silent = TRUE)
and then use something like
ok <- !inherits(fit, "try-error")
In addition to not producing an error you might require, say that all coefficients are positive (or something like that). Then you would do:
ok <- !inherits(fit, "try-error") && all(coef(fit) > 0)
Analogously, you could check the convergence of the model etc.

How to interpret the prediction in this plot of classification tree?

I have followed this tutorial and was able to reproduce the results. However, the last graph confuses me. I understand most of the time it's probability, but why are there negative numbers? Since the response is Survived, how to interpret the numbers in the predictions? How to convert those numbers to Yes and No?
https://www.h2o.ai/blog/finally-you-can-plot-h2o-decision-trees-in-r/
EIDT 11/19/2019: by the way, I did find a similar post on Cross Validated. The answer was not certain since it ended with a question mark.
https://stats.stackexchange.com/questions/374569/may-somebody-help-with-interpretation-of-trees-from-h2o-gbm-see-as-photo-attach
I filtered the data using the logic in the tree and looked at the unique prediction of the subset. I was able to find the threshold for 'yes' and 'no' predictions. I also changed the original code (starting line 34) so that the leaf shows the ultimate result of the numbers. However, this is just a way to hack the plot. If someone can tell me how the numbers are derived, that would be great.
if(class(left_node)[[1]] == 'H2OLeafNode')
leftLabel = ifelse(left_node#prediction >= threshold, 'Yes', 'No')
else
leftLabel = left_node#split_feature
if(class(right_node)[[1]] == 'H2OLeafNode')
rightLabel = ifelse(right_node#prediction >= threshold, 'Yes', 'No')
else
rightLabel = right_node#split_feature
Since the picture is a GBM plot, it’s not as straightforward as you might like, since the inference calculation does some math on the value extracted from the leaf of the tree.
The actual code is here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/gbm/GbmMojoModel.java
Look at the score0 function.
My advice would be to build a 1-tree DRF instead, and then write a short java program and try to single-step it in a java debugger.
The java snippet to start from is how to compile and run a MOJO in this document:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
If you do this, you will be able to step through the exact steps that produce the answer (for GBM as well if you prefer), and nothing will be unknown at that point.

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

auto.arima() seemingly selects different models given same data

I was trying something like the auto.arima example in https://otexts.com/fpp2/lagged-predictors.html and noticed I get different results depending on whether I specify (all) rows of data explicitly or not. MWE:
library(forecast); library(fpp2)
nrow(insurance)
auto.arima(insurance[,1], xreg=insurance[,2], stationary=TRUE)
auto.arima(insurance[1:40,1], xreg=insurance[1:40,2], stationary=TRUE)
The nrow(insurance) shows there are 40 rows, so I'd think insurance[,1] would be the same as insurance[1:40,1], and similarly for the second column. Yet, the first way results in a "Regression with ARIMA(3,0,0) errors" whereas the second way results in a "Regression with ARIMA(1,0,2) errors."
Why do these seemingly equivalent calls result in different selected models?
Note that insurance[,1] has labels and insurance[1:40,1] does not. If you pass as.numeric(insurance[,1]) you will actually receive "ARIMA(1,0,2)". So I bet it has to do with if the first argument has labels or not...Also note that it doesn't matter if xreg=insurance[,2] or xreg=insurance[1:40,2] they both will work
Corey nudged me in the right direction: insurance[,1] is a "time series" whereas insurance[1:40,1] is numeric. That is, is.ts(insurance[,1]) is TRUE but is.ts(insurance[1:40,1]) is FALSE. The forecast package has a subset function that preserves the time series structure, so is.ts(subset(insurance[,1],start=1,end=40)) is TRUE and
auto.arima(subset(insurance[,1],start=1,end=40),
xreg=subset(insurance[,2],start=1,end=40), stationary=TRUE)
gives the same output as the first version in my question (with insurance[,1] and insurance[,2]).
I think that explains "why" at least superficially, although I don't understand
1) why the time series structure changes the result here (since there doesn't seem to be any seasonality in the selected models?), and
2) why in the linked example Hyndman uses insurance[4:40,1] instead of his own subset() function from his forecast package?
I'll wait to see if somebody wants to answer those "deeper" questions, otherwise I'll probably accept this answer.

Resources