A Feature Selection Algorithm POE1ACC for features with continuous value - r

i want to implement the algorithm of "Probability of Error and Average Correlation Coefficient". (more info Page 143. It is a algorithm to elect unused features from set of features. As far as i know, this algorithm is not limited to boolean valued features but i dont know how i can use it for continuous features.
This is the only example what i could find about this algorithm:
Thus, X is to be predicted feature and C is any feature. To calculate Probability of Error value of C, they select values which are mismatching with green pieces. Thus PoE of C is (1-7/9) + (1-6/7) = 3/16 = 1875.
My question is thus: How can we use a continuous feature instead of a boolean feature, like in this example, to calculate PoE? Or is it not possible?

The algorithm that you describe is a feature selection algorithm, similar to the forward selection technique. At each step, we find a new feature Fi that minimizes this criterion :
weight_1 * ErrorProbability(Fi) + weight_2 * Acc(Fi)
ACC(Fi) represents the mean correlation between the feature Fi and other features already selected. You want to minimize this in order to have all your features not correlated, thus have a well conditionned problem.
ErrorProbability(Fi) represents if the feature correctly describes the variable you want to predict. For example, lets say you want to predict if tommorow will be rainy depending on temperature (continuous feature)
The Bayes error rate is (http://en.wikipedia.org/wiki/Bayes_error_rate) :
P = Sum_Ci { Integral_xeHi { P(x|Ci)*P(Ci) } }
In our example
Ci belong to {rainy ; not rainy}
x are instances of temperatures
Hi represent all temperatures that would lead to a Ci prediction.
What is interesting is that you can take any predictor you like.
Now, suppose you have all temperatures in one vector, all states rainy/not rainy in another vector :
In order to have P(x|Rainy), consider the following values :
temperaturesWhenRainy <- temperatures[which(state=='rainy')]
What you should do next is to plot an histogram of these values. Then you should try to fit a distribution on it. You will havea parametric formula of P(x|Rainy).
If your distribution is gaussian, you can do it simply :
m <- mean(temperaturesWhenRainy)
s <- sd(temperaturesWhenRainy)
Given some x value, you have the density of probability of P(x|Rainy) :
p <- dnorm(x, mean = m, sd = s)
You can do the same procedure for P(x|Not Rainy). Then P(Rainy) and P(Not Rainy) are easy to compute.
Once you have all that stuff you can use the Bayes error rate formula, which yields your ErrorProbability for a continuous feature.
Cheers

Related

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):
x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25
loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?
I have read the cloess documentation online but am unable to find where it discusses this.
However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat
Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.
To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.
This is a part of the manual of the command describing how it really works:
Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.
For the default family, fitting is by (weighted) least squares. For
family="symmetric" a few iterations of an M-estimation procedure with
Tukey's biweight are used. Be aware that as the initial value is the
least-squares fit, this need not be a very resistant fit.
What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.
Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.
Let me know if you are looking for anything special.
To better understand what is happening in a loess fit try running the loess.demo function from the TeachingDemos package. This lets you interactively click on the plot (even between points) and it then shows the set of points and their weights used in the prediction and the predicted line/curve for that point.
Note also that the default for loess is to do a second smoothing/interpolating on the loess fit, so what you see in the fitted object is probably not the true loess fitting information, but the secondary smoothing.
Found the answer on page 42 of the manual:
In this algorithm a set of points typically small in number is selected for direct
computation using the loess fitting method and a surface is evaluated using an interpolation
method that is based on blending functions. The space of the factors is divided into
rectangular cells using an algorithm based on k-d trees. The loess fit is evaluated at
the cell vertices and then blending functions do the interpolation. The output data
structure stores the k-d trees and the fits at the vertices. This information
is used by predict() to carry out the interpolation.
I geuss that for predict at x, predict.loess make a regression with some points near x, and calculate the y-value at x.
Visit https://stats.stackexchange.com/questions/223469/how-does-a-loess-model-do-its-prediction

Is there an implementation of loess in R with more than 3 parametric predictors or a trick to a similar effect?

Calling all experts on local regression and/or R!
I have run into a limitation of the standard loess function in R and hope you have some advice. The current implementation supports only 1-4 predictors. Let me set out our application scenario to show why this can easily become a problem as soon as we want to employ globally fit parametric covariables.
Essentially, we have a spatial distortion s(x,y) overlaid over a number of measurements z:
z_i = s(x_i,y_i) + v_{g_i}
These measurements z can be grouped by the same underlying undistorted measurement value v for each group g. The group membership g_i is known for each measurement, but the underlying undistorted measurement values v_g for the groups are not known and should be determined by (global, not local) regression.
We need to estimate the two-dimensional spatial trend s(x,y), which we then want to remove. In our application, say there are 20 groups of at least 35 measurements each, in the most simple scenario. The measurements are randomly placed. Taking the first group as reference, there are thus 19 unknown offsets.
The below code for toy data (with a spatial trend in one dimension x) works for two or three offset groups.
Unfortunately, the loess call fails for four or more offset groups with the error message
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square,
normalize, :
only 1-4 predictors are allowed"
I tried overriding the restriction and got
k>d2MAX in ehg136. Need to recompile with increased dimensions.
How easy would that be to do? I cannot find a definition of d2MAX anywhere, and it seems this might be hardcoded -- the error is apparently triggered by line #1359 in loessf.f
if(k .gt. 15) call ehg182(105)
Alternatively, does anyone know of an implementation of local regression with global (parametric) offset groups that could be applied here?
Or is there a better way of dealing with this? I tried lme with correlation structures but that seems to be much, much slower.
Any comments would be greatly appreciated!
Many thanks,
David
###
#
# loess with parametric offsets - toy data demo
#
x<-seq(0,9,.1);
x.N<-length(x);
o<-c(0.4,-0.8,1.2#,-0.2 # works for three but not four
); # these are the (unknown) offsets
o.N<-length(o);
f<-sapply(seq(o.N),
function(n){
ifelse((seq(x.N)<= n *x.N/(o.N+1) &
seq(x.N)> (n-1)*x.N/(o.N+1)),
1,0);
});
f<-f[sample(NROW(f)),];
y<-sin(x)+rnorm(length(x),0,.1)+f%*%o;
s.fs<-sapply(seq(NCOL(f)),function(i){paste('f',i,sep='')});
s<-paste(c('y~x',s.fs),collapse='+');
d<-data.frame(x,y,f)
names(d)<-c('x','y',s.fs);
l<-loess(formula(s),parametric=s.fs,drop.square=s.fs,normalize=F,data=d,
span=0.4);
yp<-predict(l,newdata=d);
plot(x,y,pch='+',ylim=c(-3,3),col='red'); # input data
points(x,yp,pch='o',col='blue'); # fit of that
d0<-d; d0$f1<-d0$f2<-d0$f3<-0;
yp0<-predict(l,newdata=d0);
points(x,y-f%*%o); # spatial distortion
lines(x,yp0,pch='+'); # estimate of that
op<-sapply(seq(NCOL(f)),function(i){(yp-yp0)[!!f[,i]][1]});
cat("Demo offsets:",o,"\n");
cat("Estimated offsets:",format(op,digits=1),"\n");
Why don't you use an additive model for this? Package mgcv will handle this sort of model, if I understand your Question, just fine. I might have this wrong, but the code you show is relating x ~ y, but your Question mentions z ~ s(x, y) + g. What I show below for gam() is for response z modelled by a spatial smooth in x and y with g being estimated parametrically, with g stored as a factor in the data frame:
require(mgcv)
m <- gam(z ~ s(x,y) + g, data = foo)
Or have I misunderstood what you wanted? If you want to post a small snippet of data I can give a proper example using mgcv...?

Resources