R: Regression of the sum of distributions on an histogram - r

Here is the case:
I want to describe an histogram as the sum of several distributions, and thus to fit these distributions on that histogram. In ROOT/C++ that is pretty obvious, but I look for the equivalent in R. Here is a self-explanatory exemple:
## SUM OF TWO GAUSSIANS OF DIFFERENT WIDTHS
x=rnorm(n=1000,mean=0,sd=1)
y=rnorm(n=1000,mean=0,sd=3)
z=append(x,y)
b=seq(-10,10,by=0.25)
hist(z,breaks=b)
In this case the individual contributions (x) and (y) are known, and I can extract their density curves with a Kernel:
## NARROW GAUSSIAN
hist(x,prob=T,breaks=b)
dx=density(x,ker="epan")
lines(dx,col=3,lwd=2)
## WIDE GAUSSIAN
hist(y,prob=T,breaks=b)
dy=density(y,ker="epan")
lines(dy,col=2,lwd=2)
I would like to do something like
z~dx+dy
Where the fractions of dx and dy would be the parameters to be fitted.
Looking into the R documentation I have only found references to single regression and smoothing.
Does anyone have a clue or a sympathetic link?
Thanks in advance,
X.

I found a way, but ignoring the kernel:
x=rnorm(n=10000,mean=0,sd=1)
y=rnorm(n=10000,mean=0,sd=3)
z=append(x,y)
x=subset(x,abs(x)<=10)
y=subset(y,abs(y)<=10)
z=subset(z,abs(z)<=10)
hx=hist(x,prob=T,breaks=b)
hy=hist(y,prob=T,breaks=b)
hz=hist(z,prob=T,breaks=b)
lm(formula=as.formula(hz$intensities~hx$intensities+hy$intensities))
Call:
lm(formula = as.formula(hz$intensities ~ hx$intensities + hy$intensities))
Coefficients:
(Intercept) hx$intensities hy$intensities
4.344e-17 5.002e-01 4.998e-01
That assumes that the template histograms are reliable (enough entries, relevant binning).
I will meanwhile dig further to see how that can be applied to the fit of kernels, given that
lm(formula=as.formula(hz$intensities~dx$y+dy$y))
lm(formula=as.formula(z~dx$y+dy$y))
end up with the error:
variable lengths differ (found for 'dx$y')
as the kernel is estimated from the full set (x) and not the histogram hx.
Thanks, greetings to Massachusetts!

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Trouble with 'fitdistrplus' package, t-distribution

I am trying to fit t-distributions to my data but am unable to do so. My first try was
fitdistr(myData, "t")
There are 41 warnings, all saying that NaNs are produced. I don't know how, logarithms seem to be involved. So I adjusted my data somewhat so that all data is >0, but I still have the same problem (9 fewer warnings though...). Same problem with sstdFit(), produces NaNs.
So instead I try with fitdist which I've seen on stackoverflow and CrossValidated:
fitdist(myData, "t")
I then get
Error in mledist(data, distname, start, fix.arg, ...) :
'start' must be defined as a named list for this distribution
What does this mean? I tried looking into the documentation but that told me nothing. I just want to possibly fit a t-distribution, this is so frustrating :P
Thanks!
Start is the initial guess for the parameters of your distribution. There are logs involved because it is using maximum likelihood and hence log-likelihoods.
library(fitdistrplus)
dat <- rt(100, df=10)
fit <- fitdist(dat, "t", start=list(df=2))
I think it's worth adding that in most cases, using the fitdistrplus package to fit a t-distribution to real data will lead to a very bad fit, which is actually quite misleading. This is because the default t-distribution functions in R are used, and they don't support shifting or scaling. That is, if your data has a mean other than 0, or is scaled in some way, then the fitdist function will simply lead to a bad fit.
In real life, if data fits a t-distribution, it is usually shifted (i.e. has a mean other than 0) and / or scaled. Let's generate some data like that:
data = 1.5*rt(10000,df=5) + 0.5
Given this data has been sampled from the t-distribution with 5 degrees of freedom, you'd think that trying to fit a t-distribution to this should work quite nicely. But actually, here is the result. It estimates a df of 2, and provides a bad fit as shown in the qq plot.
> fit_bad <- fitdist(data,"t",start=list(df=3))
> fit_bad
Fitting of the distribution ' t ' by maximum likelihood
Parameters:
estimate Std. Error
df 2.050967 0.04301357
> qqcomp(list(fit_bad)) # generates plot to show fit
When you fit to a t-distribution you want to not only estimate the degrees of freedom, but also a mean and scaling parameter.
The metRology package provides a version of the t-distribution called t.scaled that has a mean and sd parameter in addition to the df parameter [metRology]. Now let's fit it again:
> library("metRology")
> fit_good <- fitdist(data,"t.scaled",
start=list(df=3,mean=mean(data),sd=sd(data)))
> fit_good
Fitting of the distribution ' t.scaled ' by maximum likelihood
Parameters:
estimate Std. Error
df 4.9732159 0.24849246
mean 0.4945922 0.01716461
sd 1.4860637 0.01828821
> qqcomp(list(fit_good)) # generates plot to show fit
Much better :-) The parameters are very close to how we generated the data in the first place! And the QQ plot shows a much nicer fit.

Fit generic function (special case Power Law) to a histogram in R [duplicate]

I am trying to plot a power law line to fit x and y data that I already have in a data frame. I have tried power.law.fit in the igraph library but it isn't working. The data frame is:
dat=data.frame(
x=1:8,
ygm=c( 251.288, 167.739, 112.856, 109.705, 102.064, 94.331, 95.206, 91.415)
)
I generally use one of two strategies here, I take the log and fit a linear model or I use nls. I think you could figure out the logged model if you wanted to, so Ill show the nls method here.
nls1=nls(ygm~i*x^-z,start=list(i=-3,z=-2),data=dat)
Double check that is the formula you want, this method accepts a pretty broad class of formulas. Spend some time fooling with starting values. In particular try to think of frontiers where the likelihood surface could do weird things. Try values on both sides of the wierd places so you can be assured that you are not in a local optima.
> nls1
Nonlinear regression model
model: ygm ~ i * x^-z
data: dat
i z
245.0356 0.5449
residual sum-of-squares: 811.4
...
> predict(nls1)
[1] 245.03564 167.95574 134.66070 115.12256 101.94200 92.30101 84.86458
[8] 78.90891
> plot(dat)
> lines(predict(nls1))

loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):
x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25
loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?
I have read the cloess documentation online but am unable to find where it discusses this.
However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat
Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.
To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.
This is a part of the manual of the command describing how it really works:
Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.
For the default family, fitting is by (weighted) least squares. For
family="symmetric" a few iterations of an M-estimation procedure with
Tukey's biweight are used. Be aware that as the initial value is the
least-squares fit, this need not be a very resistant fit.
What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.
Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.
Let me know if you are looking for anything special.
To better understand what is happening in a loess fit try running the loess.demo function from the TeachingDemos package. This lets you interactively click on the plot (even between points) and it then shows the set of points and their weights used in the prediction and the predicted line/curve for that point.
Note also that the default for loess is to do a second smoothing/interpolating on the loess fit, so what you see in the fitted object is probably not the true loess fitting information, but the secondary smoothing.
Found the answer on page 42 of the manual:
In this algorithm a set of points typically small in number is selected for direct
computation using the loess fitting method and a surface is evaluated using an interpolation
method that is based on blending functions. The space of the factors is divided into
rectangular cells using an algorithm based on k-d trees. The loess fit is evaluated at
the cell vertices and then blending functions do the interpolation. The output data
structure stores the k-d trees and the fits at the vertices. This information
is used by predict() to carry out the interpolation.
I geuss that for predict at x, predict.loess make a regression with some points near x, and calculate the y-value at x.
Visit https://stats.stackexchange.com/questions/223469/how-does-a-loess-model-do-its-prediction

Is there an implementation of loess in R with more than 3 parametric predictors or a trick to a similar effect?

Calling all experts on local regression and/or R!
I have run into a limitation of the standard loess function in R and hope you have some advice. The current implementation supports only 1-4 predictors. Let me set out our application scenario to show why this can easily become a problem as soon as we want to employ globally fit parametric covariables.
Essentially, we have a spatial distortion s(x,y) overlaid over a number of measurements z:
z_i = s(x_i,y_i) + v_{g_i}
These measurements z can be grouped by the same underlying undistorted measurement value v for each group g. The group membership g_i is known for each measurement, but the underlying undistorted measurement values v_g for the groups are not known and should be determined by (global, not local) regression.
We need to estimate the two-dimensional spatial trend s(x,y), which we then want to remove. In our application, say there are 20 groups of at least 35 measurements each, in the most simple scenario. The measurements are randomly placed. Taking the first group as reference, there are thus 19 unknown offsets.
The below code for toy data (with a spatial trend in one dimension x) works for two or three offset groups.
Unfortunately, the loess call fails for four or more offset groups with the error message
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square,
normalize, :
only 1-4 predictors are allowed"
I tried overriding the restriction and got
k>d2MAX in ehg136. Need to recompile with increased dimensions.
How easy would that be to do? I cannot find a definition of d2MAX anywhere, and it seems this might be hardcoded -- the error is apparently triggered by line #1359 in loessf.f
if(k .gt. 15) call ehg182(105)
Alternatively, does anyone know of an implementation of local regression with global (parametric) offset groups that could be applied here?
Or is there a better way of dealing with this? I tried lme with correlation structures but that seems to be much, much slower.
Any comments would be greatly appreciated!
Many thanks,
David
###
#
# loess with parametric offsets - toy data demo
#
x<-seq(0,9,.1);
x.N<-length(x);
o<-c(0.4,-0.8,1.2#,-0.2 # works for three but not four
); # these are the (unknown) offsets
o.N<-length(o);
f<-sapply(seq(o.N),
function(n){
ifelse((seq(x.N)<= n *x.N/(o.N+1) &
seq(x.N)> (n-1)*x.N/(o.N+1)),
1,0);
});
f<-f[sample(NROW(f)),];
y<-sin(x)+rnorm(length(x),0,.1)+f%*%o;
s.fs<-sapply(seq(NCOL(f)),function(i){paste('f',i,sep='')});
s<-paste(c('y~x',s.fs),collapse='+');
d<-data.frame(x,y,f)
names(d)<-c('x','y',s.fs);
l<-loess(formula(s),parametric=s.fs,drop.square=s.fs,normalize=F,data=d,
span=0.4);
yp<-predict(l,newdata=d);
plot(x,y,pch='+',ylim=c(-3,3),col='red'); # input data
points(x,yp,pch='o',col='blue'); # fit of that
d0<-d; d0$f1<-d0$f2<-d0$f3<-0;
yp0<-predict(l,newdata=d0);
points(x,y-f%*%o); # spatial distortion
lines(x,yp0,pch='+'); # estimate of that
op<-sapply(seq(NCOL(f)),function(i){(yp-yp0)[!!f[,i]][1]});
cat("Demo offsets:",o,"\n");
cat("Estimated offsets:",format(op,digits=1),"\n");
Why don't you use an additive model for this? Package mgcv will handle this sort of model, if I understand your Question, just fine. I might have this wrong, but the code you show is relating x ~ y, but your Question mentions z ~ s(x, y) + g. What I show below for gam() is for response z modelled by a spatial smooth in x and y with g being estimated parametrically, with g stored as a factor in the data frame:
require(mgcv)
m <- gam(z ~ s(x,y) + g, data = foo)
Or have I misunderstood what you wanted? If you want to post a small snippet of data I can give a proper example using mgcv...?

Resources