Weibull distribution with weighted data - r

I have some time to event data that I need to generate around 200 shape/scale parameters for subgroups for a simulation model. I have analysed the data, and it best follows a weibull distribution.
Normally, I would use the fitdistrplus package and fitdist(x, "weibull") to do so, however this data has been matched using kernel matching and I have a variable of weighting values called km and so needs to incorporate a weight, which isn't something fitdist can do as far as I can tell.
With my gamma distributed data instead of using fitdist I did the calculation manually using the wtd.mean and wtd.var functions from the hsmisc package, which worked well. However, finding a similar formula for the weibull is eluding me.
I've been testing a few options and comparing them against the fitdist results:
test_data <- rweibull(100, 0.676, 946)
fitweibull <- fitdist(test_data, "weibull", method = "mle", lower = c(0,0))
fitweibull$estimate
shape scale
0.6981165 935.0907482
I first tested this: The Weibull distribution in R (ExtDist)
library(bbmle)
m1 <- mle2(y~dweibull(shape=exp(lshape),scale=exp(lscale)),
data=data.frame(y=test_data),
start=list(lshape=0,lscale=0))
which gave me lshape = -0.3919991 and lscale = 6.852033
The other thing I've tried is eweibull from the EnvStats package.
eweibull <- eweibull(test_data)
eweibull$parameters
shape scale
0.698091 935.239277
However, while these are giving results, I still don't think I can fit my data with the weights into any of these.
Edit: I have also tried the similarly named eWeibull from the ExtDist package (which I'm not 100% sure still works, but does have a weibull function that takes a weight!). I get a lot of error messages about the inputs being non-computable (NA or infinite). If I do it with map, so map(test_data, test_km, eWeibull) I get [[NULL] for all 100 values. If I try it just with test_data, I get a long string of errors associated with optimx.
I have also tried fitDistr from propagate which gives errors that weights should be a specific length. For example, if both are set to be 100, I get an error that weights should be length 94. If I set it to 94, it tells me it has to be length of 132.
I need to be able to pass either a set of pre-weighted mean/var/sd etc data into the calculation, or have a function that can take data and weights and use them both in the calculation.

After much trial and error, I edited the eweibull function from the EnvStats package to instead of using mean(x) and sd(x), to instead use wtd.mean(x,w) and sqrt(wtd.var(x, w)). This now runs and outputs weighted values.

Related

How can I get SE for smooth.spline()?

I use smooth.spline() on two large numeric vectors, x and y. I now want to visualize it, showing confidence intervals (or standard errors).
Ideally, I would just use geom_smooth(method="spline"), but that doesn't exist. So I tried method = "loess", which is similar, but than it runs out of memory (unless I set se=F) and I already set it to the maximum capacity.
I than used base R function smooth.spline(), but that doesn't calculate standard errors or confidence intervals. I tried using the function predict(model, se = T), but it does not produce standard errors for smooth.spline models.
What can I do? please help, I am trying for days now, and even GPT couldn't do it

How to use escalc function?

I am working on a meta analysis using the metafor package. I want to calculate the effect size in using the package but am running into some trouble. I am trying to calculate effect size using the escalc function. I have a file with values ~200 rows containing data on the control/test means variances, and sample numbers. For each row I would like to calculate the effect size. I would now like to use the escalc function to determiner the effect size using SMD.
My current code is as follows:
# escalc function
escalc <- function(measure, ai, bi, ci, di, n1i, n2i, x1i, x2i, t1i, t2i, m1i, m2i, sd1i, sd2i, xi, mi, ri, ti, sdi, r2i, ni, yi, vi, sei,
data, slab, subset, include, add=1/2, to="only0", drop00=FALSE, vtype="LS", var.names=c("yi","vi"), add.measure=FALSE, append=TRUE, replace=TRUE, digits, ...)
# apply data and add effect size col to data frame
data$ES <- escalc(measure = SMD, dat$MRE1, dat$MTE2, dat$VRE1, dat$VTE2, dat$NR1, dat$NR2, data = dat)
When I run this code once there seems to be no problem/error (if I run the code more than once it says "Error: C stack usage 15925888 is too close to the limit" - unsure what this means) but my dataframe does not have a new column with the ES for each study. When I highlight the new variable and click enter (to see what the data looks like) it says NULL so I don't think it actually ran. How can I get a summary of the effect sizes?
I am unsure what I am doing wrong or how to see what the effect sizes I've calculated are. I've been reading the metafor documentation and am unsure what I am doing wrong (https://cran.r-project.org/web/packages/metafor/metafor.pdf). Do I need to calculate escalc for each paper? Any help is greatly appreciated.
Thank you!
You should use:
dat <- escalc(measure="SMD", m1i=MRE1, m2i=MTE2, sd1i=sqrt(VRE1), sd2i=sqrt(VTE2), n1i=NR1, n2i=NR2, data=dat)
Note that the SDs are the input for arguments sd1i and sd2i, so if you have the variances, we need to take the square-root of them.

How to use fitdist when the paramters are already known (Pareto distribution)

I am fitting a Pareto distribution to some data and have already estimated the maximum likelihood estimates for the data. Now I need to create a fitdist (fitdistrplus library) object from it, but I am not sure how to do this. I need a fitdist object because I would like to create qq, density etc. plots with the function such as denscomp. Could someone help?
The reason I calculated the MLEs first is because fitdist does not do this properly - the estimates always blow up to infinity, even if I give the correct MLEs as the starting values (see below). If the earlier option of manually giving fitdist my parameters is not possible, is there an optimization method in fitdist that would allow the pareto parameters to be properly estimated?
I don't have permission to post the original data, but here's a simulation using MLE estimates of a gamma distribution/pareto distribution on the original.
library(fitdistrplus)
library(actuar)
sim <- rgamma(1000, shape = 4.69, rate = 0.482)
fit.pareto <- fit.dist(sim, distr = "pareto", method = "mle",
start = list(scale = 0.862, shape = 0.00665))
#Estimates blow up to infinity
fit.pareto$estimate
If you look at the ?fitdist help topic, it describes what fitdist objects look like: they are lists with lots of components. If you can compute substitutes for all of those components, you should be able to create a fake fitdist object using code like
fake <- structure(list(estimate = ..., method = ..., ...),
class = "fitdist")
For the second part of your question, you'll need to post some code and data for people to improve.
Edited to add:
I added set.seed(123) before your simulation of random data. Then I get the MLE from fitdist to be
scale shape
87220272 9244012
If I plot the log likelihood function near there, I get this:
loglik <- Vectorize(function(shape, scale) sum(dpareto(sim, shape, scale, log = TRUE)))
shape <- seq(1000000, 10000000, len=30)
scale <- seq(10000000, 100000000, len=30)
surface <- outer(shape, scale, loglik)
contour(shape, scale, surface)
points(9244012, 87220272, pch=16)
That looks as though fitdist has made a somewhat reasonable choice, though there may not actually be a finite MLE. How did you find the MLE to be such small values? Are you sure you're using the same parameters as dpareto uses?

How to do a leave-one-out cross validation for CAP/capscale in R vegan?

I would like to perform a "leave-one-out cross validation" (LOO-CV) for a CAP in R. The CAP was calculated by using capscale in R package vegan and is a canonical analysis of principal coordinates, similar to an rda or cca, but based on another similarity matrix, in my case Bray-Curtis. I have found that within predict.cca there is the function calibrate.cca but I cannot make it work.
https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/predict.cca
This is what I have (based on the sample data mite available in vegan)
library(vegan)
data(mite, mite.env)
str(mite.env) #"SubsDens", "WatrCont", "Substrate", "Shrub", "Topo"
miteBC <- vegdist(mite, method="bray") #Bray-Curtis similarity matrix
miteCAP <-capscale(miteBC~Substrate + Shrub + Topo, data=mite.env, #CAP in capscale
distance = "bray", metaMDSdist = F)
summary(miteCAP)
anova(miteCAP)
anova(miteCAP, by = "axis")
anova(miteCAP, by = "margin")
calibrate.cca(miteCAP, type = c("response")) #error cannot find function calibrate.cca
In the program Primer it is done automatically within the CAP function ("Leave-one-out Allocation of Observations to Groups"), where it assigns each sample automatically to a group and get a mis-classification error (similar to a classification randomForest, which I have already done), but I would like to use R, and it should be possible with vegan::capscale.
Any help is very much appreciated!
Function vegan::calibrate does not have argument type and never returns "response". Check its documentation. It does the environmental calibration, and returns the predicted values of constraints (Substrate, Shrub, Topo) in the scale of model matrix, and with factors these hardly make sense directly.
There is no direct option of LOO: you got to do it by hand cycling through points, and using the complete left-out-point as the newdata. However, I'd suggest k-fold cross-validation as a better alternative for estimation of predictive power: LOO changes data too little, and gives over-optimistic view of predictive power.

R: Using fitdistrplus to fit curve over histogram of discrete data

So I have this discrete set of data my_dat that I am trying to fit a curve over to be able to generate random variables based on my_dat. I had great success using fitdistrplus on continuous data but have many errors when attempting to use it for discrete data.
Table settings:
library(fitdistrplus)
my_dat <- c(2,5,3,3,3,1,1,2,4,6,
3,2,2,8,3,4,3,3,4,4,
2,1,5,3,1,2,2,4,3,4,
2,4,1,6,2,3,2,1,2,4,
5,1,2,3,2)
I take a look at the histogram of the data first:
hist(my_dat)
Since the data's discrete, I decide to try a binomial distribution or the negative binomial distribution to fit and this is where I run into trouble: Here I try to define each:
fitNB3 <- fitdist(my_dat, discrete = T, distr = "nbinom" ) #NaNs Produced
fitB3 <- fitdist(my_dat, discrete = T, distr = "binom")
I receive two errors:
fitNB3 seems to run but notes that "NaNs Produced" - can anyone let me
know why this is the case?
fitB3 doesn't run at all and provides me with the error: "Error in start.arg.default(data10, distr = distname) : Unknown starting values for distribution binom." - can anyone point out why this won't work here? I am unclear about providing a starting number given that the data is discrete (I attempted to use start = 1 in the fitdist function but I received another error: "Error in fitdist(my_dat, discrete = T, distr = "binom", start = 1) : the function mle failed to estimate the parameters, with the error code 100"
I've been spinning my wheels for a while on this but I would be take any feedback regarding these errors.
Don't use hist on discrete data, because it doesn't do what you think it's doing.
Compare plot(table(my_dat)) with hist(my_dat)... and then ponder how many wrong impressions you've gotten doing this before. If you must use hist, make sure you specify the breaks, don't rely on defaults designed for continuous variables.
hist(my_dat)
lines(table(my_dat),col=4,lwd=6,lend=1)
Neither of your models can be suitable as both these distributions start from 0, not 1, and with the size of values you have, p(0) will not be ignorably small.
I don't get any errors fitting the negative binomial when I run your code.
The issue you had with fitting the binomial is you need to supply starting values for the parameters, which are called size (n) and prob (p), so
you'd need to say something like:
fitdist(my_dat, distr = "binom", start=list(size=15, prob=0.2))
However, you will then get a new problem! The optimizer assumes that the parameters are continuous and will fail on size.
On the other hand this is probably a good thing because with unknown n MLE is not well behaved, particularly when p is small.
Typically, with the binomial it would be expected that you know n. In that case, estimation of p could be done as follows:
fitdist(my_dat, distr = "binom", fix.arg=list(size=20), start=list(prob=0.15))
However, with fixed n, maximum likelihood estimation is straightforward in any case -- you don't need an optimizer for that.
If you really don't know n, there are a number of better-behaved estimators than the MLE to be found, but that's outside the scope of this question.

Resources