The data is gamma like distributed.
To replicate the data would be something like this:
a) first find the distrib. parameters of the true data:
fitdist(datag, "gamma", optim.method="Nelder-Mead")
b) Use the parameters shape, rate, scale to simulate data:
data <- rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8)
To find quantiles using the qgamma function in r, would be just:
EDIT:
qgamma(c(seq(1,0.1,by=-0.1)), shape=0.6, rate =4.8, scale = 1/4.8, log = FALSE)
How I can find quantiles for my true data (not simulated with rgamma)?
Please note that the quantile r function returns the desired quantiles of the true data (datag) but these are as I understand assuming the data are normally distributed. As you can see they are clearly not.
quantile(datag, seq(0,1, by=0.1), type=7)
What function in r to use or otherwise how to obtain statistically the quantiles for the highly skewed data?
In addition, would this make sense somewhat? But still not getting the lower values!
Fn <- ecdf(datag)
Fn(seq(0.1,1,by=0.1))
Quantiles are returned by the "q" functions, in this case qgamma. For your data the eyeball integration suggests that most of the data is to the left of 0.2 and if we ask for the 0.8 quantile we see that 80% of the data in the estimated distribution is to the left of:
qgamma(.8, shape=0.6, rate=4.8)
#[1] 0.20604
Seems to agree with what you have plotted. If you wanted the 0.8 quantile in the sample you have, then just:
quantile(datag, 0.8)
Related
Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196
I have an arbitrary CDF that is applied to a point estimate. I have a number of these point estimates with associated CDFs, that I need to simulate random data for a Monte Carlo simulation.
The CDF I'm generating by doing a spline fit to the arbitrary points provided in a table. For example, the quantile 0.1 is a product of 0.13 * point estimate. The quantile 0.9 is a product of 7.57 * point estimate. It is fairly crude and is based on a large study comparing these models to real world system -- ignore that for now please.
I fit the CDF using a spline fit as shown here.
If I take the derivative of this, I get the shape of the pdf (image).
I modified the function "samplepdf" found here, Sampling from an Arbitrary Density, as follows:
samplecdf <- function(n, cdf, spdf.lower = -Inf, spdf.upper=Inf) {
my_fun <- match.fun(cdf)
invcdf <- function(u) {
subcdf <- function(t) my_fun(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, c(spdf.lower, spdf.upper))$root)
}
sapply(runif(n), invcdf)
}
This seems to work, OK - when I compare the quantiles I estimate from the randomly generated data they are fairly close to the initial values. However, when I look at the histogram something funny is happening at the tail where it is looks like my function is consistently generating more values than it should according to the pdf. This function consistently does that across all my point-estimates and even though I can look at the individual quantiles and they seem close, I can tell that the overall Monte Carlo simulation is demonstrating higher estimates for the 50% percentile than I expect. Here is a plot of my histogram of the random samples.
Any tips or advice would be very welcome. I think the best route would be to fit an exponential distribution to the CDF, but I'm struggling to do that. All "fitting" assumes that you have data that needs to be fitted -- this is more arbitrary than that.
I have a set of simulated data that are roughly uniformly distributed. I would like to sample a subset of these data and for that subset to have a log-normal distribution with a (log)mean and (log)standard deviation that I specify.
I can figure out some slow brute-force ways to do this, but I feel like there should be a way to do it in a couple lines using the plnorm function and the sample function with the "prob" variable set. I can't seem to get the behavior I'm looking for though. My first attempt was something like:
probs <- plnorm(orig_data, meanlog = mu, sdlog = sigma)
new_data <- sample(orig_data, replace = FALSE, prob = probs)
I think I'm misinterpreting the way the plnorm function behaves. Thanks in advance.
If your orig_data are uniformly distributed between 0 and 1, then
new_data = qlnorm(orig_data, meanlog = mu, sdlog = sigma)
will give log sampled data. IF your data aren't between 0 and 1 but say a and b then first:
orig_data = (orig_data-a)/(b-a)
Generally speaking, uniform RV between 0 and 1 are seen as probability so if you want to sample from a given distribution with it, you have to use q... ie take the corresponding quantile
Thanks guys for the suggestions. While they get me close, I've decided on a slightly different approach for my particular problem, which I'm posting as the solution in case it's useful to others.
One specific I left out of the original question is that I have a whole data set (stored as a data frame), and I want to resample rows from that set such that one of the variables (columns) is log-normally distributed. Here is the function I wrote to accomplish this, which relies on dlnorm to calculate probabilities and sample to resample the data frame:
resample_lognorm <- function(origdataframe,origvals,meanlog,sdlog,n) {
prob <- dlnorm(origvals,meanlog=log(10)*meanlog,sdlog=log(10)*sdlog)
newsamp <- origdataframe[sample(nrow(origdataframe),
size=n,replace=FALSE,prob=prob),]
return(newsamp)
}
In this case origdataframe is the full data frame I want to sample from, and originals is the column of data I want to resample to a log-normal distribution. Note that the log(10) factors in meanlog and sdlog are because I want the distribution to be log-normal in base 10, not natural log.
So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.
I am an statistics student and R beginner (understatement of the year) trying to generate multiple confidence intervals for randomly generated samples of a normal distribution as part of an assignment.
I used the function
data <- replicate(25, rnorm(20, 50, 6))
to generate 25 samples of size n=20 from a N(50, 6^2) distribution (in a double matrix).
My question is, how do I find a 95% confidence interval for each sample of this distribution? I know that I can use colMeans(data) and sd(data) to find the sample mean and sample standard deviation for each sample, but I am having a brain fart trying to think of a function that can generate the confidence intervals for all columns in the double matrix (data).
As of now, my (extremely crude) solution consists of creating the functions
left <- function (x,y){x-(qnorm(0.975)*y/sqrt(20))}
right <- function (x,y){x+(qnorm(0.975)*y/sqrt(20))}
left(colMeans(data), sd(data)
right(colMeans(data), sd(data)
to generate 2 vectors of left and right bounds. Please let me know if there is a better way I can do this.
I suppose you could use the t.test() function. It returns the mean and the 95% confidence interval for a given vector of numbers.
# Create your data
data <- replicate(25, rnorm(20, 50, 6))
data <- as.data.frame(data)
After you make your data, you could apply the t.test() function to all columns using the lapply() function.
# Apply the t.test function and save the results
results <- lapply(data, t.test)
If you only want to see the confidence interval or mean returned, you can call them using the dollar sign operator. For example, for column one of your original data frame, you could type the following:
# Check 95% CI for sample one
results[[1]]$conf.int[1:2]
You could come up with a more eloquent way of saving these data to a results data frame. Remember, you can always see what individual bits of information you can yank from an object by using the str() command. For example:
# Example
example <- t.test(data[,1])
str(example)
Hope this helps. Try this link for more information: Using R to find Confidence Intervals