Altering distribution of one dataset to match another dataset - r

I have 2 datasets, one of modeled (artificial) data and another with observed data. They have slightly different statistical distributions and I want to force the modeled data to match the observed data distribution in the spread of the data. In other words, I need the modeled data to better represent the tails of the observed data. Here's an example.
model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)
summary(model)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.62 36.98 40.38 40.28 44.91 54.15
summary(observed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.62 35.54 42.58 41.10 47.76 59.2
How can I force the model data to have the variability that the observed has in R?

Are you just modeling the distribution of observed? If so, you could generate a kernel density estimate from the observations and then resample from that modeled density distribution. For example:
library(ggplot2)
First we generate a density estimate from the observed values. This is our model of the distribution of the observed values. adjust is a parameter that determines the bandwidth. The default value is 1. Smaller values result in less smoothing (i.e., a density estimate that more closely follows small-scale structure in the data):
dens.obs = density(observed, adjust=0.8)
Now, resample from the density estimate to get the modeled values. We set prob=dens.obs$y so that the probability of a value in dens.obs$x being chosen is proportional to its modeled density.
set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)
Put observed and modeled values in a data frame in preparation for plotting:
dat = data.frame(value=c(observed,resample.obs),
group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))
The ECDF (empirical cumulative distribution function) plot below shows that sampling from the kernel density estimate gives samples with a distribution similar to the observed data:
ggplot(dat, aes(value, fill=group, colour=group)) +
stat_ecdf(geom="step") +
theme_bw()
You can also plot the density distribution of the observed data and the values sampled from the modeled distribution (using the same value for the adjust parameter as we used above).
ggplot(dat, aes(value, fill=group, colour=group)) +
geom_density(alpha=0.4, adjust=0.8) +
theme_bw()

Have a look at this answer How to generate distributions given, mean, SD, skew and kurtosis in R?.
It discusses use of the SuppDists package. This package permits you to create a distribution by creating a set of parameters based on the Johnson system of distributions.

Related

How do I calculate weighted median instead of weighted mean for single-arm meta-analysis?

I am trying to calculate weighted median instead of weighted mean for meta-analysis.
I used to use metamean but unfortunately no metamedian if the data was skewed.
I saw this but it is irrelevant in case of meta-analysis.
I appreciate any help or guidance on this.
Here is my prior code.
library(meta)
data(Fleiss1993cont)
# Meta-analysis of weighted mean from each study in a single arm meta-analysis
m1 <- metamean(n.psyc, mean.psyc, sd.psyc, data = Fleiss1993cont, studlab = study); forest(m1)

fitdistrplus vs MASS - difference in standard error outputs of estimates

I am trying to fit a Gamma distribution to data. Since data vector is huge, I am not able to copy paste the vector here, but following are some summary statisitics-
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.96 170.41 792.28 1983.93 2511.30 42039.76
I tried to fit Gamma distribution using fitdistrplus package -
fitdist(df %>% pull(x)/100, "gamma", start=list(shape = 1, rate = 0.1), lower=0.01)
I get estimates of parameters, but standard errors are NA (as well as correlation matrix between parameters) -
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters :
estimate Std. Error
shape 0.56172991 NA
rate 0.02846644 NA
Loglikelihood: -1582.601 AIC: 3169.202 BIC: 3177.244
Correlation matrix:
[1] NA
However, when I do the same with fitdistr from MASS, it gives out standard errors -
fitdistr(df$x/100, "gamma", start=list(shape = 1, rate = 0.1), lower=0.01)
shape rate
0.561910739 0.028481215
(0.032652615) (0.002494628)
My questions are following -
Obviously estimates are coming out as same, but why is one giving out standard errors but another is unable to do so?
The fitdist function from fitdistrplus gives out correlation matrix between parameters. But I am unable to understand what inferences I could deduce from it regarding the quality of fit?

calculating p-value manually or using R

Sample size of 40 with observed mean of m1. After statistical study the mean is m2, with standard deviation sd and alpha=0.05. How do I calculate p-value either manually or in R?

obtaining quantiles from complete gaussian fit of data in R

I have been struggling with how R calculates quantiles and the normal fitting of data.
I have data (NDVI values) that follows a truncated normal distribution (see figure)
I am interested in getting the lowest 10th percentile value (p=0.1) from the data and from the fitting normal distribution curve.
In my understanding, because the data is truncated, the two should be quite different: I expect the quantile from the data to be higher than the one calculated from the normal distribution, but this is not so. For what I understand of the quantile function help the quantile from the data should be the default quantile function:
q=quantile(y, p=0.1)
while the quantile from the normal distribution is :
qx=quantile(y, p=0.1, type=9)
However the two result very close in all cases, which makes me wonder to what type of distribution does R fit the data to calculate the quantile (truncated normal dist.?)
I have also tried to calculate the quantile based on the fitting normal curve as:
fitted=fitdist(as.numeric(y), "norm", discrete = T)
fit.q=as.numeric(quantile(fitted, p=0.1)[[1]][1])
but obtaining no difference.
So my questions are:
To what curve does R fit the data for calculating quantiles, in particular for type=9 ? How can I calculate the quantile based on the complete normal distribution (including the lower tail)?
I don't know how to generate a reproducible example for this, but the data is available at https://dl.dropboxusercontent.com/u/26249349/data.csv
Thanks!
R is using the empirical ordering of the data when determining quantiles, rather than assuming any particular distribution.
The 10th percentile for your truncated data and a normal distribution fit to your data happen to be pretty close, although the 1st percentile is quite a bit different. For example:
# Load data
df = read.csv("data.csv", header=TRUE, stringsAsFactors=FALSE)
# Fit a normal distribution to the data
df.dist = fitdist(df$x, "norm", discrete = T)
Now let's get quantiles of the fitted distribution and the original data. I've included the 1st percentile in addition to the 10th percentile. You can see that the fitted normal distribution's 10th percentile is just a bit lower than that of the data. However, the 1st percentile of the fitted normal distribution is much lower.
quantile(df.dist, p=c(0.01, 0.1))
Estimated quantiles for each specified probability (non-censored data)
p=0.01 p=0.1
estimate 1632.829 2459.039
quantile(df$x, p=c(0.01, 0.1))
1% 10%
2064.79 2469.90
quantile(df$x, p=c(0.01, 0.1), type=9)
1% 10%
2064.177 2469.400
You can also see this by direct ranking of the data and by getting the 1st and 10th percentiles of a normal distribution with mean and sd equal to the fitted values from fitdist:
# 1st and 10th percentiles of data by direct ranking
df$x[order(df$x)][round(c(0.01,0.1)*5780)]
[1] 2064 2469
# 1st and 10th percentiles of fitted distribution
qnorm(c(0.01,0.1), df.dist$estimate[1], df.dist$estimate[2])
[1] 1632.829 2459.039
Let's plot histograms of the original data (blue) and of fake data generated from the fitted normal distribution (red). The area of overlap is purple.
# Histogram of data (blue)
hist(df$x, xlim=c(0,8000), ylim=c(0,1600), col="#0000FF80")
# Overlay histogram of random draws from fitted normal distribution (red)
set.seed(685)
set.seed(685)
x.fit = rnorm(length(df$x), df.dist$estimate[1], df.dist$estimate[2])
hist(x.fit, add=TRUE, col="#FF000080")
Or we can plot the empirical cumulative distribution function (ecdf) for the data (blue) and the random draws from the fitted normal distribution (red). The horizontal grey line marks the 10th percentile:
plot(ecdf(df$x), xlim=c(0,8000), col="blue")
lines(ecdf(x.fit), col="red")
abline(0.1,0, col="grey40", lwd=2, lty="11")
Now that I've gone through this, I'm wondering if you were expecting fitdist to return the parameters of the normal distribution we would have gotten had your data really come from a normal distribution and not been truncated. Rather, fitdist returns a normal distribution with the mean and sd of the (truncated) data at hand, so the distribution returned by fitdist is shifted to the right compared to where we might have "expected" it to be.
c(mean=mean(df$x), sd=sd(df$x))
mean sd
3472.4708 790.8538
df.dist$estimate
mean sd
3472.4708 790.7853
Or, another quick example: x is normally distributed with mean ~ 0 and sd ~ 1. xtrunc removes all values less than -1, and xtrunc.dist is the output of fitdist on xtrunc:
set.seed(55)
x = rnorm(6000)
xtrunc = x[x > -1]
xtrunc.dist = fitdist(xtrunc, "norm")
round(cbind(sapply(list(x=x,xtrunc=xtrunc), function(x) c(mean=mean(x),sd=sd(x))),
xtrunc.dist=xtrunc.dist$estimate),3)
x xtrunc xtrunc.dist
mean -0.007 0.275 0.275
sd 1.009 0.806 0.806
And you can see in the ecdf plot below that the truncated data and the normal distribution fitted to the truncated data have about the same 10th percentile, while the 10th percentile of the untruncated data is (as we would expect) shifted to the left.

What are quantiles in ggplot stat_quantile?

Here is my reproducible data:
library("ggplot2")
library("ggplot2movies")
library("quantreg")
set.seed(2154)
msamp <- movies[sample(nrow(movies), 1000), ]
I am trying to become acquainted with stat_quantile but the example from the documentation raises a couple of questions.
mggp <- ggplot(data=msamp, mapping=aes(x=year, y=rating)) +
geom_point() +
stat_quantile(formula=y~x, quantiles=c(0, 0.25, 0.50, 0.75, 1)) +
theme_classic(base_size = 12) +
ylim(c(0,10))
mggp
To my understanding quantiles split the data into parts that are smaller than the defined cut-off values, correct? If I define quantiles like in the following code I get five lines. Why? What do they represent?
It seems that the quantiles are calculated based on the dependent variable on the y-axis (rating). Is it possible to reverse this? I mean to split the data based on quantiles in 'year'?
This function performs quantile regression, and each line is an indicator of the
From Wikipedia:
Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares results in estimates that approximate the conditional mean of the response variable given certain values of the predictor variables, quantile regression aims at estimating either the conditional median or other quantiles of the response variable.
Thus each line in the regression plot is an estimate of the quantile value, e.g. median, 75th and 100th percentile.
You can find a detailed technical discussion in the vignette of the quantreg package.

Resources