I have a set of simulated data that are roughly uniformly distributed. I would like to sample a subset of these data and for that subset to have a log-normal distribution with a (log)mean and (log)standard deviation that I specify.
I can figure out some slow brute-force ways to do this, but I feel like there should be a way to do it in a couple lines using the plnorm function and the sample function with the "prob" variable set. I can't seem to get the behavior I'm looking for though. My first attempt was something like:
probs <- plnorm(orig_data, meanlog = mu, sdlog = sigma)
new_data <- sample(orig_data, replace = FALSE, prob = probs)
I think I'm misinterpreting the way the plnorm function behaves. Thanks in advance.
If your orig_data are uniformly distributed between 0 and 1, then
new_data = qlnorm(orig_data, meanlog = mu, sdlog = sigma)
will give log sampled data. IF your data aren't between 0 and 1 but say a and b then first:
orig_data = (orig_data-a)/(b-a)
Generally speaking, uniform RV between 0 and 1 are seen as probability so if you want to sample from a given distribution with it, you have to use q... ie take the corresponding quantile
Thanks guys for the suggestions. While they get me close, I've decided on a slightly different approach for my particular problem, which I'm posting as the solution in case it's useful to others.
One specific I left out of the original question is that I have a whole data set (stored as a data frame), and I want to resample rows from that set such that one of the variables (columns) is log-normally distributed. Here is the function I wrote to accomplish this, which relies on dlnorm to calculate probabilities and sample to resample the data frame:
resample_lognorm <- function(origdataframe,origvals,meanlog,sdlog,n) {
prob <- dlnorm(origvals,meanlog=log(10)*meanlog,sdlog=log(10)*sdlog)
newsamp <- origdataframe[sample(nrow(origdataframe),
size=n,replace=FALSE,prob=prob),]
return(newsamp)
}
In this case origdataframe is the full data frame I want to sample from, and originals is the column of data I want to resample to a log-normal distribution. Note that the log(10) factors in meanlog and sdlog are because I want the distribution to be log-normal in base 10, not natural log.
Related
I have data with 10000 instances, which resemble negative binomial distribution. I am sampling out of this data, but I need a subsample which is normally distributed and has a pre-specified mean. How can I achieve this?
library(MASS)
my_trees <- rnegbin(10000, mu = 15, theta = 3)
hist(my_trees)
mean(my_trees)
my_sample <- sample(my_trees, size = 500)
hist(my_sample)
mean(my_sample)
How can I sample data which will be normally distributed with a mean of, e.g. 25? I am aware of prob argument, and also read this related question, but anyhow I can not get what I want.
Normal distribution have two parameters ; location parameter and scale parameter.
You mentioned only mean (25) then you can generate n random values distributed from normal distribution with rnorm(n, mean = 25). For different sd(scale parameter), use rnorm(n, mean, sd). In the same way you generate my_trees with rnegbin, rnorm does the same job.
https://www.stat.umn.edu/geyer/old/5101/rlook.html provides information about several other distributions.
I have a txt file with numbers that looks like this(but with 100 numbers) -
[1] 7.1652348 5.6665965 4.4757553 4.8497086 15.2276296 -0.5730937
[7] 4.9798067 2.7396933 5.1468304 10.1221489 9.0165661 65.7118194
[13] 5.5205704 6.3067488 8.6777177 5.2528503 3.5039562 4.2477401
[19] 11.4137624 -48.1722034 -0.3764006 5.7647536 -27.3533138 4.0968204
I need to estimate MLE theta parameter from this distrubution -
[![this is my distrubution ][1]][1]
and I need to estimate theta from a sample of 1000 observations with replace, and save the sample, and do a hist.
How can I estimate theta from my sample? I have no information about normal distrubation.
I wrote something like this -
data<-read.table(file.choose(), header = TRUE, sep= "")
B <- 1000
sample.means <- numeric(data)
sample.sd <- numeric(data)
for (i in 1:B) {
MySample <- sample(data, length(data), replace = TRUE)
sample.means <- c(sample.means,mean(MySample))
sample.sd <- c(sample.sd,sd(MySample))
}
sd(sample.sd)
but it doesn't work..
This question incorporates multiple different ones, so let's tackle each step by step.
First, you will need to draw a random sample from your population (with replacement). Assuming your 100 population-observations sit in a vector named pop.
rs <- sample(pop, 1000, replace = True)
gives you your vector of random samples. If you wanna save it, you can write it to your disk in multiple formats, so I'll just suggest a few related questions (How to Export/Import Vectors in R?).
In a second step, you can use the mle()-function of the stats4-package (https://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html) and specify the objective function explicitly.
However, the second part of your question is more of a statistical/conceptual question than R related, IMO.
Try to understand what MLE actually does. You do not need normally distributed variables. The idea behind MLE is to choose theta in such a way, that under the resulting distribution the random sample is the most probable. Check https://en.wikipedia.org/wiki/Maximum_likelihood_estimation for more details or some youtube videos, if you'd like a more intuitive approach.
I assume, in the description of your task, it is stated that f(x|theta) is the conditional joint density function and that the observations x are iir?
What you wanna do in this case, is to select theta such that the squared difference between the observation x and the parameter theta is minimized.
For your statistical understanding, in such cases, it makes sense to perform log-linearization on the equation, instead of dealing with a non-linear function.
Minimizing the squared difference is equivalent to maximizing the log-transformed function since the sum is negative (<=> the product was in the denominator) and the log, as well as the +1 are solely linear transformations.
This leaves you with the maximization problem:
And the first-order condition:
Obviously, you would also have to check that you are actually dealing with a maximum via the second-order condition but I'll omit that at this stage for simplicity.
The algorithm in R does nothing else than solving this maximization problem.
Hope this helps for your understanding. Maybe some smarter people can give some additional input.
Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196
I would like to resample data with weighted bootstrap for constructing random forest.
The situation is like that.
I have the data which consist of normal subjects(N=20000) and patients(N=500).
I made new data set with normal subjects (N=2000) and patients (n=500) because I conducted a certain experiment with subjects (N=2500).
As you can see, normal subjects extracted 1/10 of original data and patients extracted all of them.
Therefore, I should give a weight to normal subjects to perform machine learning algorithm.
Please let me know how I can bootstrap with weight in R.
Thank you.
It sounds like you really need to stratified resampling rather than weighted resampling.
Your data are structured into two different groups of different sizes, and you would like to preserve that structure in your bootstrap. You didn't say what function you were applying to these data, so lets use something simple like the mean.
Generate some fake data, and take the (observed) means:
controls <- rnorm(2000, mean = 10)
patients <- rnorm(500, mean = 9.7)
mean(controls)
mean(patients)
Tell R we want to perform 200 bootstraps, and set up two empty vectors to store means for each bootstrap sample:
nbootraps <- 200
boot_controls <- numeric(nbootraps)
boot_patients <- numeric(nbootraps)
Using a loop we can draw resamples of the same size as you have in the original sample, and calculate the means for each.
for(i in 1:nbootraps){
# draw bootstrap sample
new_controls <- controls[sample(1:2000, replace = TRUE)]
new_patients <- patients[sample(1:500, replace = TRUE)]
# send the mean of each bootstrap sample to boot_ vectors
boot_controls[i] <- mean(new_controls)
boot_patients[i] <- mean(new_patients)
}
Finally, plot the bootstrap distributions for group means:
p1 <- hist(boot_controls)
p2 <- hist(boot_patients)
plot(p1, col=rgb(0,0,1,1/4), xlim = c(9.5,10.5), main="")
plot(p2, col=rgb(1,0,0,1/4), add=T)
The data is gamma like distributed.
To replicate the data would be something like this:
a) first find the distrib. parameters of the true data:
fitdist(datag, "gamma", optim.method="Nelder-Mead")
b) Use the parameters shape, rate, scale to simulate data:
data <- rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8)
To find quantiles using the qgamma function in r, would be just:
EDIT:
qgamma(c(seq(1,0.1,by=-0.1)), shape=0.6, rate =4.8, scale = 1/4.8, log = FALSE)
How I can find quantiles for my true data (not simulated with rgamma)?
Please note that the quantile r function returns the desired quantiles of the true data (datag) but these are as I understand assuming the data are normally distributed. As you can see they are clearly not.
quantile(datag, seq(0,1, by=0.1), type=7)
What function in r to use or otherwise how to obtain statistically the quantiles for the highly skewed data?
In addition, would this make sense somewhat? But still not getting the lower values!
Fn <- ecdf(datag)
Fn(seq(0.1,1,by=0.1))
Quantiles are returned by the "q" functions, in this case qgamma. For your data the eyeball integration suggests that most of the data is to the left of 0.2 and if we ask for the 0.8 quantile we see that 80% of the data in the estimated distribution is to the left of:
qgamma(.8, shape=0.6, rate=4.8)
#[1] 0.20604
Seems to agree with what you have plotted. If you wanted the 0.8 quantile in the sample you have, then just:
quantile(datag, 0.8)