R - Inverse cumulative distribution method with given function - r

I have a given function (let's call it f(x)) and I used the Monte Carlo method to normalized it. I calculated the probability density function and I got the cumulative distribution function from integrating that.
f = function(x) ...
plot(f,xlim = c(0, 5), ylim = c(0, 1),main="f(x)")
mc.integral = function(f, n.iter = 1000, interval){
x = runif(n.iter, interval[1], interval[2])
y = f(x)
mean(y)*(interval[2] - interval[1])
}
MC = mc.integral(f, interval = c(0, 8))
print(MC)
densityFunction <- function(x){
return ((f(x)/MC))
}
distributionFunction <- function(x){
return (integrate(densityFunction,0,x)$value)
}
vd <- Vectorize(distributionFunction)
plot(vd,xlim = c(0, 8), ylim = c(0, 1),ylab = "y",main="E(f(x))")
Now my next task is to use the inverse transform method / inverse cumulative distribution method to generate samples and test it with the Kolmogorov-Smirnov Test, but I don't know how should I do in R.
Can you please give me some help?

Well, this thread shows us how to generate a sample using the inverse transform method:
sample <- vd(runif(1000))
> head(sample)
[1] 0.28737403 0.59295499 0.30814305 0.27998306 0.07601228 0.52753327
Therefore, generating 10 different random samples could be done with:
sample <- list()
for(i in 1:10){
set.seed(i)
sample[[i]] <- vd(runif(1000))
}
Afterwards, loop ks.test over the list:
lapply(sample, function(x) ks.test(x, pnorm))
will give you the output of a test vs. normality for each sample. Choose the size of your samples wisely, as most tests for normality are prone to be significant for large samples even with small differences (reference here).

Related

R: How can I create expected values of a variable assuming Poisson distribution? [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 27 days ago.
Working in R, I have a dataset with values which I would like to compare to expected values if it was a Poisson distribution. Is there a way to do it?
Example:
n <- c(1,2,3,4,5,6,7,8,9,10)
people <- c(850, 200, 100, 60, 40, 20, 25, 10, 7, 60 )
df <- data.frame(n, people)
In my case, n is number of times event happened and people is number of people (10 means attending 10 or more events). I would like to see how many people would be in each category assuming poisson distribution.
I am completely clueless how to approach this.
From your data and the context of your question, it appears that this is a problem where you are actually observing data from a censored and truncated Poisson distribution, where you don't observe people who go to zero events and you observe people with ten or more events in one category. Assuming this distributional form with a maximum observation of $\dot{x}=10$ gives you the following probability mass function for an individual value:
$$p_X(x) =
\begin{cases}
\frac{e^{-\lambda}}{1-e^{-\lambda}} \cdot \frac{\lambda^x}{x!}
& & & \text{for }x=1,2,3,..., \dot{x}-1, \\[6pt]
1 - \frac{e^{-\lambda}}{1-e^{-\lambda}} \cdot \sum_{i=1}^{\dot{x}-1} \frac{\lambda^i}{i!}
& & & \text{for }x=\dot{x}. \\[6pt]
\end{cases}$$
Suppose you observe $n$ IID data points $x_1,...,x_n$ from this distribution. To facilitate analysis, let $\dot{n} = \sum_{i=1}^n \mathbb{I}(x_i=\dot{x})$ be the number of censored points and $\bar{x}_n = \sum_{i=1}^n x_i \mathbb{I}(x_i<\dot{x})/(n-\dot{n})$ be the sample mean of the non-censored points. The log-likelihood function for this data is:
$$\begin{align}
\ell_\mathbf{x}(\lambda)
= \text{const}
&+ (n-\dot{n}) \bigg[ \bar{x}_n \log (\lambda) - \lambda - \log (1-e^{-\lambda}) \bigg] \\[6pt]
&+ \dot{n} \log \bigg( 1 - \frac{e^{-\lambda}}{1-e^{-\lambda}} \cdot \sum_{i=1}^{\dot{x}} \frac{\lambda^i}{i!} \bigg).
\end{align}$$
This function can be maximised numerically to get the maximum likelihood estimator (MLE). The statistic $(n, \dot{n}, \bar{x}_n)$ is a sufficient statistic in this distribution, so we can create a function to find the MLE that takes either the full dataset or this summary of the data. It is useful to create a function to compute the MLE of the rate parameter for IID data from a censored Poisson distribution. Here we give a relatively simple function for this task, with the optimisation performed on the parameter $p = \log(\lambda)$ for purposes of numerical stability.
dpois.ct <- function(x, xmax, lambda, log = FALSE) {
#Check input
if (!is.numeric(x)) stop('Input x should be a numeric vector')
#Compute log-probabilities
LOGPROBS <- rep(-Inf, length(x))
for (i in 1:length(x)) {
if (x[i] %in% 1:xmax) {
LOGPROBS[i] <- dpois(x[i], lambda, log = TRUE) }
if (x[i] == xmax) {
LOGPROBS[i] <- ppois(xmax-1, lambda, lower.tail = FALSE, log = TRUE) } }
LOGPROBS <- LOGPROBS - VGAM::log1mexp(lambda)
#Return output
if (log) { LOGPROBS } else { exp(LOGPROBS) } }
MLE.pois.ct <- function(x, xmax, ...) {
#Set objective function and compute MLE
NEGLOGLIKE <- function(p) {
LL <- dpois.ct(x, xmax, lambda = exp(p), log = TRUE)
-sum(LL) }
MLE <- exp(nlm(NEGLOGLIKE , p = log(mean(x)), ...)$estimate)
names(MLE) <- 'MLE.rate'
#Give output
MLE }
We can implement this for your data to get the MLE and produce a corresponding barplot of the estimated distribution. We first generate your data and use the MLE.pois.ct function to compute the MLE. From the output below we see that a reasonable estimate of the rate parameter in your problem is $\hat{\lambda} = 1.876321$. The barplot shows the estimated probabilities under the model (the blue bars) against the actual relative frequencies in your data (the black dots). As you can see from the barplot, your data do not appear to follow a censored and truncated version of the Poisson distribution, so your model assumption seems unreasonable here.
#Generate the data vector
x <- rep(0, sum(people))
i <- 1
p <- 0
while (i <= length(people)) {
x[(p+1):(p+people[i])] <- n[i]
p <- p+people[i]
i <- i+1 }
#Compute the sample mean of your data and the MLE
MLE.rate <- MLE.pois.ct(x, xmax = 10)
MLE.rate
MLE.rate
1.876321
#Compute estimated probabilities in censored Poisson distribution
PROBS <- dpois.ct(n, xmax = 10, lambda = MLE.rate)
names(PROBS) <- n
#Barplot of estimated distribution
BARPLOT <- barplot(PROBS, col = 'blue', ylim = c(0,1),
main = 'Estimated censored-truncated Poisson distribution',
xlab = 'Number of Events', ylab = 'Estimated Probability')
points(x = BARPLOT , y = people/sum(people), pch = 16, cex = 1.2)

Why is predict.lm returning NA as prediction in R?

I am trying to replicate a simulation from an article for a class project. I have created simulated matrix x and a simulated vector y. I want to first reduce the model size from p=1000 to p=n/log(n) using SIS. I want to then further reduce the vector space to p' by using the Primal Dual Dantzig selector.
There is no direct predict function in this package so I am plugging the p' selected predictors and their coefficients into a lm function to get predictions. I am using the predict.lm function to get the predicted values. However, the returned vector is all NA.
The following code may illustrate the problem further. I am not getting any errors after running the following code. I have to follow this procedure of using SIS and then Dantzig Selector to match the simulation in the article.
library(SIS) #import the SIS library
library(fastclime)
errors_DS_small = c() #create empty array to store errors from each simulation
model_sizes_DS_small = c() #creat empty array to store selected model size from each simulation
start_time = Sys.time()
set.seed(123*1) #re-create results at a later time but different seed for each data set
n = 200 #sample size
p = 1000 #variables
x = matrix(rnorm(n*p, mean=0, sd=1), n, p) #creating IID standard Gaussian random predictors
# gaussian response
set.seed(456*1) #re-create results at a later time but different seed for each data set
s = 8
u = rbinom(s, 1, 0.4)
z = rnorm(s, mean=0, sd=1)
a = 4*log(n)/sqrt(n)
b= ((-1)**(u))*(a + abs(z))
y=x[, 1:s]%*%b
#creating SIS-DS model and gaussian response. iter=FALSE means not doing ISIS
modelSIS_small = SIS(x, y, family='gaussian', iter = FALSE, nsis=(n/log(n)))
modelDS = dantzig(x[,modelSIS_small$sis.ix0], y)
selectDS = dantzig.selector(modelDS$lambdalist, modelDS$BETA0, lambda=min(modelDS$lambdalist))
linearMod = lm(y~x[,modelSIS_small$sis.ix0])
linearMod$coefficients = selectDS
newx = x[,modelSIS_small$sis.ix0]
predTest = predict(linearMod, data=newx) #create predictions using test data test
a = modelDS$BETA0[, 9] != 0
mse = mean((y - predTest)^2) #compare predictions to real values of test y
rmse = sqrt(mse) #Square root of MSE (Square-loss function)
errors_DS_small[1] = rmse #store the RMSE
model_sizes_DS_small[1] = table(a)["TRUE"] #Store model size including intercept
print(c(1, errors_DS_small[1], model_sizes_DS_small[1])) #print results
end_time = Sys.time()
DS_total_time_small = end_time - start_time
DS_error_small_median = median(errors_DS_small)
DS_model_Size_small_median = median(model_sizes_DS_small)
print(c("DS", DS_model_Size_small_median, DS_error_small_median, DS_total_time_small))

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Generating correlated ordinal data

I'm using the package GenOrd for generating correlated ordinal data. The basic idea is to get correlated ordinal data with correlation 0.5, now I want to repeat the whole code for 1000 times and save the results of correlation, to see how close I can get to the correlation of 0.5, then change the sample size and the Marginal probabilities and see what changes.
library(GenOrd)
R<-matrix(c(1,0.5,0.5,1),2,2)
Marginal<-list(c(0.2,0.5,0.7,0.9),c(0.1,0.3,0.4,0.5))
DataOrd<-ordsample(100,Marginal,R)
correlation<-cor(DataOrd)
correlation[1,2] # 0.5269
Here is a simple solution:
sim.cor <- function(R, Marginal, n, K)
{
res <- numeric(length = K)
for(i in 1:K)
res[i] <- cor(ordsample(n, Marginal, R))[1,2]
res
}
where n is the sample size and K is the number of times you want to repeat. So, in your example, you can call this function and save the result (a vector of size K with the correlations) in an object:
set.seed(1234)
correlations <- sim.cor(R = R, Marginal = Marginal, n = 100, K = 1000)
mean(correlations)
[1] 0.5009389
A faster and more elegant solution is to use the replicate function as suggested by jaysunice3401:
set.seed(1234)
n <- 100
correlations <- replicate(n = 1000, expr = cor(ordsample(n, Marginal, R))[1,2])
mean(correlations)
[1] 0.5009389
I hope this can help!

Resources