Importance sampling in R - r

I'm a beginner to statistics and currently learning Importance Sampling. I have searched through similar problems here but still can't get mine solved.
If I need to evaluate E(x) of a target distribution
f(x)=2 * x * exp(-x^2), x>0
By using Importance Sampling, I take a proposal distribution
g(x)=exp(-x)
Then
E(x)=integral(x* (f(x)/g(x)) * g(x) dx)
=integral(exp(-x) * 4 * x^2 dx)
My R code was like this
x=rexp(1000)
w=4*x^2
y=exp(-w)
mean(y)
Am I doing it right?
Thanks a lot for your help!

I think you might want to do something like this:
x<-rexp(n=1000,r=1)
fx<-function(x){
return(x^2*exp(-(x^2)))
}
gx<-function(x){
return(exp(-x))
}
Ex=mean(x*fx(x)/gx(x))

It is simply the weighted sample mean.
Non weighted sample mean mean(x) gives you the expectation of proposal density; while weighted sample mean mean(w * x) gives the expectation of target density. But you are using a wrong weight. I think the correct one is w <- 2 * x * exp(-x^2 + x).
If I were you, I would not compute weights myself. I would do
set.seed(0)
x <- rexp(1000) ## samples from proposal density
f <- function(x) 2 * x *exp(-x^2) ## target density
w <- f(x) / dexp(x) ## importance weights
mean(x) ## non-weighted sample mean
# [1] 1.029677
mean(w * x) ## weighted sample mean
# [1] 0.9380861
In theory, the expectation of weights should be 1. But practically you only get close to 1:
mean(w)
[1] 1.036482
So, you might want the normalized version:
mean(w * x) / mean(w)
[1] 0.9050671

Related

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

Integration struggeles with "the condition has length > 1" [duplicate]

I'm having some problems with integration function in R. I'm trying to plot the integral vo but it seems I'm not doing correctly.
t <- seq(0, 0.04, 0.0001)
vi <- function(x) {5 * sin(2 * pi * 50 * x)}
vo <- function(x) {integrate(vi, lower=0, upper=x)$value}
test_vect = Vectorize(vo, vectorize.args='x')
plot(t, vo(t)) # should be a cosine wave
plot(t, vi(t)) # sine wave
vo should be a sine wave but using test_vect gives me wrong plot and using vo directly gives error 'x' and 'y' lengths differ. Can anyone, please, help me on this matter?
You are already there. Just use plot(t, test_vect(t)). You can't use vo, as integrate is not a vectorized function. There is no problem to evaluate a single point like vo(0.002), but you can not feed it a vector by vo(t). This is why we want Vectorize(vo)(t).
You said that test_vect is not giving the right plot. Sure? We can analytically compute the integral:
v <- function (x) (1-cos(100*pi*x)) / (20*pi)
Then let's compare:
sum(abs(v(t) - test_vect(t)))
# [1] 2.136499e-15
They are the same!

Probability choose (N, K) R

I have calculated a likelihood function for a sampling without replacement problem.
How can, theoretically, we can convert this likelihood function into a choose(N, K) form?
Additionally, if I plot this function such that N is my X axis and probability given by this function is the Y axis, what is the variance of the plotted distribution?
Thanks,
Your question is a follow-up to How to plot a factorial function in R. I will not repeat information / background / code given in my answer there.
Regarding your requestion for derivation, it is simply:
Don't ask any more; do a little math yourself. This is a programming site, not for a question like this.
Now, regarding computation of variance, we use statistical result: var(X) = E(X^2) - E(X) ^ 2.
## P has been scaled in below
## mean
MEAN <- sum(N * P)
# [1] 726.978
## variance
VAR <- sum(N * (N * P)) - MEAN ^ 2
# [1] 55342.9
## standard deviation
SD <- sqrt(VAR)
# [1] 235.2507

Calculate the volume under a plot of kernel bivariate density estimation

I need to calculate a measure called mutual information. First of all, I need to calculate another measure, called entropy, for example, the joint entropy of x and y:
-∬p(x,y)·log p(x,y)dxdy
So, to calculate p(x,y), I used the kernel density estimator (in this way, function kde2d, and it returned the Z values (probability of having x and y in that window).
Again, by now, I have a matrix of Z values [1x100] x [1x100], that's equal my p(x,y). But I have to integrate it, by discovering the volume under the surface (doble integral). But I didn't found a way to do that. The function quad2d, to compute the double quadrature didn't work, because I only integrated a numerical matrix p(x,y), and it gives me a constant....
Anyone knows something to find that volume/calculate the double integral?
The image of the plot from persp3d:
Thanks everybody !!!!
Once you have the results from kde2d, it is very straighforward to compute a numerical integral. The example session below sketches how to do it.
As you know, numerical double integral is just a 2D summation. The kde2d, by default takes range(x) and range(y) as 2D domain. I see that you got a 100*100 matrix, so I think you have set n = 100 in using kde2d. Now, kde$x, kde$y defines a 100 * 100 grid, with den$z giving density on each grid cell. It is easy to compute the size of each grid cell (they are all equal), then we do three steps:
find normalizing constants; although we know that in theory, density sums up (or integrates) to 1, but after computer discretization, it only approximates 1. So we first compute this normalizing constant for later rescaling;
the integrand for entropy is z * log(z); since z is a 100 * 100 matrix, this is also a matrix. You simply sum them up, and multiply it by the cell size cell_size, then you get a non-normalized entropy;
rescale the non-normalized entropy for a normalized one.
## sample data: bivariate normal, with covariance/correlation 0
set.seed(123); x <- rnorm(1000, 0, 2) ## marginal variance: 4
set.seed(456); y <- rnorm(1000, 0, 2) ## marginal variance: 4
## load MASS
library(MASS)
## domain:
xlim <- range(x)
ylim <- range(y)
## 2D Kernel Density Estimation
den <- kde2d(x, y, n = 100, lims = c(xlim, ylim))
##persp(den$x,den$y,den$z)
z <- den$z ## extract density
## den$x, den$y expands a 2D grid, with den$z being density on each grid cell
## numerical integration is straighforward, by aggregation over all cells
## the size of each grid cell (a rectangular cell) is:
cell_size <- (diff(xlim) / 100) * (diff(ylim) / 100)
## normalizing constant; ideally should be 1, but actually only close to 1 due to discretization
norm <- sum(z) * cell_size
## your integrand: z * log(z) * (-1):
integrand <- z * log(z) * (-1)
## get numerical integral by summation:
entropy <- sum(integrand) * cell_size
## self-normalization:
entropy <- entropy / norm
Verification
The above code gives entropy of 4.230938. Now, Wikipedia - Multivariate normal distribution gives entropy formula:
(k / 2) * (1 + log(2 * pi)) + (1 / 2) * log(det(Sigma))
For the above bivariate normal distribution, we have k = 2. We have Sigma (covariance matrix):
4 0
0 4
whose determinant is 16. Hence, the theoretical value is:
(1 + log(2 * pi)) + (1 / 2) * log(16) = 4.224171
Good match!

Generate numbers in R

In R, how can I generate N numbers that have a mean of X and a median of Y (at least close to).
Or perhaps more generally, is there an algorithm for this?
There is an infinite number of solutions.
Approximate algorithm:
Generate n/2 numbers below the median
Generate n/2 numbers above the median
Add you desired median and check
Add one number with enough weight to satisfy your mean -- which you can solve
Example assuming you want a median of zero and a mean of twenty:
R> set.seed(42)
R> lo <- rnorm(10, -10); hi <- rnorm(10, 10)
R> median(c(lo,0,hi))
[1] 0 # this meets our first criterion
R> 22*20 - sum(c(lo,0,hi)) # (n+1)*desiredMean - currentSum
[1] 436.162 # so if we insert this, we the right answer
R> mean(c(lo,0,hi,22*20 - sum(c(lo,0,hi))))
[1] 20 # so we meet criterion two
R>
because desiredMean times (n+1) has to be equal to sum(currentSet) + x so we solve for x getting the expression above.
For a set of data that looks fairly 'normal', you can use the correction factor method as outlined by #Dirk-Eddelbuettel but with your custom values used to generate a set of data around your mean:
X = 25
Y = 25.5
N = 100
set.sd = 5 # if you want to set the standard deviation of the set.
set <- rnorm(N, Y, set.sd) # generate a set around the mean
set.left <- set[set < X] # take only the left half
set <- c(set.left, X + (X - set.left)) # ... and make a copy on the right.
# redefine the set, adding in the correction number and an extra number on the opposite side to the correction:
set <- c(set,
X + ((set.sd / 2) * sign(X - Y)),
((length(set)+ 2) * Y)
- sum(set, X + ((set.sd / 2) * sign(X - Y)))
)
Take strong heed of the first answer's first sentence. Unless you know what underlying distribution you want, you can't do it. Once you know that distribution, there are R-functions for many standards such as runif, rnorm, rchisq . You can create an arb. dist with the sample function.
If you are okay with the restriction X < Y, then you can fit a lognormal distribution. The lognormal conveniently has closed forms for both mean and median.
rmm <- function(n, X, Y) rlnorm(n, log(Y), sqrt(2*log(X/Y)))
E.g.:
z <- rmm(10000, 3, 1)
mean(z)
# [1] 2.866567
median(z)
# [1] 0.9963516

Resources