Comparison of two vectors resulted after simulation - r

I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
and I also thought about this:
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
i = i+1
but it is not working.
I would be very grateful for any help to modify the code.

As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
iter # number of times we looped
# [1] 2
out <- head(, goods), n = 100)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720


Area Under the Curve using Simpson's rule in R

I would like to compute the Area Under the Curve defined by a set of experimental values. I created a function to calculate an aproximation of the AUC using the Simpson's rule as I saw in this post. However, the function only works when it receives a vector of odd length. How can I modify the code to add the area of the last trapezoid when the input vector has an even length.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
sum <- unlist(sum)
auc <- sum(sum)
Here a data example:
smoothed = c(0.3,0.317,0.379,0.452,0.519,0.573,0.61,0.629,0.628,0.613,0.587,0.556,0.521,
One recommended way to handle an even number of points and still achieve precision is to combine Simpson's 1/3 rule with Simpson's 3/8 rule, which can handle an even number of points. Such approaches can be found in (at least one or perhaps more) engineering textbooks on numerical methods.
However, as a practical matter, you can write a code chunk to check the data length and add a single trapezoid at the end, as was suggested in the last comment of the post to which you linked. I wouldn't assume that it is necessarily as precise as combining Simpson's 1/3 and 3/8 rules, but it is probably reasonable for many applications.
I would double-check my code edits below, but this is the basic idea.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
#jh edit: check for even data length
#and chop off last data point if even
nn = length(x)
if(length(x) %% 2 == 0){
xlast = x[length(x)]
x = x[-length(x)]
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
sum <- unlist(sum)
auc <- sum(sum)
##jh edit: add trapezoid for last two data points to result
if(nn %% 2 == 0){
auc <- auc + (x[length(x)] + xlast)/2 * h
sm = smoothed[-length(smoothed)]
[1] 70
#even data as an example
[1] 20.17633
#original odd data
[1] 20.389
There may be a good reason for you to prefer using Simpson's rule, but if you're just looking for a quick and efficient estimate of AUC, the trapezoid rule is far easier to implement, and does not require an even number of breaks:
AUC <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
#> [1] 20.3945
Here, I show example code that uses the Simpson's 1/3 and 3/8 rules in tandem for the numerical integration of data. As always, the usual caveats about the possibility of coding errors or compatibility issues apply.
The output at the end compares the numerical estimates of this algorithm with the trapezoidal rule using R's "integrate" function.
#Algorithm adapted from:
#Numerical Methods for Engineers, Seventh Edition,
#By Chapra and Canale, page 623
#Modified to accept data instead of functional values
#Modified by: Jeffrey Harkness, M.S.
##Begin Simpson's rule function code
simp13 <- function(dat, h = 1){
ans = 2*h*(dat[1] + 4*dat[2] + dat[3])/6
simp13m <- function(dat, h = 1){
summ <- dat[1]
n <- length(dat)
nseq <- seq(2,(n-2),2)
for(i in nseq){
summ <- summ + 4*dat[i] + 2*dat[i+1]}
summ <- summ + 4*dat[n-1] + dat[n]
result <- (h*summ)/3
simp38 <- function(dat, h = 1){
ans <- 3*h*(dat[1] + 3*sum(dat[2:3]) + dat[4])/8
simpson = function(dat, h = 1){
hin = h
len = length(dat)
comp <- len %% 2
##number of segments
if(len == 2){
ans = sum(dat)/2*h} ##n = 2 is the trapezoidal rule
if(len == 3){
ans = simp13(dat, h = hin)}
if(len == 4){
ans = simp38(dat,h = hin)}
if(len == 6){
ans <- simp38(dat[1:4],h = hin) + simp13(dat[4:len],h = hin)}
if(len > 6 & comp == 0){
ans = simp38(dat[1:4],h = hin) + simp13m(dat[4:len],h = hin)}
if(len >= 5 & comp == 1){
ans = simp13m(dat,h = hin)}
##End Simpson's rule function code
This next section of code shows the performance comparison. This code can easily be altered for different test functions and cases.
The precision difference tends to change with the sample size and test function used; this example is not intended to imply that the difference is always this pronounced.
#other algorithm for comparison purposes, from Allan Cameron above
oa <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
#Testing and algorithm comparison code
simans = NULL; oaans = NULL; simerr = NULL; oaerr = NULL; mp = NULL
for( j in 1:10){
n = j
#f = function(x) cos(x) + 2 ##Test functions
f = function(x) 0.2 + 25*x - 200*x^2 + 675*x^3 - 900*x^4 + 400*x^5
a = 0;b = 10
h = (b-a)/n
datain = seq(a,b,by = h)
preans = integrate(f,a,b)$value #precise numerical estimate of test function
simans[j] = simpson(f(datain), h = h)
oaans[j] = oa(f(datain), h = h)
(simerr[j] = abs(simans[j] - preans)/preans * 100)
(oaerr[j] = abs(oaans[j] - preans)/preans * 100)
mp[j] = simerr[j] < oaerr[j]
(outframe = data.frame("simpsons percent diff" = simerr,"trapezoidal percent diff" = oaerr, "more precise?" = mp, check.names = F))
simpsons percent diff trapezoidal percent diff more precise?
1 214.73489738 214.734897 FALSE
2 15.07958148 64.993410 TRUE
3 6.70203621 29.816799 TRUE
4 0.94247384 16.955208 TRUE
5 0.54830021 10.905620 TRUE
6 0.18616767 7.593825 TRUE
7 0.12051767 5.588209 TRUE
8 0.05890462 4.282980 TRUE
9 0.04087107 3.386525 TRUE
10 0.02412733 2.744500 TRUE

Keep getting 'unused argument' error when trying to produce a qqplot in R

rejection_fx_sqz <- function(n){
x <- vector() # the output vector in which simulated values should be stored
acpt <- 0 # count the accepted values
tol <- 0 # count the total number of values (accepted or not accepted)
len_x = 0
while(len_x < n){
n_to_gen = max((n-len_x)/0.69,20) # determine number to generate - not less than 20
tol = tol + n_to_gen # count the total number of values simulated
u1 = runif(n_to_gen) # simulate u1
u2 = runif(n_to_gen) # simulate u2
y = inv_G(u2)
g <- g_x(y)
d <- g*y*(5-y)
condU <- (M*u1) >= 1/d
condL <- !condU
condL[condL] <- (M*u1[condL]) <= lower(y[condL])/d[condL]
other <- !(condU | condL) # condition of below the W_U and above W_L
# modify condL because some samples can still be accepted given condL is false
condL[other] <- u1[other] <= fstar(y[other])/(M*g[other])
cond <- condL
acpt = acpt + sum(cond) # count the number of accepted values
x <- c(x, y[cond]) # add accepted values to the output vector
len_x <- length(x)
p = acpt / tol
return(list(x=x[1:n], p=p))
x=rejection_fx_sqz(n) # a function that simulates from f(x) by generating n samples
x_fx <- data.frame(x=x$x)
x_plot = cbind(x_fx, fy = 1/(I*x*(5-x))*exp(-1/8*(-1+log(x/(5-x)))^2))
f_cdf <- function(x) {
integrate(fstar, 0, x)$value/I
# quantile function, inverse cdf
f_q <- inverse(f_cdf, lower=0.000000000000000001, upper=4.999999999999999999)
ggplot(x_plot, aes(sample=x))+
labs(title="Empirical against theoretical quantiles")+
stat_qq(distribution=f_q) +
What I'm trying to do is to produce a 'quantile-quantile' diagnostic plot for my algorithm for simulating from f(x). The problem is that I keep getting two error messages which says:
Computation failed in stat_qq(): unused argument (p = quantiles)
Computation failed in stat_qq_line(): unused argument (p = quantiles)
I am beginner for r-language and this is driving me crazy. Any help is appreciated.

Machine Learning: Stochastic gradient descent for logistic regression in R: Calculating Eout and average number of epochs

I am trying to write a code to solve the following problem (As stated in HW5 in the CalTech course Learning from Data):
In this problem you will create your own target function f
(probability in this case) and data set D to see how Logistic
Regression works. For simplicity, we will take f to be a 0=1
probability so y is a deterministic function of x. Take d = 2 so you
can visualize the problem, and let X = [-1; 1]×[-1; 1] with uniform
probability of picking each x 2 X . Choose a line in the plane as the
boundary between f(x) = 1 (where y has to be +1) and f(x) = 0 (where y
has to be -1) by taking two random, uniformly distributed points from
X and taking the line passing through them as the boundary between y =
±1. Pick N = 100 training points at random from X , and evaluate the
outputs yn for each of these points xn. Run Logistic Regression with
Stochastic Gradient Descent to find g, and estimate Eout(the cross
entropy error) by generating a sufficiently large, separate set of
points to evaluate the error. Repeat the experiment for 100 runs with
different targets and take the average. Initialize the weight vector
of Logistic Regression to all zeros in each run. Stop the algorithm
when |w(t-1) - w(t)| < 0:01, where w(t) denotes the weight vector at
the end of epoch t. An epoch is a full pass through the N data points
(use a random permutation of 1; 2; · · · ; N to present the data
points to the algorithm within each epoch, and use different
permutations for different epochs). Use a learning rate of 0.01.
I am required to calculate the nearest value to Eout for N=100, and the average number of epochs for the required criterion.
I wrote and ran the code but I'm not getting the right answers (as stated in the solutions, these are Eout is near 0.1 and the number of epochs is near 350). The required number of epochs for a delta w of 0.01 comes to far too small (around 10), leaving the error too big (around 2). I then tried to replace the criterion with |w(t-1) - w(t)| < 0.001 (rather than 0.01). Then, the average required number of epochs was about 250 and out of sample error was about 0.35.
Is there something wrong with my code/solution, or is it possible that the answers provided are faulty? I've added comments to indicate what I intend to do at each step. Thanks in advance.
h<- 0 # h will later be updated to number of required epochs
p<- 0 # p will later be updated to Eout
C <- matrix(ncol=10000, nrow=2) # Testing set, used to calculate out of sample error
d <- matrix(ncol=10000, nrow=1)
for(i in 1:10000){
C[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
d[1, i] <- sign(C[2, i] - f(C[1, i]))
for(g in 1:100){ # 100 runs of the experiment
x <- runif(2, min = -1, max = 1)
y <- runif(2, min = -1, max = 1)
fit = (lm(y~x))
t <- summary(fit)$coefficients[,1]
f <- function(x){ # Target function
t[2]*x + t[1]
A <- matrix(ncol=100, nrow=2) # Sample data
b <- matrix(ncol=100, nrow=1)
norm_vec <- function(x) {sqrt(sum(x^2))} # vector norm calculator
w <- c(0,0) # weights initialized to zero
for(i in 1:100){
A[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
b[1, i] <- sign(A[2, i] - f(A[1, i]))
q <- matrix(nrow = 2, ncol = 1000) # q tracks the weight vector at the end of each epoch
l= 1
while(l < 1001){
E <- function(z){ # cross entropy error function
x = z[1]
y = z[2]
v = z[3]
return(log(1 + exp(-v*t(w)%*%c(x, y))))
err <- function(xn1, xn2, yn){ #gradient of error function
return(c(-yn*xn1, -yn*xn2)*(exp(-yn*t(w)*c(xn1,xn2))/(1+exp(-yn*t(w)*c(xn1,xn2)))))
e = matrix(nrow = 2, ncol = 100) # e will track the required gradient at each data point
e[,1:100] = 0
perm = sample(100, 100, replace = FALSE, prob = NULL) # Random permutation of the data indices
for(j in 1:100){ # One complete Epoch
r = A[,perm[j]] # pick the perm[j]th entry in A
s = b[perm[j]] # pick the perm[j]th entry in b
e[,perm[j]] = err(r[1], r[2], s) # Gradient of the error
w = w - 0.01*e[,perm[j]] # update the weight vector accorng to the formula involving step size, gradient
q[,l] = w # the lth entry is the weight vector at the end of the lth epoch
if(l > 1 & norm_vec(q[,l] - q[,l-1])<0.001){ # given criterion to terminate the algorithm
l = l+1 # move to the next epoch
for(n in 1:10000){
p[g] = mean(E(c(C[1,n], C[2, n], d[n]))) # average over 10000 data points, of the error function, in experiment no. g
h[g] = l #gth entry in the vector h, tracks the number of epochs in the gth iteration of the experiment
mean(h) # Mean number of epochs needed
mean(p) # average Eout, over 100 experiments

Faster coding than using for loop

Suppose I have the following data frame
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.

Simulating coin toss

In the New York Times yesterday there was a reference to a paper essentially saying that the probability of 'heads' after a 'head' appears is not 0.5 (assuming a fair coin), challenging the "hot hand" myth. I want to prove it to myself.
Thus, I am working on coding a simulation of 7 coin tosses, and counting the number of heads after the first head, provided, naturally, that there is a first head at all.
I came up with the following lines of R code, but I'm still getting NA values, and would appreciate some help:
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Freq_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
y <- which(z==1)[1]
Freq_post_H[i] <- sum(z[(y+1):n])/length((y+1):n)
What am I missing?
CONCLUSION: After the initial hiccups of mismatched variable names, both responses solve the question. One of the answers corrects problems in the initial code related to what happens with the last toss (i + 1) by introducing min(y + 1, n), and corrects the basic misunderstanding of next within a loop generating NA for skipped iterations. So thank you (+1).
Critically, and the reason for this appended "conclusion" the second response addresses a more fundamental or conceptual problem: we want to calculate the fraction of H's that are preceded by a H, as opposed to p(H) in whatever number of tosses remain after a head has appeared, which will be 0.5 for a fair coin.
This is a simulation of what they did in the newspaper:
nsims <- 10000
k <- 4
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
sum( # sum logical values, i.e. 0/1
diff(x) == 0L & # is difference between consecutive values 0?
x[-1] == 1L ) / # and are these values heads?
sum(head(x, -1) == 1L) #divide by number of heads (without last toss)
mean(sims, na.rm = TRUE) #NaN cases are samples without heads, i.e. 0/0
#[1] 0.4054715
k <- 7
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
sum(diff(x) == 0L & x[-1] == 1L) / sum(head(x, -1) == 1L)
mean(sims, na.rm = TRUE)
#[1] 0.4289402
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Prob_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1) != 0){
y <- which(z==1)[1]
Prob_post_H[i] <- mean(z[min(y+1, n):n], na.rm=TRUE)
#[1] 0.495068
It looks like it's right around 50%. We can scale up to see more simulations.
sims <- 10000
#[1] 0.5057866
Still around 50%.
This is to simulate 100 fair coin tosses 30,000 times
counter <- 1
coin <- sum(rbinom(100,1,0.5))
coin <- c(coin, sum(rbinom(100,1,0.5)))
counter <- counter+1
Try these after running above variable
Below is some sample code in R to simulate a fair coin toss in R using the sample function. You can modify it as you like to simulate any number of flips. Since the outcome of flipping a coin is independent for each flip, the probability of a head or tail is always 0.5 for any given flip. Over many coin flips the probability of at least half of the flips being heads (or tails) will converge to 0.5. The probability that you get exactly half heads and half tails approaches 0.
n <- 7
count_heads <- 0
coin_flip <- sample(c(0,1), n, replace = TRUE)
for(flip_i in 1:n)
if(coin_flip[flip_i] == 1)
count_heads = count_heads + 1
