Central limit theorem in [R]

Central limit theorem in [R] - r

I am working with the language [R] to generate a sample of M = 32000 averages each calculated by averaging 36 independent values of the random variable continuous uniform distribution (0, 1) is generated as follows:
sampleA<-1:32000
for ( i in 1:32000){
MuestraAUnif<- runif(36)
sampleA[i]<-mean(MuestraAUnif)
}
For the sample generated ask me calculate relative frequency of observed averages greater than L = 0.32 +4 * 1 / 100 and compare it with the probability (approximated by "Central limit theorem") that the average N values greater than L. as follows:
L<- 0.32+4*1/100
sigma<- sqrt(1/12) #(b-a)/12
miu = 0.5 #(a+b)/2
greaterA <-sum(sampleA > L) #values of the sample greater than L are 23693
xBar<- greaterA/length(sampleA)
X <- sum(sampleA)
n<-32000
Zn<- (X - n*miu)/(sigma*sqrt(n))
cat("P(xBar >",L,") = P(Z>", Zn, ")=","1 - P (Z < ", Zn,") =",1-pnorm(Zn),"\n") #print the theoretical prob Xbar greater than L
cat("sum (sampleA >",L,")/","M=", n," para N =", 36,":",xBar, "\n") #print the sampling probability print when is greater than L
The output is:
P(xBar > 0.36 ) = P(Z> -3.961838 )= 1 - P (Z < -3.961838 ) = 0.9999628
sum (sampleA > 0.36 )/ M= 32000 para N = 36 : 0.7377187
My question is: Why are so far values?, Presumably they should be much closer (0.9999628 is far from 0.7377187). Am I doing something wrong with my implementation?. Excuse my English.

Melkhiah66. You did everything right only change
MuestraAUnif<- runif(2) for MuestraAUnif<- runif(32)
and it should work

Related

Determining the n sample size to reject null at alpha =0.01 in R script

Create a simulated dataset of 100 observations, where x is a random normal variable with mean 0 and standard deviation 1, and y = 0.1 + 2 * X + e, where epsilon is also a random normal error with mean 0 and sd 1.
set.seed(1)
# simulate a data set of 100 observations
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Now extract the first 5 observations.
y1.FirstFive <- (y.1[1:5]) # extract first 5 observations from y
x.FirstFive <- (x[1:5]) # extract first 5 observations from x
y1.FirstFive # extracted 5 observations from y1
[1] -1.7732743 0.5094025 -2.4821789 3.4485904 0.1044309
x.FirstFive # extracted 5 observations from x
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Assuming the mean and sd of the sample that you calculated from the first five observations would not change, what is the minimum total number of additional observations you would need to be able to conclude that the true mean of the population is different from 0 at the p = 0.01 confidence level?
alpha <- 0.01
mu <- 0
for (i in 5:2000) {
# Recalculate the standard error and CI
stand_err <- Sd_y1 / sqrt(i)
ci <- sample_mean_y1 + c(qt(alpha/2, i-1), qt(1-alpha/2, i-1))*stand_err
if (ci[2] < mu)
break # condition met, exit loop
}
i
[1] 2000
Here, I wrote a loop that iteratively increases n from the initial n=5 to n=2000, uses pt to find the p value (given a fixed y-bar and sd), and stops when p < 0.01. However I keep getting the wrong output. Such that, the output is always the number of the maximum range that I give (here, it is 2000) instead of giving me the specific minimum n sample in order to reject the null that mu_y = 0 at the p=0.01 level. Any suggestions as to how to fix the code?
additional info: the sd of y1.FirstFive = 2.3 and mean of y1.FirstFive = -0.04

Assuming:
Sd_y1 = sd(y1.FirstFive)
sample_mean_y1 = mean(y1.FirstFive)
sample_mean_y1
[1] -0.03860587
As pointed out by #jblood94, you need to go for larger sample size.
You don't need a for loop for this, most of your functions are vectorized, so something like this:
n = 5:30000
stand_err = Sd_y1 / sqrt(n)
ub = sample_mean_y1 + qt(1-alpha/2, n-1)*stand_err
n[min(which(ub<0))]
[1] 23889

It's because n > 2000.
set.seed(1)
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Sd_y1 <- sd(y.1[1:5])
sample_mean_y1 <- mean(y.1[1:5])
alpha <- 0.01
sgn <- 2*(sample_mean_y1 > 0) - 1
f <- function(n) qt(alpha/2, n - 1)*Sd_y1 + sgn*sample_mean_y1*sqrt(n)
upper <- 2
while (f(upper) < 0) upper <- upper*2
(n <- ceiling(uniroot(f, lower = upper/2, upper = upper, tol = 0.5)$root))
#> [1] 23889

Why am I getting NAs in this calculation in R?

While working on an Rcpp program, I used the sample() function, which gave me the following error: "NAs not allowed in probability." I traced this issue to the fact that the probability vector I used had NA values in it. I have no idea how. Below is some R code that captures the errors:
n.0=20
n.1=20
n.reps=1
beta0.vals=rep(seq(-.3,.1,,n.0),n.reps)
beta1.vals=rep(seq(-7,0,,n.1),n.reps)
beta.grd=as.matrix(expand.grid(beta0.vals,beta1.vals))
n.rnd=200
beta.rnd.grd=cbind(runif(n.rnd,min(beta0.vals),max(beta0.vals)),runif(n.rnd,min(beta1.vals),max(beta1.vals)))
beta.grd=rbind(beta.grd,beta.rnd.grd)
N = 22670
count = 0
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
phi01 = exp(logit)/(1 + exp(logit))
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
Here, $\beta_0 \in (-0.3, 0.1), \beta_1 \in (-7, 0), N = 22670, N_\text{sum} \in (1, N)$. Note that $N$ and $N_\text{sum}$ are integers, whereas the beta values may not be.
Since mathematically, $\phi_{01} \in (0,1)$, I'm assuming that NAs are arising because R is not liking extremely small values. I am receiving an overwhelming amount of NA values, too. More so than numbers. Why would I be getting NAs in this code?

Include print(logit) next to count = count + 1 and you will find lots of logit > 1000 values. exp(1000) == Inf so you divide Inf by Inf which will get you a NaN and NaN is NA:
> exp(500)
[1] 1.403592e+217
> Inf/Inf
[1] NaN
> is.na(NaN)
[1] TRUE
So your problems are not too small but to large numbers coming first out of the evaluation of exp(x) with x larger then roughly 700:
> exp(709)
[1] 8.218407e+307
> exp(710)
[1] Inf

Bernhard's answer correctly identifies the problem:
If logit is large, exp(logit) = Inf.
Here is a solution:
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
## This one isn't great because exp(logit) can be very large
# phi01 = exp(logit)/(1 + exp(logit))
## So, we say instead
## phi01 = 1 / ( 1 + exp(-logit) )
phi01 = plogis(logit)
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
# Total number of invalid probabilities: 0
We can use the more stable 1 / (1 + exp(-logit)
(to convince yourself of this, multiply your expression with exp(-logit) / exp(-logit)),
and luckily either way, R has a builtin function plogis() that can calculate these probabilities quickly and accurately.
You can see from the help file (?plogis) that this function evaluates the expression I gave, but you can also double check to assure yourself
x = rnorm(1000)
y = 1 / (1 + exp(-x))
z = plogis(x)
all.equal(y, z)
[1] TRUE

t-distribution in R

I would like to find the t-value for 90% confidence interval with 17 observation.
In Excel, I can do this calculation with t=T.INV.2T(.10, 16)=1.75 however in R I cannot find the correct way to get the same result.
qt(p = 1-.9, df = 17-1) = -1.34
qt(p = (1-.9)/2, df = 17-1) = -1.75 # trying with two-tailed?
What is the function R doing the same computation as T.INV.2T in Excel.
Similarly, we have also T.DIST.2T in Excel, what is the same function in R?

You need the 1 - .1 / 2 = 0.95 quantile from the t-distribution with 17 - 1 = 16 degrees of freedom:
qt(0.95, 16)
# [1] 1.745884
Explanation
Excel describes T.INV.2T as
Returns the two-tailed inverse of the Student's t-distribution
which is the quantile in math talk (though I would never use the term 2 tailed quantile). The p% quantile q is defined as the point which satisfies P(X <= q) >= p%.
In R we get that with the function qt (q for quantile, t for t-distribution). Now we just have to sort out what is meant by a two-tailed inverse. It turns out we are looking for the point q which satisfies P(X <= -|q| | X >= |q|) >= .1. Since the t-distribution is symmetrical this simplifies to P(X >= |q|) >= .1 / 2.
You can easily verify that in R with the use of the probability function pt:
pt(qt(0.05, 16), 16, lower.tail = TRUE) +
pt(qt(0.95, 16), 16, lower.tail = FALSE)
# [1] 0.1

As you correctly guessed, you do it by estimating the two-sided interval (alpha/2 = 0.1/2 = 0.05)
> qt(p = 0.95, df = 16)
[1] 1.745884
So 5 % off the upper and lower interval. I don't know Excel, but I am guessing that's what that function is doing.
As for dist, that is I assume the two-sided CDF
pt(-1.745884, df=16, lower.tail=T) +
pt(1.745884, df=16, lower.tail=F)
which is equal to 0.09999994.

Building formula from scientific paper

I am trying to build a mathematical formuala from a scientific paper into R.
In the example given, I used a variable of 164 microns. This was taken from the first result in Table 3 from the paper I have attached. You'll see in this table the current velocities calculated (pretty neat!).
The overall point of this, is that I wish to aquire two things: erosional velocity and depositional velocity. The paper I have attached does so using the formulae given. I am trying to build a package which can run hundreds of mean grain sizes (i.e. the variable) through these formulae. In an ideal world, my main goal is to build a code using the given formulae, which takes the variable (mean grain size) and spits out lovely data.... I think it is possible, but unfortunaley my R skills are inadequate
Link to formulae: https://imgur.com/a/DEN721v?
Link to original scietific paper: https://link.springer.com/article/10.1007/s00531-008-0312-5
There are 5 equations, all of which feed into each other. The outcome depends on one variable to which I input at the start.
I am given four known values:
p (water density given as m^3),
ps (grain density as m^3),
g (acceleration due to gravity given as m/s^2),
v (kinematic viscosity of water given as m^2/s)
and a variable (written as d) with is the mean grain size of a sediment sample.
d Variable. The mean grain size of a sediment sample.
For example, if I had a mean grain size of 164μm this would be input as 1.64e-4.
Seeking help as my outcomes are definately not even close to what they should be.
p <- 1027.4 #water density (m^3)
ps <- 2650 #grain density (m^3)
g <- 9.81 #acceleration due to gravity (m/s^2)
v <- 1.4313e-6 #kinematic viscosity of water (m^2/s)
z100 <- 100 #level above seabed (cms)
d <- 1.64e-4 #variable (mean grain size in microns)
EQUATION 1
D1 <- 9.81*(ps-p)
D2 <- (p*v)^2
D3 <- (D1/D2)
D4 <- D3^(1/3)
D5 <- D4*d
D <- D5 #Dimensionless grain size
D
EQUATION 2.3
1 - exp(-0.001374634317)
Tcr1 <- -0.020*D
Tcr2 <- 1 - exp(Tcr1)
Tcr3 <- 0.055*Tcr2
Tcr4 <- 0.30/1+(1.2*D)
Tcr5 <- Tcr4 + Tcr3
Tcr6 <- 9.81*(ps-p)
Tcr7 <- Tcr6*d
Tcr8 <- Tcr7*Tcr5
Tcr <- Tcr8 #threshold bed shear stress (N/m^2)
exp(Tcr1)
Tcr
Ucr1 <- Tcr/p
Ucr2 <- sqrt(Ucr1)
Ucr <- Ucr2 #critical shear velocity
EQUATION 3
z0 <- d/12 #roughness length
z0
EQUATION 4
Ue1 <- z100/z0
Ue2 <- Ucr/0.41
Ue3 <- log(Ue1)
Ue4 <- Ue2*Ue3
Ue <- Ue4 # critical current velocity erosional threshold from particle size distribution
Ue
EQUATION 5
Usetl1 <- 10.36^(2)
Usetl2 <- D^(3)
Usetl3 <- 1.049*Usetl2
Usetl4 <- Usetl1 + Usetl3
Usetl5 <- Usetl4^(1/2)
Usetl6 <- Usetl5 - 10.36
Usetl7 <- v/d
Usetl8 <- Usetl7*Usetl6
Usetl <- Usetl8
Results given as cm/s, and should be in and around 20-50 cm/s.

OK, so let us try it from start.
p <- 1027.4 #water density (m^3)
ps <- 2650 #grain density (m^3)
g <- 9.81 #acceleration due to gravity (m/s^2)
v <- 1.4313e-6 #kinematic viscosity of water (m^2/s)
z100 <- 100 #level above seabed (cms)
d <- 1.64e-4 #variable (mean grain size in microns)
This last value is incorrect if the mean grain size is supposed to be in microns. It is in meters.
D <- d * (g * (ps - p) / (p * v^2))^(1/3)
The result is 3.22. There was an error in your formula; (p * v)^2 instead of p * (v^2).
Tcr <- g * (ps - p) * d * (.3 / (1 + 1.2 * D) + .055 * (1 - exp(-.02 * D)) )
Again, there was an error in your formula: .3/1 + 1.2 * D instead of .3/(1 + 1.2 * D). The result is .17.
Ucr <- sqrt(Tcr / p)
Result is .01.
z0 = d / 12
Result is 1.37E-5.
Ue <- Ucr / .41 * log(z100 / z0)
Result is .50. Not sure why we calculate it, however. Are we supposed to compare it with Uset?
Uset <- v / d * ( sqrt(10.36^2 + 1.049 * D^3) - 10.36)
Result is .01 (.0137014).
This is not what you say you should get, but it is different from what you are getting. Also, assuming it is not centimeters but meters per second, then it is about 1 cm per second.
Now, let us check the units. Firstly, you need to be more careful when you specify the units. Water and grain density is not m^3; it is kg * m^-3.
First, D. The p's are silent (present in both the numerator and the denominator):
m * (m * s^-2 / (m^4 * s^-2))^(1/3) =
m * (1/m^3)^(1/3) = m / m = 1
OK, unitless.
Next, Tcr: the whole right part of the formula is unitless (depends only on D). Otherwise,
m * s^-2 * kg * m^-3 * m = (m * kg * s^-2) * m^-2 = N / m^2.
OK, also OK.
OK, the formula for Uset now. Again, right part of the right side of the equation is unitless. The rest is
m^2 * s / m = m / s
At least the units check out.
Hope this helps (somehow).

Simulation and apply functions in matrix, R

I have a couple of questions regarding to the piece of code shown below, the function called "Func1" will return a matrix as a result, the size of the matrix will be 50 rows and 15 columns, I called it "M", and "M2" is just the transpose of it. W0 is the initial value for the next part of the code, if I run the function called "Rowresult", then it also give me a 50*15 matrix.
My first question is: if I want to run the "Rowresult" function for different W0 values,such as W0 = 10,20,30. and I want to have 3 matrices in the size of 50*15 with different W0 values as results,how could I achieve it?
My second question is : if you tried my code in R, you will see a matrix called "wealth_result 2" as a result. once I got this big matrix, I would like to divide it (50*15 matrix) into three same size matrix, each matrix has a size of 50*5 (so they share the same rows but different columns, the first matrix takes the first 5 columns, the second takes 6-10 columns, third one takes 11-15 columns),and then I want to work out how many positive rows (rows with all numbers positive) among each of the 50 *5 matrix? How could I achieve this?
N=15
func1<-function(N){
alpha1 = 8.439e-02
beta1 = 8.352e-01
mu = 7.483e-03
omega = 1.343e-04
X_0 = -3.092031e-02
sigma_0 = 0.03573968
eps = rt (N,7.433e+00)
# loops
Xn= numeric (N)
sigma= numeric (N)
sigma[1] = sigma_0
Xn[1] = X_0
for (t in 2:N){
sigma[t] = sqrt (omega + alpha1 * (Xn[t-1])^2 + beta1* (sigma[t-1])^2)
Xn[t] = sigma[t] * eps[t]
}
Y = mu + Xn
}
# return matrix
M<-replicate(50,func1(N))
# returns matrix
M2<-t(M)
View(M2)
# wealth with initial wealth 10
W0=10
# 10,20,30,40
r= c(0.101309031, -0.035665516, -0.037377270, -0.005928941, 0.036612849,
0.062404039, 0.124240950, -0.034843633, 0.004770613, 0.005018101,
0.097685945, -0.090660099, 0.004863099, 0.029215984, 0.020835366)
Rowresult<- function(r){
const = exp(cumsum(r))
exp.cum = cumsum(1/const)
wealth=const*(W0 - exp.cum)
wealth
}
# wealth matrix
wealth_result <-apply(M2,1,Rowresult)
wealth_result2 <-t(wealth_result )
View(wealth_result2)

This delivers the desired counds of (all) "positive rows":
> sapply(1:3, function(m) sum( rowSums( wealth_result2[ , (1:5)+(m-1)*5 ] >0 )) )
[1] 250 230 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Central limit theorem in [R] - r

Melkhiah66. You did everything right only change MuestraAUnif<- runif(2) for MuestraAUnif<- runif(32) and it should work

Related

Determining the n sample size to reject null at alpha =0.01 in R script

Why am I getting NAs in this calculation in R?

t-distribution in R

Building formula from scientific paper

Simulation and apply functions in matrix, R

Categories

Resources