Estimate a probability from elements of a list in R - r

I have a list of 100,000 simulated numbers of T in R (min: 1.5, max 88.8) and I want to calculate the probability of T being between 10 and 50.
I sumulated 100,000 numbers of T, where T is t(y) %*% M %*% y where M is a 8x8 matrix of constant values and y is a 8x1 matrix. The element in the i-th row if y, is equal to: a_i + b_i where a is a vector of constants and b is a vector whose elements follow a normal (0,sd=2) distribution (each element is a different simulated number of N(0,2) )

Is it in a vector or a list? If it's a vector, the following should work. If it's in a list, you may use unlist() to convert it to a vector.
mylist <- runif(100000,1.5,88.8) #this is just to generate a random number vector
length(which(mylist>=10 & mylist<=50))/length(mylist)

set.seed(42)
myrandoms <- rnorm(100000, mean=5, sd=2)
mydistr <- ecdf(myrandoms)
#probability of beeing between 1 and 3:
diff(mydistr(c(1, 3)))
#[1] 0.13781
#compare with normal distribution
diff(pnorm(c(1, 3), mean=5, sd=2))
#[1] 0.1359051
If you really have a list, use myrandoms <- do.call(c, mylist) to make it a vector.

Related

How can I code this equation with double summation in R?

So I'm having hard time coding the above equation, mainly the part which contains that double sum over i's and over j.
I'n my case, my n = 200 and p = 15. My yi:s are in a vector Y = (y1,y2,...yn) that is vector of length 200 and Xij:s are in a matrix which has 15 columns and 200 rows. Bj:s are in a vector of length 15.
My own solution, which I'm fairly certain is wrong, is this:
b0 <- 1/200 * sum(Y - sum(matr*b))
And here is code which you can use to reproduce my vectors and matrix:
matr <- t(mvrnorm(15,mu= rep(0,200),diag(1,nrow = 200)))
Y <- rnorm(n = 200)
b <- rnorm(n = 15)
Use matrix multiplication:
mean(y - x %*% b)
Note that if y and x are known and b is the least squares regression estimate of the coefficients then we can write it as:
fm <- lm(y ~ x + 0)
mean(resid(fm))
and that necessarily equals 0 if there is an intercept, i.e. a constant column in x, since the residual vector must be orthogonal to the range of x and taking the mean is the same as taking the inner product of the residuals and a vector whose elements are all the same (and equal to 1/n).

R code Gaussian mixture -- numerical expression has 2 elements: only the first used

I'm trying to create a Gaussian Mix function according to these parameters:
For each sample, roll a die with k sides
If the j-th side appears from the roll, draw a sample from Normal(muj, sdj) where muj and sdj are the mean and standard deviation for the j-th Normal distribution respectively. This means you should have k different Normal distributions to choose from. Note that muj is the mathematical form of referring to the j-th element in a vector called mus.
The resulting sample from this Normal is then from a Gaussian Mixture.
Where:
n, an integer that represents the number of independent samples you want from this random variable
mus, a numeric vector with length k
sds, a numeric vector with length k
prob, a numeric vector with length k that indicates the probability of choosing the different Gaussians. This should have a default to NULL.
This is what I came up with so far:
n <- c(1)
mus <- c()
sds <- c()
prob <- c()
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
for(i in 1:seq_len(n)){
if(is.null(prob)){
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus)), n, replace=TRUE)
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}else{
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus), n, replace=TRUE, p=prob))
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}
}
return(avg)
}
rgaussmix(2, 1:3, 1:3)
It seems to match most of the requirements, but it keeps giving me the following error:
numerical expression has 2 elements: only the first usednumber of items to replace is not a multiple of replacement length
I've tried looking at the lengths of multiple variables, but I can't seem to figure out where the error is coming from!
Could someone please help me?
If you do seq_len(2) it gives you:
[1] 1 2
And you cannot do 1:(1:2) .. it doesn't make sense
Also you can avoid the loops in your code, by sampling the number of tries you need, for example if you do:
rnorm(3,c(0,10,20),1)
[1] -0.507961 8.568335 20.279245
It gives you 1st sample from the 1st mean, 2nd sample from 2nd mean and so on. So you can simplify your function to:
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
if(is.null(prob)){
prob = rep(1/length(mus),length(mus))
}
rolls <- sample(length(mus), n, replace=TRUE, p=prob)
avg <- rnorm(n, mean=mus[rolls], sd=sds[rolls])
avg
}
You can plot the results:
plot(density(rgaussmix(10000,c(0,5,10),c(1,1,1))),main="mixture of 0,5,10")

How can I simulate m random samples of size n from a given distribution with R?

I know how to generate a random sample of size n from a standard statistical distribution, say exponential. But if I want to generate m such random samples of size n (i.e. m vectors of dimension n) how can I do it?
To create a n by m matrix containing m samples of size n you can use:
x <- replicate(m, rnorm(n, ...))
Obviously substituting rnorm with other distributions if desired. If you then want to store these in separate individual vectors then you can use
v <- x[ , i]
This puts the ith column of x into v, which corresponds to the ith sample. It may be easier/quicker to just use a simple for loop altogether though:
for(i in 1:m){
name <- paste("V", i, sep = "")
assign(name, rnorm(n, ...))
}
This generates a random sample at each iteration, and for stage i, names the sample Vi. By the end of it you'll have m random samples named V1, V2, ..., Vm.

Using a loop to create matrices in R

I'm trying to do a leave-one-out cross-validation on a relatively small dataset (n = 22, p = 17) on a linear regression made from the LARS algorithm. Essentially I need to create n matrices of standardized data (each column consists of entries centered by the mean and standardized by the SD of the column).
I've never used lists before, but would be open to making lists as long as columns of the different matrices can be manipulated/standardized.
Here's what I tried in R:
for (i in 1:n)
{
x.standardized.i <- matrix(data = NA, nrow = (n-1), ncol = p) #creates n matrices, all n-1 x p
for (j in 1:p)
{
x.standardized.i[,j] <- ((x[-i,j]-mean(x[-i,j]))/sd(x[-i,j])) #and standardizes the p variables with the ith row missing in each n matrix (i increments from 1 to n)
}
}
I'm not sure if I can share the data, since it's related to grades from a class, but when I run the code it goes through the loop and stops by assigning a standardized matrix with the last row missing as x.standardized.i.
You can do this quite simply with sapply and scale:
# Create dummy data
m <- matrix(runif(200), ncol=10)
# Leave each row out in turn, and scale each column
A <- sapply(seq_len(nrow(m)), function(i) scale(m[-i, ]), simplify='array')
By default, scale centres each column on its mean, and divides by its sd.
For the example above, you'll end up with an array with 19 rows, 10 columns and 20 slices.
To access particular slices (i.e. cross-validation training folds), you can subset like this:
A[,, 1] # all rows, all cols, first slice
A[,, 10] # all rows, all cols, tenth slice
To confirm that columns are centred on their mean and standardised by one sd:
apply(A, c(2, 3), mean)
apply(A, c(2, 3), sd)

generate random integers between two values with a given probability using R

I have the following four number sets:
A=[1,207];
B=[208,386];
C=[387,486];
D=[487,586].
I need to generate 20000 random numbers between 1 and 586 in which the probability that the generated number belongs to A is 1/2 and to B,C,D is 1/6.
in which way I can do this using R?
You can directly use sample, more specifcally the probs argument. Just divide the probability over all the 586 numbers. Category A get's 0.5/207 weight each, etc.
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
L <- sapply(list(A, B, C, D), length)
x <- sample(c(A, B, C, D),
size = 20000,
prob = rep(c(1/2, 1/6, 1/6, 1/6) / L, L),
replace = TRUE)
I would say use the Roulette selection method. I will try to give a brief explanation here.
Take a line of say length 1 unit. Now break this in proportion of the probability values. So in our case, first piece will be of 1.2 length and next three pieces will be of 1/6 length. Now sample a number between 0,1 from uniform distribution. As all the number have same probability of occurring, a sampled number belonging to a piece will be equal to length of the piece. Hence which ever piece the number belongs too, sample from that vector. (I will give you the R code below you can run it for a huge number to check if what I am saying is true. I might not be doing a good job of explaining it here.)
It is called Roulette selection because another analogy for the same situation can be, take a circle and split it into sectors where the angle of each sector is proportional to the probability values. Now sample a number again from uniform distribution and see which sector it falls in and sample from that vector with the same probability
A <- 1:207
B <- 208:386
C <- 387:486
D <- 487:586
cumList <- list(A,B,C,D)
probVec <- c(1/2,1/6,1/6,1/6)
cumProbVec <- cumsum(probVec)
ret <- NULL
for( i in 1:20000){
rand <- runif(1)
whichVec <- which(rand < cumProbVec)[1]
ret <- c(ret,sample(cumList[[whichVec]],1))
}
#Testing the results
length(which(ret %in% A)) # Almost 1/2*20000 of the values
length(which(ret %in% B)) # Almost 1/6*20000 of the values
length(which(ret %in% C)) # Almost 1/6*20000 of the values
length(which(ret %in% D)) # Almost 1/6*20000 of the values

Resources