Many R functions for simulating from probability distributions are vectorised. ?rmultinom says that dmultinom is not vectorized, hence I assume that also rmultinom is not. What is the most efficient way to execute rmultinom repeatedly across a set of probabilities?
For example:
p = matrix( c(0.1,0.2,0.3,0.4,0.2,0.3,0.4,0.1,0.3,0.4,0.2,0.1), ncol=4, nrow = 3, T)
p is a 3 x 4 matrix of probabilities that sum to one for each row. It is now the goal to create n samples of size for each row. For simplicity use n=1, size=1, the categorical distribution.
rmultinom(1,1,p) gives a 12 x 1 matrix. The desired result is a 4 x 3 matrix though, for which there is exactly 1 element equal to 1 for each column.
A for loop is possible but seems inefficient. Is there a better way to achieve this (for large matrices p)?
Related
I am simulating some draws using random numbers. Unlikely, the generated numbers are not random as I would like. In fact, I obtain that there are some linear combinations.
In details, I have the following starting data:
start_vector = c(1,10,30,40,50,100) # length equal to 6
residual_of_model = 5
n = 1000 # Number of simulations
I try to simulate n observations from a random normal distribution for each of the start_vector elements, assuming it as a "random noise" to add to the original value (that is the one into start_vector):
out_vec <- matrix(NA, nrow = n, ncol = length(start_vector))
for (h_aux in 1:length(start_vector))
{
random_noise <- rnorm(n, 0, residual_of_model)
out_vec[,h_aux] <- as.numeric(start_vector[h_aux]) + random_noise
}
At this point, I obtain a matrix of size 6x1000. In theory, I assume all the columns and the rows in the matrix are linearly independent among them.
If I try to check it, using the findLinearCombos() function from the caret package I obtain that all the columns are indepent:
caret::findLinearCombos(out_vec)
If I try to evaluate the independence among the rows, using the following code:
caret::findLinearCombos(t(out_vec))
I obtain that all the rows from 7 to 1000 are a linear combination of the first 6 (the length of start_vector).
It is really strange in my opinion, I would like to not observe no dependencies at all since the rows are generated adding a random number using rnorm.
What am I missing? Is there some bug? Thanks in advance!
Im not aware of any direct commands to do this in R. Any inputs?
To make a 3x3 matrix, do this:
matrix(something, nrow=3, ncol=3)
But you need to replace something with however you want to make "arbitrary" numbers. Use runif(9) for a random (uniformly distributed) real number between 0 and 1. Use sample(1:100, 9, T) to draw 9 numbers from the integers 1 through 100 with replacement. Use rnorm(9) to draw 9 numbers from a standard normal distribution. Etc.
I have a vector epsilon of length N. I am applying the function bw.CDF.pi(x, pilot="UCV") from the sROC package to compute bandwidths for cdf Kernel estimation.
My goal is to repeat this bandwidth function for every subvector from epsilon from the beginning value on. Stated otherwise, I would like to apply this function for the first value in epsilon, then for the first two values in epsilon, then for the first three values in epsilon, continiuing until the function is applied fot the total vector epsilon. Finally i want to have then N values for the bandwidth.
How can I accomplish this?
Apparently you need a vector of 2 elements for the function bw.CDF.pi to run. If you want to run it for the first 2 elemts of a vector, then the first 3, etc, you can do the following. Note that the data example is the one in the help page for the function.
library(sROC)
set.seed(100)
n <- 200
x <- c(rnorm(n/2, mean=-2, sd=1), rnorm(n/2, mean=3, sd=0.8))
lapply(seq_along(x)[-1], function(m) bw.CDF.pi(x[seq_len(m)], pilot="UCV"))
p1 <- c(.25,.025,.025,.1,.2,.4)
T <- sample(1:6,size=N,replace=TRUE, prob=someprobabilityvector)
Y <- rbinom(N,1,p1[c(T)])
HI folks, I am new to R and programming in general and need some help with understanding sth basic. could someone explain to me one what is happening in vector Y above. I figure out what p1[c(T)] does above. But have no idea what vector Y is doing. All help is appreciated in advance.
The first line of your code creates a vector of six probabilities:
p1 <- c(.25,.025,.025,.1,.2,.4)
In the second line, you randomly choose N values from the numbers one to six (with replacement). The probability for each value is specified in someprobabilityvector. Hence, the function will return a vector of length N including values between 1 and 6
T <- sample(1:6,size=N,replace=TRUE, prob=someprobabilityvector)
In the third line, N random numbers from a binomial distribution with one trial and probablities specified in p1[c(T)] are generated. c(T) is the same as T: the vector including values from 1 to 6. The vector is used for indexing the vector p1. Hence, p1[c(T)] will return a vector including N values from vector p1.
Y <- rbinom(N,1,p1[c(T)])
Since the specified binomial distribution has one trial only, the vector Y will contain zeroes and ones.
I have the number of samples per unit and need to calculate statistics with R.
The table is like this (all rows and columns are actually filled with values, I only write a few here for easier visibility, and there are many more columns):
Hour 1 2 3 4
H1 72 11 98 65
H2 19 27
H3
H4
H5
:
H200000
I.e. the first hour (H1) there were 72 samples of value 1, 11 samples of value 2, etc. The second hour(H2) there were 19 samples of value 1, 27 samples of value 2, etc.
I need to calculate the mean and standard deviation per hour (i.e. per row). As there are many thousands of rows I need a fast method.
Example: The manual mean-calculation for hour 1 (H1) would be:
(72x1 + 11x2 + 98x3 + 65x4)/(72+11+98+65) = 2.6
I suppose there are R-methods or packages that can do this, but I fail to find where. Your support is highly appreciated.
Thanks,
Chris
You want to calculate a weighted mean, so you need weighted.mean. For the first row:
values <- c(1, 2, 3, 4)
weights <- c(72, 11, 98, 65)
weighted.mean(values, weights)
The weighted standard deviation is not well-defined. You could use a hand-rolled weighted RMS as an estimator (but this assumes that your input sample is really from a single Gaussian, i.e. there are no outliers -- not sure if that's the case for your example).
# same values and weights as above
sqrt(sum(values^2*weights^2))/sum(weights)
You should read your data into a table and iterate over every row. Also, "many thousands of rows" is not necessarily a large number for such a simple calculation. This is very basic stuff, maybe checking out a tutorial would also be beneficial.
You are much better off (i.e. faster calculations) using matrix operations instead of applying something by row. For example, assuming X is the matrix containing your data, you can get the weighted means the following way:
w <- 1:ncol(X)
w <- w/sum(w) #scale to have a sum of 1
wmeans <- X %*% w
Assuming your table is a matrix called dataset of n * 20000 and you have the weigths in a weights array you just need to do:
# The 1 as 2nd parameter indicates to apply the function on the rows
w.means <- apply(dataset, 1, weighted.mean, w=weights)