How to generate such random numbers in R - r

I want to generate bivariates in the following way. I have four lists with equal length n. I need to use the first two lists as means lists, and the latter two as variance lists, and generate normal bivariates.
For example n=2, I have the lists as (1, 2), (3, 4), (5, 6), (7, 8), and I need
c(rnorm(1, mean=1, sd=sqrt(5)), rnorm(1, mean=2, sd=sqrt(6)), rnorm(1, mean=3, sd=sqrt(7)), rnorm(1, mean=4, sd=sqrt(8)),ncol=2)
How can I do this in R in a more functional way?

Here is one way:
m <- 1:4
s <- 5:8
rnorm(n = 4, mean = m, sd = s)
[1] 4.599257 1.661132 16.987241 3.418957
This works because, like many R functions, rnorm() is 'vectorized', meaning that it allows you to call it once with vectors as arguments, rather than many times in a loop that iterates through the elements of the vectors.
Your main task, then, is to convert the 'lists' in which you've got your arguments right now into vectors that can be passed to rnorm().
NOTE: If you want to produce more than one -- lets say 3 -- random variate for each mean/sd combination, rnorm(n=rep(3,4), mean=m, sd=s) will not work. You'll have to either: (a) repeat elements of the m and s vectors like so rnorm(n=3*4, mean=rep(m, each=3), sd=rep(s, each=3)); or (b) use mapply() as described in DWin's answer.

I'm taking you at your word that you have a list, i.e an Rlist:
plist <- list( a=list(1, 2), b=list(3, 4), c=list(5, 6), d=list(7, 8))
means <-plist[c("a","b")] # or you could use means <- plist[1:2]
vars <- plist[c("c","d")]
mapply(rnorm, n=rep(1,4), unlist(means), unlist(vars))
#[1] 3.9382147 1.0502025 0.9554021 -7.3591917
You used the term bivariate. Did you really want to have x,y pairs that had a specific correlation?

Related

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Calculate each element in a matrix based on elements in two other matrices

I'm trying to populate a matrix with i x j entries from a random normal distribution based on the means and standard deviations stored in two other matrices. Is there a way to use rnorm pulling each entry from the two "data" matrices (the two matrices with the means and standard deviations) without using a loop?
Sure, just do it:
means <- matrix(1:4, 2, 2)
sds <- matrix((1:4)/1000, 2, 2)
result <- matrix(rnorm(4, mean = means, sd = sds), 2, 2)
or (following the comment from Frank below)
result <- array(rnorm(length(means), mean = means, sd = sds),
dim = dim(means))

Vector of different n values in rbeta

I would like to simultaneously use vectors of different parameter values in rbeta and get out a vector whose length is the sum of the elements of the n vector. For example,
n <- c(10, 20, 30)
alpha <- c(1,2,3)
beta <- c(3,2,1)
rbeta(n, alpha, beta)
The bottom line doesn't do what I would like. I want the output to be a vector of length 10+20+30 = 60, with the first 10 elements being 10 samples from a beta(1,3), the next 20 elements from a beta(2,2) and the next 30 elements from a beta(3,1). What is the best way to do this?
In general when applying a function to the elements of a vector, you’d need to lapply over your input vector:
unlist(lapply(n, rbeta, 2, 1)
However, in your case you can simply sum all the ns:
rbeta(sum(n), 2, 1)
If you have multiple parameters for alpha and beta, you can use Map instead (careful, arguments are inverted compared to lapply):
unlist(Map(rbeta, n, alpha, beta))
For your revised question I think judicious use of rep() will make it work.
n <- c(10, 20, 30)
alpha <- c(1,2,3)
beta <- c(3,2,1)
rbeta(sum(n),rep(alpha,n),rep(beta,n))

Vapply command and mvtnorm

I am not familiar with function over a vector in R.
I would like a vector with the different values of cumulative probability of a bivariate when some parameters change value simultaneously according to different function. For example here:
library(mvtnorm)
m<-2
corr<-diag(2)
corr[2,1]<-0
vapply(2*1:3,function(x)
pmvnorm(mean=c(2,x),corr,lower=c(-Inf,-Inf), upper=c(1,2)),1)
[1] 7.932763e-02 3.609428e-03 5.024809e-06
I have the different value of cumulative probability when the mean of the second distribution takes value 2,4 and 6.
My problem is that I want simultaneously change also the value of the value of the mean of the first distribution. I can't write properly the vapply command with more than one function. What can I do?
Thank you very much
You will need to use mapply for this task
library(mvtnorm)
corr <- diag(2)
m1 <- c(3, 5, 7)
m2 <- c(2, 4, 6)
mapply(function(x, y)
pmvnorm(mean = c(x, y), corr, lower = c(-Inf, -Inf), upper = c(1, 2)),
m1, m2)
## [1] 1.1375e-02 7.2052e-07 3.1246e-14

Decile function in R - nested ifelse() statements lead to poor runtime

I wrote a function that calculates the deciles of each row in a vector. I am doing this with the intention of creating graphics to evaluate the efficacy of a predictive model. There has to be a easier way to do this, but I haven't been able to figure it out for a while. Does anyone have any idea how I could score a vector in this way without having so many nested ifelse() statements? I included the function as well as some code to copy my results.
# function
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i)
}
return (ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10))))))))))
}
# check functionality
test.df <- data.frame(a = 1:10, b = rnorm(10, 0, 1))
test.df$deciles <- decile(test.df$b)
test.df
# order data frame
test.df[with(test.df, order(b)),]
You can use quantile and findInterval
# find the decile locations
decLocations <- quantile(test.df$b, probs = seq(0.1,0.9,by=0.1))
# use findInterval with -Inf and Inf as upper and lower bounds
findInterval(test.df$b,c(-Inf,decLocations, Inf))
Another solution is to use ecdf(), described in the help files as the inverse of quantile().
round(ecdf(test.df$b)(test.df$b) * 10)
Note that #mnel's solution is around 100 times faster.

Resources