R: Applying function to every element of matrix using elements of different matrix as function input - r

I wish to apply a custom function to each element of a matrix whilst also using elements of a different matrix as inputs to the function.
Specifically, my function generates random samples from a von Mises distribution (circular normal distribution), calling the Rfast package's rvonmises function.
I have one matrix (radians) which records the angle I wish to use for the central tendency of the random generation (similar to the mean), and another matrix (kappa) which records the concentration parameter of the von Mises I wish to use (similar to standard deviation).
I wish to use (for example) element [1, 1] of the radians matrix together with element [1, 1] of the kappa matrix in a call to the von Mises random generator. So, my call for one element would be:
rvonmises(n = 1, m = radians[1, 1], k = kappa[1, 1])
But of course I want this applied across all elements of the matrices. (The rvonmises function doesn't accept multiple m or k values, so for example I couldn't use rvonmises(4, m = c(1, 2, 3, 4), k = c(1, 1.2, 1.4, 1.6)).)
To summarise: I am basically after a more principled (and faster!) way of doing this:
for(i in 1:nrow(radians)){
for(j in 1:ncol(radians)){
result[i, j] <- Rfast::rvonmises(1, radians[i, j], kappa[i, j])
}
}
What I have tried
Based on this post, I have tried to use mapply:
library(Rfast)
set.seed(42)
# random radians to use as input
radians <- matrix(data = runif(12, 0, 2 * pi),
ncol = 4)
# random concentration parameters of the von Mises distribution
kappa <- matrix(data = rgamma(12, 70, 30),
ncol = 4)
# function to generate random von mises sample with angle x and
# concentration parameter k
my_function <- function(m, k){
Rfast::rvonmises(1, m, k)
}
# my attempt
out <- matrix(mapply(my_function, m = as.data.frame(radians), k = kappa),
ncol = 4, byrow = TRUE)
However, I don't think this is working. For example, if I test it by the following (where the central tendency in test_radians increases steadily and I use large values for kappa which leads to precise estimates):
test_radians <- matrix(data = seq(from = 1, to = 2 * pi, length.out = 12),
ncol = 4)
test_kappa <- matrix(data = rep(20, times = 12),
ncol = 4)
test <- matrix(mapply(my_function, m = as.data.frame(test_radians),
k = test_kappa),
ncol = 4, byrow = TRUE)
test[1, 1] should be smaller (on average), and test[3, 4] should be largest. (I know due to random variability this won't always be the case, but I've tried it with many replications.)
So, the mapping and matching between matrices isn't working as I had anticipated.
Any guidance welcomed.

You cannot compute the mean of circular observations by simply calling "mean". This is wrong. The correct way is to compute the mean of the cosinus and sinus of the angles and then use the arc tangent. See pcakcges for directional or circular data for this.
Secondly, you gave us an idea, to return a matrix of von Mises generated data. But, since brms does this job for you, at the moment I would go there.

Related

What does the output of the function mvrnorm of MASS mean?

Using the mvrnorm() from the MASS package, now we can simulate realizations of multivariate normal distributions. This function works as follows:
library(MASS)
MASS::mvrnorm(
n = 10, # Number of realizations,
mu = c(1, 5), # Parameter vector mu,
Sigma = my_cov_matrix(1, 3, 0.2) # Parameter matrix Sigma
)
What does this output mean? Why are there two columns with ten random variables each?
The task is as follows:
Now, I created a function my_mvrnorm(n, mu_1, mu_2, sigma_1, sigma_2, rho), which simulates realizations of the corresponding multivariate normal distribution depending on mu and the matrix n and stores them in a tibble with the column names X and Y. In addition, this tibble is to contain a third column rho, in which all entries are filled with rho.
This should look like the following then:
But I couldn't write a function yet, because I don't quite understand what the values in table X and Y should be. Can someone help me?
Attempt:
my_mvrnorm <- function(n, mu_1, mu_2, sigma_1, sigma_2, rho){
mu = c(mu_1, mu_2)
sigma = my_cov_matrix(sigma_1, sigma_2, rho)
tb <- tibble(
X = ,
Y = ,
rho = rep(rho, n)
)
return(tb)
}
The n = 10 specification says do 10 samples. The mu = c(1, 5) specification says do two means. So, you get a 10 X 2 matrix as the result. If you check, the first column has a mean close to 2, and the second a mean close to 5. Is my_cov_matrix defined somewhere else?

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Weighted Pearson's Correlation with one Object

I want to create a correlation matrix using data but weighted based on significant edges.
m <- matrix(data = rnorm(36), nrow = 6, ncol = 6)
x <- LETTERS[1:6]
for (a in 1:length(x)) y <- c(y, paste("c", a, sep = ""))
mCor <- cor(t(m))
w <- sample(x = seq(0.5, 0.8, by = 0.01), size = 36)
The object w represents the weights for mCor. I know other packages that provide correlation for input data that has to be the same length for vectors x and y. I want to calculate a pairwise weighted Pearson's correlation table, using data for each row across all columns.
I just want to make sure it's correct, but I thought about using a weighted cor for each row A and B by multiplying each value by the given weight. You typically need three vectors all the same length, two for data, and one for the weights.
I am using the data.table package so speedy solutions are welcomed. Also, not sure if I should pass a table with two columns for connections and one for weights. Do the existing functions preserve order or automatically match?
weight <- data.table(x = rep(LETTERS[1:3], each = 12), y = rep(LETTERS[4:6], times = 3), w = w)

R - Inverse cumulative distribution method with given function

I have a given function (let's call it f(x)) and I used the Monte Carlo method to normalized it. I calculated the probability density function and I got the cumulative distribution function from integrating that.
f = function(x) ...
plot(f,xlim = c(0, 5), ylim = c(0, 1),main="f(x)")
mc.integral = function(f, n.iter = 1000, interval){
x = runif(n.iter, interval[1], interval[2])
y = f(x)
mean(y)*(interval[2] - interval[1])
}
MC = mc.integral(f, interval = c(0, 8))
print(MC)
densityFunction <- function(x){
return ((f(x)/MC))
}
distributionFunction <- function(x){
return (integrate(densityFunction,0,x)$value)
}
vd <- Vectorize(distributionFunction)
plot(vd,xlim = c(0, 8), ylim = c(0, 1),ylab = "y",main="E(f(x))")
Now my next task is to use the inverse transform method / inverse cumulative distribution method to generate samples and test it with the Kolmogorov-Smirnov Test, but I don't know how should I do in R.
Can you please give me some help?
Well, this thread shows us how to generate a sample using the inverse transform method:
sample <- vd(runif(1000))
> head(sample)
[1] 0.28737403 0.59295499 0.30814305 0.27998306 0.07601228 0.52753327
Therefore, generating 10 different random samples could be done with:
sample <- list()
for(i in 1:10){
set.seed(i)
sample[[i]] <- vd(runif(1000))
}
Afterwards, loop ks.test over the list:
lapply(sample, function(x) ks.test(x, pnorm))
will give you the output of a test vs. normality for each sample. Choose the size of your samples wisely, as most tests for normality are prone to be significant for large samples even with small differences (reference here).

Plot density curve of mixture of two normal distribution

I am rather new to R and could use some basic help. I'd like to generate sums of two normal random variables (variance = 1 for each) as their means move apart and plot the results. The basic idea: if the means are sufficiently far apart, the distribution will be bimodal. Here's the code I'm trying:
x <- seq(-3, 3, length=500)
for(i in seq(0, 3, 0.25)) {
y <- dnorm(x, mean=0-i, sd=1)
z <- dnorm(x, mean=0+i, sd=1)
plot(x,y+z, type="l", xlim=c(-3,3))
}
Several questions:
Are there better ways to do this?
I'm only getting one PDF on my plot. How can I put multiple PDFs on the same plot?
Thank you in advance!
It is not difficult to do this using basic R features. We first define a function f to compute the density of this mixture of normal:
## `x` is an evaluation grid
## `dev` is deviation of mean from 0
f <- function (x, dev) {
(dnorm(x, -dev) + dnorm(x, dev)) / 2
}
Then we use sapply to loop through various dev to get corresponding density:
## `dev` sequence to test
dev <- seq(0, 3, 0.25)
## evaluation grid; extending `c(-1, 1) * max(dev)` by 4 standard deviation
x <- seq(-max(dev) -4, max(dev) + 4, by = 0.1)
## density matrix
X <- sapply(dev, f, x = x)
## a comment on 2022-07-31: X <- outer(x, dev, f)
Finally we use matplot for plotting:
matplot(x, X, type = "l", lty = 1)
Explanation of sapply:
During sapply, x is not changed, while we pick up and try one element of dev each iteration. It is like
X <- matrix(0, nrow = length(x), ncol = length(dev))
for (i in 1:length(dev)) X[, i] <- f(x, dev[i])
matplot(x, X) will plot columns of X one by one, against x.
A comment on 2022-07-31: Just use outer. Here are more examples:
Run a function of 2 arguments over a span of parameter values in R
Plot of a Binomial Distribution for various probabilities of success in R

Resources