Vectorized R function to produce sets of histograms - r

I have a vectorized R function (see below). At each run, the function plots two histograms. My goal is that when argument n is a vector (see example of use below), the function plots length of n separate sets of these histograms (ex: if n is a vector of length 2, I expected two sets of histograms i.e., 4 individual histograms)?
I have tried the following with no success. Is there a way to do this?
t.sim = Vectorize(function(n, es, n.sim){
d = numeric(n.sim)
p = numeric(n.sim)
for(i in 1:n.sim){
N = sqrt((n^2)/(2*n))
x = rnorm(n, es, 1)
y = rnorm(n, 0, 1)
a = t.test(x, y, var.equal = TRUE)
d[i] = a[[1]]/N
p[i] = a[[3]]
}
par(mfcol = c(2, length(n)))
hist(p) ; hist(d)
}, "n")
# Example of use:
t.sim(n = c(30, 300), es = .1, n.sim = 1e3) # `n` is a vector of `2` so I expect
# 4 histograms in my graphical device

Vectorize seems to be based on mapply, which would essentially call the function numerous times while cycle through your inputs vector. Hence, the easier way out probably just calls it outside the function
t.sim = Vectorize(function(n, es, n.sim){
d = numeric(n.sim)
p = numeric(n.sim)
for(i in 1:n.sim){
N = sqrt((n^2)/(2*n))
x = rnorm(n, es, 1)
y = rnorm(n, 0, 1)
a = t.test(x, y, var.equal = TRUE)
d[i] = a[[1]]/N
p[i] = a[[3]]
}
# par(mfcol = c(2, npar))
hist(p) ; hist(d)
}, "n")
#inputs
data <- c(30,300)
par(mfcol = c(2, length(data)))
t.sim(n = data, es = c(.1), n.sim = 1e3)

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

Draw a vector field from matrix multiplication r

I'm trying to print a vector field based on a matrix multiplication. The problem is that the function that will print values to make the matrix multiplication can only take a single number. When a range of number is put into the all.p function, the output is not usable to do the matrix multiplication. Is there a way to change all.p so that with multiple inputs, the matrix multiplication can still be valid, and the vector field can be computed? The code fails at the vectorfield function as this function with put the values into the range 0 to 1, but the all.p can't take multiple inputs.
geno.fit = matrix(c(0.791,1.000,0.834,
0.670,1.006,0.901,
0.657,0.657,1.067),
nrow = 3,
ncol = 3,
byrow = T)
all.p <- function(p) {
if (length(p)>1) {
stop("More numbers in input than expected")
}
P = p^2
PQ = 2*p*(1-p)
Q = (1-p)^2
return(list=c(P=P,PQ=PQ,Q=Q))
}
library(pracma)
f <- function(x, y) all.p(x) %*% geno.fit %*% all.p(y)
xx <- c(0, 1); yy <- c(0, 1)
vectorfield(fun = f, xlim = xx, ylim = yy, scale = 0.1)
for (xs in seq(0, 1, by = 0.25)) {
sol <- rk4(f, 0, 1, xs, 100)
lines(sol$x, sol$y, col="darkgreen")
}
grid()
I also tried to use a for loop.
f <- function(x, y, n = 16) {
space3 = matrix(NA,nrow = n,ncol = n)
for (i in 1:(length(x))) {
for (j in 1:(length(y))) {
# Calculate mean fitness
space3[i,j] = all.p(x[i]) %*% geno.fit %*% all.p(y[j])
}
}
return(space3)
}
xx <- c(0, 1); yy <- c(0, 1)
f(seq(0,1,length.out = 16), seq(0,1,length.out = 16))
vectorfield(fun = f, xlim = xx, ylim = yy, scale = 0.1)
Below is the code to make the gradient ascend (without the vectors).
library(fields) # for image.plot
res = 0.01
seq.x = seq(0,1,by = res)
space = outer(seq.x,seq.x,"*")
pace2 = space
for (i in 1:length(seq.x)) {
for (j in 1:length(seq.x)) {
space[i,j] = all.p(1-seq.x[i]) %*% geno.fit %*% all.p(1-seq.x[j])
}
}
round(t(space),3)
new.space = t(space)
image.plot(new.space)
by.text = 8
for (i in seq(1,length(seq.x),by = by.text)) {
for (j in seq(1,length(seq.x),by = by.text)) {
text(seq.x[i],seq.x[j],
labels = round(new.space[i,j],4),
cex = new.space[i,j]/2,
col = "black")
}
}
contour(new.space,ylim=c(1,0),add = T, nlevels = 50)
I was able to make the vector field function work, but it's not showing what I was expecting from the previous gradient ascend vector field:
How can the 2 be reconciled? (i.e., plotting the vectors on the gradient ascend image which would show the proper direction of the vectors in the steepest ascend)
Here is my solution:
library(fields) # for image.plot
library(plotly)
library(raster)
# Genotype fitness matrix -------------------------------------------------
geno.fit = matrix(c(0.791,1.000,0.834,
0.670,1.006,0.901,
0.657,0.657,1.067),
nrow = 3,
ncol = 3,
byrow = T)
# Resolution
res = 0.01
# Sequence of X
seq.x = seq(0,1,by = res)
# Make a matrix
space = outer(seq.x,seq.x,"*")
# Function to calculate the AVERAGE fitness for a given frequency of an allele to get the expected frequency of genotypes in a population
all.p <- function(p) { # Takes frequency of an allele in the population
if (length(p)>1) { # Has to be only 1 number
stop("More numbers in input than expected")
}
P = p^2 # Gets the AA
PQ = 2*p*(1-p) # gets the Aa
Q = (1-p)^2 # Gets the aa
return(list=c(P=P, # Return the values
PQ=PQ,
Q=Q))
}
# Examples
all.p(0)
all.p(1)
# Plot the matrix of all combinations of genotype frequencies
image.plot(space,
ylim=c(1.05,-0.05),
ylab= "Percentage of Chromosome EF of TD form",
xlab= "Percentage of Chromosome CD of BL form")
# Backup the data
space2 = space
# calculate the average fitness for EVERY combination of frequency of 2 genotypes
for (i in 1:length(seq.x)) {
for (j in 1:length(seq.x)) {
# Calculate mean fitness
space[i,j] = all.p(1-seq.x[i]) %*% geno.fit %*% all.p(1-seq.x[j])
}
}
# Show the result
round(t(space),3)
# Transform the space
new.space = t(space)
image.plot(new.space,
# ylim=c( 1.01,-0.01),
ylab= "Percentage of Chromosome EF of TD (Tidbinbilla) form",
xlab= "Percentage of Chromosome CD of BL (Blundell) form")
# Add the numbers to get a better sense of the average fitness values at each point
by.text = 8
for (i in seq(1,length(seq.x),by = by.text)) {
for (j in seq(1,length(seq.x),by = by.text)) {
text(seq.x[i],seq.x[j],
labels = round(new.space[i,j],4),
cex = new.space[i,j]/2,
col = "black") # col = "gray70"
}
}
# Add contour lines
contour(new.space,ylim=c(1,0),add = T, nlevels = 50)
# Plotly 3D graph --------------------------------------------------------
# To get the 3D plane in an INTERACTIVE graph
xyz=cbind(expand.grid(seq.x,
seq.x),
as.vector(new.space))
plot_ly(x = xyz[,1],y = xyz[,2],z = xyz[,3],
color = xyz[,3])
# Vector field on the Adaptive landscape ----------------------------------
library(tidyverse)
library(ggquiver)
raster2quiver <- function(rast, aggregate = 50, colours = terrain.colors(6), contour.breaks = 200)
{
names(rast) <- "z"
quiv <- aggregate(rast, aggregate)
terr <- terrain(quiv, opt = c('slope', 'aspect'))
quiv$u <- -terr$slope[] * sin(terr$aspect[])
quiv$v <- -terr$slope[] * cos(terr$aspect[])
quiv_df <- as.data.frame(quiv, xy = TRUE)
rast_df <- as.data.frame(rast, xy = TRUE)
print(ggplot(mapping = aes(x = x, y = y, fill = z)) +
geom_raster(data = rast_df, na.rm = TRUE) +
geom_contour(data = rast_df,
aes(z=z, color=..level..),
breaks = seq(0,3, length.out = contour.breaks),
size = 1.4)+
scale_color_gradient(low="blue", high="red")+
geom_quiver(data = quiv_df, aes(u = u, v = v), vecsize = 1.5) +
scale_fill_gradientn(colours = colours, na.value = "transparent") +
theme_bw())
return(quiv_df)
}
r <-raster(
space,
xmn=range(seq.x)[1], xmx=range(seq.x)[2],
ymn=range(seq.x)[1], ymx=range(seq.x)[2],
crs=CRS("+proj=utm +zone=11 +datum=NAD83")
)
# Draw the adaptive landscape
raster2quiver(rast = r, aggregate = 2, colours = tim.colors(100))
Not exactly what I wanted, but it does what I was looking for!

How to calculate fuzzy performance index and normalized classification entropy in R

I am running Fuzzy C-Means Clustering using e1071 package. I want to decide the optimum number of clusters based on fuzzy performance index (FPI) (extent of fuzziness) and normalized classification entropy (NCE) (degree of disorganization of specific class) given in the following formula
where c is the number of clusters and n is the number of observations, μik is the fuzzy membership and loga is the natural logarithm.
I am using the following code
library(e1071)
x <- rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl <- cmeans(x,2,20,verbose=TRUE,method="cmeans")
cl$membership
I have been able to extract the μik i.e. fuzzy membership. Now, cmeans has to for different number of clusters e.g. 2 to 6 and the FPI and NCE has to be calculated to have a plot like the following
How can it be achieved in R?
Edit
I have tried the code provided by #nya for iris dataset using the following code
df <- scale(iris[-5])
FPI <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
1 - (c / (c - 1)) * (1 - sum(cmem^2) / n)
}
NCE <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
(n / (n - c)) * (- sum(cmem * log(cmem)) / n)
}
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- cmeans(df, i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership))
nce <- c(nce, NCE(cl[[i]]$membership))
}
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
The minimum values of fuzzy performance index(FPI) and normalized classification entropy (NCE) were considered to decide the optimum number of clusters. NCE is always increasing and FPI is showing the decreasing value. Ideally it should have been
With available equations, we can program our own functions. Here, the two functions use equations present in the paper you suggested and one of the references the authors cite.
FPI <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
# Rahul et al. 2019. https://doi.org/10.1080/03650340.2019.1578345
if(method == "Rahul"){
res <- 1 - (C / (C - 1)) * (1 - sum(cmem^2) / N)
}
# McBrathney & Moore 1985 https://doi.org/10.1016/0168-1923(85)90082-6
if(method == "McBrathney"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (F - 1)
}
# FuzME https://precision-agriculture.sydney.edu.au/resources/software/
# MATLAB code file fvalidity.m, downloaded on 11 Nov, 2021
if(method == "FuzME"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (C - 1)
}
return(res)
}
NCE <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
if(method == "Rahul"){
res <- (N / (N - C)) * (- sum(cmem * log(cmem)) / N)
}
if(method %in% c("FuzME", "McBrathney")){
H <- -1 / N * sum(cmem * log(cmem))
res <- H / log(C)
}
return(res)
}
Then use those to calculate the indices from the degrees of membership from the cmeans function from the iris dataset.
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- e1071::cmeans(iris[, -5], i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership, method = "M"))
nce <- c(nce, NCE(cl[[i]]$membership, method = "M"))
}
Last, plot with two different axes in one plot.
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
EDIT1: Updated the functions according to optional equations from two different publications and calculated the example on the iris dataset.
EDIT2: Added code for the FPI and NCE calculations specified in the FuzME MATLAB code available here.
Hope this could help
library(dplyr)
library(ggplot2)
f <- function(cl) {
C <- length(cl$size)
N <- sum(cl$size)
mu <- cl$membership
fpi <- 1 - C / (C - 1) * (1 - sum((mu)^2) / N)
nce <- N / (N - C) * (-sum(log(mu) * mu) / N)
c(FPI = fpi, NCE = nce)
}
data.frame(t(rbind(
K = 2:6,
sapply(
K,
function(k) f(cmeans(x, k, 20, verbose = TRUE, method = "cmeans"))
)
))) %>%
pivot_longer(cols = FPI:NCE, names_to = "Index") %>%
ggplot(aes(x = K, y = value, group = Index)) +
geom_line(aes(linetype = Index, color = Index)) +
geom_point() +
scale_y_continuous(
name = "FPI",
sec.axis = sec_axis(~., name = "NCE")
) +
theme(legend.position = "top")

Why is my Monte Carlo Integration wrong by a factor of 2?

I am trying to integrate the following function using a Monte Carlo Integration. The interval I want to integrate is x <- seq(0, 1, by = 0.01) and y <- seq(0, 1, by = 0.01).
my.f <- function(x, y){
result = x^2 + sin(x) + exp(cos(y))
return(result)
}
I calculated the integral using the cubature package.
library(cubature)
library(plotly)
# Rewriting the function, so it can be integrated
cub.function <- function(x){
result = x[1]^2 + sin(x[1]) + exp(cos(x[2]))
return(result)
}
cub.integral <- adaptIntegrate(f = cub.function, lowerLimit = c(0,0), upperLimit = c(1,1))
The result is 3.134606. But when I use my Monte Carlo Integration Code, see below, my result is about 1.396652. My code is wrong by more than a factor of 2!
What I did:
Since I need a volume to conduct a Monte Carlo Integration, I calculated the function values on the mentioned interval. This will give me an estimation of the maximum and minimum of the function.
# My data range
x <- seq(0, 1, by = 0.01)
y <- seq(0, 1, by = 0.01)
# The matrix, where I save the results
my.f.values <- matrix(0, nrow = length(x), ncol = length(y))
# Calculation of the function values
for(i in 1:length(x)){
for(j in 1:length(y)){
my.f.values[i,j] <- my.f(x = x[i], y = y[j])
}
}
# The maximum and minimum of the function values
max(my.f.values)
min(my.f.values)
# Plotting the surface, but this is not necessary
plot_ly(y = x, x = y, z = my.f.values) %>% add_surface()
So, the volume that we need is simply the maximum of the function values, since 1 * 1 * 4.559753 is simply 4.559753.
# Now, the Monte Carlo Integration
# I found the code online and modified it a bit.
monte = function(x){
tests = rep(0,x)
hits = 0
for(i in 1:x){
y = c(runif(2, min = 0, max = 1), # y[1] is y; y[2] is y
runif(1, min = 0, max = max(my.f.values))) # y[3] is z
if(y[3] < y[1]**2+sin(y[1])*exp(cos(y[2]))){
hits = hits + 1
}
prop = hits / i
est = prop * max(my.f.values)
tests[i] = est
}
return(tests)
}
size = 10000
res = monte(size)
plot(res, type = "l")
lines(x = 1:size, y = rep(cub.integral$integral, size), col = "red")
So, the result is completely wrong. But if I change the function a bit, suddenly is works.
monte = function(x){
tests = rep(0,x)
hits = 0
for(i in 1:x){
x = runif(1)
y = runif(1)
z = runif(1, min = 0, max = max(my.f.values))
if(z < my.f(x = x, y = y)){
hits = hits + 1
}
prop = hits / i
est = prop * max(my.f.values)
tests[i] = est
}
return(tests)
}
size = 10000
res = monte(size)
plot(res, type = "l")
lines(x = 1:size, y = rep(cub.integral$integral, size), col = "red")
Can somebody explain why the result suddenly changes? To me, both functions seem to do the exact same thing.
In your (first) code for monte, this line is in error:
y[3] < y[1]**2+sin(y[1])*exp(cos(y[2]))
Given your definition of my.f, it should surely be
y[3] < y[1]**2 + sin(y[1]) + exp(cos(y[2]))
Or..., given that you shouldn't be repeating yourself unnecessarily:
y[3] < my.f(y[1], y[2])

Graphical output of density for the function gammamixEM (package mixtools)

I'm using the function gammamixEM from the package mixtools. How can I return the graphical output of density as in the function normalmixEM (i.e., the second plot in plot(...,which=2)) ?
Update:
Here is a reproducible example for the function gammamixEM:
x <- c(rgamma(200, shape = 0.2, scale = 14), rgamma(200,
shape = 32, scale = 10), rgamma(200, shape = 5, scale = 6))
out <- gammamixEM(x, lambda = c(1, 1, 1)/3, verb = TRUE)
Here is a reproducible example for the function normalmixEM:
data(faithful)
attach(faithful)
out <- normalmixEM(waiting, arbvar = FALSE, epsilon = 1e-03)
plot(out, which=2)
I would like to obtain this graphical output of density from the function gammamixEM.
Here you go.
out <- normalmixEM(waiting, arbvar = FALSE, epsilon = 1e-03)
x <- out
whichplots <- 2
density = 2 %in% whichplots
loglik = 1 %in% whichplots
def.par <- par(ask=(loglik + density > 1), "mar") # only ask and mar are changed
mix.object <- x
k <- ncol(mix.object$posterior)
x <- sort(mix.object$x)
a <- hist(x, plot = FALSE)
maxy <- max(max(a$density), .3989*mix.object$lambda/mix.object$sigma)
I just had to dig into the source code of plot.mixEM
So, now to do this with gammamixEM:
x <- c(rgamma(200, shape = 0.2, scale = 14), rgamma(200,
shape = 32, scale = 10), rgamma(200, shape = 5, scale = 6))
gammamixEM.out <- gammamixEM(x, lambda = c(1, 1, 1)/3, verb = TRUE)
mix.object <- gammamixEM.out
k <- ncol(mix.object$posterior)
x <- sort(mix.object$x)
a <- hist(x, plot = FALSE)
maxy <- max(max(a$density), .3989*mix.object$lambda/mix.object$sigma)
main2 <- "Density Curves"
xlab2 <- "Data"
col2 <- 2:(k+1)
hist(x, prob = TRUE, main = main2, xlab = xlab2,
ylim = c(0,maxy))
for (i in 1:k) {
lines(x, mix.object$lambda[i] *
dnorm(x,
sd = sd(x)))
}
I believe it should be pretty straight forward to continue this example a bit, if you want to add the labels, smooth lines, etc. Here's the source of the plot.mixEM function.

Resources