Extract a named dataframe from a function in R - r

I have created the following function that I need to extract some information from. However, R is giving me some issues.
HAC.sim <- function(K = 1, N, Hstar, probs, perms = 10000){
specs <- 1:N
### Set up a container to hold the identity of each individual from each permutation
pop <- array(dim = c(c(perms, N), K))
### Create an ID for each haplotype
haps <- as.character(1:Hstar)
### Assign probabilities of occurrence to each haplotype, ensure they sum to 1
### This is where we assume we "know" the distribution of haplotypes
### Here, I have assumed they all occur with equal frequency, but you can change this to assume some dominant ones and some rare ones, whatever you want
probs <- rep(1/Hstar, Hstar)
# probs <- c(220/N, rep(3/N, 2), rep(2/N, 2), rep(1/N, 10))
### Generate permutations, we assume each permutation has N individuals, and we sample those individuals' haplotypes from our probabilities
# If K > 1, haplotypes are partitioned into equally-sized subpopulations/demes
# Can change number of haplotypes in each subpopulation and re-run simulation
# For each additional, K, add new Ki and new pop[j ,, i] in loop
for(j in 1:perms){
for(i in 1:K){
if(i == 1){
pop[j, specs, i] <- sample(haps, size = N, replace = TRUE, prob = probs)
}
else{
pop[j ,, 1] <- sample(haps[K1], size = N, replace = TRUE, prob = probs[K1])
pop[j ,, 2] <- sample(haps[K2], size = N, replace = TRUE, prob = probs[K2])
}
}
}
### Make a matrix to hold the 1:N individuals from each permutation
HAC.mat <- array(dim = c(c(perms, N), K))
for(k in specs){
for(j in 1:perms){
for(i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE) ## which individuals will we sample
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(1:K, size = 1, replace = TRUE)] ## pull those individuals from a permutation
HAC.mat[j, k, i] <- length(unique(hap.plot)) ## how many haplotypes did we get for a given sampling intensity (k) from each ### permutation (j)
}
}
}
### Calculate the mean and CI for number of haplotypes at each sampling intensity (j)
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
### Plot the curve and frequency barplot
par(mfrow = c(1, 2))
for(i in 1:K){
if(i == 1){
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
}
else{
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, max(HAC.mat)))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs[get(paste0("K", i))], xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = get(paste0("K",i)))
}
}
d <- data.frame(specs, means)
## Measures of Closeness ##
list(c(cat("\n Number of haplotypes sampled: " , max(means))
cat("\n Number of haplotypes not sampled: " , Hstar - max(means))
cat("\n Proportion of haplotypes sampled: " , max(means)/Hstar)
cat("\n Proportion of haplotypes not sampled: " , (Hstar - max(means))/Hstar)), d)
}
HAC.sim(K = 1, N = 100, Hstar = 10)
I would like to be able to easily extract the data frame 'd' from the above function.
I need to pass d to a regression model object.
The output gives me almost what I need except there is a [[1]] appended to the end of the value of (Hstar - max(means))/Hstar), instead of on a new line. Plus, [[1]] is a NULL list object. Why am I getting this NULL list as an output, and how can I fix the problem?
Any help is appreciated.

Related

How to calculate fuzzy performance index and normalized classification entropy in R

I am running Fuzzy C-Means Clustering using e1071 package. I want to decide the optimum number of clusters based on fuzzy performance index (FPI) (extent of fuzziness) and normalized classification entropy (NCE) (degree of disorganization of specific class) given in the following formula
where c is the number of clusters and n is the number of observations, μik is the fuzzy membership and loga is the natural logarithm.
I am using the following code
library(e1071)
x <- rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl <- cmeans(x,2,20,verbose=TRUE,method="cmeans")
cl$membership
I have been able to extract the μik i.e. fuzzy membership. Now, cmeans has to for different number of clusters e.g. 2 to 6 and the FPI and NCE has to be calculated to have a plot like the following
How can it be achieved in R?
Edit
I have tried the code provided by #nya for iris dataset using the following code
df <- scale(iris[-5])
FPI <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
1 - (c / (c - 1)) * (1 - sum(cmem^2) / n)
}
NCE <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
(n / (n - c)) * (- sum(cmem * log(cmem)) / n)
}
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- cmeans(df, i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership))
nce <- c(nce, NCE(cl[[i]]$membership))
}
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
The minimum values of fuzzy performance index(FPI) and normalized classification entropy (NCE) were considered to decide the optimum number of clusters. NCE is always increasing and FPI is showing the decreasing value. Ideally it should have been
With available equations, we can program our own functions. Here, the two functions use equations present in the paper you suggested and one of the references the authors cite.
FPI <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
# Rahul et al. 2019. https://doi.org/10.1080/03650340.2019.1578345
if(method == "Rahul"){
res <- 1 - (C / (C - 1)) * (1 - sum(cmem^2) / N)
}
# McBrathney & Moore 1985 https://doi.org/10.1016/0168-1923(85)90082-6
if(method == "McBrathney"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (F - 1)
}
# FuzME https://precision-agriculture.sydney.edu.au/resources/software/
# MATLAB code file fvalidity.m, downloaded on 11 Nov, 2021
if(method == "FuzME"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (C - 1)
}
return(res)
}
NCE <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
if(method == "Rahul"){
res <- (N / (N - C)) * (- sum(cmem * log(cmem)) / N)
}
if(method %in% c("FuzME", "McBrathney")){
H <- -1 / N * sum(cmem * log(cmem))
res <- H / log(C)
}
return(res)
}
Then use those to calculate the indices from the degrees of membership from the cmeans function from the iris dataset.
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- e1071::cmeans(iris[, -5], i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership, method = "M"))
nce <- c(nce, NCE(cl[[i]]$membership, method = "M"))
}
Last, plot with two different axes in one plot.
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
EDIT1: Updated the functions according to optional equations from two different publications and calculated the example on the iris dataset.
EDIT2: Added code for the FPI and NCE calculations specified in the FuzME MATLAB code available here.
Hope this could help
library(dplyr)
library(ggplot2)
f <- function(cl) {
C <- length(cl$size)
N <- sum(cl$size)
mu <- cl$membership
fpi <- 1 - C / (C - 1) * (1 - sum((mu)^2) / N)
nce <- N / (N - C) * (-sum(log(mu) * mu) / N)
c(FPI = fpi, NCE = nce)
}
data.frame(t(rbind(
K = 2:6,
sapply(
K,
function(k) f(cmeans(x, k, 20, verbose = TRUE, method = "cmeans"))
)
))) %>%
pivot_longer(cols = FPI:NCE, names_to = "Index") %>%
ggplot(aes(x = K, y = value, group = Index)) +
geom_line(aes(linetype = Index, color = Index)) +
geom_point() +
scale_y_continuous(
name = "FPI",
sec.axis = sec_axis(~., name = "NCE")
) +
theme(legend.position = "top")

Options to speed up R code when parallelization is not feasible [duplicate]

This question already has an answer here:
Reducing nested for loop to single loop in R
(1 answer)
Closed 5 years ago.
I have developed a simulation, which can be quite slow (up to 6 hours), depending on values of input parameters.
Profiling my code reveals that sample() is the bottleneck, but from my knowledge, there is no better function (sample() is already optimized for maximal speed since it is written in C).
Parallelization will not work either since algorithm iterations are dependent on one another.
Any suggestions on alternatives to improve runtime?
Any advice is warmly welcomed.
Below is my simulation code:
Haplotype Accumulation Curve Simulator
HAC.sim <- function(K = 1, N, Hstar, probs, m = 0, perms = 10000, p = 0.95){
pop <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
haps <- as.character(1:Hstar)
specs <- 1:ceiling((1 - m) * N / K)
for (j in 1:perms){
for (i in 1:K){
pop[j, specs, i] <- sample(haps, size = length(specs), replace = TRUE, prob = probs)
}
}
HAC.mat <- array(dim = c(c(perms, length(specs), K)))
for (k in specs){
for (j in 1:perms){
for (i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE)
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)]
HAC.mat[j, k, i] <- length(unique(hap.plot))
}
}
}
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
d <- assign("d", data.frame(specs, means), envir = .GlobalEnv)
P <- max(means)
Q <- Hstar - max(means)
R <- assign("R", max(means) / Hstar, envir = .GlobalEnv)
S <- (Hstar - max(means)) / Hstar
Nstar <- assign("Nstar", (N * Hstar) / max(means), envir = .GlobalEnv)
X <- ((N * Hstar) / max(means)) - N
cat("\n Measures of Sampling Closeness \n \n Mean number of haplotypes sampled: " , P, "\n Mean number of haplotypes not sampled: " , Q, "\n Proportion of haplotypes sampled: " , R, "\n Proportion of haplotypes not sampled: " , S, "\n \n Calculated mean value of N*: ", Nstar, "\n Mean number of individuals not sampled: ", X, "\n \n")
if (R < p){
cat("Desired level of H* has not yet been reached \n")
} else{
cat("Desired level of H* has been reached")
}
par(mfrow = c(1, 2))
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(length(specs) * probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
}
A quick reproducible example
N <- 50
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
HAC.sim(N = N, Hstar = Hstar, probs = probs)
Sure, Rcpp is an option but would require you to rewrite parts of your code in C++. One non-invasive option is to use the compiler package that can offer some speed improvements with minimal effort:
#old R version of lapply
slow_func <- function(X, FUN, ...) {
FUN <- match.fun(FUN)
if (!is.list(X))
X <- as.list(X)
rval <- vector("list", length(X))
for(i in seq(along = X))
rval[i] <- list(FUN(X[[i]], ...))
names(rval) <- names(X) # keep `names' !
return(rval)
}
# Compiled versions
require(compiler)
slow_func_compiled <- cmpfun(slow_func)
You can read more about it here. However, if the only issue is sample it looks like Rcpp has an alternative implemention. It looks like it is not faster in all cases though.

while loop within a user-defined R function is not breaking

I have created a user defined function as follows:
HAC.sim(K = 1, N, Hstar, m = 0, probs, perms, p = 0.95)
and within the body of this function, I would like to put the following 'while' loop (because I want to eventually create a package for easy use).
while(R < p){
HAC.sim(K = 1, N = Nstar, Hstar, m = 0, probs, perms, p = 0.95)
}
R is a variable that is defined in the main function body.
The problem is that the 'while' loop is not breaking when the condition is met (i.e., the moment when R < p).
My routine works (i.e., it breaks successfully) when the 'while' loop is placed outside of the function, but fails when I place it within the main function body.
I have also tried implementing a repeat loop with break, but the same problem occurs.
Any ideas on how I can fix the issue?
Here is the full function (quite long)
HAC.sim <- function(K = 1, N, Hstar, probs, m = 0, perms = 10000, p = 0.95){
### Set up a container to hold the identity of each individual from each permutation
pop <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
### Create an ID for each haplotype
haps <- as.character(1:Hstar)
### Generate permutations, we assume each permutation has Npop individuals, and we sample those individuals' haplotypes from the probabilities
specs <- 1:ceiling((1 - m) * N / K)
for(j in 1:perms){
for(i in 1:K){
pop[j, specs, i] <- sample(haps, size = ceiling((1 - m) * N / K), replace = TRUE, prob = probs)
}
}
### Make a matrix to hold the 1:N individuals from each permutation
HAC.mat <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
for(k in specs){
for(j in 1:perms){
for(i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE) ## which individuals will we sample
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)] ## pull those individuals from a permutation
HAC.mat[j, k, i] <- length(unique(hap.plot)) ## how many haplotypes did we get for a given sampling intensity (j) from each permutation (i)
}
}
}
### Calculate the mean and CI for number of haplotypes at each sampling intensity (k)
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
assign("d", data.frame(specs, means), envir = .GlobalEnv)
### Compute Measures of Sampling Closeness
P <- max(means)
Q <- Hstar - max(means)
R <- assign("R", max(means) / Hstar, envir = .GlobalEnv)
S <- (Hstar - max(means)) / Hstar
Nstar <- assign("Nstar", (N * Hstar) / max(means), envir = .GlobalEnv)
X <- ((N * Hstar) / max(means)) - N
cat("\n Input parameters \n \n Number of (sub)populations: ", K, "\n Number of individuals: ", N, "\n Number of haplotypes: ", Hstar, "\n Haplotype distribution: ", probs, "\n Migration rate: ", m, "\n Number of permutations: ", perms, "\n Proportion of haplotypes to recover: ", p, "\n \n \n Measures of Sampling Closeness \n \n Mean number of haplotypes sampled: " , P, "\n Mean number of haplotypes not sampled: " , Q, "\n Proportion of haplotypes sampled: " , R, "\n Proportion of haplotypes not sampled: " , S, "\n \n Calculated mean value of N*: ", Nstar, "\n Mean number of individuals not sampled: ", X, "\n \n")
if(R < p){
cat("Desired level of H* has not yet been reached")
} else{
cat("Desired level of H* has been reached")
}
### Plot the curve and frequency barplot
par(mfrow = c(1, 2))
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(ceiling((1 - m) * N / K)*probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
while(R < p){
HAC.sim(K = K, N = ceiling(Nstar), Hstar = Hstar, probs = probs, m = m, perms = perms, p = p)
}
}
### Run simulation
HAC.sim(K = K, N = N, Hstar = Hstar, probs = probs, m = m, perms = perms, p = p)

get(paste0(...)) overwrites results in R subarray

I have two subarrays and would like to sample one of them at random. The subarrays consist of character labels from a larger pool of labels as follows:
K <- as.character(1:10)
Suppose I call the arrays K1 and K2 and assign 10 labels between them as follows:
K1 <- 1:8
K2 <- 9:10
I then use
get(paste0("K", i))
to retrieve the labels that were sampled using a 'for' loop.
The issue I am having is this: if K1 is sampled, get(paste0("K", i)) returns labels from K2 (which contain labels 9:10) instead of returning K1 as needed.
In other words, I believe get(past0("K", i)) may be overwriting results.
Any thoughts on why this may occur and how to go about circumventing the issue?
Below is a more extensive example:
K <- 2
N <- 100
Hstar <- 10
perms <- 10000
probs <- rep(1/Hstar, Hstar)
specs <- 1:N
pop <- array(dim = c(c(perms, N), K))
haps <- as.character(1:Hstar)
K1 <- 1:8 # subarray 1
K2 <- 9:10 # subarray 2
for(j in 1:perms){
for(i in 1:K){
if(i == 1){
pop[j, specs, i] <- sample(haps, size = N, replace = TRUE, prob = probs)
}
else{
pop[j ,, 1] <- sample(haps[K1], size = N, replace = TRUE, prob = probs[K1])
pop[j ,, 2] <- sample(haps[K2], size = N, replace = TRUE, prob = probs[K1])
}
}
}
HAC.mat <- array(dim = c(c(perms, N), K))
for(k in specs){
for(j in 1:perms){
for(i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE)
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)]
HAC.mat[j, k, i] <- length(unique(hap.plot))
}
}
}
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
d <- data.frame(specs, means, lower, upper)
par(mfrow = c(1, 2))
if(i == 1){
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
}
else if(i > 1){
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs[get(paste0("K", i))], xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = get(paste0("K", i))) ## The issue may lie here
}
Any advice on what may be going on here is greatly appreciated.

Plot multiple graphs in one figure using a loop

I need to compute the efficient frontier with different risk measure and to use a bootstrapping technique to simulate possible outcome. However, now I'm stuck: what I want to do is to generate via a loop (which will be integrated later into a function) multiple efficient frontier, each one associated to a possible future outcome, and to plot them on the same figure in such a way to see how they may change as the simulation goes on. Here is the loop that I wrote so far:
for (i in 1:B) {
idx <- sample(1:N, N, replace = TRUE)
new.x <- x[idx, ]
µ.b <- apply(X = new.x, 2, FUN = mean)
range.b[, i] <- seq(from = min(µ.b), to = max(µ.b), length.out = steps)
sigma.b <- apply(X = new.x, 2, FUN = sd)
riskCov.b[, i] <- sapply(range.b[, i], function(targetReturn) {
w <- MV_QP(new.x, targetReturn, Sigma)
sd(c(new.x %*% w))
})
xlim.b <- range(c(sigma.b, riskCov.b[, 1]), na.rm = TRUE)
ylim.b <- range(µ.b)
par(new = TRUE)
plot(x = riskCov.b[, i], y = range.b[, i], type = "l", xlim = xlim.b, ylim = ylim.b, xlab = "Risk", ylab = "Return", main = "Resampling EFs")
}
but the problem is that the elements on the x and y axis are rewriting each time the loop runs. How can this problem be solved?
I don't nknow if the optimization is correct. For ploting you can try the following:
for (i in 1:B) {
idx <- sample(1:N, N, replace = TRUE)
new.x <- x[idx, ]
µ.b <- apply(X = new.x, 2, FUN = mean)
range.b[, i] <- seq(from = min(µ.b), to = max(µ.b), length.out = steps)
#sigma.b <- apply(X = new.x, 2, FUN = sd)
riskCov.b[, i] <- sapply(range.b[, i], function(targetReturn) {
w <- MV_QP(new.x, targetReturn,Sigma=cov(new.x))
sd(c(new.x %*% w))
})
}
xlim.b <- range(c(apply(X = x, 2, FUN= sd), riskCov.b), na.rm = TRUE) *c(0.98,1.02)
ylim.b <- range(µ.b) *c(0.98,1.02)
#par(new = TRUE)
for (i in 1:B){
if (i==1) plot(x = riskCov.b[, i], y = range.b[, i], type = "l", xlim = xlim.b, ylim = ylim.b, xlab = "Risk", ylab = "Return", main = "Resampling EFs") else
lines(x = riskCov.b[, i], y = range.b[, i],col=rainbow(B)[i])
}
Depending on your data, you should end up with a similar plot:

Resources