while loop within a user-defined R function is not breaking - r

I have created a user defined function as follows:
HAC.sim(K = 1, N, Hstar, m = 0, probs, perms, p = 0.95)
and within the body of this function, I would like to put the following 'while' loop (because I want to eventually create a package for easy use).
while(R < p){
HAC.sim(K = 1, N = Nstar, Hstar, m = 0, probs, perms, p = 0.95)
}
R is a variable that is defined in the main function body.
The problem is that the 'while' loop is not breaking when the condition is met (i.e., the moment when R < p).
My routine works (i.e., it breaks successfully) when the 'while' loop is placed outside of the function, but fails when I place it within the main function body.
I have also tried implementing a repeat loop with break, but the same problem occurs.
Any ideas on how I can fix the issue?
Here is the full function (quite long)
HAC.sim <- function(K = 1, N, Hstar, probs, m = 0, perms = 10000, p = 0.95){
### Set up a container to hold the identity of each individual from each permutation
pop <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
### Create an ID for each haplotype
haps <- as.character(1:Hstar)
### Generate permutations, we assume each permutation has Npop individuals, and we sample those individuals' haplotypes from the probabilities
specs <- 1:ceiling((1 - m) * N / K)
for(j in 1:perms){
for(i in 1:K){
pop[j, specs, i] <- sample(haps, size = ceiling((1 - m) * N / K), replace = TRUE, prob = probs)
}
}
### Make a matrix to hold the 1:N individuals from each permutation
HAC.mat <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
for(k in specs){
for(j in 1:perms){
for(i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE) ## which individuals will we sample
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)] ## pull those individuals from a permutation
HAC.mat[j, k, i] <- length(unique(hap.plot)) ## how many haplotypes did we get for a given sampling intensity (j) from each permutation (i)
}
}
}
### Calculate the mean and CI for number of haplotypes at each sampling intensity (k)
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
assign("d", data.frame(specs, means), envir = .GlobalEnv)
### Compute Measures of Sampling Closeness
P <- max(means)
Q <- Hstar - max(means)
R <- assign("R", max(means) / Hstar, envir = .GlobalEnv)
S <- (Hstar - max(means)) / Hstar
Nstar <- assign("Nstar", (N * Hstar) / max(means), envir = .GlobalEnv)
X <- ((N * Hstar) / max(means)) - N
cat("\n Input parameters \n \n Number of (sub)populations: ", K, "\n Number of individuals: ", N, "\n Number of haplotypes: ", Hstar, "\n Haplotype distribution: ", probs, "\n Migration rate: ", m, "\n Number of permutations: ", perms, "\n Proportion of haplotypes to recover: ", p, "\n \n \n Measures of Sampling Closeness \n \n Mean number of haplotypes sampled: " , P, "\n Mean number of haplotypes not sampled: " , Q, "\n Proportion of haplotypes sampled: " , R, "\n Proportion of haplotypes not sampled: " , S, "\n \n Calculated mean value of N*: ", Nstar, "\n Mean number of individuals not sampled: ", X, "\n \n")
if(R < p){
cat("Desired level of H* has not yet been reached")
} else{
cat("Desired level of H* has been reached")
}
### Plot the curve and frequency barplot
par(mfrow = c(1, 2))
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(ceiling((1 - m) * N / K)*probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
while(R < p){
HAC.sim(K = K, N = ceiling(Nstar), Hstar = Hstar, probs = probs, m = m, perms = perms, p = p)
}
}
### Run simulation
HAC.sim(K = K, N = N, Hstar = Hstar, probs = probs, m = m, perms = perms, p = p)

Related

How to calculate fuzzy performance index and normalized classification entropy in R

I am running Fuzzy C-Means Clustering using e1071 package. I want to decide the optimum number of clusters based on fuzzy performance index (FPI) (extent of fuzziness) and normalized classification entropy (NCE) (degree of disorganization of specific class) given in the following formula
where c is the number of clusters and n is the number of observations, μik is the fuzzy membership and loga is the natural logarithm.
I am using the following code
library(e1071)
x <- rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl <- cmeans(x,2,20,verbose=TRUE,method="cmeans")
cl$membership
I have been able to extract the μik i.e. fuzzy membership. Now, cmeans has to for different number of clusters e.g. 2 to 6 and the FPI and NCE has to be calculated to have a plot like the following
How can it be achieved in R?
Edit
I have tried the code provided by #nya for iris dataset using the following code
df <- scale(iris[-5])
FPI <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
1 - (c / (c - 1)) * (1 - sum(cmem^2) / n)
}
NCE <- function(cmem){
c <- ncol(cmem)
n <- nrow(cmem)
(n / (n - c)) * (- sum(cmem * log(cmem)) / n)
}
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- cmeans(df, i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership))
nce <- c(nce, NCE(cl[[i]]$membership))
}
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
The minimum values of fuzzy performance index(FPI) and normalized classification entropy (NCE) were considered to decide the optimum number of clusters. NCE is always increasing and FPI is showing the decreasing value. Ideally it should have been
With available equations, we can program our own functions. Here, the two functions use equations present in the paper you suggested and one of the references the authors cite.
FPI <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
# Rahul et al. 2019. https://doi.org/10.1080/03650340.2019.1578345
if(method == "Rahul"){
res <- 1 - (C / (C - 1)) * (1 - sum(cmem^2) / N)
}
# McBrathney & Moore 1985 https://doi.org/10.1016/0168-1923(85)90082-6
if(method == "McBrathney"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (F - 1)
}
# FuzME https://precision-agriculture.sydney.edu.au/resources/software/
# MATLAB code file fvalidity.m, downloaded on 11 Nov, 2021
if(method == "FuzME"){
F <- sum(cmem^2) / N
res <- 1 - (C * F - 1) / (C - 1)
}
return(res)
}
NCE <- function(cmem, method = c("FuzME", "McBrathney", "Rahul")){
method = match.arg(method)
C <- ncol(cmem)
N <- nrow(cmem)
if(method == "Rahul"){
res <- (N / (N - C)) * (- sum(cmem * log(cmem)) / N)
}
if(method %in% c("FuzME", "McBrathney")){
H <- -1 / N * sum(cmem * log(cmem))
res <- H / log(C)
}
return(res)
}
Then use those to calculate the indices from the degrees of membership from the cmeans function from the iris dataset.
# prepare variables
cl <- list()
fpi <- nce <- NULL
# cycle through the desired number of clusters
for(i in 2:6){
cl[[i]] <- e1071::cmeans(iris[, -5], i, 20, method = "cmeans")
fpi <- c(fpi, FPI(cl[[i]]$membership, method = "M"))
nce <- c(nce, NCE(cl[[i]]$membership, method = "M"))
}
Last, plot with two different axes in one plot.
# add space for the second axis label
par(mar = c(5,4,1,4) + .1)
# plot FPI
plot(2:6, fpi, lty = 2, pch = 18, type = "b", xlab = "Number of clusters", ylab = "FPI")
# plot NCE, manually adding the second axis
par(new = TRUE)
plot(2:6, nce, lty = 1, pch = 15, type = "b", xlab = "", ylab = "", axes = FALSE)
axis(4, at = pretty(range(nce)))
mtext("NCE", side = 4, line = 3)
# add legend
legend("top", legend = c("FPI", "NCE"), pch = c(18,15), lty = c(2,1), horiz = TRUE)
EDIT1: Updated the functions according to optional equations from two different publications and calculated the example on the iris dataset.
EDIT2: Added code for the FPI and NCE calculations specified in the FuzME MATLAB code available here.
Hope this could help
library(dplyr)
library(ggplot2)
f <- function(cl) {
C <- length(cl$size)
N <- sum(cl$size)
mu <- cl$membership
fpi <- 1 - C / (C - 1) * (1 - sum((mu)^2) / N)
nce <- N / (N - C) * (-sum(log(mu) * mu) / N)
c(FPI = fpi, NCE = nce)
}
data.frame(t(rbind(
K = 2:6,
sapply(
K,
function(k) f(cmeans(x, k, 20, verbose = TRUE, method = "cmeans"))
)
))) %>%
pivot_longer(cols = FPI:NCE, names_to = "Index") %>%
ggplot(aes(x = K, y = value, group = Index)) +
geom_line(aes(linetype = Index, color = Index)) +
geom_point() +
scale_y_continuous(
name = "FPI",
sec.axis = sec_axis(~., name = "NCE")
) +
theme(legend.position = "top")

My P-values are way lower than I expected and can not build a proper power curve

pval.dist.sim = function(n, sigma_x, rho, reps = 2500){
p = 5; sigma = sqrt(2)
beta = c(0.5, 0.5, 0, 0.25, 0)
mu = 10
# generate vector for pvals
pval.list = numeric(reps)
for(r in 1:reps){
# generate design matrix
X = gen_X(n = n, p = 5, rho = rho, sigma_x = sigma_x, mu = mu)
# generate the XtXinv portion of equation
XtXinv = qr.solve(crossprod(X))
sqrtXtXinv55 = sqrt(XtXinv[5,5])
y = X %*% beta + rnorm(n = n)
beta.hat = XtXinv %*% crossprod(X, y)
sE = sqrt(sum((y - X %*% beta.hat)^2)/(n-p))
t.val = beta.hat[3]/(sE * sqrtXtXinv55)
pval.list[r] = 2 * pt(-abs(t.val), df = n - p)
}
return(pval.list)
}
Above is the pval.dist simulation. I need to run this function to build my p.values to build my power curve
set.seed(3701)
# givens
p = 5; d = 2; mu = 10; sigmasqrd = 2; reps = 2500
n.list = seq(from=10, to=150, by=10)
# create a vector for the estimates of the power
est.power = numeric(length(n.list))
# create a vector for the left endpoints of the 95% CI
LB.list = numeric(length(n.list))
# create a vector for the right endpoints of the 95% CI
UB.list = numeric(length(n.list))
for(j in 1:length(n.list)){
# perform the test reps times
pvals = pval.dist.sim(n = n.list[j], sigma_x = 1.5, rho = 0.2, reps = reps )
# record the simulated estimate of the power
est.power[j] = mean(pvals<0.05)
# compute the 95% conf int
bounds = binom.test(x=sum(pvals < 0.05), n = reps, conf.level = 0.95)$conf.int[1:2]
LB.list[j] = bounds[1]
UB.list[j] = bounds[2]
}
## plot the power curve estimation
plot(n.list, est.power, t = "l", xlab = "n",ylab = "Power")
I am having the issue that my pvalues, when plugged in, are drastically low. I am getting values in the single digit percentage. What am I doing wrong?

Number of items to replace is not a multiple of replacement length - Datacamp

I follow some classes in DataCamp about R and sometimes when I replicate the code from datacamp to R-studio, I have issues but usually, I find the answer here or generally online. However, this time I cannot understand what is the mistake or how to fix it. I copy paste the code from datacamp
# Compute cross-validated errors for up to 8 steps ahead
e <- matrix(NA_real_, nrow = 1000, ncol = 8)
for(h in 1:8)
e[, h] <- tsCV(goog, forecastfunction = naive, h = h)
# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
# Plot the MSE values against the forecast horizon
data.frame(h = 1:8, MSE = mse) %>% ggplot(aes(x = h, y = MSE)) + geom_point()
When I add this code to R-studio (including the demanded packages) I always get the following error:
**Error in e[, h] <- tsCV(goog, forecastfunction = naive, h = h) :
number of items to replace is not a multiple of replacement length**
Does anybody know why does this happen?
The problem here is when h=1 the tsCV will return only one column when h=2 it will provide two columns h=1 and h=2 similarly if h=8 it will return 8 columns h=1,...h=8. The following code will solve the problem but we will not be able to find values for h=1 with the loop (because tsCV(goog, forecastfunction = naive, h = 1) will be a vector and [,1] will be an incorrect number of dimensions) so we will calculate it separately:
library(forecast)
library(fpp2)
e <- matrix(data = NA, nrow = 1000, ncol =8)
for(h in 2:8){
e[, h] <- tsCV(goog, forecastfunction = naive, h = h)[,h]
}
e[,1]<- tsCV(goog, forecastfunction = naive, h = 1)
# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
# Plot the MSE values against the forecast horizon
data.frame(h = 1:8, MSE = mse) %>% ggplot(aes(x = h, y = MSE)) + geom_point()
note that:
tsCV stands for Time series cross-validation from forecast package
goog is a data set from fpp2 package
when you set tsCV(h = n), it is returns n columns and calculate all values 1:n.
You can simply change your code to
# Compute cross-validated errors for up to 8 steps ahead
e <- matrix(NA_real_, nrow = 1000, ncol = 8)
e <- tsCV(goog, forecastfunction = naive, h = 8)
# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
# Plot the MSE values against the forecast horizon
data.frame(h = 1:8, MSE = mse) %>%
ggplot(aes(x = h, y = MSE)) + geom_point()
If you want to know more about the tsCV function, below is the function code
function (y, forecastfunction, h = 1, window = NULL, ...)
{
y <- as.ts(y)
n <- length(y)
e <- ts(matrix(NA_real_, nrow = n, ncol = h))
tsp(e) <- tsp(y)
for (i in seq_len(n - 1)) {
fc <- try(suppressWarnings(forecastfunction(subset(y,
start = ifelse(is.null(window), 1L, ifelse(i - window >=
0L, i - window + 1L, stop("small window"))),
end = i), h = h, ...)), silent = TRUE)
if (!is.element("try-error", class(fc))) {
e[i, ] <- y[i + (1:h)] - fc$mean
}
}
if (h == 1) {
return(e[, 1L])
}
else {
colnames(e) <- paste("h=", 1:h, sep = "")
return(e)
}
}
<bytecode: 0x10e17fe70>
<environment: namespace:forecast>

Options to speed up R code when parallelization is not feasible [duplicate]

This question already has an answer here:
Reducing nested for loop to single loop in R
(1 answer)
Closed 5 years ago.
I have developed a simulation, which can be quite slow (up to 6 hours), depending on values of input parameters.
Profiling my code reveals that sample() is the bottleneck, but from my knowledge, there is no better function (sample() is already optimized for maximal speed since it is written in C).
Parallelization will not work either since algorithm iterations are dependent on one another.
Any suggestions on alternatives to improve runtime?
Any advice is warmly welcomed.
Below is my simulation code:
Haplotype Accumulation Curve Simulator
HAC.sim <- function(K = 1, N, Hstar, probs, m = 0, perms = 10000, p = 0.95){
pop <- array(dim = c(c(perms, ceiling((1 - m) * N / K)), K))
haps <- as.character(1:Hstar)
specs <- 1:ceiling((1 - m) * N / K)
for (j in 1:perms){
for (i in 1:K){
pop[j, specs, i] <- sample(haps, size = length(specs), replace = TRUE, prob = probs)
}
}
HAC.mat <- array(dim = c(c(perms, length(specs), K)))
for (k in specs){
for (j in 1:perms){
for (i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE)
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(i, size = 1, replace = TRUE)]
HAC.mat[j, k, i] <- length(unique(hap.plot))
}
}
}
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
d <- assign("d", data.frame(specs, means), envir = .GlobalEnv)
P <- max(means)
Q <- Hstar - max(means)
R <- assign("R", max(means) / Hstar, envir = .GlobalEnv)
S <- (Hstar - max(means)) / Hstar
Nstar <- assign("Nstar", (N * Hstar) / max(means), envir = .GlobalEnv)
X <- ((N * Hstar) / max(means)) - N
cat("\n Measures of Sampling Closeness \n \n Mean number of haplotypes sampled: " , P, "\n Mean number of haplotypes not sampled: " , Q, "\n Proportion of haplotypes sampled: " , R, "\n Proportion of haplotypes not sampled: " , S, "\n \n Calculated mean value of N*: ", Nstar, "\n Mean number of individuals not sampled: ", X, "\n \n")
if (R < p){
cat("Desired level of H* has not yet been reached \n")
} else{
cat("Desired level of H* has been reached")
}
par(mfrow = c(1, 2))
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(length(specs) * probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
}
A quick reproducible example
N <- 50
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
HAC.sim(N = N, Hstar = Hstar, probs = probs)
Sure, Rcpp is an option but would require you to rewrite parts of your code in C++. One non-invasive option is to use the compiler package that can offer some speed improvements with minimal effort:
#old R version of lapply
slow_func <- function(X, FUN, ...) {
FUN <- match.fun(FUN)
if (!is.list(X))
X <- as.list(X)
rval <- vector("list", length(X))
for(i in seq(along = X))
rval[i] <- list(FUN(X[[i]], ...))
names(rval) <- names(X) # keep `names' !
return(rval)
}
# Compiled versions
require(compiler)
slow_func_compiled <- cmpfun(slow_func)
You can read more about it here. However, if the only issue is sample it looks like Rcpp has an alternative implemention. It looks like it is not faster in all cases though.

Extract a named dataframe from a function in R

I have created the following function that I need to extract some information from. However, R is giving me some issues.
HAC.sim <- function(K = 1, N, Hstar, probs, perms = 10000){
specs <- 1:N
### Set up a container to hold the identity of each individual from each permutation
pop <- array(dim = c(c(perms, N), K))
### Create an ID for each haplotype
haps <- as.character(1:Hstar)
### Assign probabilities of occurrence to each haplotype, ensure they sum to 1
### This is where we assume we "know" the distribution of haplotypes
### Here, I have assumed they all occur with equal frequency, but you can change this to assume some dominant ones and some rare ones, whatever you want
probs <- rep(1/Hstar, Hstar)
# probs <- c(220/N, rep(3/N, 2), rep(2/N, 2), rep(1/N, 10))
### Generate permutations, we assume each permutation has N individuals, and we sample those individuals' haplotypes from our probabilities
# If K > 1, haplotypes are partitioned into equally-sized subpopulations/demes
# Can change number of haplotypes in each subpopulation and re-run simulation
# For each additional, K, add new Ki and new pop[j ,, i] in loop
for(j in 1:perms){
for(i in 1:K){
if(i == 1){
pop[j, specs, i] <- sample(haps, size = N, replace = TRUE, prob = probs)
}
else{
pop[j ,, 1] <- sample(haps[K1], size = N, replace = TRUE, prob = probs[K1])
pop[j ,, 2] <- sample(haps[K2], size = N, replace = TRUE, prob = probs[K2])
}
}
}
### Make a matrix to hold the 1:N individuals from each permutation
HAC.mat <- array(dim = c(c(perms, N), K))
for(k in specs){
for(j in 1:perms){
for(i in 1:K){
ind.index <- sample(specs, size = k, replace = FALSE) ## which individuals will we sample
hap.plot <- pop[sample(1:nrow(pop), size = 1, replace = TRUE), ind.index, sample(1:K, size = 1, replace = TRUE)] ## pull those individuals from a permutation
HAC.mat[j, k, i] <- length(unique(hap.plot)) ## how many haplotypes did we get for a given sampling intensity (k) from each ### permutation (j)
}
}
}
### Calculate the mean and CI for number of haplotypes at each sampling intensity (j)
means <- apply(HAC.mat, MARGIN = 2, mean)
lower <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.025))
upper <- apply(HAC.mat, MARGIN = 2, function(x) quantile(x, 0.975))
### Plot the curve and frequency barplot
par(mfrow = c(1, 2))
for(i in 1:K){
if(i == 1){
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, Hstar))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs, xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = 1:Hstar)
}
else{
plot(specs, means, type = "n", xlab = "Specimens sampled", ylab = "Unique haplotypes", ylim = c(1, max(HAC.mat)))
polygon(x = c(specs, rev(specs)), y = c(lower, rev(upper)), col = "gray")
lines(specs, means, lwd = 2)
HAC.bar <- barplot(N*probs[get(paste0("K", i))], xlab = "Unique haplotypes", ylab = "Specimens sampled", names.arg = get(paste0("K",i)))
}
}
d <- data.frame(specs, means)
## Measures of Closeness ##
list(c(cat("\n Number of haplotypes sampled: " , max(means))
cat("\n Number of haplotypes not sampled: " , Hstar - max(means))
cat("\n Proportion of haplotypes sampled: " , max(means)/Hstar)
cat("\n Proportion of haplotypes not sampled: " , (Hstar - max(means))/Hstar)), d)
}
HAC.sim(K = 1, N = 100, Hstar = 10)
I would like to be able to easily extract the data frame 'd' from the above function.
I need to pass d to a regression model object.
The output gives me almost what I need except there is a [[1]] appended to the end of the value of (Hstar - max(means))/Hstar), instead of on a new line. Plus, [[1]] is a NULL list object. Why am I getting this NULL list as an output, and how can I fix the problem?
Any help is appreciated.

Resources