How to iterate a given process 1'000 times and average the results - r

I am here to ask you about R language and how to construct a loop to iterate some functions several times.
Here is my problem: I have a numeric matrix obtained from previous analyses (matrix1) that I want to compare (using the overlap function that results in a single value) with another numeric matrix that I get by extracting values of a given raster with a set of randomly created points, as many as the values in the first numeric matrix.
I want to repeat the random sampling procedure 1'000 times, in order to get 1'000 different sets of random points, then repeat the comparison with matrix1 1'000 times (one for each set of random points), and, in the end, calculate the mean of the results to get a single value.
Hereafter I give you an example of the functions I want to use:
#matrix1 is the first matrix, obtained before starting the potential loop;
#LineVector is a polyline shapefile that has to be used within the loop and downloaded before it;
#Raster is a raster from which I should extract values at points location;
#The loop should start from here:
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
#This value, obtained 1'000 times, has then to be averaged into a single number.
I hope I have posed my question in a clear and understandable manner, and I hope you can help me with this problem.
Thanks to everyone in advance, I wish you a good day!

Easy way i can figure out is to use "replicate":
values <- replicate(1000, {
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
Overlapping_value
})
mean(values)

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

How do I loop different percentages of missing values using MCAR?

Using the cleveland data from MCI data respository, I want to generate missing values on the data to apply some imputation techniques.
heart.ds <- read.csv(file.choose())
head(heart.ds)
attach(heart.ds)
sum(is.na(heart.ds))
str(heart.ds)
#Changing Appropriate Variables to Factors
heart.ds$sex<-as.factor(heart.ds$sex)
heart.ds$cp<-as.factor(heart.ds$cp)
heart.ds$fbs<-as.factor(heart.ds$fbs)
heart.ds$exang<-as.factor(heart.ds$exang)
heart.ds$restecg<-as.factor(heart.ds$restecg)
heart.ds$slope<-as.factor(heart.ds$slope)
heart.ds$thal<-as.factor(heart.ds$thal)
heart.ds$target<-as.factor(heart.ds$target)
str(heart.ds)
Now i want to generate missing values using the MCAR mechanism. Below is the loop code;
p = c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
for(i in 1:length(p)){
hd_mcar[i] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
The problem here is that, after the above code, i dont get the data been generated in it original values like in a data frame where i will have n variables and n rows.
Below is a picture of the output i had through the above code;
enter image description here
But when i use only one missingness percentage i get an accurate results; below is the coe for only one missing percentage
#Missing Completely at Random(MCAR)
hd_mcar <- delete_MCAR(heart.ds, 0.05)
sum(is.na(hd_mcar))
Below is the output of the results;
enter image description here
Please I need help to to solve the looping problem. Thank you.
Now I want to apply the MICE and other imputations methods like HMISC, Amelia, mi, and missForest inside the loop but it is giving me an error saying "Error: Data should be a matrix or data frame"
The code below is for only MICE;
#1. Method(MICE)
mice_mcar[[i]] <- mice(hd_mcar, m=ip, method = c("pmm","logreg","polyreg","pmm","pmm","logreg",
"polyreg","pmm","logreg","pmm","polyreg","pmm",
"polyreg","logreg"), maxit = 20)
#Diagnostic check
summary(heart.ds$age)
mice_mcar$imp$age
#Finding the means of the impuatations
app1 <- apply(mice_mcar$imp$age, MARGIN = 2, FUN = mean)
min1 <- abs(app1-mean(heart.ds$age))
#Selecting the minimum index
sm1 <- which(min1==min(min1))
#Selecting final imputation
final_clean_hd_mcar =mice::complete(mice_mcar,sm1)
mice.mcar = final_clean_hd_mcar
How do i go about to make it fit into the loop and works perfectly?
Your problem was this line:
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
You are creating a vector here rather than a list. You can't assign a data frame to an element of a vector without coercing it into something that is not a data frame. You want to do this:
p <- c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar <- vector(mode = "list", length = length(p))
for(i in 1:length(p)){
hd_mcar[[i]] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
Note that because it's a list now, hd_mcar[[i]] uses the [[ rather than [ subscript.

Simulation with N trials in R

I am trying to create a simulation where a number 0:100 is chosen by a person, then a random number 0:100 is generated using sample(). The difference between their chosen number and the random number is calculated and stored. I would like to use a for loop to run this 10000 times and store the results in a vector so I can later plot the results. Can anyone point me to where I can read about this or see some examples? Below is what I have so far but I keep getting errors saying the lengths aren't the same multiple.
N = 10000
chosen.number = 0:100
generated.number = sample(0:100, N, replace = T)
differences = numeric(0)
for(i in 1:length(chosen.number)){
differences = (generated.number - chosen.number)
}
Then I'll make a scatterplot of the vector differences.
Here's an example of how you could go about it (if I understand your questions correctly).
You can set how many loops you want using Repeat.
Since you want a different randomly generated number each time, you'll have to put sample() within your loop. I didn't know where your user-selected number would come from, but in this example, it gets randomly generated with the same set of criteria as the random selection.
Then differences are collected in collect_differences for you to use downstream.
Repeat = 10 # Number of times to repeat/loop
collect_differences <- NULL
for(i in 1:Repeat){
randomly.generated.number = sample(0:100, size = 1, replace = T)
selected.number = sample(0:100, size = 1, replace = T)
differences = randomly.generated.number - selected.number
collect_differences = c(collect_differences, differences)
}
collect_differences
As for resources, you can look up anything related to the fundamentals of looping. You could also look through The Carpentries lessons in R as they have some resources for this as well.

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature?
if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
This follows #DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp
R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Loop operations in which later iterations depend on previous
iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.
Wickham's Advanced R has a good chapter on that topic too.

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

Resources