Simulation with N trials in R - r

I am trying to create a simulation where a number 0:100 is chosen by a person, then a random number 0:100 is generated using sample(). The difference between their chosen number and the random number is calculated and stored. I would like to use a for loop to run this 10000 times and store the results in a vector so I can later plot the results. Can anyone point me to where I can read about this or see some examples? Below is what I have so far but I keep getting errors saying the lengths aren't the same multiple.
N = 10000
chosen.number = 0:100
generated.number = sample(0:100, N, replace = T)
differences = numeric(0)
for(i in 1:length(chosen.number)){
differences = (generated.number - chosen.number)
}
Then I'll make a scatterplot of the vector differences.

Here's an example of how you could go about it (if I understand your questions correctly).
You can set how many loops you want using Repeat.
Since you want a different randomly generated number each time, you'll have to put sample() within your loop. I didn't know where your user-selected number would come from, but in this example, it gets randomly generated with the same set of criteria as the random selection.
Then differences are collected in collect_differences for you to use downstream.
Repeat = 10 # Number of times to repeat/loop
collect_differences <- NULL
for(i in 1:Repeat){
randomly.generated.number = sample(0:100, size = 1, replace = T)
selected.number = sample(0:100, size = 1, replace = T)
differences = randomly.generated.number - selected.number
collect_differences = c(collect_differences, differences)
}
collect_differences
As for resources, you can look up anything related to the fundamentals of looping. You could also look through The Carpentries lessons in R as they have some resources for this as well.

Related

How to iterate a given process 1'000 times and average the results

I am here to ask you about R language and how to construct a loop to iterate some functions several times.
Here is my problem: I have a numeric matrix obtained from previous analyses (matrix1) that I want to compare (using the overlap function that results in a single value) with another numeric matrix that I get by extracting values of a given raster with a set of randomly created points, as many as the values in the first numeric matrix.
I want to repeat the random sampling procedure 1'000 times, in order to get 1'000 different sets of random points, then repeat the comparison with matrix1 1'000 times (one for each set of random points), and, in the end, calculate the mean of the results to get a single value.
Hereafter I give you an example of the functions I want to use:
#matrix1 is the first matrix, obtained before starting the potential loop;
#LineVector is a polyline shapefile that has to be used within the loop and downloaded before it;
#Raster is a raster from which I should extract values at points location;
#The loop should start from here:
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
#This value, obtained 1'000 times, has then to be averaged into a single number.
I hope I have posed my question in a clear and understandable manner, and I hope you can help me with this problem.
Thanks to everyone in advance, I wish you a good day!
Easy way i can figure out is to use "replicate":
values <- replicate(1000, {
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
Overlapping_value
})
mean(values)

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

How to vary multiple parameters with lapply in R

In an attempt to avoid nesting for loops 6-7 times, I am trying to use lapply to find the proportion of randomly drawn values (that are combined in a certain way) that exceed some arbitrary thresholds values. The problem is that I have several parameters that each vary a certain number of ways, and these, in turn, will affect how the values are combined. The goal is to use the results in an ANOVA to see how varying these parameters contributes to reaching those thresholds. However, I don't understand how to do this. I have a feeling that anonymous functions could be useful, but I don't understand how they work with more than 1 parameter.
I tried to simplify the code as much as possible. But again, there are just so many parameters that must be included.
trials = 10
data_means = c(0,1,2,3)
prior_samples = c(2, 8, 32)
data_SD = c(0.5, 1, 2)
thresholds = c(10, 30, 80)
The idea is that there are two distributions, data and prior, which I draw values from. I always draw one from data, but I draw a sample (see prior_samples) of values from the prior distribution. There are four different values that determine the mean of the data distribution (see data_means), but the values are drawn the same number of times (determined by trials) from each of these four "versions" of the data distribution. These are then put into nested lists:
set.seed(123)
data_list = list()
for (nMean in data_means){ #the data values
for (nTrial in 1:trials){
data_list[[paste(nMean, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(1, nMean, 1)
}
}
prior_list = list()
for (nSamples in prior_samples){ #the prior values
for (nTrial in 1:trials){
prior_list[[paste(nSamples, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(nSamples, 0, 1)
}
}
Then I create another list for the prior values, because I want to calculate the means and standard deviations (SD) of the samples of prior values. I include normal SD, as well as SD/2 and SD*2:
prior_SD = list("mean"=0, "standard_devations"=list("SD/2"=0, "SD"=0, "SD*2"=0))
prior_mean_SD = rep(list(prior_SD), trials)
prior_nested_list = list("2"=prior_mean_SD, "8"=prior_mean_SD, "32"=prior_mean_SD)
for (nSamples in 1:length(prior_samples)){
for (nTrial in 1:trials){
prior_nested_list[[nSamples]][[nTrial]][["mean"]]=mean(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD/2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])/2)
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD"]]=sd(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD*2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])*2)
}
}
Then I combinde the values from the data list and the last list, using list.zip from rlist:
library(rlist)
dataMean0 = list.zip(dMean0=data_list[["0"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean1 = list.zip(dMean1=data_list[["1"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean2 = list.zip(dMean2=data_list[["2"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean3 = list.zip(dMean3=data_list[["3"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
all_values = list(mean_difference0=dataMean0, mean_difference1=dataMean1,
mean_difference2=dataMean2, mean_difference3=dataMean3)
Now comes the tricky part. I combine the data values and the prior values in all_values by using this custom function for the Kullback-Leibler divergence. As you can see, there are 6 parameters that varies:
mean_diff refers to the means of the data distribution (data_means). It is named mean_diff beacsue it refers to the difference in mean between the prior distribution (which is always 0), and the data distribution (which can be 0, 1, 2 or 3).
trial refers to trials,
pSample refers to the numbers of samples drawn from the prior distribution (prior_samples)
p_SD refers to the calculations of the SD based on the prior samples (normal SD, SD/2, SD*2)
data_SD refers to the SD of the data distribution, determined by data_SD
threshold refers to thresholds
The Kullback-Leibler divergence function:
kld = function(mean_diff, trial, pSample, p_SD, data_SD, threshold){
prior_mean = all_values[[mean_diff]][[trial]][[pSample]][["mean"]]
data_mean = all_values[[mean_diff]][[trial]][["mean"]]
prior_SD = all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
posterior_SD = sqrt(1/(1/
((all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
*all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]))
+1/(data_SD*data_SD)))
length(
which(
(log(prior_SD/posterior_SD) +
(((posterior_SD*posterior_SD) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(prior_SD*prior_SD)))-0.5
+
log(posterior_SD/prior_SD) +
((((prior_SD*prior_SD)) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(posterior_SD*posterior_SD)))-0.5
)>=threshold))/trials
}
So the question is how can one use lapply on the list with all the values (all_values) while using all the different combinations of the six parameters that are included? The data I want to end up with is the proportions of values (percentage of trials) that exceed the thresholds in all the parameter combinations.
I can't find the info I need, so any tips would be appreciated.

How to decide best number of clusters for kamila clustering with R?

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

CONFUSION MATRIX, R,

I need little help with the following code below. I have to setup a loop to train a neural network model on the TRAINING data with a different number of epochs each time by starting from 5 and adding 3 until I reach 20. Then I have to calculate a line chart showing the accuracy with differing numbers of epochs. I also have to keep all the parameters same as shown. Much of the code is what was given by our instructor. I added the epochs= c(5,8,11,14,17,20) to create a list of epochs and the error.rate = vector() where I intend to store the accuracy from each loop into a vector. The accuracy I want is from the confusion matrix and is found from the formula
h2o.hit_ratio_table(<model>,train = TRUE)[1,2]
The problem I face is that I have tried to create a loop. I am trying to get the results from each loop. I have labled the first part of the loop as X to try to put it into the vector for the accuracy for each loop into a vector error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2]).
However, it gives an error.
> Error in is(object, "H2OModelMetrics") : object 'X' not found In
> addition: Warning messages: 1: In 1:epochs : numerical expression has
> 6 elements: only the first used
Moreover, when I remove the error.rate=...... part, the function runs fine but there is no way to find the values of the accuracy.
I am a noob at R so a little help will be much appreciated.
s <- proc.time()
epochs= c(5,8,11,14,17,20)
error.rate = vector()
for (epoch in 1:epochs){#set up loop to go around 6 times
X=h2o.deeplearning(x = 2:785, # column numbers for predictors
y = 1, # column number for label
training_frame = train_h2o, # data in H2O format
activation = "RectifierWithDropout", # mathematical activation function
input_dropout_ratio = 0.2, # % of inputs dropout, because some inputs might not matter.
hidden_dropout_ratios = c(0.25,0.25,0.25,0.25), # % for nodes dropout, because maybe we don't need full connections. Improves generalisability
balance_classes = T, # over/under samples so that all classes are similar size.
hidden = c(50,50,50,50), # two layers of 100 nodes
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2])
proc.time() - s}
You are doing for(epoch in 1:epochs). Here the 'epoch' term changes each loop (and usually you use it within the loop but i don't see it?). 1:epochs will not work as you think it should. It is taking the first element of epochs (5) and basically saying for(epoch in 1:5) where epoch is 1, then 2, ... and then 5. You want something like for(epoch in epochs) and if you DO want a sequence from 1:each epoch in your code you should write it within the loop.
Also, x is rewritten each time it loops. You should initialize it and save subsets of it each loop instead:
epochs= c(5,8,11,14,17,20)
x <- list() # save as list #option 1
y <- list() # for an option 2
for (epoch in epochs){ #set up loop to go around 6 times
X[[epoch]] = h2o.deeplearning(... )
# or NOW you can somehow use 1:epoch where each loop epoch changes
}
But I would really focus on there is no use of using your epoch in your for loop as I see in your post. Perhaps find out where you want to use it...

Resources