Creating this covariance matrix manually in R - r

I have two samples of 1000 in length each, and I need to construct covariance matrices for these two samples.
Each sample is made up of 10 clusters of size 100. Now, each unit has a variable attached to it that identifies the cluster it came from, and the covariance between two units will be X if they are from the same cluster, or Y if they are from different clusters.
So I need to find a way to construct a covariance matrix that looks like the following picture, except the blocks of X are 100x100 and not 3x3:
Is there any method of doing this easily? The matrix is far too big to create it by manually inputting the data, and the procedure needs to be repeated thousands of times within a loop.

You mean something like this?
m <- c(rep(1, 100), rep(0, 300),
rep(0, 100), rep(1, 100), rep(0, 200),
rep(0, 200), rep(1, 100), rep(0, 100),
rep(0, 300), rep(1, 100))
m <- matrix(m, byrow = TRUE)
m

I managed to find a straightforward solution requiring no extra packages, so I'll post the solution here in case other people encounter the same problem.
The easiest way for me was to create a double loop that goes through each index of the matrix and manually enters the item. Obviously this is very computationally exhaustive, so if you need to do this many times I'd recommend using a much more efficient approach.
m<-matrix(rep(NA,1000000),ncol=1000)
for(i in 1:1000){
for(j in 1:1000){
if(sampleA$cluster[i]==sampleA$cluster[j]){
m[i,j]<-"X"
}
else{
m[i,j]<-"Y"
}
}
}

Related

How to use for loops to take multiple samples?

The setup is: "Suppose that you have a population of 10,000 and 6,000 people are Democrats. We simulate survey sampling from this hypothetical population. First, generate a vector of the population in R. Then, create 500 random samples of size n = 1,000."
My code so far is:
pop<-c(rep("Democrat", 6000), rep("Republican", 4000))
nTrials <- 500
n <- 1000
results <- rep(NA, nTrials)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=FALSE)
results[i]<- sampledpop<-c(rep(1, 6000), rep(0, 4000))
nTrials <- 500
n <- 1000
results <- matrix(data=NA, ncol = nTrials, nrow = n)
Y<-matrix(data=NA, ncol=nTrials, nrow=1)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
}
I think this code works, but I'm worried about how to tell if the matrix is filling correctly.
You can easily view objects you've saved using the view function. Link to View Function in R.
we can also put a line into our R code that halts execution until after a key stroke. Stack exchange thread covering this
Putting the two together, we can then put put two lines into a loop, one which shows us the current version of the final output, and another which pauses the loop until we continue. This will let us explore the behaviour of the loop step by step. Using one of your loops as an example:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y)
readline(prompt="Press [enter] to continue")
}
Keep in mind this will keep going for the specified amount of trials.
You could limit the amount of trials, but then you cannot be sure to get the same result, so instead we could put a break into the code. link to info about the break statement. This lets us interrupt a for loop early, assuming we are happy with how things are building up. To make the break really shine, lets pair it with some user input, you can choose if you'd like to continue or not. link for collecting user input in r
so then combing all of this we get something like:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y,)
interrupt = readline(prompt="Enter 1 for next loop, 0 to exit: ")
if (interrupt == 0) {break}
}
For what it's worth, your code looks perfectly fine to me so far.
Try this
replicate(500L, sample(c("Democrat", "Republican"), 1000L, replace = TRUE, prob = c(0.6, 0.4)), simplify = FALSE)
Or this
pop <- c(rep("Democrat", 6000L), rep("Republican", 4000L))
replicate(500L, sample(pop, 1000L), simplify = FALSE)

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

simulation of binomial distribution and storing value in matrix in r

set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.

expand.grid very big vectors exceeding memory limit

I have a problem with R.
I have 6 vectors of data. Each vector will have weight.
I need to calculate the quantile of each possible scenarios.
For example :
v1=c(1,2)
v2=c(0,5)
weights=c(1/3,2/3)
I would normally use :
scenarios=data.matrix(expand.grid(v1,v2))
results=scenarios %*% weights
And finally to get all the quantiles from 1% to 100% :
quantiles=quantile(results,seq(0.01,1,0.01),names=FALSE)
The problem is that I have 6 vectors of : 51,236,234,71,7 and 8 obs respectively, which would give me a vector of 11 G obs...
I get the error from R that I exceed the memory limit with a vector of 47 Gb...
Do you see some alternative that I can use to bypass this big matrix? I'm thinking like a loop within each value one vector and write the result in a document.
But then I don't know how i would calculate the percentile of these separate files...
Rather than generate the whole population, how about sampling to generate your pdf?
N <- 1e6
scenarios <- unique(matrix(c(sample(1:51, N, replace=T),
sample(1:236, N, replace=T),
sample(1:234, N, replace=T),
sample(1:71, N, replace=T),
sample(1:7, N, replace=T),
sample(1:8, N, replace=T)), nrow=N))
N <- nrow(scenarios)
weights <- matrix(rep(1/6, 6))
quantiles <- quantile(scenarios %*% weights, seq(0.01,1,0.01), names=FALSE)
if OP strictly wants the whole population, I will take this post down
Alright !! Thanks for your help guys !
Looks like sampling was the way to go !
Heres the code i use at the end with chinson12's help !
I did a bootstrap to see if the sampling converges towards the right value !
N=1e6
B=2
results = c(1:100)
for ( i in 1:B){
scenarios=unique(matrix(c(sample(v1,N,replace=T),sample(v2,N,replace=T),sample(v3,N,replace=T),
sample(v4,N,replace=T),sample(v5,N,replace=T),sample(v6,N,replace=T)),nrow = N))
weightedSum = round(scenarios %*% weights,4)
results=cbind(results,quantile(weightedSum ,seq(0.01,1,0.01),names=FALSE))
}
write(t(results),"ouput.txt",ncolumns = B + 1)
The output file looks great ! To 4 digits places, all of my percentiles are the same ! So they converges to a value at least !
This being said, are those percentiles unbiased for my population percentiles ?
Thanks

Resources