Assume I generate the following matrix
Sigma <- diag(x = 1, 100, 100)
Sigma[Sigma == 0] <- 0.25
so having "1" on the diagonals and "0.25" on the off-diagonals. How can I randomly change some of the signs on the off diagonals to have -0.25 and 0.25 ?
Of course, one can loop over the elements but I think that is not an elegant solution there
One way would be to generate 1 and -1 randomly for length of the matrix, multiply it by the matrix and turn the diagonal to 1.
Sigma <- Sigma * sample(c(1, -1), length(Sigma), replace = TRUE)
diag(Sigma) <- 1
Sigma
Related
Consider the Markov chain with state space S = {1, 2}, transition matrix
and initial distribution α = (1/2, 1/2).
Simulate 5 steps of the Markov chain (that is, simulate X0, X1, . . . , X5). Repeat the simulation 100
times. Use the results of your simulations to solve the following problems.
Estimate P(X1 = 1|X0 = 1). Compare your result with the exact probability.
My solution:
# returns Xn
func2 <- function(alpha1, mat1, n1)
{
xn <- alpha1 %*% matrixpower(mat1, n1+1)
return (xn)
}
alpha <- c(0.5, 0.5)
mat <- matrix(c(0.5, 0.5, 0, 1), nrow=2, ncol=2)
n <- 10
for (variable in 1:100)
{
print(func2(alpha, mat, n))
}
What is the difference if I run this code once or 100 times (as is said in the problem-statement)?
How can I find the conditional probability from here on?
Let
alpha <- c(1, 1) / 2
mat <- matrix(c(1 / 2, 0, 1 / 2, 1), nrow = 2, ncol = 2) # Different than yours
be the initial distribution and the transition matrix. Your func2 only finds n-th step distribution, which isn't needed, and doesn't simulate anything. Instead we may use
chainSim <- function(alpha, mat, n) {
out <- numeric(n)
out[1] <- sample(1:2, 1, prob = alpha)
for(i in 2:n)
out[i] <- sample(1:2, 1, prob = mat[out[i - 1], ])
out
}
where out[1] is generated using only the initial distribution and then for subsequent terms we use the transition matrix.
Then we have
set.seed(1)
# Doing once
chainSim(alpha, mat, 1 + 5)
# [1] 2 2 2 2 2 2
so that the chain initiated at 2 and got stuck there due to the specified transition probabilities.
Doing it for 100 times we have
# Doing 100 times
sim <- replicate(chainSim(alpha, mat, 1 + 5), n = 100)
rowMeans(sim - 1)
# [1] 0.52 0.78 0.87 0.94 0.99 1.00
where the last line shows how often we ended up in state 2 rather than 1. That gives one (out of many) reasons why 100 repetitions are more informative: we got stuck at state 2 doing just a single simulation, while repeating it for 100 times we explored more possible paths.
Then the conditional probability can be found with
mean(sim[2, sim[1, ] == 1] == 1)
# [1] 0.4583333
while the true probability is 0.5 (given by the upper left entry of the transition matrix).
I have a 25x25 matrix with numeric values and I want to choose through some conditions ! For example I want only the values from 0 to 0.2 to install them in another matrix how can I do this ?
x<-matrix(rnorm(25*25),25,25)
which(x>0.2) # indices where x>0.2
n<-40
h<-hist(x,breaks = seq(min(x),max(x),length.out = n+1),plot = F) # For multiple ranges and counts
h$breaks #n+1 break points
h$count #n counts of numbers between those breakpoints
What you want can be done with simple logical operations, see file R-intro.pdf that comes with your distribution of R, section 2.7 Index vectors; selecting and modifying subsets of a data set.
set.seed(1356) # make the results reproducible
m <- matrix(rnorm(25*25), 25) # input matrix
i <- 0 <= m & m <= 0.2 # logical index into 'm'
# create a result matrix with the same dimensions as the input
m2 <- matrix(NA, nrow = nrow(m), ncol = ncol(m))
m2[i] <- m[i] # assign the values you want
m2
sum(i) # count of values in [0, 0.2]
sum(m < 0) # count of values less than zero
sum(m > 0.2) # count of values greater than 0.2
I am trying to draw from two different distributions with a probability 100000 times. Unfortunately I can't see what is wrong with my for loop, however, it only adds 1 value to simulated_data instead of the desired 100,000 values.
Question 1: How can I fix this?
Question 2: Is there a far more efficient method where I don't have to loop through 100,000 items in a list?
#creating a vector of probabilities
probabilities <- rep(0.99,100000)
#creating a vector of booleans
logicals <- runif(length(probabilities)) < probabilities
#empty list for my simulated data
simulated_data <- c()
#drawing from two different distributions depending on the value in logicals
for(i in logicals){
if (isTRUE(i)) {
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 1)
}else{
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 10)
}
}
It seems that you want to create a final sample where each element is taken randomly from either sample1 or sample2, with probabilities 0.99 and 0.01.
The correct approach would be to generate both samples, each containing the same number of elements and then select randomly from either one.
The correct approach would be:
# Generate both samples
n = 100000
sample1 = rnorm(n,0,1)
sample2 = rnorm(n,0,10)
# Create the logical vector that will decide whether to take from sample 1 or 2
s1_s2 = runif(n) < 0.99
# Create the final sample
sample = ifelse(s1_s2 , sample1, sample2)
In this case, it is not guaranteed that there are exactly 0.99*n samples from sample1 and 0.01*n from sample2. In fact:
> sum(sample == sample1)
[1] 98953
This is close to 0.99*n, as expected, but not exactly.
Create a vector with the desired fraction of values from each distribution and then create a random permutation of the values:
N = 10000
frac =0.99
rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( abs(rand_mix) >1.96)
FALSE TRUE
9364 636
> (100000-636)/100000
[1] 0.99364
> table( rnorm(10000) >6)
FALSE
10000
The fraction is fixed. If you wante a possibly random fraction (but close to 0.99 statistically) then try this:
> table( sample( c( rnorm(10e6), rnorm(10e4, sd=10) ), 10e4) > 1.96 )
FALSE TRUE
97151 2849
Compare with:
> N = 100000
> frac =0.99
> rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( rand_mix > 1.96 )
FALSE TRUE
97117 2883
Here is a nice solution for anyone here:
n <- 100000
prob1 <- 0.99
prob2 <- 1-prob1
dist1 <- rnorm(prob1*n, 0, 1)
dist2 <- rnorm(prob2*n, 0, 10)
actual_sample <- c(dist1, dist2)
I have a correlation matrix similar to:
It's just an example, my actual matrix is much larger than this. I only want to type the upper-triangle or lower-triangle of the matrix. The problem is that I only know how to type the entire matrix......
In the example, is there any way to give just (1, 0.31, 1, 0.32 ... etc) for the lower-traiangle to create a whole correlation matrix?
If v is a vector giving the lower triangular part of the correlation matrix including the diagonal by column then:
# test data
nms <- c("a", "b", "c")
v <- c(1, .2, .1, 1, .1, 1)
n <- length(nms)
m <- diag(n)
dimnames(m) <- list(nms, nms)
m[lower.tri(m, diag = TRUE)] <- v
m[upper.tri(m)] <- t(m)[upper.tri(m)]
giving:
> m
a b c
a 1.0 0.2 0.1
b 0.2 1.0 0.1
c 0.1 0.1 1.0
Note: Since R stores matrices by column providing the input by column would be more usual than by row but if v contains the lower triangle by row then swap lower.tri and upper.tri in the above.
I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.