I have a R script I'm running now that is currently using 3 correlated variables. I'd like to add a 4th, and am wondering if there's a simple way to input matrix data, particularly for correlation matrices---some Matlab-like technique to enter a correlation matrix, 3x3 or 4x4, in R without the linear to matrix reshape I've been using.
In Matlab, you can use the semicolon as an end-row delimiter, so it's easy to keep track of where the cross correlations are.
In R, where I first create
corr <- c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1)
cormat <- matrix(corr, ncol=3)
Versus
cormat = [1 0.1 0.5;
0.1 1 0.9;
0.5 0.9 1]
It just feels clunkier, which makes me suspect there's a smarter way I haven't looked up yet. Thoughts?
Welcome to the site! :) you should be able to do it in one step:
MyMatrix = matrix(
c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1),
nrow=3,
ncol=3)
Here is another way:
CorrMat <- matrix(scan(),3,3,byrow=TRUE)
1 0.1 0.5
0.1 1 0.9
0.5 0.9 1
Trailing white line is important.
If you want to input a symmetric matrix, you can use the xpnd() function in the MCMCpack library.
xpnd() takes a vector which corresponds to the upper-triangle of the matrix (thus you only have to enter each value once). For instance, if you want to input:
$\left(\begin{array}{c c c}
1 & 0.1 & 0.5 \\
0.1 & 1 & 0.9 \\
0.5 & 0.9 & 1
\end{array}\right)$
You would use
library(MCMCpack)
xpnd(c(1, 0.1, 0.5, 1, 0.9, 1), 3)
where 3 refers to the number of rows in the matrix.
Help page for xpnd.
rbind(c(1, 0.1, 0.5),
c(0.1, 1, 0.9),
c(0.5, 0.9, 1))
For the existing solutions. That may only work for 3*3 matrix. I tried this one.
a<-diag(3)
m<-diag(3)
m[lower.tri(m,diag=F)]<-c(0.1, 0.5, 0.9)
m<-m+t(m)-a
As you are working with correlation matrices, you are probably not interested in entering the diagonal, and both the upper and lower parts. You can manipulate/extract those three parts separately using diag(), upper.tri() and lower.tri().
> M <- diag(3) # create 3x3 matrix, diagonal defaults to 1's
> M[lower.tri(M, diag=F)] <- c(0.1, 0.5, 0.9) # read in lower part
> M # lower matrix containing all information
[,1] [,2] [,3]
[1,] 1.0 0.0 0
[2,] 0.1 1.0 0
[3,] 0.5 0.9 1
If you want the full matrix:
> M[upper.tri(M, diag=F)] <- M[lower.tri(M)] # fill upper part
> M # full matrix
[,1] [,2] [,3]
[1,] 1.0 0.1 0.5
[2,] 0.1 1.0 0.9
[3,] 0.5 0.9 1.0
Related
Given this example data frame:
DF <- data.frame(x = c(1, 0.85, 0.9, 0, 0, 0.9, 0.95),
y = c(0, 0, 0.1, 0.9, 1, 0.9, 0.97),
z = c(0, 0, 0, 0.9, 0.9, 0.0, 0.9 ))
I am trying to assign each row to a group containing rows adjacent to one another, based on their similarity. I would like to use a cutoff of 0.35, meaning that consecutive rows of values c(1, 0.85, 0.7) can be assigned to one group, but c(0, 1, 0) cannot.
Regarding the columns, column-to-column differences are not important i.e. c(1, 1, 1) and c(0, 0, 0) could still be assigned to one group, HOWEVER, if rows in one column meet the criteria (e.g. c(1, 1, 1)) but the rows in another column(s) do not (e.g. c(1, 0, 1)) - the row is invalid.
Here is the desired output for the example I gave above:
[1] 1 1 1 2 2 NA NA
I am currently applying the abs(diff()) function to determine the difference between the values, and then for each row I take the largest value (adding 1 at the beginning to account for the first row):
diff <- apply(DF, MARGIN = 2, function (x) abs(diff(x)))
max_diff <- c(1, apply(diff, MARGIN = 1, function (x) max(x, na.rm = T)))
max_diff
[1] 1.00 0.15 0.10 0.90 0.10 0.90 0.90
I am stuck at this point, not quite sure what is the best way to proceed with the group assignment. I was initially trying to convert max_diff into a logical vector (max diff < 0.35), and then running a for loop grouping all the TRUEs together. This has a couple of problems:
My dataset has millions of rows so the forloop takes ages,
I "ignore" the first component of the group - e.g. I would not consider the first row as a member of the first group, because the max_diff value of 1 gives FALSE. I don't want to ignore anything.
I will be very grateful for any advice on how to proceed in an efficient way.
PS. The way of determining the difference between sites is not crucial - here it is just a difference of 0.35 but this is very flexible. All I am after is an adjustable method of finding similar rows.
You could do a cluster analysis and play around with different cutoffs h.
cl <- hclust(dist(DF))
DF$group <- cutree(cl, h=.5)
DF
# x y z group
# 1 1.00 0.00 0.0 1
# 2 0.85 0.00 0.0 1
# 3 0.90 0.10 0.0 1
# 4 0.00 0.90 0.9 2
# 5 0.00 1.00 0.9 2
# 6 0.90 0.90 0.0 3
# 7 0.95 0.97 0.9 4
A dendrogram helps to determine h.
plot(cl)
abline(h=.5, col=2)
Within some matrix algebra I found the expression B = ker(A), where A is a 3x4 transformation matrix. The following two links gave me some vague idea about ker() in general:
Wolfram: Kernel
Calculate the dimensions and basis of the kernel Ker(f)
But frankly, I still can not square how to get a 4x1 vector as result. How would this kernel be calculated in R? And some additional background/links would be appreciated.
Here is the matrix A and the result B (or its transpose...).
A = structure(c(0.9, 1.1, 1.2, 0.8, 0, 0.5, 0.3, 0.1, 0.5, 0, 0.2,
0.7), .Dim = 4:3)
B = structure(c(0.533, 0.452, -0.692, -0.183), .Dim = c(4L, 1L))
I did get as far as realizing, that each row of the A-matrix times B equals zero, just like in the examples. But for solving the set of linear equations I am missing one more equation, don't I?
With the pracma package:
pracma::nullspace(t(A))
# [,1]
# [1,] -0.5330006
# [2,] -0.4516264
# [3,] 0.6916801
# [4,] 0.1830918
With the MASS package:
MASS::Null(A)
I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .
I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?
This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)
For a simulation study I need to create nxn covariance matrices. for example I can input 2x2 covariance matrices like
[,1] [,2]
[1,] 1.0 1.5
[2,] 1.5 2.0
into a r function/object:
var <- c(1,2) ## variances
covar <- c(1.5,1.5) ## covariance(s)
mat <- matrix(c(var[1],covar[1],covar[2],var[2]),ncol=length(var))
then I only have to change var & covar values to form the matrices. but unfortunately I'm not just dealing with 2x2s but 2x2:30x30 or even higher! so is it possible to write only one function for any matrix of nxn dimension in r?
You can do:
m <- diag(variance)
m[lower.tri(m)] = m[upper.tri(m)] <- head(covar, length(covar)/2)
For example:
variance = c(0.25, 0.75, 0.6)
covar = c(0.1, 0.3, 0.2, 0.1, 0.3, 0.2)
#>m
# [,1] [,2] [,3]
#[1,] 0.25 0.10 0.3
#[2,] 0.10 0.75 0.2
#[3,] 0.30 0.20 0.6