I have a summation that I can calculate using four for loops but I wonder if this can be simplified, maybe using a vectorized function, to reduce the computation time. Something similar to the Kronecker product (in R: kronecker(x, x)), or maybe something using outer?
The summation is:
where E is the sample space of integers ranging from 1 - 9. The i and j indices are also integers ranging from 1 - 9.
So, f, g, and h all matrices of dimension 9x9.
The h matrix is fixed and I have that but I am simulating g many times and then I choose the one that minimizes another function. The problem is, one thousand simulations, which is too few, takes about 1 second. I really want to try a million, but that many would take a long time.
I have the for loops in a function:
sim <- function(y, nreps, h) {
G <- vector("list", nreps) # list containing random values from Dirichlet distribution
F <- vector("list", nreps) # list containing the f matrices
M <- vector("numeric", nreps) # vector to store the results
require(gtools)
for(n in 1:nreps) {
f <- matrix(0, nrow=9, ncol=9) # initialize f
g <- gtools::rdirichlet(9, rep(1,9)) # simulate g
for(i in 1:9) {
for(j in 1:9) {
for(k in 1:9) {
for(l in 1:9) {
f[i,j] <- f[i,j] + h[i,k] * h[j,l] * g[k,l] # summation (see above)
}
}
}
}
F[[n]] <- f # store f matrix
G[[n]] <- g # store g matrix
M[n] <- sum((y - f)^2) # sum of squared differences between y and f
}
m <- which.min(M) # which M is the minimum?
return(list(g=G[[m]], m=M[m]))
}
And I call the function with
sim(y=f.y1, nreps=1000, h=x)
Here is the data:
> dput(f.y1)
structure(c(0.0182002022244692, 0.0121334681496461, 0.0101112234580384,
0, 0, 0, 0, 0, 0, 0.0485338725985844, 0.0940343781597573, 0.112234580384226,
0.0434782608695652, 0.00910010111223458, 0.00101112234580384,
0, 0, 0, 0.0333670374115268, 0.110212335692619, 0.132457027300303,
0.0808897876643074, 0.0222446916076845, 0.0070778564206269, 0.00101112234580384,
0, 0, 0.0070778564206269, 0.0202224469160768, 0.0596562184024267,
0.0616784630940344, 0.0262891809908999, 0.0070778564206269, 0,
0, 0, 0.00202224469160768, 0.00505561172901921, 0.0151668351870576,
0.0182002022244692, 0.0111223458038423, 0.00404448938321537,
0, 0, 0, 0.00202224469160768, 0.00404448938321537, 0.00505561172901921,
0.00505561172901921, 0.00202224469160768, 0.00202224469160768,
0, 0, 0, 0, 0.00202224469160768, 0.00202224469160768, 0.00202224469160768,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), class = "table", dim = c(9L, 9L), dimnames = structure(list(
c("0", "1", "2", "3", "4", "5", "6", "7", "8"), c("0", "1",
"2", "3", "4", "5", "6", "7", "8")), names = c("", "")))
> dput(x)
structure(c(0.61, 0.16, 0.03, 0.005, 0, 0, 0, 0, 0, 0.32, 0.61,
0.16, 0.03, 0.005, 0, 0, 0, 0, 0.06, 0.16, 0.61, 0.16, 0.03,
0.005, 0, 0, 0, 0.01, 0.06, 0.16, 0.61, 0.16, 0.03, 0.01, 0,
0, 0, 0.01, 0.03, 0.16, 0.61, 0.16, 0.03, 0.01, 0, 0, 0, 0.01,
0.03, 0.16, 0.61, 0.16, 0.06, 0.01, 0, 0, 0, 0.005, 0.03, 0.16,
0.61, 0.16, 0.06, 0, 0, 0, 0, 0.005, 0.03, 0.16, 0.61, 0.32,
0, 0, 0, 0, 0, 0.005, 0.03, 0.16, 0.61), dim = c(9L, 9L))
And you'll need to load the gtools package for the rdirichlet function. Thanks heaps!
library(gtools)
Luckily this particular example is just "simple" matrix multiplication, so can easily be vectorised with:
sim1 <- function(y, nreps, h) {
G <- vector("list", nreps) # list containing random values from Dirichlet distribution
F <- vector("list", nreps) # list containing the f matrices
M <- vector("numeric", nreps) # vector to store the results
require(gtools)
for(n in 1:nreps) {
g <- gtools::rdirichlet(9, rep(1,9)) # simulate g
f <- h %*% g %*% t(h)
F[[n]] <- f # store f matrix
G[[n]] <- g # store g matrix
M[n] <- sum((y - f)^2) # sum of squared differences between y and f
}
m <- which.min(M) # which M is the minimum?
return(list(g=G[[m]], m=M[m]))
}
Run function for comparison
#Original version
set.seed(0)
system.time(a <- sim(y=f.y1, nreps=1000, h=x))
# user system elapsed
# 0.97 0.03 1.00
#revised version
set.seed(0)
system.time(b <- sim1(y=f.y1, nreps=1000, h=x))
# user system elapsed
# 0.01 0.00 0.02
#Check they give the same answer
all.equal(a, b)
#[1] TRUE
Related
I solved the following Linear Matrix Inequality (LMI) problem using cvx in Matlab:
Lhs = [19.467593196, 1.82394007, 0.1625838, 0.01685267, 0.002495194;
1.823940068, 1.78664305, 0.9845668, 0.32951706, 0.010431878;
0.162583843, 0.98456679, 1.2333818, 0.92276329, 0.132643463;
0.016852668, 0.32951706, 0.9227633, 1.55698000, 0.848190932;
0.002495194, 0.01043188, 0.1326435, 0.84819093, 0.638889503];
S = [0.001, -0.001, 0, 0, 0;
-0.001, 0.001, 0, 0, 0;
0, 0, 0, 0, 0;
0, 0, 0, 0.001 -0.001;
0, 0, 0, -0.001, 0.001];
cvx_begin sdp
variable t;
minimize t;
Lhs+t*S >= 0;
cvx_end
The result makes sense.
I need to solve the same problem in R. As far as I understood, it can't be expressed as a LMI with CVXR. Thus, I exploited the dual formulation to write the problem as
cvx_begin sdp
variable X(5,5) symmetric;
maximize -trace(Lhs*X);
trace(S*X) == 1;
X >= 0;
cvx_end
As expected, the result in Matlab is the same as in the primal formulation.
However, if I solve the dual problem in R:
Lhs = matrix(c(19.467593196, 1.82394007, 0.1625838, 0.01685267, 0.002495194,
1.823940068, 1.78664305, 0.9845668, 0.32951706, 0.010431878,
0.162583843, 0.98456679, 1.2333818, 0.92276329, 0.132643463,
0.016852668, 0.32951706, 0.9227633, 1.55698000, 0.848190932,
0.002495194, 0.01043188, 0.1326435, 0.84819093, 0.638889503), ncol = 5, byrow = T)
S = matrix(c(0.001, -0.001, 0, 0, 0,
-0.001, 0.001, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0.001, -0.001,
0, 0, 0, -0.001, 0.001), ncol = 5, byrow = T)
X = Variable(k, k, PSD = T)
constr = list(matrix_trace(S%*%X) == 1,
X >= 0)
prob = Problem(Maximize(-matrix_trace(Lhs%*%X)), constr)
the result is totally wrong. Where is the mistake?
I can build very basic plots in R, but I'm trying to make my heatmap look more professional and I'm not sure how to do it. I have a data frame df of 11 observations of 11 variables:
> dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10+"), class = "data.frame")
So I transform df into a matrix to get a heatmap:
heatmap(data.matrix(df), Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
This is what the plot looks like:
I'm not sure how to:
Change the location of the keys for the row and columns. I want them both to start from 0 in the top left corner, and both row and column to continue ascending until 10+
I'd also like more granularity in the colour. Right now you can't event tell the difference in values by looking at the colour...
Is heatmap from base R even the right library for this? I looked up a few examples and I wasn't sure if there's a better library to achieve what I want.
There are several libraries that offer heatmap functionalities. IMO base heatmap and gplots::heatmap.2 did not age well and are not the best options anymore. 3 good possibilities are with ggplot2::geom_tile, pheatmap and ComplexHeatmap.
Example data
Let's assume we have a matrix
dta <- matrix(rnorm(25), nrow=5)
rownames(dta) <- letters[1:5]
colnames(dta) <- LETTERS[1:5]
ggplot2::geom_tile
The ggplot2 version requires your data to be a tidy dataframe, so we can transform our matrix with tidyr::pivot_longer().
dta %>%
as_tibble(rownames = "myrows") %>%
pivot_longer(cols = -myrows, names_to = "mycols", values_to = "level") %>%
ggplot() +
geom_tile(aes(x=myrows, y=mycols, fill = level))
pheatmap
The pheatmap package is quite good at generating modern heatmaps. It takes a matrix as input. It can cluster the rows and columns and make a dendrogram, which is often a desired feature. It can also scale rows and columns (effectively plotting a Z-score).
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE)
Note that the positions of rows and columns are not the same as with ggplot. You can look at the options that allow some useful customization. For example, if our rows have classes defined elsewhere.
ann_df <- data.frame(row.names = rownames(dta),
classification = rep(c("first", "second"), times = c(2,3)))
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_row = ann_df,
gaps_row = c(2))
Color scale
One of the big aspects that make your heatmap look professional is the color scale. On ggplot, you should check out scale_fill_gradient2().
On pheatmap, you can try these settings for color as a starting point (see the documentation of these functions):
color = scales::div_gradient_pal(low = "navy",
mid = "green",
high="yellow")(seq(0,1,
length.out = max(dta))),
color = colorRampPalette(RColorBrewer::brewer.pal(n = 9,
name = "Blues"))(max(dta)),
color = viridisLite::plasma(max(dta)),
ComplexHeatmap
Finally, a package that has gained success recently is ComplexHeatmap. It is based on pheatmap but offers many additional options. See the link in zx8754's comment for a detailed book full of examples.
I am trying to find several info between each correlation
corr.test
I have two data set df1 and df2
df1<- structure(list(col1A = c(1.64, 0.03, 0, 4.202, 2.981, 0.055,
0, 0.002, 0.005, 0, 0.002, 0.649, 2.55, 2.762, 6.402), col2A = c(2.635,
0.019, 0, 5.638, 3.542, 0.793, 0.259, 0, 0.046, 0.004, 0.017,
0.971, 3.81, 3.104, 5.849), col3A = c(0.91, 0.037, 0, 5.757,
3.916, 0.022, 0, 0, 0.003, 0, 0.262, 0.136, 2.874, 3.466, 5.003
), col4A = c(1.027, 0.021, 0, 4.697, 2.832, 0.038, 0.032, 0.001,
0.003, 0, 0, 0.317, 2.743, 3.187, 6.455)), class = "data.frame", row.names = c(NA,
-15L))
the second data is like below
df2<-structure(list(col1 = c(2.172, 0, 0, 4.353, 4.581, 0.001, 0.027,
0, 0.002, 0, 0, 0.087, 2.129, 4.317, 5.849), col2 = c(2.093,
0, 0, 4.235, 3.166, 0, 0, 0.006, 0.01, 0, 0, 0.475, 0, 2.62,
5.364), col3 = c(3.322, 0, 0, 4.332, 4.018, 0.049, 0.169, 0.004,
0.02, 0, 0.032, 1.354, 2.944, 4.323, 5.44), col4 = c(0.928, 0.018,
0, 3.943, 3.723, 0.02, 0, 0, 0, 0, 0.075, 0.136, 3.982, 3.875,
5.83)), row.names = c("A", "AA", "AAA", "Aab", "buy", "yuyn",
"gff", "fgd", "kil", "lilk", "hhk", "lolo", "fdd", "vgfh", "nghg"
), class = "data.frame")
I want to obtain all possible correlation between the two and extract all p values and adjusted p values
I use
library(psych)
corr.test(df1,df2, use = "pairwise",method="pearson",adjust="holm",alpha=.05,ci=TRUE,minlength=5)
it does not give me any p value. also I cannot control any sort of permutation to calculate the adjusted p value.
I was thinking to use the following
x <-df1[,1]
y <-df2[,2]
corr_init <- cor(x,y) # original correlation
N <- 1000 # initialize number of permutations
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
y_perm <- permute(y)
if (cor(y_perm,x) > corr_init) count <- count+1
}
p <- count/N #final p
but then I have do it one by one and still I need to extract each column and test ...
I am wondering if there is better way to calculate all correlation between two data, get R values, p values and P adjusted with specific number of randomization ?
It could be done using the Hmisc package:
library(Hmisc)
df1_cor_matrix <- rcorr(as.matrix(df1), type = "pearson")
df2_cor_matrix <- rcorr(as.matrix(df2), type = "pearson")
You can then extract out the coefficients using the following:
df1_coef <- df1_cor_matrix$r
df2_coef <- df2_cor_matrix$r
You can extract the p-values using the following:
df1_p_values <- df1_cor_matrix$P
df2_p_values <- df2_cor_matrix$P
You could get the adjusted p-values using the rcorr.adjust function:
rcorr.adjust(df1_cor_matrix, type = "pearson")
rcorr.adjust(df2_cor_matrix, type = "pearson")
I am trying to run a multiple imputation using the mice function (from the package of the same name) in R. I get a Warning that events have been logged. Here is the output from mice(.)$loggedEvents from my MWE (see below):
it im dep meth out
1 1 X pmm H
I'm not sure what is causing this warning and what the implications are. From what I understand, this can be caused by collinearity amongst variables, but this should be prevented by using remove_collinear=FALSE, but this isn't fixing the Warning.
MWE:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07),
Z = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90),
D = c( 0, 0, 0, 1, 0, 0),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Obviously my original issue involved much more rows and columns of data and a higher number of imputations and iterations, but I've managed to trim this down to find this MWE.
Any help into figuring out what's causing this problem would be great. Is there some sort of cut-off that mice uses when deciding if/when a covariable is collinear? If it's very high, would this override the remove_collinear=FALSE parameter?
This isn't a full answer, but I couldn't fit the reply in a comment.
The logged events warning can arise from a variety of issues. The issue raised can be identified from the "meth" column in the mice()$loggedEvents output.
The two issues I know of are collinearity, and a constant predictor across all values (or maybe constant across all missing/not missing also satisfied this criteria). Added some variables to highlight these:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
Z2 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 collinear Z2
2 1 1 X pmm H
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
consvar = c( 0.83, 0.83, 0.83, 0.83, 0.83, 0.83, 0.83),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 constant consvar
2 1 1 X pmm H
Unfortunately I don't know what issue "pmm" is. Maybe something to do with the predictive mean matching (chosen imputation method) not able to work in such a small dataset?
I've been given a matrix:
P <- matrix(c(0, 0, 0, 0.5, 0, 0.5, 0.1, 0.1, 0, 0.4, 0, 0.4, 0, 0.2, 0.2, 0.3, 0, 0.3, 0, 0, 0.3, 0.5, 0, 0.2, 0, 0, 0, 0.4, 0.6, 0, 0, 0, 0, 0, 0.4, 0.6), nrow = 6, ncol = 6, byrow = TRUE)
Using the functions, mpow, rows_equal, matrices_equal. I want to find when P^n converges, in other words what n is, when all the rows are equal in the matrix and when P^n = P^(n+1).
By just looking at the functions i have managed to deduce that around n=19-21 the matrix will converge.
Although, I want to find the right n using a loop. Here under are the functions mpow, rows_equal and matrices_equal. I know they can be written differently but please keep them as they are.
mpow <- function(P, n, d=4) {
if (n == 0) diag(nrow(P)))
else if (n== 1) P
else P %*% mpow(P, n - 1))
}
rows_equal <- function(P, d = 4) {
P_new <- trunc(P * 10^d)
for (k in 2:nrow(P_new)) {
if (!all(P_new[1, ] == P_new[k, ])) {
return(FALSE)}
}
return(TRUE)
}
matrices_equal <- function(A, B, d = 4) {
A_new <- trunc(A * 10^d)
B_new <-trunc(B * 10^d)
if (all(A_new == B_new)) TRUE else FALSE
}
Now, to write the loop, we should do it something along the lines of:
First creating a function like so:
when_converged <- function(P) {...}
and
for (n in 1:50)
To try for when t.ex n = 50.
Although i don't know how to write the code correctly to do so, can anyone help me with that?
Thank you for reading my question.
Actually, a much better way is to do this:
## transition probability matrix
P <- matrix(c(0, 0, 0, 0.5, 0, 0.5, 0.1, 0.1, 0, 0.4, 0, 0.4, 0, 0.2, 0.2, 0.3, 0, 0.3, 0, 0, 0.3, 0.5, 0, 0.2, 0, 0, 0, 0.4, 0.6, 0, 0, 0, 0, 0, 0.4, 0.6), nrow = 6, ncol = 6, byrow = TRUE)
## a function to find stationary distribution
stydis <- function(P, tol = 1e-16) {
n <- 1; e <- 1
P0 <- P ## transition matrix P0
while(e > tol) {
P <- P %*% P0 ## resulting matrix P
e <- max(abs(sweep(P, 2, colMeans(P))))
n <- n + 1
}
cat(paste("convergence after",n,"steps\n"))
P[1, ]
}
Then when you call the function:
stydis(P)
# convergence after 71 steps
# [1] 0.002590674 0.025906736 0.116580311 0.310880829 0.272020725 0.272020725
The function stydis, essentially continuously does:
P <- P %*% P0
until convergence of P is reached. Convergence is numerically determined by the L1 norm of discrepancy matrix:
sweep(P, 2, colMeans(P))
The L1 norm is the maximum, absolute value of all matrix elements. When the L1 norm drops below 1e-16, convergence occurs.
As you can see, convergence takes 71 steps. Now, we can obtain faster "convergence" by controlling tol (tolerance):
stydis(P, tol = 1e-4)
# convergence after 17 steps
# [1] 0.002589361 0.025898057 0.116564506 0.310881819 0.272068444 0.271997814
But if you check:
mpow(P, 17)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.002589361 0.02589806 0.1165645 0.3108818 0.2720684 0.2719978
# [2,] 0.002589415 0.02589722 0.1165599 0.3108747 0.2720749 0.2720039
# [3,] 0.002589738 0.02589714 0.1165539 0.3108615 0.2720788 0.2720189
# [4,] 0.002590797 0.02590083 0.1165520 0.3108412 0.2720638 0.2720515
# [5,] 0.002592925 0.02592074 0.1166035 0.3108739 0.2719451 0.2720638
# [6,] 0.002588814 0.02590459 0.1166029 0.3109419 0.2720166 0.2719451
Only the first 4 digits are the same, as you put tol = 1e-4.
A floating point number has a maximum of 16 digits, so I would suggest you use tol = 1e-16 for reliable convergence test.