What does MICE LoggedEvents mean? - r

I am trying to run a multiple imputation using the mice function (from the package of the same name) in R. I get a Warning that events have been logged. Here is the output from mice(.)$loggedEvents from my MWE (see below):
it im dep meth out
1 1 X pmm H
I'm not sure what is causing this warning and what the implications are. From what I understand, this can be caused by collinearity amongst variables, but this should be prevented by using remove_collinear=FALSE, but this isn't fixing the Warning.
MWE:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07),
Z = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90),
D = c( 0, 0, 0, 1, 0, 0),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Obviously my original issue involved much more rows and columns of data and a higher number of imputations and iterations, but I've managed to trim this down to find this MWE.
Any help into figuring out what's causing this problem would be great. Is there some sort of cut-off that mice uses when deciding if/when a covariable is collinear? If it's very high, would this override the remove_collinear=FALSE parameter?

This isn't a full answer, but I couldn't fit the reply in a comment.
The logged events warning can arise from a variety of issues. The issue raised can be identified from the "meth" column in the mice()$loggedEvents output.
The two issues I know of are collinearity, and a constant predictor across all values (or maybe constant across all missing/not missing also satisfied this criteria). Added some variables to highlight these:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
Z2 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 collinear Z2
2 1 1 X pmm H
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
consvar = c( 0.83, 0.83, 0.83, 0.83, 0.83, 0.83, 0.83),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 constant consvar
2 1 1 X pmm H
Unfortunately I don't know what issue "pmm" is. Maybe something to do with the predictive mean matching (chosen imputation method) not able to work in such a small dataset?

Related

Vectorizing many for loops

I have a summation that I can calculate using four for loops but I wonder if this can be simplified, maybe using a vectorized function, to reduce the computation time. Something similar to the Kronecker product (in R: kronecker(x, x)), or maybe something using outer?
The summation is:
where E is the sample space of integers ranging from 1 - 9. The i and j indices are also integers ranging from 1 - 9.
So, f, g, and h all matrices of dimension 9x9.
The h matrix is fixed and I have that but I am simulating g many times and then I choose the one that minimizes another function. The problem is, one thousand simulations, which is too few, takes about 1 second. I really want to try a million, but that many would take a long time.
I have the for loops in a function:
sim <- function(y, nreps, h) {
G <- vector("list", nreps) # list containing random values from Dirichlet distribution
F <- vector("list", nreps) # list containing the f matrices
M <- vector("numeric", nreps) # vector to store the results
require(gtools)
for(n in 1:nreps) {
f <- matrix(0, nrow=9, ncol=9) # initialize f
g <- gtools::rdirichlet(9, rep(1,9)) # simulate g
for(i in 1:9) {
for(j in 1:9) {
for(k in 1:9) {
for(l in 1:9) {
f[i,j] <- f[i,j] + h[i,k] * h[j,l] * g[k,l] # summation (see above)
}
}
}
}
F[[n]] <- f # store f matrix
G[[n]] <- g # store g matrix
M[n] <- sum((y - f)^2) # sum of squared differences between y and f
}
m <- which.min(M) # which M is the minimum?
return(list(g=G[[m]], m=M[m]))
}
And I call the function with
sim(y=f.y1, nreps=1000, h=x)
Here is the data:
> dput(f.y1)
structure(c(0.0182002022244692, 0.0121334681496461, 0.0101112234580384,
0, 0, 0, 0, 0, 0, 0.0485338725985844, 0.0940343781597573, 0.112234580384226,
0.0434782608695652, 0.00910010111223458, 0.00101112234580384,
0, 0, 0, 0.0333670374115268, 0.110212335692619, 0.132457027300303,
0.0808897876643074, 0.0222446916076845, 0.0070778564206269, 0.00101112234580384,
0, 0, 0.0070778564206269, 0.0202224469160768, 0.0596562184024267,
0.0616784630940344, 0.0262891809908999, 0.0070778564206269, 0,
0, 0, 0.00202224469160768, 0.00505561172901921, 0.0151668351870576,
0.0182002022244692, 0.0111223458038423, 0.00404448938321537,
0, 0, 0, 0.00202224469160768, 0.00404448938321537, 0.00505561172901921,
0.00505561172901921, 0.00202224469160768, 0.00202224469160768,
0, 0, 0, 0, 0.00202224469160768, 0.00202224469160768, 0.00202224469160768,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), class = "table", dim = c(9L, 9L), dimnames = structure(list(
c("0", "1", "2", "3", "4", "5", "6", "7", "8"), c("0", "1",
"2", "3", "4", "5", "6", "7", "8")), names = c("", "")))
> dput(x)
structure(c(0.61, 0.16, 0.03, 0.005, 0, 0, 0, 0, 0, 0.32, 0.61,
0.16, 0.03, 0.005, 0, 0, 0, 0, 0.06, 0.16, 0.61, 0.16, 0.03,
0.005, 0, 0, 0, 0.01, 0.06, 0.16, 0.61, 0.16, 0.03, 0.01, 0,
0, 0, 0.01, 0.03, 0.16, 0.61, 0.16, 0.03, 0.01, 0, 0, 0, 0.01,
0.03, 0.16, 0.61, 0.16, 0.06, 0.01, 0, 0, 0, 0.005, 0.03, 0.16,
0.61, 0.16, 0.06, 0, 0, 0, 0, 0.005, 0.03, 0.16, 0.61, 0.32,
0, 0, 0, 0, 0, 0.005, 0.03, 0.16, 0.61), dim = c(9L, 9L))
And you'll need to load the gtools package for the rdirichlet function. Thanks heaps!
library(gtools)
Luckily this particular example is just "simple" matrix multiplication, so can easily be vectorised with:
sim1 <- function(y, nreps, h) {
G <- vector("list", nreps) # list containing random values from Dirichlet distribution
F <- vector("list", nreps) # list containing the f matrices
M <- vector("numeric", nreps) # vector to store the results
require(gtools)
for(n in 1:nreps) {
g <- gtools::rdirichlet(9, rep(1,9)) # simulate g
f <- h %*% g %*% t(h)
F[[n]] <- f # store f matrix
G[[n]] <- g # store g matrix
M[n] <- sum((y - f)^2) # sum of squared differences between y and f
}
m <- which.min(M) # which M is the minimum?
return(list(g=G[[m]], m=M[m]))
}
Run function for comparison
#Original version
set.seed(0)
system.time(a <- sim(y=f.y1, nreps=1000, h=x))
# user system elapsed
# 0.97 0.03 1.00
#revised version
set.seed(0)
system.time(b <- sim1(y=f.y1, nreps=1000, h=x))
# user system elapsed
# 0.01 0.00 0.02
#Check they give the same answer
all.equal(a, b)
#[1] TRUE

pa package in R - performance attribution

When i try to upload my own data set into pa package I am not able to call a Brison method. I have attached a screenshot. The package itself bundled with the data set "jan", "quarter" and "year", there is no problem in running the analysis using these data (jan) or data (year) but my own data "Sample.xlsx" didn't work in the same way. When I try to run I get the following error: Error in is.data.frame(x) : object 'Sample' not found.
Can anyone help me to clean up my data.
Here is the code for the first 4 lines
data <-
structure(
list(
date = structure(
c(1420070400, 1420156800, 1420243200,
1420329600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")
),
portfolio_return = c(0.0212, 0.019, 0.021, 0.03),
australian_equities_3 = c(0.01,
0.02, 0.01, 0.01),
domestic_fixed_interest_4 = c(0.02,-0.01,
0.02, 0.05),
cash_equivalents_5 = c(0.03, 0.04, 0.03, 0.03),
australian_equities_6 = c(0.28, 0.3, 0.3, 0.3),
domestic_fixed_interest_7 = c(0.32,
0.3, 0.3, 0.3),
cash_equivalents_8 = c(0.4, 0.4, 0.4, 0.4),
total_weight_9 = c(1, 1, 1, 1),
benchmark_return = c(0.0175,
0.0215, 0.032, 0.0381666666666666),
australian_equities_11 = c(0.02,
0.01, 0.01, 0.00333333333333333),
domestic_fixed_interest_12 = c(-0.01,
0.02, 0.05, 0.08),
cash_equivalents_13 = c(0.04, 0.03, 0.03,
0.0233333333333333),
australian_equities_14 = c(0.25, 0.25,
0.25, 0.25),
domestic_fixed_interest_15 = c(0.35, 0.35, 0.35,
0.35),
cash_equivalents_16 = c(0.4, 0.4, 0.4, 0.4),
total_weight_17 = c(1,
1, 1, 1),
allocation_australian_equities = c(7.5e-05,-0.000575,-0.0011,-0.00174166666666667),
allocation_domestic_fixed_interest = c(
0.000824999999999999,
7.49999999999999e-05,
-9e-04,
-0.00209166666666667
),
allocation_cash_equivalents = c(0,
0, 0, 0),
selection_australian_equities = c(-0.0025, 0.0025,
0, 0.00166666666666667),
selection_domestic_fixed_interest = c(0.0105,-0.0105,-0.0105,-0.0105),
selection_cash_equivalents = c(-0.004,
0.004, 0, 0.00266666666666668),
interaction_australian_equities = c(-3e-04,
5e-04, 0, 0.000333333333333333),
interaction_domestic_fixed_interest = c(-0.000899999999999999,
0.0015, 0.0015, 0.0015),
interaction_cash_equivalents = c(0,
0, 0, 0)
),
row.names = c(NA,-4L),
class = c("tbl_df", "tbl",
"data.frame")
)

anyone successfully generate sankey chart with R plotly in powerBI desktop and powerBI service?

I tried by best to create sankey chart with R plotly package in powerBI, but failed. I can create sankey in RStudio with same code. I checked the official document that the package 'plotly' i used is supported by powerBI service. So there should be no reason the chart not displayed. https://learn.microsoft.com/en-us/power-bi/connect-data/service-r-packages-support
library("plotly")
a = read.csv('cleanSankey.csv', header=TRUE, sep=',')
node_names <- unique(c(as.character(a$source), as.character(a$target)))
node_names <- node_names[order(sub('.*_', '', node_names))]
nodes <- data.frame(name = node_names)
links <- data.frame(source = match(a$source, node_names) - 1,
target = match(a$target, node_names) - 1,
value = a$value)
node_x <- c(0, 0, 0, 0,
0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125,
0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.375, 0.375, 0.375, 0.375, 0.375, 0.375,
0.5, 0.5, 0.5, 0.5,
0.625, 0.6255, 0.625,
0.8, 0.8, 0.8,
0.999, 0.999)
node_y <- c(0.01, 0.02, 0.03, 0.04,
0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12,
0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08,
0.01, 0.02, 0.03, 0.04, 0.05, 0.06,
0.01, 0.02, 0.03, 0.04,
0.01, 0.02, 0.03,
0.01, 0.02, 0.03,
0.01, 0.02)
#Plot
plot_ly(type='sankey',
orientation = "h",
arrangement = "snap",
node = list (
label = node_names,
x = node_x,
y = node_y,
color = "grey",
pad = 15,
thinkness = 15,
line = list(color = "grey", width = 0.5)),
link = list(
source = links$source,
target = links$target,
value = links$value))
Then I I run above code in powerBI desktop, but powerBI says that 'Can't display the visual'.
Anyone has experience help to advice?

Make Heatmap Look More Professional in R

I can build very basic plots in R, but I'm trying to make my heatmap look more professional and I'm not sure how to do it. I have a data frame df of 11 observations of 11 variables:
> dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10+"), class = "data.frame")
So I transform df into a matrix to get a heatmap:
heatmap(data.matrix(df), Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
This is what the plot looks like:
I'm not sure how to:
Change the location of the keys for the row and columns. I want them both to start from 0 in the top left corner, and both row and column to continue ascending until 10+
I'd also like more granularity in the colour. Right now you can't event tell the difference in values by looking at the colour...
Is heatmap from base R even the right library for this? I looked up a few examples and I wasn't sure if there's a better library to achieve what I want.
There are several libraries that offer heatmap functionalities. IMO base heatmap and gplots::heatmap.2 did not age well and are not the best options anymore. 3 good possibilities are with ggplot2::geom_tile, pheatmap and ComplexHeatmap.
Example data
Let's assume we have a matrix
dta <- matrix(rnorm(25), nrow=5)
rownames(dta) <- letters[1:5]
colnames(dta) <- LETTERS[1:5]
ggplot2::geom_tile
The ggplot2 version requires your data to be a tidy dataframe, so we can transform our matrix with tidyr::pivot_longer().
dta %>%
as_tibble(rownames = "myrows") %>%
pivot_longer(cols = -myrows, names_to = "mycols", values_to = "level") %>%
ggplot() +
geom_tile(aes(x=myrows, y=mycols, fill = level))
pheatmap
The pheatmap package is quite good at generating modern heatmaps. It takes a matrix as input. It can cluster the rows and columns and make a dendrogram, which is often a desired feature. It can also scale rows and columns (effectively plotting a Z-score).
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE)
Note that the positions of rows and columns are not the same as with ggplot. You can look at the options that allow some useful customization. For example, if our rows have classes defined elsewhere.
ann_df <- data.frame(row.names = rownames(dta),
classification = rep(c("first", "second"), times = c(2,3)))
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_row = ann_df,
gaps_row = c(2))
Color scale
One of the big aspects that make your heatmap look professional is the color scale. On ggplot, you should check out scale_fill_gradient2().
On pheatmap, you can try these settings for color as a starting point (see the documentation of these functions):
color = scales::div_gradient_pal(low = "navy",
mid = "green",
high="yellow")(seq(0,1,
length.out = max(dta))),
color = colorRampPalette(RColorBrewer::brewer.pal(n = 9,
name = "Blues"))(max(dta)),
color = viridisLite::plasma(max(dta)),
ComplexHeatmap
Finally, a package that has gained success recently is ComplexHeatmap. It is based on pheatmap but offers many additional options. See the link in zx8754's comment for a detailed book full of examples.

obtain all possible correlation between two set of data

I am trying to find several info between each correlation
corr.test
I have two data set df1 and df2
df1<- structure(list(col1A = c(1.64, 0.03, 0, 4.202, 2.981, 0.055,
0, 0.002, 0.005, 0, 0.002, 0.649, 2.55, 2.762, 6.402), col2A = c(2.635,
0.019, 0, 5.638, 3.542, 0.793, 0.259, 0, 0.046, 0.004, 0.017,
0.971, 3.81, 3.104, 5.849), col3A = c(0.91, 0.037, 0, 5.757,
3.916, 0.022, 0, 0, 0.003, 0, 0.262, 0.136, 2.874, 3.466, 5.003
), col4A = c(1.027, 0.021, 0, 4.697, 2.832, 0.038, 0.032, 0.001,
0.003, 0, 0, 0.317, 2.743, 3.187, 6.455)), class = "data.frame", row.names = c(NA,
-15L))
the second data is like below
df2<-structure(list(col1 = c(2.172, 0, 0, 4.353, 4.581, 0.001, 0.027,
0, 0.002, 0, 0, 0.087, 2.129, 4.317, 5.849), col2 = c(2.093,
0, 0, 4.235, 3.166, 0, 0, 0.006, 0.01, 0, 0, 0.475, 0, 2.62,
5.364), col3 = c(3.322, 0, 0, 4.332, 4.018, 0.049, 0.169, 0.004,
0.02, 0, 0.032, 1.354, 2.944, 4.323, 5.44), col4 = c(0.928, 0.018,
0, 3.943, 3.723, 0.02, 0, 0, 0, 0, 0.075, 0.136, 3.982, 3.875,
5.83)), row.names = c("A", "AA", "AAA", "Aab", "buy", "yuyn",
"gff", "fgd", "kil", "lilk", "hhk", "lolo", "fdd", "vgfh", "nghg"
), class = "data.frame")
I want to obtain all possible correlation between the two and extract all p values and adjusted p values
I use
library(psych)
corr.test(df1,df2, use = "pairwise",method="pearson",adjust="holm",alpha=.05,ci=TRUE,minlength=5)
it does not give me any p value. also I cannot control any sort of permutation to calculate the adjusted p value.
I was thinking to use the following
x <-df1[,1]
y <-df2[,2]
corr_init <- cor(x,y) # original correlation
N <- 1000 # initialize number of permutations
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
y_perm <- permute(y)
if (cor(y_perm,x) > corr_init) count <- count+1
}
p <- count/N #final p
but then I have do it one by one and still I need to extract each column and test ...
I am wondering if there is better way to calculate all correlation between two data, get R values, p values and P adjusted with specific number of randomization ?
It could be done using the Hmisc package:
library(Hmisc)
df1_cor_matrix <- rcorr(as.matrix(df1), type = "pearson")
df2_cor_matrix <- rcorr(as.matrix(df2), type = "pearson")
You can then extract out the coefficients using the following:
df1_coef <- df1_cor_matrix$r
df2_coef <- df2_cor_matrix$r
You can extract the p-values using the following:
df1_p_values <- df1_cor_matrix$P
df2_p_values <- df2_cor_matrix$P
You could get the adjusted p-values using the rcorr.adjust function:
rcorr.adjust(df1_cor_matrix, type = "pearson")
rcorr.adjust(df2_cor_matrix, type = "pearson")

Resources