Correlation in Scatterplot Matrix with missing values - r

I want to make a scatterplot matrix with points in upper pane and r or r2 values in lower pane, as described here: http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
When there is no missing data, it works fine. But when there are some missing values, it seems unable to calculate R, even when I use code I thought would account for missing values. See commented-out lines in the code below, which show what I've tried -- those attempts were passed on what I found after searching about here on StackOverflow: Dealing with missing values for correlations calculation
Probably something simple, as I'm a pretty simple R user (so I'm hoping for solutions that are more simple than elegant). Talk to me like I'm stupid!
I do not want to remove whole rows just because there is one missing value, as my real dataset (not this example) is rather small.
# --------------------------------------
# Create Dataframes, one with missing values
# --------------------------------------
Alx <- c(13, 9, 5, 17, 2, 8, 11, 4)
Bex <- c(23, 41, 32, 58, 26, 33, 51, 46)
Dex <- c(7,10,6,4,19,6,15,16)
Gax <- c(43,54,31,28,60,30,43,21)
AlxM <- c(NA, 9, 5, 17, 2, 8, 11, 4)
BexM <- c(23, 41, NA, 58, 26, 33, 51, 46)
DexM <- c(7,10,6,4,19,6,15,NA)
GaxM <- c(43,54,31,28,60,30,43,21)
df <- data.frame(Alx,Bex,Dex,Gax) # dataframe that works in scatterplot matrix
df_miss <- data.frame(AlxM,BexM,DexM,GaxM)# dataframe that has missing values
rm(Alx,Bex,Dex,Gax,AlxM,BexM,DexM,GaxM) # removing un-needed garbage
# --------------------------------------
# --------------------------------------
# Scatterplot Matrix - functions for upper and lower
# panels, it is the line "r <- round(cor(x,y), digits=2)"
# that I've been focusing on. Perhaps the wrong approach?
# see: http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
# --------------------------------------
# Upper panel
upper.panel<-function(x, y){
points(x,y, pch=19)
r <- round(cor(x,y), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
# Correlation panel
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2) # gives all NA
# Neither of these (immediately below) worked for me:
# see: https://stackoverflow.com/questions/7445639/dealing-with-missing-values-for-correlations-calculation
# r <- round(cor(na.omit(x, y)), digits=2) # does not work
# r <- round(cor(x, y), use="pairwise.complete.obs", digits=2) # does not work
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 0.5)
}
# Scatterplots
pairs(df[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
pairs(df_miss[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
# --------------------------------------

We can use the use argument in cor i.e. it shouldn't be outside the cor as in the OP's commented line r <- round(cor(x, y), use="pairwise.complete.obs", digits=2)
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y, use = "pairwise.complete.obs"), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 0.5)
}
-testing
pairs(df_miss[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
-output

Related

When plotting a correlation matrix with pairs(), how to display long column names in many lines?

After drawing the correlation matrix in R, I try to read it in binary form in an external program.
However, each column name is too long, so in the correlation matrix drawn in pairs(), parts of both sides of the column names are cut off.
If the column name is this long, is there a way to make these column names appear in multiple lines?
And can we increase the fontsize of the column names to increase readability?
This is the sample code.
In this case, for example, I want the column
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
to look like
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
in two lines.
a <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
b <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
c <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
d <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
e <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
f <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
g <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
h <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
i <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
df <- data.frame(AAAAAAAAAAAAAAAAAAAAAAAAAAA = a,
BBBBBBBBBBBBBBBBBBBBBBBBBBB = b,
CCCCCCCCCCCCCCCCCCCCCCCCCCC = c,
DDDDDDDDDDDDDDDDDDDDDDDDDDD = d,
EEEEEEEEEEEEEEEEEEEEEEEEEEE = e,
FFFFFFFFFFFFFFFFFFFFFFFFFFF = f,
GGGGGGGGGGGGGGGGGGGGGGGGGGG = g,
HHHHHHHHHHHHHHHHHHHHHHHHHHH = h,
IIIIIIIIIIIIIIIIIIIIIIIIIII = i)
pairs(df,
lower.panel = NULL,
upper.panel = function(x, y){
points(x,y,pch=20)
r <- round(cor(x, y, use = "complete.obs"), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(1, 0.95, txt, col="red", pos=2, cex=1.0)
},
)
I would begin by using the function stringi::stri_extract_all on a long label to break it down into chunks of at most ten characters
longlabel <- "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
small_chunks <- stringi::stri_extract_all(longlabel,
regex = ".{1,10}")[[1]]
After that, you can use paste to get
betterlabel <- paste(small_chunks, collapse = "\n")
And test that it works:
hist(rnorm(100), main = betterlabel)
On your example, you would need to wrap all that into an sapply to wrap the column names of df, like so:
colnames(df) <- sapply(stringi::stri_extract_all(colnames(df), regex = ".{1,10}"), paste, collapse = "\n")
to obtain the desired result:

R bootstrapping for the two dataframe individual column wise

Want to do Bootstrapping while comparing two dataframe column wise with the different number of rows.
I have two dataframe in which row represent values from experiments and column with the dataset names (data1, data2, data3, data4)
emp.data1 <- data.frame(
data1 = c(234,0,34,0,46,0,0,0,2.26,0, 5,8,93,56),
data2 = c(1.40,1.21,0.83,1.379,2.60,9.06,0.88,1.16,0.64,8.28, 5,8,93,56),
data3 =c(0,34,43,0,0,56,0,0,0,45,5,8,93,56),
data4 =c(45,0,545,34,0,35,0,35,0,534, 5,8,93,56),
stringsAsFactors = FALSE
)
emp.data2 <- data.frame(
data1 = c(45, 0, 0, 45, 45, 53),
data2 = c(23, 0, 45, 12, 90, 78),
data3 = c(72, 45, 756, 78, 763, 98),
data4 = c(1, 3, 65, 78, 9, 45),
stringsAsFactors = FALSE
)
I am trying to do bootstrapping(n=1000). Values are selected at random replacement from emp.data1(14 * 4) without change in the emp.data2(6 * 4). For example from emp.data2 first column (data1) select 6 values colSum and from emp.data1(data1) select 6 random non zero values colSum Divide the values and store in temp repeat the same 1000 times and take a median value et the end. like this i want to do it for each column of the dataframe. sample code I am providing which is working fine but i am not able get the non-zero random values for emp.data1
nboot <- 1e3
boot_temp_emp<- c()
n_data1 <- nrow(emp.data1); n_data2 <- nrow(emp.data2)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(emp.data2)/colSums(emp.data1[boot,])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
boot_data<- apply(boot_temp_emp, 2, median)
From the above script i am able get the output but each column emp.data1[boot,] data has zero values and taken sum. I want indivisual ramdomly selected non-zero values column sum so I tried below script not able remove zero values. Not able get desired output please some one help me to correct my script
nboot <- 1e3
boot_temp_emp<- c()
for (i in colnames(emp.data2)){
for (j in seq_len(nboot)){
data1=emp.data1[i]
data2=emp.data2[i]
n_data1 <- nrow(data1); n_data2 <- nrow(data2)
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(data2[i])/colSums(data1[boot, ,drop = FALSE])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
}
boot_data<- apply(boot_temp_emp, 2, median)
Thank you
Here is a solution.
Write a function to make the code clearer. This function takes the following arguments.
x the input data.frame emp.data1;
s2 the columns sums of emp.data2;
n = 6 the number of vector elements to sample from emp.data1's columns with a default value of 6.
The create a results matrix, pre-compute the column sums of emp.data2 and call the function in a loop.
boot_fun <- function(x, s2, n = 6){
# the loop makes sure ther is no divide by zero
nrx <- nrow(x)
repeat{
i <- sample(nrx, n, replace = TRUE)
s1 <- colSums(x[i, ])
if(all(s1 != 0)) break
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results[i, ] <- boot_fun(emp.data1, sums2)
}
ratios_medians <- apply(results, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-24 by the reprex package (v2.0.1)
Edit
Following the comments here is a revised version of the bootstrap function. It makes sure there are no zeros in the sampled vectors, before computing their sums.
boot_fun2 <- function(x, s2, n = 6){
nrx <- nrow(x)
ncx <- ncol(x)
s1 <- numeric(ncx)
for(j in seq.int(ncx)) {
repeat{
i <- sample(nrx, n, replace = TRUE)
if(all(x[i, j] != 0)) {
s1[j] <- sum(x[i, j])
break
}
}
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results2 <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results2[i, ] <- boot_fun2(emp.data1, sums2)
}
ratios_medians2 <- apply(results2, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results2[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians2[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-27 by the reprex package (v2.0.1)

Adjust nomogram ticks with (splines) transformation, rms package [R]

I'm using a Cox regression model considering my variable trough splines transformation. All is working nice until the subsequent nomogram... as expected, the scale of my variable is also transformed but I'd like to add some custom ticks inside the region between values 0 and 2 (I guess is the transformed one). Any idea, if you please?
Here's my code...
data <- source("https://pastebin.com/raw/rGtUSTLz")$value
ddist <- datadist(data)
options(datadist = "ddist")
fit <- cph(Surv(time, event) ~ rcs(var, 3), data = data, surv = T, x = T, y = T)
surv <- Survival(fit)
plot(nomogram(fit,
fun = list(function(x) surv(times = 10, lp = x),
function(x) surv(times = 30, lp = x),
function(x) surv(times = 60, lp = x)),
funlabel = paste("c", 1:3), lp = T))
... and these are the real and the desired outputs.
Thanks in advance for your help!
I have had this issue too. My answer is a work around using another package, regplot. Alternatively, if you know what the point values are at the tick marks you want plotted, then you can supply those instead of using the output from regplot. Basically, you need to modify the tick marks and points that are output from the nomogram function and supplied to plot the nomogram.
This method also provides a way to remove points / tick marks by editing the nomogram output.
data <- source("https://pastebin.com/raw/rGtUSTLz")$value
ddist <- datadist(data)
options(datadist = "ddist")
fit <- cph(Surv(time, event) ~ rcs(var, 3), data = data, surv = T, x = T, y = T)
surv <- Survival(fit)
nom1 <- nomogram(fit, fun = list(function(x) surv(times = 10, lp = x),
function(x) surv(times = 30, lp = x),
function(x) surv(times = 60, lp = x)),
funlabel = paste("c", 1:3), lp = T)
library(regplot)
# call regplot with points = TRUE to get output
regplot(fit, fun = list(function(x) surv(times = 10, lp = x),
function(x) surv(times = 30, lp = x),
function(x) surv(times = 60, lp = x)),
funlabel = paste("c", 1:3), points = TRUE)
# look at the points supplied through regplot and take those.
nom1_edit <- nom1
# now we edit the ticks supplied for var and their corresponding point value
nom1_edit[[1]][1] <- list(c(0, 0.06, 0.15, 0.3, 2,4,6,8,10,12,14,16))
nom1_edit[[1]][2] <- list(c(0, 10, 21, 32, 42.41191, 50.63878, 58.86565,
67.09252, 75.31939, 83.54626, 91.77313, 100.00000))
nom1_edit$var$points <- c(0, 10, 21, 32, 42.41191, 50.63878, 58.86565,
67.09252, 75.31939, 83.54626, 91.77313, 100.00000)
# plot the edited nomogram with new points
plot(nom1_edit)

Plotting line segments on top of a plot from a previous run of an R function

I have an R function called stock (below). I was wondering if it might be in any way possible that the result of each run of the function (which is a plot()) be plotted (i.e., added) on top of the plot from the previous run of the function? (the picture below the code may show this)
stock = function(m, s){
loop = length(s)
I = matrix(NA, loop, 2)
for(i in 1:loop){
I[i,] = quantile(rbeta(1e2, m, s[i]), c(.025, .975))
}
plot(rep(1:loop, 2), I[, 1:2], ty = "n", ylim = 0:1, xlim = c(1, loop))
segments(1:loop, I[, 1], 1:loop, I[, 2])
}
# Example of use:
stock(m = 2, s = c(1, 10, 15, 20, 25, 30))
stock(m = 50, s = c(1, 10, 15, 20, 25, 30)) #The result of this run be plotted on top of previous run above
Simplest would be to add an argument for the option. As segments() by default adds to the previous frame, all you have to do is to not do a new plot().
stock = function(m, s, add=FALSE) {
loop = length(s)
I = matrix(NA, loop, 2)
for(i in 1:loop) {
I[i,] = quantile(rbeta(1e2, m, s[i]), c(.025, .975))
}
if (!add) {
plot(rep(1:loop, 2), I[, 1:2], ty = "n", ylim = 0:1, xlim = c(1, loop))
}
segments(1:loop, I[, 1], 1:loop, I[, 2], xpd = NA)
}
# Example of use:
set.seed(1)
stock(m = 2, s = c(1, 10, 15, 20, 25, 30))
stock(m = 50, s = seq(1, 90, 10), add=TRUE)

Create multiple comparisons without packages

I would like to create multiple comparisons using a programming approach in R. This in a complete factorial design as when I use the gen.factorial () function AlgDesign package. Could someone tell me how from my code I could create it, since I can not use the gen.factorial () function directly because in my real data I have unbalanced data.
Factor
treat <- gl(4, 15, labels = paste("t", 1:4, sep="")); treat
Variables
set.sed(125)
sp <- cbind(c(rnorm(10, 5, 0.25), rnorm(50, 2.5, 0.25)), rnorm(60, 2.5, 0.25),
c(rnorm(10, 12, 0.25), rnorm(50, 2.5, 0.25)), rnorm(60, 2.5, 0.25))
colnames(sp) <- c("sp1", "sp2", "sp3", "sp4")
Comparisons
TI <- model.matrix(~ treat-1)
head(TI)
f <- nlevels(treat)
comb <- t(combn(1:f, 2))
n <- nrow(comb)
contr2 <- NULL
for (x in 1:n) {
i <- comb[x, 1]
j <- comb[x, 2]
tmp <- list(TI[,i] - TI[,j]); names(tmp) <- paste0("TI",i, "_", j)
contr2 <- c(contr2, tmp) }
contr2df <- as.data.frame(contr2)
contr2df# OK but incomplete
Equivalent, but creating a full factorial design
require(AlgDesign)
contr2df2 <-AlgDesign::gen.factorial(3, 6, TRUE, varNames=c("TI1_2", "TI1_3", "TI1_4", "TI2_3", "TI2_4", "TI3_4"))
contr2df2
#
Thanks,
Alexandre

Resources