Plot correlation matrix with R in specific data range - r

I have used corrplot package to plot my data-pairs. But all the relationships in my data are positive.
Mydata<-read.csv("./xxxx.csv")
M <-cor(Mydata)
corrplot(M,,col=rev(brewer.pal(n=8, name="RdYlBu")))
Using ggcorr, I also can't find any solution to deal with the issue.
How to generate a user-defined colormap with the corresponding range from 0 to 1?

If you are trying to map the entire range of the colormap to only the positive correlations, you could use col = rep(rev(brewer.pal(n=8, name="RdYlBu")), 2). This repeats the color sequence, and then cl.lim = c(0,1) forces corrplot to use only the 2nd half of the sequence, mapped to the range 0 to 1.
par(xpd=T)
corrplot(M,,'upper',
col = rep(rev(brewer.pal(n=8, name="RdYlBu")), 2),
cl.lim = c(0,1),
mar = c(1, 0, 1, 0))
Some reproducible data
set.seed(12)
x = (1:100)/100
Mydata = data.frame(a=x^runif(1, 0, 50),
b=x^runif(1, 0, 50),
c=x^runif(1, 0, 50),
d=x^runif(1, 0, 50),
e=x^runif(1, 0, 50),
f=x^runif(1, 0, 50),
g=x^runif(1, 0, 50),
h=x^runif(1, 0, 50),
i=x^runif(1, 0, 50))
M = cor(Mydata)

Related

How can I condense a long list of items into categories for a repeated logit regression?

I'm using a program called Apollo to make an ordered logit model. In this model, you have to specify a list of variables like this:
apollo_beta = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum3 = 0)
I want to do two things:
Firstly, I want to be able to specify these beforehand:
specification1 = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to call it:
apollo_beta = specification1
Secondly, I want to be able to make categories:
var1 <- c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0)
var2 <- c(
b_var2_dum1 = 0,
b_var2_dum2 = 0)
var3 <- c(
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to use those in the specification:
specification1 = c(
var1,
var2,
var3)
And then:
apollo_beta = specification1
I know you might not have the best knowledge of the very niche programme Apollo. I am not quite sure if this is even possible, but since it would save me days (maybe weeks) of work, can anyone give me a hint on what I might be doing wrong? I worry I have a list within a list.
Since I have to make 60 specifications of the same model with different variations of 6 variables, it would be a lot of code and lot of work if I can't shorten it like this.
Any tips would be greatly appreciated.
Data:
df <- data.frame(
var1_dum1 = c(0, 1, 0),
var1_dum2 = c(1, 0, 0),
var1_dum3 = c(0, 0, 1),
var2_dum1 = c(0, 1, 0),
var2_dum2 = c(1, 0, 0),
var3_dum1 = c(1, 1, 0),
var3_dum2 = c(1, 0, 0),
var3_dum3 = c(0, 1, 0),
var3_dum4 = c(0, 0, 1),
)
So there is a dataset with these variables. In apollo you specify "database = df" first, so it already refers to the variables.
In the list of apollo_beta, it doesn't refer to the variables directly, so technically you can call it what you want. I just want to call it the same as the variables as I will refer to them later.
My question is simple. Can I condense the long list to simply say "specification1". It's just a question of the r language. Whether the items of the list will function the same way as how it was originally written in code.
In other words, would calling apollo_beta in the above three examples lead to the same result? If not, how do I change the code so that it does lead to the same?

Place elements from vector on histogram bins (R ggplot)

I have a ggplot histogram, showing two histograms of a continuous variable, one for each level of a group.
Through use of ggplot_build, I now also have vectors where each element is the proportional count of one group (1) versus the other (0), per bin.
So for the following histogram built with
ggplot(data,aes(x=nonfordist)) + geom_histogram(aes(fill=presence),
position="identity",alpha=0.5,bins=30)+ coord_cartesian(xlim=c(NA,1750))
I have the following list, showing sequential proportions of group1/group0 per bin
list(0.398927744608261, 0.35358629130967, 0.275296034083078,
0.247361252979231, 0.260224274406332, 0.22107969151671, 0.252847380410023,
0.230055658627087, 0.212244897959184, 0.242105263157895,
0.235294117647059, 0.115384615384615, 0.2, 0.421052631578947,
0.4375, 0.230769230769231, 0.222222222222222, 0.5, 0, 0,
0, NaN, 1, 1, 0, 0, NaN, NaN, NaN, Inf)
What I want now is to plot the elements of this list on the corresponding bins, preferably above the bars showing the counts for group1.
I do not want to include the proportions for bins that fall outside of the histogram due to my xlim command.
You could use stat_bin with a text geom, using the same breaks as you do for your histogram. We don't have your actual data, so I've tried to approximate it here (see footnote for reproducible data). You haven't told us what your list of proportions is called, so I have named it props in this example.
ggplot(data,aes(x=nonfordist)) +
geom_histogram(aes(fill = presence),
breaks = seq(-82.5, by = 165, length = 11),
position = "identity", alpha = 0.5, bins = 30) +
stat_bin(data = data[data$presence == 1, ], geom = "text",
breaks = seq(-82.5, by = 165, length = 11),
label = round(unlist(props)[1:10], 2), vjust = -0.5) +
coord_cartesian(xlim = c(NA, 1750))
Approximation of data
data <- data.frame(
nonfordist = rep(165 * c(0:10, 0:10),
c(24800, 20200, 16000, 6000, 2800, 1300, 700, 450, 100,
50, 30, 9950, 7400, 4500, 600, 300, 150, 80, 50, 30, 20,
10)),
presence = factor(rep(c(0, 1), c(72430, 23090))))

plot pearson correlation for the two datasets

I want to plot a correlation plot between two datasets. y axis is frequency of genes and x axis is correlation value or R2 value.
gvds_counts = read.delim("GVDS_normalized_counts_new.txt", header = TRUE, sep="\t",row.names = 1)
gvds_elastic = (read.delim("GVDS_PrediXcan_Test_2021_new.txt", header = TRUE, sep="\t",row.names = 1))
gvds_elastic= gvds_elastic[,-1]
gvds_elastic1=t(gvds_elastic)
This is how my two datasets look like:
dput(gvds_elastic1[1:5,1:5])
structure(c(0.117057915232398, 0, 0.16739607781207, -0.00582814885591799,
-0.00522232324665859, 0.196331738115648, 0, -0.00712521269403638,
0, -0.00522232324665859, 0.196331738115648, -0.021754984214482,
-0.00356260634701819, -0.0210845683319449, -0.00522232324665859,
0.177439691941073, -0.021754984214482, 0, -0.0210845683319449,
-0.00522232324665859, 0.158547645766499, -0.0435099684289639,
0, -0.00582814885591799, -0.00522232324665859), .Dim = c(5L,
5L), .Dimnames = list(c("ENSG00000107959.15", "ENSG00000057608.16",
"ENSG00000281186.1", "ENSG00000151632.17", "ENSG00000134452.19"
), c("GTEX-1117F", "GTEX-111CU", "GTEX-111FC", "GTEX-111VG",
"GTEX-111YS")))
dput(gvds_counts[1:5,1:5])
structure(list(GTEX.1117F.3226.SM.5N9CT = c(1.07050611836238,
319.010823271988, 0, 0, 0), GTEX.111FC.3126.SM.5GZZ2 = c(0, 137.627503211918,
0.81921132864237, 1.63842265728474, 0), GTEX.1128S.2726.SM.5H12C = c(0.93125973749023,
98.7135321739644, 0, 0.93125973749023, 0.93125973749023), GTEX.117XS.3026.SM.5N9CA = c(0,
140.966661860149, 0, 0.766123162283421, 0.766123162283421), GTEX.1192X.3126.SM.5N9BY = c(0.937426181007344,
139.676500970094, 0, 0.937426181007344, 0)), row.names = c("ENSG00000223972.5",
"ENSG00000227232.5", "ENSG00000278267.1", "ENSG00000243485.5",
"ENSG00000237613.2"), class = "data.frame")
I want to first rename the ID in gvds_counts similar to gvds_elastic1 which is to edit it from
GTEX.1128S.2726.SM.5H12C to GTEX-1128S
and then plot a correlation between these two datasets with x axis as R2 value and y axis as frequency of genes.

Pooling Survreg Results Across Multiply Imputed Datasets - Error Message: log(1 - 2 * pnorm(width/2)) : NaNs produced

I am trying to run an interval regression using the survival r package (as described here https://stats.oarc.ucla.edu/r/dae/interval-regression/), but I am running into difficulties when trying to pool results across multiply imputed datasets. Specifically, although estimates are returned, I get the following error: log(1 - 2 * pnorm(width/2)) : NaNs produced. The estimates seem reasonable, at face value (no NaNs, very large or small SEs).
I ran the same model on the stacked dataset (ignoring imputations) and on individual imputed datasets, but in either case, I do not get the error. Would someone be able to explain to me what is going on? Is this an ignorable error? If not, is there a workaround that avoids this error?
Thanks so much!
# A Reproducible Example
require(survival)
require(mice)
require(car)
# Create DF
dat <- data.frame(dv = c(1, 1, 2, 1, 0, NA, 1, 4, NA, 0, 3, 1, 3, 0, 2, 1, 4, NA, 2, 4),
catvar1 = factor(c(0, 0, 0, 0, 0, 1, 0, 0, 0, NA, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)),
catvar2 = factor(c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, NA, 0)))
dat_imp <- mice(data = dat)
# Transform Outcome Var for Interval Reg
dat_imp_long <- complete(dat_imp, action = "long", include=TRUE)
# 1-4 correspond to ranges (e.g., 1 = 1 to 2 times...4 = 10 or more)
# create variables that reflect this range
dat_imp_long$dv_low <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 1; 2 = 3; 3 = 6; 4 = 10")
dat_imp_long$dv_high <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 2; 2 = 5; 3 = 9; 4 = 999")
dat_imp_long$dv_high[dat_imp_long$dv_high > 40] <- Inf
# Convert back to mids
dat_mids <- as.mids(dat_imp_long)
# Run Interval Reg
model1 <- with(dat_mids, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
# Warning message for both calls: In log(1 - 2 * pnorm(width/2)) : NaNs produced
# Problem does not only occur with pool, but summary
summary(model1)
summary(pool(model1))
# Run Equivalent Model on Individual Datasets
# No errors produced
imp1 <- subset(dat_imp_long, .imp == 1)
model2 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp1)
summary(model2)
imp2 <- subset(dat_imp_long, .imp == 2)
model3 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp2)
summary(model3)
# Equivalent Analysis on Stacked Dataset
# No error
model <- with(dat_imp_long, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
summary(model)

How to color code a categorical variable in a mosaic

I am trying to display a relationship between my categorical variables. I finally got my data into what I believe is a contingency table
subs_count
## [,1] [,2] [,3] [,4]
## carbohydrate 2 0 11 2
## cellulose 18 0 60 0
## chitin 0 4 0 4
## hemicellulose 21 3 10 0
## monosaccharide 3 0 0 0
## pectin 8 0 2 2
## starch 1 0 4 0
Where each column represents an organism. So for my plot I put in
barplot(subs_count, ylim = c(0, 100), col = predicted.substrate,
xlab = "organism", ylab = "ESTs per substrate")
But my substrates are not consistently the same color. What am I doing wrong?
Your data seems to be a matrix with row names which is close to a contingency table in R but not exactly the same. Some plotting methods have additional support for tables.
More importantly, I couldn't run your code because it is unclear what predicted.substrate is. If it were a palette with 7 colors then it should do what you intend to do (or at least what I think you intend).
I replicated your data with:
subs_count <- structure(c(2, 18, 0, 21, 3, 8, 1, 0, 0,
4, 3, 0, 0, 0, 11, 60, 0, 10, 0, 2, 4, 2, 0, 4, 0, 0, 2, 0),
.Dim = c(7L, 4L), .Dimnames = list(c("carbohydrate", "cellulose",
"chitin", "hemicellulose", "monosaccharide", "pectin", "starch"), NULL))
And then transformed them into a table by:
subs_count <- as.table(subs_count)
names(dimnames(subs_count)) <- c("EST", "Organism")
Then I used a qualitative palette from the colorspace package:
subs_pal <- colorspace::qualitative_hcl(7)
And with your barplot seems to be reasonable:
barplot(subs_count, ylim = c(0,100), col = subs_pal,
xlab = "organism", ylab = "ESTs per substrate", legend = TRUE)
And a mosaic display (as indicated in your title) would be:
mosaicplot(t(subs_count), col = subs_pal, off = 5, las = 1, main = "")
For visualizing patterns of dependence (or rather departures from independence) a mosaic plot shaded with residuals from the independence model might be even more useful.
mosaicplot(t(subs_count), shade = TRUE, off = 5, las = 1, main = "")
More refined versions of shaded mosaic displays are available in package vcd (see doi:10.18637/jss.v017.i03).

Resources