R function to plot binned means and model fit, ggplot - r

Sample data:
pp.inc <- structure(list(has.di.rec.pp = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), m.dist.km2 = c(-34.4150009155273, 6.80600023269653, -6.55499982833862,
-61.7700004577637, 15.6840000152588, -11.2869997024536, -26.9729995727539,
0, 81.9940032958984, -35.1459999084473, -12.5179996490479, 0,
21.5919990539551, 81.9940032958984, -20.7770004272461, 85.9469985961914,
-15.2959995269775, -75.5879974365234, 81.9940032958984, 3.04999995231628,
-17.1490001678467, -25.806999206543, -16.0060005187988, -14.91100025177,
-12.9020004272461, -16.0060005187988, 5.44000005722046, -34.4150009155273,
81.9940032958984, 3.61400008201599, 13.7379999160767, 2.71300005912781,
4.31300020217896), treated = c(0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
1, 1)), .Names = c("has.di.rec.pp", "m.dist.km2", "treated"), row.names = c(NA,
-33L), class = c("data.table", "data.frame"))
Code:
library(data.table)
library(ggplot2)
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
data <- data.table(data)
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
seq(-span, span, by = bw),
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
data.span.plot <- data.span.plot[ , runvar := head(seq(-span, span, by = bw), -1)]
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = factor(treatment)), ...)
bp
return(bp)
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
This code runs perfect if I do not wrap it in a function. I am a novice in R, only using it very infrequently. What am I doing wrong? Am I missing something obvious or is it to do with data.table or ggplot2? I thought it might be something with ggplot, as other questions mention there is an issue and aes_string should be used. I can rewrite the data.table parts to use base functions. But I think the error already occurs before that, on the second line. How do I make this work?
EDIT:
[Original title:
R function returns Error in eval(expr, envir, enclos) : object 'name' not found]
I had some time to look at this again and have worked out a solution, hence I also modified the title a bit. Using eval() didn't really work out for me, so I went the [['columname']] selection route. I've ditched data.table (and plyr as well), so that this only uses base functions except for ggplot2. I am happy for any comments on how to improve it. Please let me know if there are some essential flaws. If not I will add an answer with my solution later.
I have changed the bin calculation so that there is always a breakpoint at zero, which is necessary. Default binwidth is determined by the Silverman rule. I am thinking of calculating model fit separately and returning it, as the model choice within ggplot is limited, however I can't think of a nice way to incorporate this for a variety of diverse models such as lm or loess, and it's not strictly necessary. I actually wanted to overlay a thin bar plot displaying the number of observations in each bin, but found out this is impossible in ggplot (I know this generally is a bad idea, but there are several well-published papers which use similar graphs). I don't find the size aestetic to appealing here, but these are really minor gripes.
Thanks for getting me on the right path.
My solution:
rddplot <- function(data, outcome, runvar, treatment = treated,
span, bw = bw.nrd0(data[[runvar]]), ...){
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
data.span <- data[abs(data[[runvar]]) <= max(breaks), ]
data.span$bins <- cut(data.span[[runvar]], breaks,
include.lowest = TRUE, right = FALSE)
data.span.plot <- as.data.frame(cbind(tapply(data.span[[outcome]], data.span$bins, mean),
tapply(data.span[[runvar]], data.span$bins, mean),
tapply(data.span[[treatment]], data.span$bins, max),
tapply(data.span[[outcome]], data.span$bins, length),
tapply(data.span[[outcome]], data.span$bins, sum)))
colnames(data.span.plot) <- c("avg.outcome", "avg.runvar", "treated", "n.iid", "n.rec")
data.span.plot$runvar <- head(breaks, -1)
print(data.span.plot)
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(size = n.iid))
bp <- bp + stat_smooth(data = data.span, aes_string(x = runvar, y = outcome,
group = treatment), ...)
print(bp)
}
Call:
rddplot(pp.inc, "has.di.rec.pp", "m.dist.km2", "treated", 50,
method = lm, formula = y ~ poly(x, 4, raw = TRUE))

I have an approach using data.table and some deparse(substitute()) and setnames trickery....
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
# convert to data.table
data <- data.table(data)
# get the column names as defined in the call to rddplot
outname <- deparse(substitute(outcome))
runname <- deparse(substitute(runvar))
treatname <- deparse(substitute(treatment))
# rename these columns with the argument namses
setnames(data, old = c(outname,runname,treatname), new = c('outcome','runvar', 'treatment'))
# breaks as defined in the second example
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
# the stuff you were doing before
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
breaks,
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
# note I've removed trying to add `runvar` column to data.span.plot....)
bp <- ggplot(data = data.span.plot, aes(x = avg.runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = treatment), ...)
bp
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
Note that if you didn't convert to data.table within the function, and assumed the data argument was a data.table, then you could use on.exit() to revert the names changed by reference.

Related

When plotting a correlation matrix with pairs(), how to display long column names in many lines?

After drawing the correlation matrix in R, I try to read it in binary form in an external program.
However, each column name is too long, so in the correlation matrix drawn in pairs(), parts of both sides of the column names are cut off.
If the column name is this long, is there a way to make these column names appear in multiple lines?
And can we increase the fontsize of the column names to increase readability?
This is the sample code.
In this case, for example, I want the column
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
to look like
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
in two lines.
a <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
b <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
c <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
d <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
e <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
f <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
g <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
h <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
i <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
df <- data.frame(AAAAAAAAAAAAAAAAAAAAAAAAAAA = a,
BBBBBBBBBBBBBBBBBBBBBBBBBBB = b,
CCCCCCCCCCCCCCCCCCCCCCCCCCC = c,
DDDDDDDDDDDDDDDDDDDDDDDDDDD = d,
EEEEEEEEEEEEEEEEEEEEEEEEEEE = e,
FFFFFFFFFFFFFFFFFFFFFFFFFFF = f,
GGGGGGGGGGGGGGGGGGGGGGGGGGG = g,
HHHHHHHHHHHHHHHHHHHHHHHHHHH = h,
IIIIIIIIIIIIIIIIIIIIIIIIIII = i)
pairs(df,
lower.panel = NULL,
upper.panel = function(x, y){
points(x,y,pch=20)
r <- round(cor(x, y, use = "complete.obs"), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(1, 0.95, txt, col="red", pos=2, cex=1.0)
},
)
I would begin by using the function stringi::stri_extract_all on a long label to break it down into chunks of at most ten characters
longlabel <- "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
small_chunks <- stringi::stri_extract_all(longlabel,
regex = ".{1,10}")[[1]]
After that, you can use paste to get
betterlabel <- paste(small_chunks, collapse = "\n")
And test that it works:
hist(rnorm(100), main = betterlabel)
On your example, you would need to wrap all that into an sapply to wrap the column names of df, like so:
colnames(df) <- sapply(stringi::stri_extract_all(colnames(df), regex = ".{1,10}"), paste, collapse = "\n")
to obtain the desired result:

Prp plot - Coloring positive and negative values differently

I am fitting regression trees via the function rpart(). Given my data, I am going to have both positive and negative estimates in nodes. Is there a way to color them differently?
In particular, what I would like to have is a tree whose nodes are shaded in blue for negative values and in red for positive values, where darker colors signal stronger absolute values.
I attach a minimal reproducible example.
library(rpart)
library(rpart.plot)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y)
# Fitting regression tree.
my.tree = rpart(y ~ X1 + X2, data = dta, method = "anova", maxdepth = 3)
# Plotting.
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3)
# Repeating, with median(y) != 0.
X = matrix(rnorm(2000, 5, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y)
my.tree = rpart(y ~ X1 + X2, data = dta, method = "anova", maxdepth = 3)
# HERE I NEED HELP!
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3)
As far as I understood, thanks to the box.palette option, I obtained the result I need in the first setting because median(y) is close to zero.
Indeed, in the second setting I am unhappy: I get blue shades for values less than median(y), and red shades for those above such value. How can I impose zero as the threshold for the two colors?
To be more specific, I would like a command that automatically ensures the two-colors system in any tree.
Ook, I answered my own question. The solution is actually quite simple: if the box.palette option is a two-color diverging palette (as in my example), we can use pal.thresh to set the threshold we want. In my case:
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3,
pal.thresh = 0) # HERE THE SOLUTION!
Even if this is probably bad for me, I will leave here the answer for future users and close the question, rather than deleting it.

"For" loop with column names as index

I would like to create a loop in which the index is given by the column names of a dataframe. The idea is to select one column at a time and create a map based on the data in that column. I need i being the column name, as it identifies the name of the variable and I'll use that as part of the title of the map. However, I do not seem to be able to associate my index i to the name of the column. My code goes as follows:
# random data
x <- rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4)
y <- c(1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0 ,1, 0, 1, 0, 1)
z <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,0, 0, 0, 0, 0)
w <- c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1 ,0, 1, 0, 1, 0)
d <- as.data.frame(cbind(x,y,z,w))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
## do stuff
}
I get the following error at the first line:
Error: Can't subset columns that don't exist.
x Columns `1`, `2`, and `3` don't exist.
Run `rlang::last_error()` to see where the error occurred.
However, if I just run colnames(d[,2:ncol(d)]), it works properly
colnames(d[,2:ncol(d)])
[1] "typeA" "typeB" "typeC"
I could find a workaround by using columns numbers to make it work, but I would like to keep the column names since I am printing (10+) maps within the loop and I am using i to insert the title of the map each, as follows:
# I use geodata files from the library `Eurostat`.
geodata <- get_eurostat_geospatial(resolution = "60", nuts_level = "3", year = 2013)
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
colnames(var_to_map)[1] <- "geo"
# Joining, by = "geo"
map_data <- merge(var_to_map, geodata, by=c("geo"), all.y=T, all.x=T)
## creating ranges
map_data$cat <- with(map_data, cut(value,
breaks= qu <- unique(quantile(value,
probs=c(0, 0.2, 0.5, 0.8,
0.9, 0.95, 0.99, 1),
na.rm=TRUE, include.lowest=T )),
labels=qu[-1]),include.lowest=TRUE )
# Map
print(ggplot(data=map_data) + geom_sf(aes(fill=cat), size=.1) +
scale_fill_brewer(palette = "Darkred", na.value= "grey") + aes(geometry = geometry) +
guides(fill = guide_legend(reverse=T, title = "Percentiles")) +
labs(title = paste("The name of this graph is the column name", i) ## here is where I use the index
)+
theme_minimal() + theme(legend.position=c(.8,.6)) +
coord_sf(xlim=c(-12,44), ylim=c(35,70)) +
theme( axis.text.x=element_blank(), axis.text.y=element_blank()))
}
I could also use column numbers for i and create another object with column names to refer to when pasting the title of the map, but I am wondering why the above approach fails and what I could do to make it work in that setting.
In base R, you can either select the columns by position or by name, you can't combine them both in one command. If you use dplyr::select you can select columns by name and position in the same command.
So here are your options -
cols <- colnames(d)
for (i in cols[-1]) {
#Select columns by position
var_to_map <- d[,c(1,match(i, cols))]
#OR select column by name
var_to_map <- d[,c(cols[1],i)]
#OR select column by position and name
var_to_map <- dplyr::select(d, 1, i)
#...rest of the code
#...rest of the code
}
There's a lot going on in this question, but perhaps this minimal example will help:
library(tidyverse)
# random data
d <- data.frame(x = rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4),
y = sample(1:10, 20, replace = TRUE),
z = sample(1:10, 20, replace = TRUE),
w = sample(1:10, 20, replace = TRUE))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
type <- ensym(i)
p <- ggplot(d, aes(y = !!type, x = id, fill = id)) +
geom_boxplot() +
ggtitle(type)
print(p)
}

Density distributions in R

An assignment has tasked us with creating a series of variables: normal1, normal2, normal3, chiSquared1 and 2, t, and F. They are defined as follows:
library(tibble)
Normal.Frame <- data_frame(normal1 = rnorm(5000, 0, 1),
normal2 = rnorm(5000, 0, 1),
normal3 = rnorm(5000, 0, 1),
chiSquared1 = normal1^2,
chiSquared2 = normal2^2,
F = sum(chiSquared1/chiSquared2),
t = sum(normal3/sqrt(chiSquared1 )))
We then have to make histograms of the distributions for normal1, chiSquared1 and 2, t, and F, which is simple enough for normal1 and the chiSquared variables, but when I try to plot F and t, the plot space is blank.
Our lecturer recommended limiting the range of F to 0-10, and t to -5 to 5. To do this, I use:
HistT <- hist(Normal.Frame$t, xlim = c(-5, 5))
HistF <- hist(Normal.Frame$F, xlim = c(0, 10))
Like I mentioned, this yields blank plots.
Your t and F are defined as sums; they will be single values. If those values are outside your range, the histogram will be empty. If you remove the sum() function you should get the desired results.

Different colors of geom_point() based on subsets of dataframe

I am trying to produce a geom_violin() plot overlayed with a geom_point() plot, in which the geom_point() plot has different colors of the points based on which subset I have categorized the data into.
I have an error saying "Error in eval(expr, envir, enclos) : object 'ind' not found" when attempting to load the subset dataframe when I do it within the geom_point() function, but I don't understand what I am doing wrong from poking around or googling the error.
(Without that row, the code runs and generates this output, which is what I want other than the color coding of the points: PDF output when the second geom_point is commented out)
Here is the nonsense dataset I used to try and make this work (gene1,2,3 are rownames). I will transpose it in the code below:
,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10,cell_11,cell_12,cell_13,cell_14,cell_15,cell_16,cell_17,cell_18,cell_19,cell_20,cell_21,cell_22,cell_23,cell_24,cell_25,cell_26,cell_27,cell_28,cell_29,cell_30,cell_31,cell_32,cell_33,cell_34,cell_35,cell_36,cell_37,cell_38,cell_39,cell_40,cell_41,cell_42,cell_43,cell_44,cell_45,cell_46,cell_47,cell_48,cell_49,cell_50
gene1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19230,0.0,0.0,0.0,0.19230,0.0,0.0,0.0,69.3915,0.0,0.0,74.123,0,0,0,0,0,13.01,0.0,0.0,0.0,0.0,0.0,0.9231,73.023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gene2,0.279204,23.456,13.1,10.5,0.0,14.2,151,2,50.3201,0.0,0.0,128.0,0.0,0.0,0.0,9.74082,20.9432,0.0,0.0,300.023,20.0234,0.0,0.0,300.024,123,201.345,164.681,301.421,173.023,216.537,201.234,302.102,199.234,20.234,40.234,180.0234,0.0,23.234,190.134,170.023,0.0,8.023,40.234,180.0234,0.0,23.234,190.134,170.023,21.24,8.023
gene3,25.9954,77.3398,45.3092,107.508,0.266139,70.4924,114.17,291.324,198.525,190.353,185.381,0.14223,90.323,20.4332,29.012,500.391,2.51459,300.021,60.001,192.023,60.0234,300.022,60.002,192.024,34,500.392,2.51460,300.022,60.002,192.024,60.0235,300.023,60.003,192.025,60.002,192.024,34,500.392,2.51460,300.022,60.002,192.024,60.0235,300.023,60.003,192.025,35,194.231,94.13,32.124
gene4,46.1717,194.241,0.776565,3.0325,0.762981,2.3123,14.507,13.0234,0.538315,0.0,1.5234,11.2341,0.0,1.34819,6.0142,3.2341,4.4444,150.324,0.0,20.9432,134.023,150.325,0.0,20.9433,3.2341,4.4444,150.324,0.0,20.9432,134.023,170.13408,0.0,3.2341,4.4444,150.324,0.0,3.2341,6.7023,150.324,0.0,3.2341,4.4444,170.341,0.0,20.9432,134.023,150.325,0.0,50.234,3.123
gene5,94.2341,301.234,0.0,0.0,123.371,0.0,0.0,155.234,0.0,0.664744,0.0,402.616,222.148,0.0,0.0,0.0,169.234,0.0,10.234,0.0,0.0,0.0,0.99234,0.0,0.99234,0.0,0.0,0.0,0.99234,0.0,0.99234,0.0,0.0,0.0,0.99234,0.0,10.324,0.0,0.0,15.0234,43.1243,0.0,320.023,0.0,0.0,0.0,1.234,0.0,12.123,0.0
Here's the code I wrote:
#Load dataset
df_raw <- read.table("pretend_dataset.csv",
sep=",",
header=TRUE)
#Make gene names into rownames
rownames(df_raw) <- df_raw$Name
#Remove "Name" column
df_raw$Name <- NULL
#TRANSPOSE DATASET
matrix_transp <- t(df_raw)
#Make matrix_transp matrix into dataframe
df <- as.data.frame(as.matrix(matrix_transp))
#Subset gene1 positive and negatve cells
df.positive <- subset(df, gene1 > 0)
#Convert data in data frames to log scale
df.log <- log(df+1)
df.positive.log <- log(df.positive+1)
#Violin plot for each gene with all cells (positive and negative with color coded scatter)
plot <- ggplot(stack(df.log), aes(x = ind, y = values, fill=ind)) +
geom_violin() +
geom_point(position = position_jitterdodge(jitter.width=4)) +
geom_point(data=df.positive.log, aes(x = ind, y = values, fill=ind), position = position_jitterdodge(jitter.width=4), color="red") +
xlab("Gene") + ylab("Expression level (TPM log)") +
theme_classic(base_size = 14, base_family = "Helvetica") +
theme(axis.text.y=element_text(size=14)) +
theme(axis.title.y=element_text(size=14, face="bold")) +
theme(axis.text.x=element_text(size=14)) +
theme(axis.title.x=element_text(size=14, face="bold")) +
scale_fill_brewer(palette="Pastel1")
plot + coord_cartesian(ylim = c(0, 8))
Update:
This question was asked due to a fundamental misunderstanding regarding how data needs to be formatted to efficiently plot it in R.
The data needs to be reformatted into a long instead of a wide format, which can be done i.e. with gather as suggested below, but also with other methods listed in this question: Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
The below answer overlays a coloured violin plot with a jittered set of points that are coloured by positive or negative.
library(dplyr); library(ggplot2); library(tidyr)
#read in data.
df2 <-read.csv(textConnection(df), header=TRUE, row.names = 1)
# Add in the rownames and gather the dataset
df3 <- df2 %>% mutate(Gene= rownames(.)) %>%
gather(., key= "cell", value="value", -Gene) %>%
mutate(positive = value>0, absolute= abs(value), logabs= log(absolute+1))
df3 %>% ggplot(. , aes(x = Gene, y=logabs, fill=Gene)) +
geom_violin() +geom_jitter( aes(colour= positive))
Is this what you were looking for?
EDIT: The read in data line, line pastes in the data you presented above into a text string, then converts the text string to a dataframe. If you already have the data frame it isn't necessary. It is only used as there was not dput() object available to use.
EDIT 2:
This extended answer results from comments to the previous answer. The solution uses a transposed matrix of the data shown in the question. The resulting plot has violin plots, coloured by gene overlaid with points coloured by whether that observation is negative in gene1.
The exact data set is shown below and is the result of calling the dput() command on the matrix.
df <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0.1923, 0, 0, 0, 0.1923, 0, 0, 0, 69.3915, 0, 0, 74.123, 0, 0,
0, 0, 0, 13.01, 0, 0, 0, 0, 0, 0.9231, 73.023, 0, 0, 0, 0, 0,
0, 0, 0, 0.279204, 23.456, 13.1, 10.5, 0, 14.2, 151, 2, 50.3201,
0, 0, 128, 0, 0, 0, 9.74082, 20.9432, 0, 0, 300.023, 20.0234,
0, 0, 300.024, 123, 201.345, 164.681, 301.421, 173.023, 216.537,
201.234, 302.102, 199.234, 20.234, 40.234, 180.0234, 0, 23.234,
190.134, 170.023, 0, 8.023, 40.234, 180.0234, 0, 23.234, 190.134,
170.023, 21.24, 8.023, 25.9954, 77.3398, 45.3092, 107.508, 0.266139,
70.4924, 114.17, 291.324, 198.525, 190.353, 185.381, 0.14223,
90.323, 20.4332, 29.012, 500.391, 2.51459, 300.021, 60.001, 192.023,
60.0234, 300.022, 60.002, 192.024, 34, 500.392, 2.5146, 300.022,
60.002, 192.024, 60.0235, 300.023, 60.003, 192.025, 60.002, 192.024,
34, 500.392, 2.5146, 300.022, 60.002, 192.024, 60.0235, 300.023,
60.003, 192.025, 35, 194.231, 94.13, 32.124, 46.1717, 194.241,
0.776565, 3.0325, 0.762981, 2.3123, 14.507, 13.0234, 0.538315,
0, 1.5234, 11.2341, 0, 1.34819, 6.0142, 3.2341, 4.4444, 150.324,
0, 20.9432, 134.023, 150.325, 0, 20.9433, 3.2341, 4.4444, 150.324,
0, 20.9432, 134.023, 170.13408, 0, 3.2341, 4.4444, 150.324, 0,
3.2341, 6.7023, 150.324, 0, 3.2341, 4.4444, 170.341, 0, 20.9432,
134.023, 150.325, 0, 50.234, 3.123), .Dim = c(50L, 4L), .Dimnames = list(
c("cell_1", "cell_2", "cell_3", "cell_4", "cell_5", "cell_6",
"cell_7", "cell_8", "cell_9", "cell_10", "cell_11", "cell_12",
"cell_13", "cell_14", "cell_15", "cell_16", "cell_17", "cell_18",
"cell_19", "cell_20", "cell_21", "cell_22", "cell_23", "cell_24",
"cell_25", "cell_26", "cell_27", "cell_28", "cell_29", "cell_30",
"cell_31", "cell_32", "cell_33", "cell_34", "cell_35", "cell_36",
"cell_37", "cell_38", "cell_39", "cell_40", "cell_41", "cell_42",
"cell_43", "cell_44", "cell_45", "cell_46", "cell_47", "cell_48",
"cell_49", "cell_50"), c("gene1", "gene2", "gene3", "gene4"
)))
The code required to turn the above data set into the plot requested is shown below.
df2 <- df %>% as.data.frame %>% mutate(Cell= rownames(.), positive = gene1>0) %>%
gather(., key= "Gene", value="value", -Cell,-positive) %>%
mutate( absolute= abs(value), logabs= log(absolute+1))
df2 %>% ggplot(. , aes(x = Gene, y=logabs, fill=Gene)) +
geom_violin() +geom_jitter( aes(colour= positive))
As the plot might be difficult to interpret, to additional methods of displaying the status relative to gene1.
df2 %>% ggplot(., aes(x=Gene, y=logabs, fill=positive)) +geom_boxplot()
df2 %>% ggplot(. , aes(x = Gene, y=logabs, fill=positive)) +
geom_violin()

Resources