Plotting confidence intervals in ggplot - r

I'd like to do the following plot using ggplot:
Here is an example of the structure of my df (sort of, draw not to scale with the data):
example.df = data.frame(mean = c(0.3,0.8,0.4,0.65,0.28,0.91,0.35,0.61,0.32,0.94,0.1,0.9,0.13,0.85,0.7,1.3),
std.dev = c(0.01,0.03,0.023,0.031,0.01,0.012,0.015,0.021,0.21,0.13,0.023,0.051,0.07,0.012,0.025,0.058),
class = c("1","2","1","2","1","2","1","2","1","2","1","2","1","2","1","2"),
group = c("group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2","group1","group2"))
This data frame consists of 16 replicates, each with a given mean and a given standard deviation.
For each replicate I'd like to plot the confidence intervals, where the big dot in my figure example is the mean estimate, and the length of the bar is twice the standard deviation.
Also I'd like to plot two different replicates in the same line but with different coloring, coloring it by class, red is class 1 and blue is class 2.
Finally, I'd like to divide the whole plot into two panels (in the same row) corresponding to the two different groups.
I tried looking into this site, http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ but couldn't figure out how to automate this for any data frame of this structure, with X number of groups (in this case 2), and K replicates per group (in this case 8, 4 of class 1 and 4 of class 2).
Is there a good way to do this using ggplot or standard r pkg libraries?

I suppose that sample data frame you provided isn't build in appropriate way because all values in group1 have class 1, and in group2 all are class 2. So I made new data frame, added also new column named replicate that shows number of replicate (four replicates (with two class values) in each group).
example.df = data.frame(mean = c(0.3,0.8,0.4,0.65,0.28,0.91,0.35,0.61,0.32,0.94,0.1,
0.9,0.13,0.85,0.7,1.3),
std.dev = c(0.01,0.03,0.023,0.031,0.01,0.012,0.015,0.021,0.21,
0.13,0.023,0.051,0.07,0.012,0.025,0.058),
class = c("1","2","1","2","1","2","1","2","1","2","1",
"2","1","2","1","2"),
group = rep(c("group1","group2"),each=8),
replicate=rep(rep(1:4,each=2),time=2))
Now you can use geom_pointrange() to get points with confidence intervals and facet_wrap() to make plot for each group.
ggplot(example.df,aes(factor(replicate),
y=mean,ymin=mean-2*std.dev,ymax=mean+2*std.dev,color=factor(class)))+
geom_pointrange()+facet_wrap(~group)

Related

Heat scatter plot with non-numerical output

Image of excel data set
I have a table in excel with 100 columns and 100 rows. The column starts at 0% and works up to 100%. Same for the row, starts at 0% and goes up to 100%. It is a 2-way sensitivity analysis, i.e. which drug would be optimal if x(variable in column)=10% and y(variable in row)=30%.
I have 100 by 100 table, with the name of four different drugs scattered across the table. I want to take this data into R and create a scatter plot, essentially a square with 10000 smaller squares. I then want R to colour each square based on the drug which is most optimal for that combination of X and Y.
I've attached an image of dummy data showing the same example in a 10 by 10 table.
Hope you can help!
You'll need to start by prepping the data -- reading it in, using something like the pivot_longer() function from the tidyverse package to make the columns into rows, and then likely doing some clean up on the percentages.
After that, the plot (using ggplot2) itself may be pretty straightforward. The geom_tile() function is the one that creates the squares.
library(tidyverse)
# Create test data
df <- expand_grid(x = 1:100, y = 1:100) %>%
mutate(drug = sample(LETTERS[1:4], size = 10000, replace = TRUE))
# Make the plot
df %>%
ggplot(aes(x, y, fill = drug)) +
geom_tile()

Creating multiple density plots using only summary statistics (no raw data) in R

I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)

Graphing 3 Variable Scatterplot R

I imported some data from Islander and am trying to graph something with 3 variables. I'm thinking of trying to graph 2 numeric variables with a nominal category (gender). The plot I'm trying to do therefore is a regular scatterplot, but color-coded.
I looked at this starter tutorial on R: Scatterplots, but didn't see any mention of 3 variable plotting.
http://www.laptopmag.com/articles/ssd-upgrade-tutorial
Can anyone help me out? My variables hold values pertaining to number of balls bounced, minutes of physical activity per week, and gender.
Picture of the data:
Data
Since gender is a binary variable (usually, otherwise ternary), I would plot a 2D scatterplot with color encoding the gender.
Dummy data:
a = data.frame(x=runif(100), y = runif(100)+2, group = round(runif(100))+1 )
Now I would plot y against x using a$group to select the color:
plot(a$y, a$x, pch = 16, col = c('cornflowerblue', 'springgreen')[a$group])
Output:
If you do have missingness I would add a third group to the color vector.
Here is a bunch of other solutions for 2D scatter with color

R - Creating a clustered barplot with two datasets

I'd like to do the following in R.
I have a group of individuals (1 - 50) with two datasets each. Each dataset (A & B) has values that can be in two categories (Gains, shown in blue; Losses, shown in red). I'd like to show those two datasets together, as below. The frequency of Gains/Losses would be in the y axis, where Dataset A would go upward from the x axis, and Dataset B would go downward from the x axis. I'd like to be able to cluster the barplot either by Individual (as shown below) OR by Gains or Losses (All gains together, then all losses together).
I know how to make clustered barplots in ggplot, but can't figure out how to combine the two datasets as in my image (with dataset A going up and dataset B going down).
We can do something similar to age pyramids, only without flipping the coordinates
testA <- data.frame(v=as.factor(sample(1:2,1000,replace=T, prob = c(1,5))), dataset='A')
testB <- data.frame(v=as.factor(sample(1:2,1000,replace=T, prob = c(5,1))), dataset='B')
require(ggplot2)
require(plyr)
ggplot(data=rbind(testA, testB),aes(x=as.factor(v),fill=v)) +
geom_bar(subset=.(dataset=="A")) +
geom_bar(subset=.(dataset=="B"),aes(y=..count..*(-1)))

Plotting gene expression data with means in a randomized experiment

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.
Example gene:
IL10_BL
IL10_1Y
IL10_fold
Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0.
100 patients have been randomized to either a statin or diet regime.
I would like to do the following plot:
- Y axis should display the mean gene expression with 95% confidence limit
- X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.
I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would greatly appreciate any help (or link to a similar thread) to do this.
First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.
(your command creates an empty column you might need to get rid of first: genes$X = NULL)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order that you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Resources