Grouped bar chart with ggplot2 and already tabulated data - r

I fit a count model to a vector of actual data and would now like to plot the actual and the predicted as a grouped (dodged) bar chart. Since this is a count model, the data are discrete (X=x from 0 to 317). Since I am fitting a model, I only have already-tabulated data for the predicted values.
Here is how my original data frame looks:
actual predicted
1 3236 3570.4995
2 1968 1137.1202
3 707 641.8186
4 302 414.8763
5 185 285.1854
6 104 203.0502
I transformed the data to be plotted with ggplot2:
melted.data <- melt(plot.data)
melted.data$realization <- c(rep(0:317, times=2))
colnames(melted.data)=c('origin','count','realization')
So that my data frame now looks like this:
head(melted.data)
origin count realization
1 actual 3236 0
2 actual 1968 1
3 actual 707 2
4 actual 302 3
5 actual 185 4
6 actual 104 5
> tail(melted.data)
origin count realization
631 predicted 1.564673e-27 312
632 predicted 1.265509e-27 313
633 predicted 1.023552e-27 314
634 predicted 8.278601e-28 315
635 predicted 6.695866e-28 316
636 predicted 5.415757e-28 317
When I try to graph it (again, I'd like to have the actual and predicted count --which is already tabulated in the data-- by discrete realization), I give this command:
ggplot(melted.data, stat="identity", aes(x=realization, fill=origin)) + geom_bar(position="dodge")
Yet it seems like the stat parameter is not liked by ggplot2, as I don't get the correct bar height (which would be those of the variable "count").
Any ideas?
Thanks,
Roberto.

You need y-values in the aes mapping if you use stat_identity (column count). Try the following:
ggplot(melted.data, aes(x=realization, y=count, fill=origin)) +
stat_identity(position="dodge", geom="bar")
or
ggplot(melted.data, aes(x=realization, y=count, fill=origin)) +
geom_bar(position="dodge", stat="identity")

Related

How to merge estimation and projection graphs in one plot?

I accessed this graph of estimation of number of cases of diabetes and future projections of numbers for every two year estimation data points from year 2000. The graph is factually incorrect as the points on line do not coincide with the scale on left. I am trying to replot it in ggplot2 or ggplotly. While replotting I intend to make two line graphs in a single plot - One for estimations over last few years and the other for future projections made in those years for next 20-25 years and the year on which the projections were made. Any help is highly appreciable.
Here is the data that was used to plot the graph - Estimated numbers with year are represented in blue while Projected numbers for future years are represented by red line. Since, there are multiple projected numbers for few year, I am intending to keep the highest number on the line graph.
EstimationYear
Estimates (in millions)
Projections (in millions)
Projection Year
2000
151
333
2025
2003
194
380
2025
2006
246
438
2030
2009
285
552
2030
2011
366
578
2030
2013
382
592
2035
2015
415
642
2040
2017
425
629
2045
2019
463
700
2045
Your question is more about the data wrangling than the actual plotting with ggplot. Once you have the data in the right shape, the plotting command is just a few lines.
prepare the data for the estimation (blue) points. Set a column type to "estimation".
prepare the data for the projected (red) points. Set a column type to "projection".
use bind_rows to combine both tables.
In the aesthetics of ggplot use color=type
Here is a start in how you can go recreate the plot from the data. I haven't put any effort in recreating the balloons, set the theme to something more elegant and those kind of things.
library(ggplot2)
txt <- "2000 151 333 2025
2003 194 380 2025
2006 246 438 2030
2009 285 552 2030
2011 366 578 2030
2013 382 592 2035
2015 415 642 2040
2017 425 629 2045
2019 463 700 2045"
df <- read.table(text = txt)
# Putting years and values in the same columns
# Probably some tidyverse function can do this more elegantly
df <- rbind(cbind(unname(df[1:2]), type = "Estimates"),
cbind(unname(df[4:3]), type = "Projection"))
colnames(df) <- c("year", "value", "type")
# We're reordering on value, because the red line does not touch year-duplicates
df <- df[order(df$value, decreasing = TRUE),]
ggplot(df, aes(year, value, colour = type)) +
# Formula notation to filter out data for the line
geom_line(data = ~ subset(., !duplicated(year))) +
geom_point() +
scale_colour_manual(
values = c("Estimates" = "dodgerblue",
"Projection" = "tomato")
) +
scale_y_continuous(limits = c(0, NA),
name = "Millions")
Created on 2021-01-06 by the reprex package (v0.3.0)

R - coerce geom_density() in ggplot2 to accept a df column as the frequency (y-variable)

I am attempting to make a smoothed histogram using geom_density in ggplot2. The problem is, technically what I am making is not exactly a histogram, so I am running into trouble. Specifically, along the x-axis of the desired plot is genomic position, but the values can start and end anywhere. Moreover, the y-axis is not exactly counts, but rather is contents of a numeric vector in the df, locus_df_trim$V7, which for my data is an intensity value ranging between 0 and 266.
The "bins" corresponding to each observation may be different numbers of base pairs long, so there are no uniform bin sizes, and there also may be breaks between bins.
I have not been able to get ggplot to accept anything resembling locus_trim_df$V7 as a y-value. Also, if I let the value = ..scaled.., it is not correct because the intensity values are not proportional to the number of observations (which is what drives the ..scaled.. and ..count.. variables if I specify them).
So, at this point my only idea is to recreate the dataframe itself so that there are uniform bin sizes, and each observation for each cell type that does not have an intensity receives a 0 in the df. However, I am wondering if there is a way to produce the desired plot using a df of the current form, which is:
head(locus_df_trim)
> head(locus_df_trim)
V2 V3 V5 V7 annot_width
1 CD4+_CD25-_IL17-_PMA- H3K27ac 204738970 103 1042
2 CD4+_CD25-_IL17-_PMA- H3K27ac 204738517 40 250
3 CD4+_CD25-_IL17-_PMA- H3K36me3 204738136 158 515
4 CD4+_CD25-_IL17-_PMA- H3K36me3 204738702 104 709
5 CD4+_CD25-_IL17+ H3K4me1 204738665 226 1246
6 CD4+_CD25-_IL17+ H3K4me1 204741441 73 397
...
43 Tmem_Primary_Cells H3K27ac 204738908 34 390
44 Tmem_Primary_Cells H3K27me3 204738382 28 194
45 Tmem_Primary_Cells H3K4me1 204738766 124 424
46 Tmem_Primary_Cells H3K4me1 204741433 48 423
47 Tmem_Primary_Cells H3K4me1 204739411 40 215
48 Tmem_Primary_Cells H3K4me1 204737304 33 210
I am able to produce a plot that is close to what is desired, but trying to specify the y-value to be proportional to V7 (as below) results in the error:
Error in eval(substitute(list(...)), `_data`, parent.frame()) :
object 'y' not found
To be clear, the desired plot has the following attributes:
I want to make a smoothed histogram with the following features:
the bin for a given observation begins at the x-coordinate of a value found in one column, and ends at the value found in another column
each cell type (in column locus_trim_df$V2 below) has a different color)
the height of each peak (in the y-direction) is proportional to the value in a column of the df (locus_df_trim$V7).
so far I have the following code:
library("ggplot2")
base_plot<-ggplot(data=locus_df_trim, aes(x=V5, y=V7, fill=V2, size=V7)) + geom_density(alpha=0.3, adjust=0.4, kernel="gaussian")
base_plot<-base_plot + geom_vline(xintercept = 204738919, size = 1, colour = "#FF3721", linetype = "dashed")
base_plot<-base_plot + theme_classic(); base_plot
by contrast, the following code does not error and produces a plot similar to what is desired, but suffers from miscalibrated bin placement and incorrect height for the peaks in the y-direction:
base_plot<-ggplot(data=locus_df_trim, aes(x=V5, fill=V2)) + geom_density(alpha=0.3, adjust=0.4, kernel="gaussian")
base_plot<-base_plot + geom_vline(xintercept = 204738919, size = 1, colour = "#FF3721", linetype = "dashed")
base_plot<-base_plot + theme_classic(); base_plot

ggplot multiple lines in same graph

I am trying to plot multiple gene expressions over time in the same graph to demonstrate a similar profile and then add a line to illustrate the mean of total for each timepoint (like the figure 4b in recent Nature comm article https://www.nature.com/articles/s41467-017-02546-5/figures/4). My data has been normalised to be around 0 so they are all on the same scale.
df2 sample:
variable value gene
1 5 -0.610384193 1
2 5 -6.25967087 2
3 5 -3.773389731 3
50 6 -0.358879035 1
51 6 -6.066341017 2
52 6 -4.202998579 3
99 7 -0.103885903 1
100 7 -6.648844687 2
101 7 -5.041554127 3
I plot the expression levels with ggplot2:
plotC <- ggplot(df2, aes(x=variable, y=value, group=factor(gene), colour=gene)) + geom_line(size=0.5, aes(color=gene), alpha=0.4)
But adding the mean line in red to this plot is proving difficult. I calculated the means and put them in another dataframe:
means
value variable gene
1 -1.5037354 5 50
2 -0.8783492 6 50
3 -0.7769085 7 50
Then tried adding them as another layer:
plotC + geom_line(data=means, aes(x=variable, y=value, color="red", group=factor(gene)), size=0.75)
But I get an error Error: Discrete value supplied to continuous scale
Do you have any suggestions as to how I can plot this mean on the same graph in another color?
Thank you,
Anna
edit: the answer by RG20 is helpful, thanks for pointing out I had the color in the wrong place. However it plots the line outside the rest of the graph... I really don't understand what's wrong with my graph...
enter image description here
plotC + geom_line(data=means, aes(x=variable, y=value, group=factor(gene)), color='red',size=0.75)

Creating a Bar Plot with Proportions on ggplot

I'm trying to create a bar graph on ggplot that has proportions rather than counts, and I have c+geom_bar(aes(y=(..count..)/sum(..count..)*100)) but I'm not sure what either of the counts refer to. I tried putting in the data but it didn't seem to work. What should I input here?
This is the data I'm using
> describe(topprob1)
topprob1
n missing unique Info Mean
500 0 9 0.93 3.908
1 2 3 4 5 6 7 8 9
Frequency 128 105 9 15 13 172 39 12 7
% 26 21 2 3 3 34 8 2 1
You haven't provided a reproducible example, so here's an illustration with the built-in mtcars data frame. Compare the following two plots. The first gives counts. The second gives proportions, which are displayed in this case as percentages. ..count.. is an internal variable that ggplot creates to store the count values.
library(ggplot2)
library(scales)
ggplot(mtcars, aes(am)) +
geom_bar()
ggplot(mtcars, aes(am)) +
geom_bar(aes(y=..count../sum(..count..))) +
scale_y_continuous(labels=percent_format())
You can also use ..prop.. computed variable with group aesthetics:
library(ggplot2)
library(scales)
ggplot(mtcars, aes(am)) +
geom_bar(aes(y=..prop.., group = 1)) +
scale_y_continuous(labels=percent_format())

R histogram plot density for a given variable or column of data

I have a data set with a column for age and a corresponding column with lung capacity. How can I create a histogram showing the distribution of lung capacity with respect to age?
Here is an example of what the data looks like. I actually want to compare the distributions for those who don't smoke with those who do:
Caes Age Gender Smoke Height FEV
0 16 1 0 64.8 2.65
0 12 0 0 60.5 2.27
1 19 1 0 71.7 4.29
0 15 0 0 64.8 2.52
Histograms are usually used when you have a single vector (like lung capacity) and you want to show the distribution of values:
library(ggplot2)
foo <- data.frame(age=runif(1000,min=10,max=50), capacity=rnorm(1000,mean=10))
ggplot(foo, aes(capacity))+geom_histogram(fill="blue")
If you want to plot the relationship between two variables, scatter plot might be a better choice:
ggplot(foo, aes(age, capacity))+geom_point(color="blue")
Thanks for the responses. I realized that I wanted a barplot rather than a histogram. Here is the solution that I came up with:
smoke=read.csv("SmokingEffect.csv",header=TRUE)
smokes=subset(smoke,select=c(Age,Smoke,FEV))
library(plyr)
smokesmeans <- ddply(smokes, c("Age","Smoke"), summarize, mean=mean(FEV),
sem=sd(FEV)/sqrt(length(FEV)))
smokesmeans <- transform(smokesmeans, lower=mean-sem, upper=mean+sem)
smokesmeans[,2] <- sapply(smokesmeans[,2], as.character)
library(ggplot2)
plotation <- qplot(x=Age, y=mean, fill=Smoke, data=smokesmeans,
geom="bar",stat="identity",position="dodge",main="distribution of FEV",
ylab="mean FEV")
plotation <- plotation + geom_errorbar(aes(ymax=upper,
ymin=lower), position=position_dodge(0.9), data=smokesmeans)
png(myplot.png)
plotation
dev.off()
The output looks like this:

Resources