Related
Question 1:
Lets assume I have carried out a "before-after" (repeated measures with two points in time) experiment with 100 subjects. Each subject ticks a score on a 1 to 3 numerical scale in the 'before' condition (at T1) and again, after some treatment applied, in the 'after' condition (at T2). The behavior of each subject in the experiment can be described as 'transition from the score value at T1 to score value at T2'. E.g. from 2 to 3, or from 1 to 1, or from 3 to 1 and so on... The cartesian product tells us that 9 different transition types are theoretically possible. For each transition type, I calculated (in an external program) the count of observations.
This gives the following dataframe:
MyData1 <- data.frame(TransitionTypeID=seq(1:9), T1=c(1,1,1,2,2,2,3,3,3), T2=c(1,2,3,1,2,3,1,2,3), Count=c(2,14,0,18,12,8,23,12,11))
MyData1
For each score value (on the y-axis) I would like to plot a point and a line between T1 and T2 (on the x-axis), whereas the thickness of the line between T1 and T2 should (somehow) correspond to the count which is observed. The plot should simply visualize which transitions occur more often than others. Any hints?
Question 2:
While some pre-calculation steps of the above example have been carried out outside of R (in MS Access), I believe there must exist a way reaching the desired result from within R, i.e. using a dataframe with individual records for each subject and each point in time (i.e. with two rows per subject, one for the score at T1 and one for T2, hence in the 'long' format).
In that case the dataframe is something like this:
MyData2 <- data.frame(SubjectID=seq(1:100), Condition = c(rep("T1",100), rep("T2",100)), Score=floor(runif(200,min=1, max=4)))
library(ggplot2)
ggplot(data = MyData2, aes(x = Condition, y = Score, group = SubjectID)) + geom_line()
I get a nice plot showing the observed transitions, but obviously the individual lines for each subject are just plotted on top of each other, i.e. the thicknesses between T1 and T2 do not reflect the count of observations for each type of transition. Again: hints on how to achieve meaningful line thicknesses would be highly appreciated.
Here is a possible solution for Question 1.
MyData1 <- data.frame(TransitionTypeID=seq(1:9),
T1=c(1,1,1,2,2,2,3,3,3),
T2=c(1,2,3,1,2,3,1,2,3),
Count=c(2,14,0,18,12,8,23,12,11))
MyData1
df <- data.frame(x=rep(c("T1","T2"), each=nrow(MyData1)),
y=c(MyData1$T1,MyData1$T2),
ID=rep(MyData1$TransitionTypeID,2),
cnt=rep(MyData1$Count,2)
)
df_lb <- data.frame(x=rep(c("T1","T2"), each=3),
y=rep(1:3,2),
hj=rep(c(2,-1),each=3))
library(ggplot2)
pal <- colorRampPalette(c("white","blue","red"))
ggplot(data=df, aes(x=x, y=y, group=ID, size=cnt, color=cnt)) +
geom_line() +
geom_point(show.legend=F) +
labs(x="", y="Score") +
scale_color_gradientn(colours=pal(10)) +
geom_text(data=df_lb, aes(x=x, y=y, label=y), size=7, inherit.aes=F, hjust=df_lb$hj) +
theme_void()
For question #1
You get to your goal very tidily with geom_segment:
ggplot( data=cbind( MyData1 ),
aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+
geom_segment()
I suspect you will need to change the 0 Counts to NA to get that 1->3 transition to go away. I think a size of 0 should make a segment disappear, but apparently Hadley thinks otherwise.
Yep:
is.na(MyData1 ) <- MyData1==0
ggplot( data=cbind( MyData1 ), aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+geom_segment()
The same code as above now delivers the correct plot.
I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.
This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!
It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I have a data frame that contains 4 variables: an ID number (chr), a degree type (factor w/ 2 levels of Grad and Undergrad), a degree year (chr with year), and Employment Record Type (factor w/ 6 levels).
I would like to display this data as a count of the unique ID numbers by year as a stacked area plot of the 6 Employment Record Types. So, count of # of ID numbers on the y-axis, degree year on the x-axis, the value of x being number of IDs for that year, and the fill will handle the Record Type. I am using ggplot2 in RStudio.
I used the following code, but the y axis does not count distinct IDs:
ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID,
fill=Employment.Data.Type)) + geom_freqpoly() +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
I also tried setting y = Entity.ID to y = ..count.. and that did not work either. I have searched for solutions as it seems to be a problem with how I am writing the aes code.
I also tried the following code based on examples of similar plots:
ggplot(AlumJobStatusCopy, aes(interval)) +
geom_area(aes(x=Degree.Year, y = Entity.ID,
fill = Employment.Data.Type)) +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
This does not even seem to work. I've read the documentation and am at my wit's end.
EDIT:
After figuring out the answer to the problem, I realized that I was not actually using the correct values for my Year variable. A count tells me nothing as I am trying to display the rise in a lack of records and the decline in current records.
My Dataset:
Year, int, 1960-2015
Current Record, num: % of total records that are current
No Record, num: % of total records that are not current
Ergo each Year value has two corresponding percent values. I am now using 2 lines instead of an area plot since the Y axis has distinct values instead of a count function, but I would still like the area under the curves filled. I tried using Melt to convert the data from wide to long, but was still unable to fill both lines. Filling is just for aesthetic purposes as I would like to use a gradient for each with 1 fill being slightly lighter than the other.
Here is my current code:
ggplot(Alum, aes(Year)) +
geom_line(aes(y = Percent.Records, colour = "Percent.Records")) +
geom_line(aes(y = Percent.No.Records, colour = "Percent.No.Records")) +
scale_y_continuous(labels = percent) + ylab('Percent of Total Records') +
ggtitle("Active, Living Alumni Employment Record") +
scale_x_continuous(breaks=seq(1960, 2014, by=5))
I cannot post an image yet.
I think you're missing a step where you summarize the data to get the quantities to plot on the y-axis. Here's an example with some toy data similar to how you describe yours:
# Make toy data with three levels of employment type
set.seed(1)
df <- data.frame(Entity.ID = rep(LETTERS[1:10], 3), Degree.Year = rep(seq(1990, 1992), each=10),
Degree.Type = sample(c("grad", "undergrad"), 30, replace=TRUE),
Employment.Data.Type = sample(as.character(1:3), 30, replace=TRUE))
# Here's the part you're missing, where you summarize for plotting
library(dplyr)
dfsum <- df %>%
group_by(Degree.Year, Employment.Data.Type) %>%
tally()
# Now plot that, using the sums as your y values
library(ggplot2)
ggplot(dfsum, aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")
The result could use some fine-tuning, but I think it's what you mean. Here, the bars are equal height because each year in the toy data include an equal numbers of IDs; if the count of IDs varied, so would the total bar height.
If you don't want to add objects to your workspace, just do the summing in the call to ggplot():
ggplot(tally(group_by(df, Degree.Year, Employment.Data.Type)),
aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")
This question asks about ordering a bar graph according to an unsummarized table. I have a slightly different situation. Here's part of my original data:
experiment,pvs_id,src,hrc,mqs,mcs,dmqs,imcs
dna-wm,0,7,9,4.454545454545454,1.4545454545454546,1.4545454545454541,4.3939393939393945
dna-wm,1,7,4,2.909090909090909,1.8181818181818181,0.09090909090909083,3.9090909090909087
dna-wm,2,7,1,4.818181818181818,1.4545454545454546,1.8181818181818183,4.3939393939393945
dna-wm,3,7,8,3.4545454545454546,1.5454545454545454,0.4545454545454546,4.272727272727273
dna-wm,4,7,10,3.8181818181818183,1.9090909090909092,0.8181818181818183,3.7878787878787876
dna-wm,5,7,7,3.909090909090909,1.9090909090909092,0.9090909090909092,3.7878787878787876
dna-wm,6,7,0,4.909090909090909,1.3636363636363635,1.9090909090909092,4.515151515151516
dna-wm,7,7,3,3.909090909090909,1.7272727272727273,0.9090909090909092,4.030303030303029
dna-wm,8,7,11,3.6363636363636362,1.5454545454545454,0.6363636363636362,4.272727272727273
I only need a few variables from this, namely mqs and imcs, grouped by their pvs_id, so I create a new table:
m = melt(t, id.var="pvs_id", measure.var=c("mqs","imcs"))
I can plot this as a bar graph where one can see the correlation between MQS and IMCS.
ggplot(m, aes(x=pvs_id, y=value))
+ geom_bar(aes(fill=variable), position="dodge", stat="identity")
However, I'd like the resulting bars to be ordered by the MQS value, from left to right, in decreasing order. The IMCS values should be ordered with those, of course.
How can I accomplish that? Generally, given any molten dataframe — which seems useful for graphing in ggplot2 and today's the first time I've stumbled over it — how do I specify the order for one variable?
It's all in making
pvs_id a factor and supplying the appropriate levels to it:
dat$pvs_id <- factor(dat$pvs_id, levels = dat[order(-dat$mqs), 2])
m = melt(dat, id.var="pvs_id", measure.var=c("mqs","imcs"))
ggplot(m, aes(x=pvs_id, y=value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity")
This produces the following plot:
EDIT:
Well since pvs_id was numeric it is treated in an ordered fashion. Where as if you have a factor no order is assumed. So even though you have numeric labels pvs_id is actually a factor (nominal). And as far as dat[order(-dat$mqs), 2] is concerned the order function with a negative sign orders the data frame from largest to smallest along the variable mqs. But you're interested in that order for the pvs_id variable so you index that column which is the second column. If you tear that apart you'll see it gives you:
> dat[order(-dat$mqs), 2]
[1] 6 2 0 5 7 4 8 3 1
Now you supply that to the levels argument of factor and this orders the factor as you want it.
With newer tidyverse functions, this becomes much more straightforward (or at least, easy to read for me):
library(tidyverse)
d %>%
mutate_at("pvs_id", as.factor) %>%
mutate(pvs_id = fct_reorder(pvs_id, mqs)) %>%
gather(variable, value, c(mqs, imcs)) %>%
ggplot(aes(x = pvs_id, y = value)) +
geom_col(aes(fill = variable), position = position_dodge())
What it does is:
create a factor if not already
reorder it according to mqs (you may use desc(mqs) for reverse-sorting)
gather into individual rows (same as melt)
plot as geom_col (same as geom_bar with stat="identity")