I did 10,000 stochastic implementations of a model, and want to plot some outputs. Long story short, the plot shows how long two species co-occurred within the system on a given run. Here's an example:
Now, I want to colour the histogram bars to specify which species lasted longer within the system. Specifically, I want to be able to specify if Species A outlasted Species B, if Species B outlasted Species A, or if both species went "extinct" at the same time, within a given simulation.
So, I made three vectors containing the run numbers (out of 10,000) for each of these three outcomes. For example:
# Pi wins
PiWins_1 <- which(Winner_1[1:10000, 2] %in% TRUE)
head(PiWins_1)
# [1] 1 2 6 7 9 12
# Pj wins
PjWins_1 <- which(Winner_1[1:10000, 3] %in% TRUE)
head(PjWins_1)
# [1] 3 4 5 8 10 11
# Ties
Ties_1 <- which(Winner_1[1:10000, 4] %in% TRUE)
head(Ties_1)
# [1] 20 24 29 40 110 132
And am now trying to figure out how to use these (or some other method) to colour each of the 10,000 histogram bars accordingly?
This is the ggplot script so far:
Histogram_1 <- ggplot(temp_df_1, aes(x=persistance_vec_1))+
geom_histogram(binwidth = 1, fill = "darkseagreen2")+
geom_vline(aes(xintercept = mean_pv_1, colour = "Mean # of Overlapping Time Pts in Stoch Runs"))+
geom_vline(aes(xintercept = cmean_pv_1, colour = "Conditional Mean # of Overlapping Time Pts in Stoch Runs (Dashed)"), linetype=2)+
geom_vline(aes(xintercept = median_pv_1, colour = "Median # of Overlapping Time Pts in Stoch Runs"))+
geom_vline(aes(xintercept = cmedian_pv_1, colour = "Conditional Median # of Overlapping Time Pts in Stoch Runs (Dashed)"), linetype=2)+
geom_vline(aes(xintercept = DO_1, colour = "Overlap in Det Sim"))+
labs(title="Det vs Stoch Overlap; Intro # 1, 10000 sims",
x="# of Time Pts Pi and Pj Co-occur",
y="# of Sims")+
scale_colour_manual(name="Legend",
values=c("Mean # of Overlapping Time Pts in Stoch Runs" = "navyblue",
"Conditional Mean # of Overlapping Time Pts in Stoch Runs (Dashed)" = "navyblue",
"Median # of Overlapping Time Pts in Stoch Runs" = "red2",
"Conditional Median # of Overlapping Time Pts in Stoch Runs (Dashed)" = "red2",
"Overlap in Det Sim" = "orange"))+
theme_minimal()
Histogram_1
Please let me know if I should provide any more information! Thanks so much in advance for any suggestions :)
Okay so I think what you're asking for is just how to colour bars by different colours. Here's a reprex:
x = c(1:10)
y = c(11:15, 15:11)
z = sample(c("A", "B", "C"), 10, replace = TRUE)
df = data.frame(x = x, y = y, z = z)
And then the plot - you can just use the fill argument in geom_col() if you want to use ggplot.
library(ggplot2)
df$z = as.factor(df$z)
ggplot(data = df) +
geom_col(aes(x = x, y = y, fill = z), colour = "black")
Does this do what you're looking to do?
Or if you really want to use geom_histogram(), here's an option with a different reprex:
# make the numeric data
x = sample(c(1:10), 50, replace = TRUE)
# initialize empty vector for categorical variables
z = vector(mode = "character", length = 50)
# im making 3 groups that i've randomly decided to group
for(i in 1:length(x)){
z[i] = ifelse(x[i] %in% c(1,3,5,7,9), # if x is an odd number, z=A
"A",
ifelse(x[i] %in% c(2,4,6), # if x in (2,4,6), z=B, if not, z=C
"B", "C"))
}
#turn into dataframe
df = data.frame(x = x, z = z)
and then the plot using the same idea:
df$z = as.factor(df$z)
ggplot(data = df) +
geom_histogram(aes(x = x, fill = z), colour = "black")
Related
I have a large dataset as a result of a bayesian logistic regression. The dataset contains parameter estimates, confidence intervals, etc (see below for head).
mean sd confint_2.5 confint_97.5 Rhat median spec Errorup Errordown
1 -0.7897597 0.18668304 -1.1759960 -0.4517294 1.002211 -0.7811156 Marvulg -0.3293862 -1.957112
2 -0.7891327 0.08145761 -0.9570086 -0.6380287 1.000155 -0.7861764 Viotric -0.1481477 -1.743185
3 -0.6619662 0.26049168 -1.2203315 -0.2059030 1.045208 -0.6440501 Antdioi -0.4381470 -1.864382
4 -0.6571516 0.17940842 -1.0417642 -0.3364415 1.008100 -0.6470382 Eleacic -0.3105968 -1.688802
5 -0.6526717 0.20005184 -1.0816375 -0.2968111 1.005126 -0.6394952 Antcotu -0.3426842 -1.721133
6 -0.6497648 0.16620699 -1.0081607 -0.3555847 1.003738 -0.6384035 Triflav -0.2828188 -1.646564
I have a total of 714 rows of data, sorted (mean) from low to high. I use this code to plot 50 at a time, where a3_sort is a subset of 50 rows of data (so manually doing a3_sort <- a3[n:n,), after which I print the subset and proceed to the next 50):
ggplot2::ggplot(data = a3_sort, mapping = aes(x = reorder(spec, mean), y = mean, ymin = confint_97.5, ymax = confint_2.5))+
geom_pointrange()+
geom_hline(yintercept = 0, lty = 2)+
coord_flip()+
xlab ("species") +ylab ("mean (credibility interval)")+
theme_bw()
This works, and I get what I want, but there must be a less manual labour way to do this?
My question: Is there a way to loop this procedure, automatically saving the PDF in the working directory?
Below an example of what one plot looks like:
You can try this solution. I tested with dummy data DF with 714 rows and same columns as you have. DF in your case is your sorted dataframe of 714 rows and the variables you have. I have set the code so that you can change if you require a width larger than 50.
library(zoo)
#Create keys; change 50 if you want a larger window
keys <- seq(1, nrow(DF), 50)
vals=1:length(keys)
#Flag to allocate the position and values
#na.locf is used to complete NA so that we have same index
DF$Flag <- NA
DF$Flag[keys]<-vals
DF$Flag <- na.locf(DF$Flag)
#Then split by flag
ListData <- split(DF,DF$Flag)
#Function to create plot
myplot <- function(x)
{
tplot <- ggplot2::ggplot(data = x, mapping = aes(x = reorder(spec, mean), y = mean, ymin = confint_97.5, ymax = confint_2.5))+
geom_pointrange()+
geom_hline(yintercept = 0, lty = 2)+
coord_flip()+
xlab ("species") +ylab ("mean (credibility interval)")+
theme_bw()
return(tplot)
}
#Replicate plots
LPlots <- lapply(ListData,myplot)
#Export to pdf
pdf('Myplots.pdf',width = 14)
for(i in c(1:length(LPlots)))
{
plot(LPlots[[i]])
}
dev.off()
In the end, you will have your plots in pdf. I hope this helps. Let me know if you have any doubt.
This approach could be adapted to your case:
# Some dummy data:
df <- data.frame(g = letters[1:24],
min = sample(0:10, 24, replace = TRUE),
mid = sample(11:20, 24, replace = TRUE),
max = sample(21:30, 24, replace = TRUE))
library(ggplot2)
library(purrr)
# list of the rows you want printing, this could be automated
plot_range <- list(p1_6 = 1:6, p7_12 = 7:12, p13_18 = 13:18, p19_24 = 19:24)
# plotting function which also sets a title and plot name
gg_plot <- function(df, plot_rows){
title <- paste("Automatic plot rows: ", min(plot_rows), "to", max(plot_rows))
plot_nm <- paste("plots", min(plot_rows), max(plot_rows), sep = "_")
p <- ggplot(df[plot_rows, ])+
geom_segment(aes(x = min , xend = max, y = g, yend = g))+
geom_point(aes(x = mid, y = g))+
ggtitle(title)
print(ggsave(plot_nm, p, device = "pdf"))
}
# purrr function which acts as a loop to print each graph and allows a different data frame to be used.
walk(plot_range, ~gg_plot(df = df, plot_rows = .x))
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
#> Saving 7 x 5 in image
#> NULL
Created on 2020-07-11 by the reprex package (v0.3.0)
I have a number of symmetric matrices of the same dimensionality, and I wish to visualise the mean and variance of the values in each cell across these matrices in an elegant way (which I will make more precise below) that makes use of the symmetric character.
Let me start by making some data to illustrate. The following creates 10 9x9 matrices, aggregates the mean and variance, and transforms to long format in preparation for plotting:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
rownames(m) <- colnames(m) <- letters[1:n]
m
}
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "mean")
vars <- apply(matrices, 1:2, var) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "var")
df <- full_join(means, vars, by = c("row", "col"))
head(df)
#> # A tibble: 6 x 4
#> row col mean var
#> <chr> <chr> <dbl> <dbl>
#> 1 a a 0.548 0.111
#> 2 a b 0.507 0.0914
#> 3 a c 0.374 0.105
#> 4 a d 0.350 0.0976
#> 5 a e 0.525 0.0752
#> 6 a f 0.452 0.0887
Now, I could simply use geom_tile to make one plot of the means, and one plot of the variances. However, considering that both of these are symmetric, this wastes quite a lot of space, and also fails to communicate the symmetric character to the audience.
To address this problem, I have been playing around with the ggasym package to create an asymmetric matrix plot. The following is a slight modification from the ggasym vignette:
library(ggasym)
library(ggplot2)
ggplot(df, aes(x = col, y = row)) +
geom_asymmat(aes(fill_diag = mean, fill_tl = mean, fill_br = var)) +
scale_fill_diag_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_tl_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_br_gradient(low = "lightblue1", high = "dodgerblue") +
geom_text(data = filter(df, row == col), aes(label = signif(var, 2)))
Created on 2020-06-27 by the reprex package (v0.3.0)
What bothers me about this is the diagonal. In the above, I have mapped the fill of the diagonal to the means, and overlaid the variance by text, which works, but doesn't seem great. Specifically, I would like to map all the information here to fill, so as to get rid of the text. I see a couple of options for how to do this, but I am not sure how to implement any of them:
Split the fill of the diagonal cells, so that (in the example above) the lower right of each cell on the diagonal is an appropriate shade of blue, while the upper left is some shade of red.
Plot the upper and lower matrices separately (each with the diagonal), and then somehow "overlay" these plots so that they end up next to each other in an appropriate way. In other words, this would plot the diagonal twice.
I am open to other suggestions for how to accomplish this in a clean way. Let me emphasise that I do not require a solution building on ggasym, this was simply the closest I have been able to get so far. However, I would like some kind of ggplot-based solution.
So here is my take on the 'split-the-fill' strategy. You can plot most of the things you would want in ggplot if you don't mind parameterising your stuff as polygons. We let the ggnewscale package handle the double fill mapping for us.
First off, we no longer autoname the matrices, as we will not use the dimnames.
suppressPackageStartupMessages({
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggnewscale)
})
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
# rownames(m) <- colnames(m) <- letters[1:n]
m
}
Below is a function that takes a matrix, parameterises it as a polygon and cuts off one half.
halfmat <- function(mat, side) {
side <- match.arg(side, c("upper", "lower", "both"))
# Convert to long format
dat <- data.frame(
x = as.vector(row(mat)),
y = as.vector(col(mat)),
id = seq_along(mat),
value = as.vector(mat)
)
# Parameterise as polygon
poly <- with(dat, data.frame(
x = c(x - 0.5, x + 0.5, x + 0.5, x - 0.5),
y = c(y - 0.5, y - 0.5, y + 0.5, y + 0.5),
id = rep(id, 4),
value = rep(value, 4)
))
# Slice off one of the triangles
if (side == "upper") {
poly <- filter(poly, y >= x)
} else if (side == "lower") {
poly <- filter(poly, x >= y)
}
poly
}
Then we generate the data, compute the means and variances and reparameterise them.
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>% halfmat("upper")
vars <- apply(matrices, 1:2, var) %>% halfmat("lower")
Then we put in the means and variances as two seperate polygon layers, since we need to seperate the fill mappings with new_scale_fill(). There is a bit of extra fiddling with the scales, as these are now continuous instead of discrete, but it is not that bad.
ggplot(means, aes(x, y, fill = value, group = id)) +
geom_polygon() +
scale_fill_distiller(palette = "Reds", name = "Mean") +
# Be sure to call new_scale_fill() only after you've set up a fill scale
# for the upper part
new_scale_fill() +
geom_polygon(data = vars, aes(fill = value)) +
scale_fill_distiller(palette = "Blues", name = "Variance") +
scale_x_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "col") +
scale_y_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "row")
Created on 2020-06-27 by the reprex package (v0.3.0)
I have a data frame for observation numbers (3 observations for same id), height, weight and fev that looks like this (just for example):
id obs height weight fev
1 1 160 80 90
1 2 150 70 85
1 3 155 76 87
2 1 140 67 91
2 2 189 78 71
2 3 178 86 89
I need to plot this data using ggplot2 such that on x-axis there are 3 variables height, weight, fev; and the observation numbers are displayed as 3 vertical lines for each variable (color coded), where each lines show a median as a solid circle, and 25th and 75th percentiles as caps at the upper and lower extremes of the line (no minimum or maximum needed). I have so far tried many variations of box plots but I am not even getting close. Any suggestion(s) how to approach or solve this?
Thanks
OK instead what I did below was make three graphs then piece together with gridExtra. Read more about package here: http://www.sthda.com/english/wiki/wiki.php?id_contents=7930
I took the common legend code from this site to produce the following, starting with our existing longdf2. By piecing together the graphs, the information about corresponding observation is within the title of the graph
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly)
newvars <- melt(df[-2],id.vars = 'id')
longdf2 <- cbind(obsonly,newvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
#Make graph 1 of observation 1
g1 <- longdf2 %>%
dplyr::filter(obsnum == 1) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 1") +
theme(plot.title = element_text(hjust = 0.5)) #has a legend
g2 <- longdf2 %>%
dplyr::filter(obsnum == 2) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 2") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
#specified as none to make common legend at end
g3 <- longdf2 %>%
dplyr::filter(obsnum == 3) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 3") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
library(gridExtra)
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
# Save legend
legend <- get_legend(g1)
# Remove legend from 1st graph
g1 <- g1 + theme(legend.position = 'none')
# Combine graphs
grid.arrange(g1, g2, g3, legend, ncol=4, widths=c(2.3, 2.3, 2.3, 0.8))
Plenty of other little tweaks you could make along the way
Try putting the data into long format prior to graphing. I generated some more data, 12 subjects, each with 3 observations.
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
library(reshape2) #use to melt data from wide to long format
longdf <- melt(df,id.vars = c('id', 'obs'))
Don't need to define measure variables here since the id.vars are defined, the remaining non-id.vars automatically default to measure variables. If you have more variables in your data set, you'll want to define measure variables in that same line as: measure.vars = c("height,"weight","fev")
longdf <- melt(df,id.vars = c('id', 'obs'), measure.vars = c("height", "weight", "fev"))
Apologies, haven't earned enough votes to put figures into my responses
ggplot(data = longdf, aes(x = variable, y = value, fill = factor(obs))) +
geom_boxplot(notch = T, notchwidth = .25, width = .25, position = position_dodge(.5))
This does not produce the exact graph you described-- which sounded like it was geom_linerange or something similar? -- those geoms require an x, ymin, and ymax to draw. Otherwise a regular, 'ole boxplot has your 1st and 3rd IQRs and median marked. I adjusted parameters of the boxplot to make it thinner with notches and widths, and separated them slightly with the position_dodge(.5)
after reading your response, I edited my original answer
You could try facet_wrap -- and watch the exchanging of "fill" vs. "color" in ggplot. If an object can't be "filled" with a color, like a boxplot or distribution, then it has to be "colored" with a color. Use color instead in the original aes()
ggplot(data = longdf, aes(x = variable, y = value, color = factor(obs))) +
stat_summary(fun.data=median_hilow) + facet_wrap(.~obs)
This gives you observation 1 - height, weight, fev side by side, observation 2- height, ....
If that still isn't what you want perhaps more like height observation 1,2,3; weight observation 1,2,3...then you'll need to modify your melting to have two variable and two value columns. Essentially make two melted dataframes, then cbind. Annnnd because each observation has three variables, you'll need to rbind to make sure both data frames have the same number of rows:
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly) #making rows equal
longvars <- melt(df[-2],id.vars = 'id') #dropping obs from melt
longdf2 <- cbind(obsonly,longvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
ggplot(data = longdf2, aes(x = obsnum, y = value,
color = factor(variable))) +
stat_summary(fun.data=median_hilow) +
facet_wrap(.~variable)
From here you can play around with the x axis marks (probably isn't useful to have a 1.5 observation marked) and the spacing of the lines from each other
I try to generate a heatmap based on values.
Here is my dataset which consists of three variables: Lat (latitude), Lon (longitude), and Value.
https://www.dropbox.com/s/s53xeplywz9jh15/sample_data.csv?dl=0
I have looked through the relevant posts and found this useful:
Generating spatial heat map via ggmap in R based on a value
I copied the code in that post and here my code looks like:
# import data and libaries
library(ggplot2)
library(ggmap)
Yunan<-read.csv("C:\\Program Files\\RStudio\\data\\pb_sp\\sample_data.csv", header = TRUE)
# call the map to see point distribution
Yunan_map<-get_map(location="yunan",zoom=6,maptype="terrain",scale=2)
ggmap(Yunan_map)+geom_point(data=Yunan,aes(x=Yunan$Lon,y=Yunan$Lat,fill="red",alpha=0.3,size=0.05,shape=21))+scale_shape_identity()
# 1. generate bins for x, y coordinates (unit=decimal degree)
xbreaks <- seq(floor(min(Yunan$Lat,na.rm=TRUE)), ceiling(max(Yunan$Lat,na.rm=TRUE)), by = 0.5)
ybreaks <- seq(floor(min(Yunan$Lon,na.rm=TRUE)), ceiling(max(Yunan$Lon,na.rm=TRUE)), by = 0.5)
# 2. allocate the data points into the bins
Yunan$latbin <- xbreaks[cut(Yunan$Lat, breaks = xbreaks, labels=F)]
Yunan$longbin <- ybreaks[cut(Yunan$Lon, breaks = ybreaks, labels=F)]
# 3. summarise the data for each bin (use the median)
datamat <- Yunan[, list(Value= median(Value)),
by = c("latbin", "longbin" )]
# 4. Merge the summarised data with all possible x, y coordinate combinations to get
# a value for every bin
datamat <- merge(setDT(expand.grid(latbin = xbreaks, longbin = ybreaks)), datamat,
by = c("latbin", "longbin"), all.x = TRUE, all.y = FALSE)
# 5. Fill up the empty bins 0 to smooth the contour plot
datamat[is.na(Value), ]$Value <- 0
# 6. Plot the contours
ggmap(Yunan_map,extent ="device") +
stat_contour(data = datamat, aes(x = longbin, y = latbin, z = Value,
fill = ..level.., alpha = ..level..), geom = 'polygon', binwidth = 30) +
scale_fill_gradient(name = "Value", low = "green", high = "red") +
guides(alpha = FALSE)
However, I encountered two problems
After executing the step 3 (summarise the data for each bin), I got this error message:
Error in [.data.frame(Yunan, , list(Value = median(Value)), by = c("latbin", :
unused argument (by = c("latbin", "longbin"))
I wish to change the colour scheme from gradient to discrete colours, something like this map:
Since the values in my dataset range from 17 to 21, I want to classify them in to different bins such as 17-17.5, 17.5-18, 18-18.5.... with corresponding colours.
Any suggestions that I can fix these problems. Thanks in advance.
I have a set of data looks like this
CHROM POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167
Where POS range from 1xxxx to 1xxxxxxx.
And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".
I want to plot a scatterplot:
y(DIFF) vs. X(POS)
having panels separated by CHROM
grouped by GT (different colors by GT)
I'm creating a ggplot with running average (though not time series data).
What I want is to get average for every 1,000,000 range of POS by GT.
For example,
for x in range(1 ~ 1,000,000) , DIFF average = _____
for x in range(1,000,001 ~ 2,000,000), DIFF average = _____
and I want to plot horizontal lines on the ggplot (coloured by GT).
#
What I have so far before apply your function:
After apply your function:
I tried to apply your solution to what I already have, here are some problems:
There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line
Here is my code before:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
geom_point() +
facet_grid(~ CHROM,scales="free_x",space="free_x") +
theme(strip.text.x = element_text(size=40),
strip.background = element_rect(color='lightblue',fill='lightblue'),
legend.position="top",
legend.title = element_text(size=40,colour="darkblue"),
legend.text = element_text(size=40),
legend.key.size = unit(2.5, "cm")) +
guides(fill = guide_legend(title.position="top",
title = "Legend:GT='REF'+'ALT'"),
shape = guide_legend(override.aes=list(size=10))) +
scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) +
scale_x_continuous(breaks = pretty_breaks(n=3))
This should get you started:
# It saves a lot of headaches to just make factors as you need them
options(stringsAsFactors = FALSE)
library(ggplot2)
library(plyr)
# Here's some made-up data - it always helps if you can post a subset of
# your real data, though. The dput() function is really useful for that.
dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
# Add random GT value
dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
size = nrow(dat),
replace = TRUE)
# Group by millions - there are several ways to do this that I can
# never remember, but here's a simple way to split by millions
dat$POSgroup <- floor(dat$POS / 1e6)
# Add an arbitrary DIFF value
dat$DIFF <- rnorm(n = nrow(dat),
mean = 200 * dat$POSgroup,
sd = 300)
# Aggregate the data by GT and POS-group
# Ideally, you'd do this inside of the plot using stat_summary,
# but I couldn't get that to work. Using two datasets in a plot
# is okay, though.
datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
# Calculate the mean DIFF value for each GT group in this POSgroup
meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
# Add the center of the POSgroup range as the x position
meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
# Return the results
meandiff
})
# On the plot, these results will be grouped by both POS and GT - but
# ggplot will only accept one vector for grouping. So make a combination.
datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
# Plot it
ggplot() +
# First, a layer for the points themselves
# Large numbers of points can get pretty slow - you might try getting
# the plot to work with a subsample (~1000) and then add in the rest of
# your data
geom_point(data = dat,
aes(x = POS, y = DIFF, color = as.factor(GT))) +
# Then another layer for the means. There are a variety of geoms you could
# use here, but crossbar with ymin and ymax set to the group mean
# is a simple one
geom_crossbar(data = datsum, aes(x = center,
y = ymean,
ymin = ..y..,
ymax = ..y..,
color = as.factor(GT),
group = combogroup),
size = 1) +
# Some other niceties
scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
labs(x = "POS", y = "DIFF", color = "GT") +
theme_bw()
Which results in this: