I have a dataframe derived from the output of running GWAS. Each row is a SNP in the genome, with its Chromosome, Position, and P.value. From this dataframe, I'd like to generate a Manhattan Plot where the x-axis goes from the first SNP on Chr 1 to the last SNP on Chr 5 and the y-axis is the -log10(P.value). To do this, I generated an Index column to plot the SNPs in the correct order along the x-axis, however, I would like the x-axis to be labeled by the Chromosome column instead of the Index. Unfortunately, I cannot use Chromosome to plot my x-axis because then all the SNPs on any given Chromosome would be plotted in a single column of points.
Here is an example dataframe to work with:
library(tidyverse)
df <- tibble(Index = seq(1, 500, by = 1),
Chromosome = rep(seq(1, 5, by = 1), each = 100),
Position = rep(seq(1, 500, by = 5), 5),
P.value = sample(seq(1e-5, 1e-2, by = 1e-5), 500, replace = TRUE))
And the plot that I have so far:
df %>%
ggplot(aes(x = Index, y = -log10(P.value), color = as.factor(Chromosome))) +
geom_point()
I have tried playing around with the scale_x_discrete option, but haven't been able to figure out a solution.
Here is an example of a Manhattan Plot I found online. See how the x-axis is labeled according to the Chromosome? That is my desired output.
geom_jitter is your friend:
df %>%
ggplot(aes(x = Chromosome, y = -log10(P.value), color = as.factor(Chromosome))) +
geom_jitter()
Edit given OP's comment:
Using base R plot, you could do:
cols = sample(colors(), length(unique(df$Chromosome)))[df$Chromosome]
plot(df$Index, -log10(df$P.value), col=cols, xaxt="n")
axis(1, at=c(50, 150, 250, 350, 450), labels=c(1:5))
You'll need to specify exactly where you want each chromosome label to be for the axis function. Thanks to this post.
Edit #2:
I found an answer using ggplot2. You can use the annotate function to plot your points by coordinates, and the scale_x_discrete function (as you suggested) to place the labels in the x axis according to chromosome. We also need to define the pos vector to get the position of labels for the plot. I used the mean value of the Index column for each group as an example, but you can define it by hand if you wish.
pos <- df %>%
group_by(Chromosome) %>%
summarize(avg = round(mean(Index))) %>%
pull(avg)
ggplot(df) +
annotate("point", x=df$Index, y=-log10(df$P.value),
color=as.factor(df$Chromosome)) +
scale_x_discrete(limits = pos,
labels = unique(df$Chromosome))
Related
I asked a question before, but now I would like to know how do I put the labels above the bars.
post old: how to create a frequency histogram with predefined non-uniform intervals?
dataframe <- c (1,1.2,40,1000,36.66,400.55,100,99,2,1500,333.45,25,125.66,141,5,87,123.2,61,93,85,40,205,208.9)
Upatdate
Update
Following the guidance of the colleague I am updating the question.
I have a data base and I would like to calculate the frequency that a given value of that base appears within a pre-defined range, for example: 0-50, 50-150, 150-500, 500-2000.
in the post(how to create a frequency histogram with predefined non-uniform intervals?) I managed to do this, but I don't know how to add the labels above the bars. I Tried:
barplort (data, labels = labels), but it didn't work.
I used barplot because the post recommended me, but if it is possible to do it using ggplot, it would be good too.
Based on the answer to your first question, here is one way to add a text() element to your Base R plot, that serves as a label for each one of your bars (assuming you want to double-up the information that is already on the x axis).
data <- c(1,1.2,40,1000,36.66,400.55,100,99,2,1500,333.45,25,125.66,141,5,87,123.2,61,93,85,40,205,208.9)
# Cut your data into categories using your breaks
data <- cut(data,
breaks = c(0, 50, 150, 500, 2000),
labels = c('0-50', '50-150', '150-500', '500-2000'))
# Make a data table (i.e. a frequency count)
data <- table(data)
# Plot with `barplot`, making enough space for the labels
p <- barplot(data, ylim = c(0, max(data) + 1))
# Add the labels with some offset to be above the bar
text(x = p, y = data + 0.5, labels = names(data))
If it is the y values that you are after, you can change what you pass to the labels argument:
p <- barplot(data, ylim = c(0, max(data) + 1))
text(x = p, y = data + 0.5, labels = data)
Created on 2020-12-11 by the reprex package (v0.3.0)
I've been trying this out but I cannot find a solution. The best I can do is plotting the first 15846 values in 1 colour and then using the lines() function to add the remaining 841 points. But these then appear at the start of the graph and does not continue from the 15846th datapoint.
str(as.numeric(sigma.in.fr))
num [1:15846] 0.000408 0.000242 0.000536 0.000274 0.000476 ...
str(as.numeric(sigma.out.fr))
num [1:841] 0.002558 0.000428 0.000255 0.000549 0.00028 ...
plot(as.numeric(sigma.in.fr),type="l",col=c("tomato4"))
lines(as.numeric(sigma.out.fr), type="l",col="tomato1")
This returns the plot below:
Lets make some dummy data to demonstrate:
sigma.ins.fr = sin((1:800)/20) + rnorm(800)
sigma.outs.fr = sin((801:1000)/20) + rnorm(200)
Now, put all the data together into a single sequence
sigma.all = c(sigma.ins.fr, sigma.outs.fr)
And create an x vector which simply counts along the data. We'll need this in the segments call below.
x = seq_along(sigma.all)
Now create a vector of colors for the trace. It is the same length as the full data, with a color for each segment.
cols = c(rep("tomato4", length(sigma.ins.fr)), rep("blue", length(sigma.outs.fr)))
Now create a blank canvass on which to draw the data.
plot(sigma.all, type="l", col=NA)
At last, we can plot the data. Unfortunately, lines does not allow for a separate color in different segments. So instead we can use segments
segments(head(x,-1), head(sigma.all,-1), x[-1], sigma.all[-1], type="l", col=cols)
Or, if you really prefer to use two separate traces uning lines, then we can achieve this by adding the x coordinates to each call:
plot(sigma.all, type="l", col=NA)
lines(seq_along(sigma.ins.fr), sigma.ins.fr, col=c("tomato4"))
lines(seq_along(sigma.outs.fr) + length(sigma.ins.fr), sigma.outs.fr, col="tomato1")
Please provide a reproducible example. Using the packages ggplot2 and dplyr you can do something like:
df <- tibble(x = seq(1,1000, 1), y = seq(1, 500.5, 0.5))
ggplot() +
geom_line(data = df %>% filter(x < 800),
aes(x = x, y = y), color = "red", size =2) +
geom_line(data = df %>% filter(x >= 800),
aes(x = x, y = y),
color = "black", size = 2)
Note that I put the cut off at 800 (as I only created 1000 points), but you can easily change that.
So what I do is putting the data in geom_line, as you can also use this if you have different dataframes (with overlapping x and y) you want to plot in the same graph. However, I do filter the data at different points, so that different lines are drawn by the geom_line.
I need to do a deviance chart (lollipop chart with lines from the mean to values above / below the mean). From this question and answer Drawing line segments in R, it is clear that I need to plot segments and then add the points. However, my x axis is a factor and the solution fails.
This works:
df <- data.frame(ID = c(1, 2, 3),
score = c(30, 42, 48))
mid <- mean(df$score)
plot(range(df$ID), range(df$score),type="n")
segments(df$ID, df$score, df$ID, mid)
But changing my identifier variable into a factor breaks it.
df$ID2 <- factor(df$ID)
plot(range(df$ID2), range(df$score),type="n")
segments(df$ID2, df$score, df$ID2, mid)
How can I set up the plot area and x-axis values to deal with a factor?
Note that I need a base graphics solution to fit with the other charts in a dashboard style report.
You can convert the factor in a numeric variable, supress the x-axis and then add the correct labels to the plot:
df$ID2 <- factor(letters[df$ID]) # Use letters to show that this is working
plot(range(as.numeric(df$ID2)), range(df$score), type = "n", xaxt = "n")
segments(as.numeric(df$ID2), df$score, as.numeric(df$ID2), mid)
axis(1, at = seq_along(levels(df$ID2)), labels = levels(df$ID2))
I want to plot two continuous variables using ggplot.
Assume I have a dataframe where one column is a ratio between 0 and 1, and the other one is an amount. I want to by able to have a break of the ratio in the x axis using something like
breaks=seq(0, 5, by = .1)
and in the y axis I want to have the sum of the amount for each break. It would be look like a histogram but the y axis should be the sum of all columns within the break ratio. If I was making a histogram, it would look like this:
ggplot(data=data, aes(ratio)) +
geom_histogram(breaks=seq(0, 1, by = .1), aes(fill=..count..))
Try this example script. x in the script represent the variable you want breaks in and then y represents the variable you would like to sum within those break. End product, variable name "SUM" should have your sums and variable named "facet" should have your breaks that you can plot
library(dplyr)
dataframe1<-data.frame(x=seq(0,1, length.out = 100), y = seq(0,1000, length.out = 100))
x<-mutate(dataframe1,facet = factor(rep(c("0-0.25", "0.25 - 0.50", "0.50 - 0.75", "0.75 - 1.0"), each = length(dataframe1$x)/4)))
x[,"SUM"]<-NA
x$SUM
list1<-as.list(matrix(unique(x$facet),nrow = 4, ncol = 1))
list1[[1]]
i<-1:4
facetfill<-function(i){
sum1<-sum(x$y[x$facet==list1[[i]]])
x$SUM[x$facet==list1[i]]<-sum1
x$SUM
}
for (j in 1:4) {
x$SUM<-facetfill(j)
x$SUM
}
x$SUM
x
I have a bunch of 'paired' observations from a study for the same subject, and I am trying to build a spaghetti plot to visualize these observations as follows:
library(plotly)
df <- data.frame(id = rep(1:10, 2),
type = c(rep('a', 10), rep('b', 10)),
state = rep(c(0, 1), 10),
values = c(rnorm(10, 2, 0.5), rnorm(10, -2, 0.5)))
df <- df[order(df$id), ]
plot_ly(df, x = type, y = values, group = id, type = 'line') %>%
layout(showlegend = FALSE)
It produces the correct plot I am seeking. But, the code shows each grouped line in own color, which is really annoying and distracting. I can't seem to find a way to get rid of colors.
Bonus question: I actually want to use color = state and actually color the sloped lines by that variable instead.
Any approaches / thoughts?
You can set the lines to the same colour like this
plot_ly(df, x = type, y = values, group = id, type = 'scatter', mode = 'lines+markers',
line=list(color='#000000'), showlegend = FALSE)
For the 'bonus' two-for-the-price-of-one question 'how to color by a different variable to the one used for grouping':
If you were only plotting markers, and no lines, this would be simple, as you can simply provide a vector of colours to marker.color. Unfortunately, however, line.color only takes a single value, not a vector, so we need to work around this limitation.
Provided the data are not too numerous (in which case this method becomes slow, and a faster method is given below), you can set colours of each line individually by adding them as separate traces one by one in a loop (looping over id)
p <- plot_ly()
for (id in df$id) {
col <- c('#AA0000','#0000AA')[df[which(df$id==id),3][1]+1] # calculate color for this line based on the 3rd column of df (df$state).
p <- add_trace(data=df[which(df$id==id),], x=type, y=values, type='scatter', mode='markers+lines',
marker=list(color=col),
line=list(color=col),
showlegend = FALSE,
evaluate=T)
}
p
Although this one-trace-per-line approach is probably the simplest way conceptually, it does become very (impractically) slow if applied to hundreds or thousands of line segments. In this case there is a faster method, which is to plot only one line per colour, but to split this line up into multiple segments by inserting NA's between the separate segments and using the connectgaps=FALSE option to break the line into segments where there are missing data.
Begin by using dplyr to insert missing values between line segements (i.e. for each unique id we add a row containing NA in the columns that provide x and y coordinates).
library(dplyr)
df %<>% distinct(id) %>%
`[<-`(,c(2,4),NA) %>%
rbind(df) %>%
arrange (id)
and plot, using connectgaps=FALSE:
plot_ly(df, x = type, y = values, group = state, type = 'scatter', mode = 'lines+markers',
showlegend = FALSE,
connectgaps=FALSE)