Plotting 2 continuous variables in barchart using ggplot2 - r

I want to plot two continuous variables using ggplot.
Assume I have a dataframe where one column is a ratio between 0 and 1, and the other one is an amount. I want to by able to have a break of the ratio in the x axis using something like
breaks=seq(0, 5, by = .1)
and in the y axis I want to have the sum of the amount for each break. It would be look like a histogram but the y axis should be the sum of all columns within the break ratio. If I was making a histogram, it would look like this:
ggplot(data=data, aes(ratio)) +
geom_histogram(breaks=seq(0, 1, by = .1), aes(fill=..count..))

Try this example script. x in the script represent the variable you want breaks in and then y represents the variable you would like to sum within those break. End product, variable name "SUM" should have your sums and variable named "facet" should have your breaks that you can plot
library(dplyr)
dataframe1<-data.frame(x=seq(0,1, length.out = 100), y = seq(0,1000, length.out = 100))
x<-mutate(dataframe1,facet = factor(rep(c("0-0.25", "0.25 - 0.50", "0.50 - 0.75", "0.75 - 1.0"), each = length(dataframe1$x)/4)))
x[,"SUM"]<-NA
x$SUM
list1<-as.list(matrix(unique(x$facet),nrow = 4, ncol = 1))
list1[[1]]
i<-1:4
facetfill<-function(i){
sum1<-sum(x$y[x$facet==list1[[i]]])
x$SUM[x$facet==list1[i]]<-sum1
x$SUM
}
for (j in 1:4) {
x$SUM<-facetfill(j)
x$SUM
}
x$SUM
x

Related

how to add labels above the bar of "barplot" graphics?

I asked a question before, but now I would like to know how do I put the labels above the bars.
post old: how to create a frequency histogram with predefined non-uniform intervals?
dataframe <- c (1,1.2,40,1000,36.66,400.55,100,99,2,1500,333.45,25,125.66,141,5,87,123.2,61,93,85,40,205,208.9)
Upatdate
Update
Following the guidance of the colleague I am updating the question.
I have a data base and I would like to calculate the frequency that a given value of that base appears within a pre-defined range, for example: 0-50, 50-150, 150-500, 500-2000.
in the post(how to create a frequency histogram with predefined non-uniform intervals?) I managed to do this, but I don't know how to add the labels above the bars. I Tried:
barplort (data, labels = labels), but it didn't work.
I used barplot because the post recommended me, but if it is possible to do it using ggplot, it would be good too.
Based on the answer to your first question, here is one way to add a text() element to your Base R plot, that serves as a label for each one of your bars (assuming you want to double-up the information that is already on the x axis).
data <- c(1,1.2,40,1000,36.66,400.55,100,99,2,1500,333.45,25,125.66,141,5,87,123.2,61,93,85,40,205,208.9)
# Cut your data into categories using your breaks
data <- cut(data,
breaks = c(0, 50, 150, 500, 2000),
labels = c('0-50', '50-150', '150-500', '500-2000'))
# Make a data table (i.e. a frequency count)
data <- table(data)
# Plot with `barplot`, making enough space for the labels
p <- barplot(data, ylim = c(0, max(data) + 1))
# Add the labels with some offset to be above the bar
text(x = p, y = data + 0.5, labels = names(data))
If it is the y values that you are after, you can change what you pass to the labels argument:
p <- barplot(data, ylim = c(0, max(data) + 1))
text(x = p, y = data + 0.5, labels = data)
Created on 2020-12-11 by the reprex package (v0.3.0)

Labeling x-axis with another column from dataframe

I have a dataframe derived from the output of running GWAS. Each row is a SNP in the genome, with its Chromosome, Position, and P.value. From this dataframe, I'd like to generate a Manhattan Plot where the x-axis goes from the first SNP on Chr 1 to the last SNP on Chr 5 and the y-axis is the -log10(P.value). To do this, I generated an Index column to plot the SNPs in the correct order along the x-axis, however, I would like the x-axis to be labeled by the Chromosome column instead of the Index. Unfortunately, I cannot use Chromosome to plot my x-axis because then all the SNPs on any given Chromosome would be plotted in a single column of points.
Here is an example dataframe to work with:
library(tidyverse)
df <- tibble(Index = seq(1, 500, by = 1),
Chromosome = rep(seq(1, 5, by = 1), each = 100),
Position = rep(seq(1, 500, by = 5), 5),
P.value = sample(seq(1e-5, 1e-2, by = 1e-5), 500, replace = TRUE))
And the plot that I have so far:
df %>%
ggplot(aes(x = Index, y = -log10(P.value), color = as.factor(Chromosome))) +
geom_point()
I have tried playing around with the scale_x_discrete option, but haven't been able to figure out a solution.
Here is an example of a Manhattan Plot I found online. See how the x-axis is labeled according to the Chromosome? That is my desired output.
geom_jitter is your friend:
df %>%
ggplot(aes(x = Chromosome, y = -log10(P.value), color = as.factor(Chromosome))) +
geom_jitter()
Edit given OP's comment:
Using base R plot, you could do:
cols = sample(colors(), length(unique(df$Chromosome)))[df$Chromosome]
plot(df$Index, -log10(df$P.value), col=cols, xaxt="n")
axis(1, at=c(50, 150, 250, 350, 450), labels=c(1:5))
You'll need to specify exactly where you want each chromosome label to be for the axis function. Thanks to this post.
Edit #2:
I found an answer using ggplot2. You can use the annotate function to plot your points by coordinates, and the scale_x_discrete function (as you suggested) to place the labels in the x axis according to chromosome. We also need to define the pos vector to get the position of labels for the plot. I used the mean value of the Index column for each group as an example, but you can define it by hand if you wish.
pos <- df %>%
group_by(Chromosome) %>%
summarize(avg = round(mean(Index))) %>%
pull(avg)
ggplot(df) +
annotate("point", x=df$Index, y=-log10(df$P.value),
color=as.factor(df$Chromosome)) +
scale_x_discrete(limits = pos,
labels = unique(df$Chromosome))

Changing colour under particular threshold for geom_line [duplicate]

I have the following dataframe that I would like to plot. I was wondering if it is possible to color portions of the lines connecting my outcome variable(stackOne$y) in a different color, depending on whether it is less than a certain value or not. For example, I would like portions of the lines falling below 2.2 to be red in color.
set.seed(123)
stackOne = data.frame(id = rep(c(1, 2, 3), each = 3),
y = rnorm(9, 2, 1),
x = rep(c(1, 2, 3), 3))
ggplot(stackOne, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(group = id))
Thanks!
You have at least a couple of options here. The first is quite simple, general (in that it's not limited to straight-line segments) and precise, but uses base plot rather than ggplot. The second uses ggplot, but is slightly more complicated, and colour transition will not be 100% precise (but near enough, as long as you specify an appropriate resolution... read on).
base:
If you're willing to use base plotting functions rather than ggplot, you could clip the plotting region to above the threshold (2.2), then plot the segments in your preferred colour, and subsequently clip to the region below the threshold, and plot again in red. While the first clip is strictly unnecessary, it prevents overplotting different colours, which can look a bit dud.
threshold <- 2.2
set.seed(123)
stackOne=data.frame(id=rep(c(1,2,3),each=3),
y=rnorm(9,2,1),
x=rep(c(1,2,3),3))
# create a second df to hold segment data
d <- stackOne
d$y2 <- c(d$y[-1], NA)
d$x2 <- c(d$x[-1], NA)
d <- d[-findInterval(unique(d$id), d$id), ] # remove last row for each group
plot(stackOne[, 3:2], pch=20)
# clip to region above the threshold
clip(min(stackOne$x), max(stackOne$x), threshold, max(stackOne$y))
segments(d$x, d$y, d$x2, d$y2, lwd=2)
# clip to region below the threshold
clip(min(stackOne$x), max(stackOne$x), min(stackOne$y), threshold)
segments(d$x, d$y, d$x2, d$y2, lwd=2, col='red')
points(stackOne[, 3:2], pch=20) # plot points again so they lie over lines
ggplot:
If you want or need to use ggplot, you can consider the following...
One solution is to use geom_line(aes(group=id, color = y < 2.2)), however this will assign colours based on the y-value of the point at the beginning of each segment. I believe you want to have the colour change not just at the nodes, but wherever a line crosses your given threshold of 2.2. I'm not all that familiar with ggplot, but one way to achieve this is to make a higher-resolution version of your data by creating new points along the lines that connect your existing points, and then use the color = y < 2.2 argument to achieve the desired effect.
For example:
threshold <- 2.2 # set colour-transition threshold
yres <- 0.01 # y-resolution (accuracy of colour change location)
d <- stackOne # for code simplification
# new cols for point coordinates of line end
d$y2 <- c(d$y[-1], NA)
d$x2 <- c(d$x[-1], NA)
d <- d[-findInterval(unique(d$id), d$id), ] # remove last row for each group
# new high-resolution y coordinates between each pair within each group
y.new <- apply(d, 1, function(x) {
seq(x['y'], x['y2'], yres*sign(x['y2'] - x['y']))
})
d$len <- sapply(y.new, length) # length of each series of points
# new high-resolution x coordinates corresponding with new y-coords
x.new <- apply(d, 1, function(x) {
seq(x['x'], x['x2'], length.out=x['len'])
})
id <- rep(seq_along(y.new), d$len) # new group id vector
y.new <- unlist(y.new)
x.new <- unlist(x.new)
d.new <- data.frame(id=id, x=x.new, y=y.new)
p <- ggplot(d.new, aes(x=x,y=y)) +
geom_line(aes(group=d.new$id, color=d.new$y < threshold))+
geom_point(data=stackOne)+
scale_color_discrete(sprintf('Below %s', threshold))
p
There may well be a way to do this through ggplot functions, but in the meantime I hope this helps. I couldn't work out how to draw a ggplotGrob into a clipped viewport (rather it seems to just scale the plot). If you want colour to be conditional on some x-value threshold instead, this would obviously need some tweaking.
Encouraged by people in my answer to a newer but related question, I'll also share a easier to use approximation to the problem here.
Instead of interpolating the correct values exactly, one can use ggforce::geom_link2() to interpolate lines and use after_stat() to assign the correct colours after interpolation. If you want more precision you can increase the n of that function.
library(ggplot2)
library(ggforce)
#> Warning: package 'ggforce' was built under R version 4.0.3
set.seed(123)
stackOne = data.frame(id = rep(c(1, 2, 3), each = 3),
y = rnorm(9, 2, 1),
x = rep(c(1, 2, 3), 3))
ggplot(stackOne, aes(x = x, y = y)) +
geom_point() +
geom_link2(
aes(group = id,
colour = after_stat(y < 2.2))
) +
scale_colour_manual(
values = c("black", "red")
)
Created on 2021-03-26 by the reprex package (v1.0.0)

Single histogram with two or more colors depending on xaxis values

I know it was already answered here, but only for ggplot2 histogram.
Let's say I have the following code to generate a histogram with red bars and blue bars, same number of each (six red and six blue):
set.seed(69)
hist(rnorm(500), col = c(rep("red", 6), rep("blue", 7)), breaks = 10)
I have the following image as output:
I would like to automate the entire process, how can I use values from any x-axis and set a condition to color the histogram bars (with two or more colors) using the hist() function, without have to specify the number os repetitions of each color?
Assistance most appreciated.
The hist function uses the pretty function to determine break points, so you can do this:
set.seed(69)
x <- rnorm(500)
breaks <- pretty(x,10)
col <- ifelse(1:length(breaks) <= length(breaks)/2, "red", "blue")
hist(x, col = col, breaks = breaks)
When I want to do this, I actually tabulate the data and make a barplot as follows (note that a bar plot of tabulated data is a histogram):
set.seed(69)
dat <- rnorm(500, 0, 1)
tab <- table(round(dat, 1))#Round data from rnorm because rnorm can be precise beyond most real data
bools <- (as.numeric(attr(tab, "name")) >= 0)#your condition here
cols <- c("grey", "dodgerblue4")[bools+1]#Note that FALSE + 1 = 1 and TRUE + 1 = 2
barplot(tab, border = "white", col = cols, main = "Histogram with barplot")
The output:

Add multiple horizontal lines in a boxplot

I know that I can add a horizontal line to a boxplot using a command like
abline(h=3)
When there are multiple boxplots in a single panel, can I add different horizontal lines for each single boxplot?
In the above plot, I would like to add lines 'y=1.2' for 1, 'y=1.5' for 2, and 'y=2.1' for 3.
I am not sure that I understand exactly, what you want, but it might be this: add a line for each boxplot that covers the same x-axis range as the boxplot.
The width of the boxes is controlled by pars$boxwex which is set to 0.8 by default. This can be seen from the argument list of boxplot.default:
formals(boxplot.default)$pars
## list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5)
So, the following produces a line segment for each boxplot:
# create sample data and box plot
set.seed(123)
datatest <- data.frame(a = rnorm(100, mean = 10, sd = 4),
b = rnorm(100, mean = 15, sd = 6),
c = rnorm(100, mean = 8, sd = 5))
boxplot(datatest)
# create data for segments
n <- ncol(datatest)
# width of each boxplot is 0.8
x0s <- 1:n - 0.4
x1s <- 1:n + 0.4
# these are the y-coordinates for the horizontal lines
# that you need to set to the desired values.
y0s <- c(11.3, 16.5, 10.7)
# add segments
segments(x0 = x0s, x1 = x1s, y0 = y0s, col = "red")
This gives the following plot:

Resources