How to include bar with NAs in geom_histogram? - r

I am trying to create a histogram of a continuous variable (1-10) with a bar a little to the side that says how many NAs are in the vector. I am using geom_histogram() from ggplot2. Here is an example:
v <- data.frame(x=c(1, 2, 3, 4, 3, 2, 3, 4, 5, 3, 2, 1, NA, NA, NA, NA))
ggplot(v, aes(x=x)) +
geom_histogram()
I have looked through the features of the function but there doesn't seem to be a way to inlcude NAs and haven't found an elegant way of doing it from other questions. Thanks for the help.

I don't if it is a perfect solution but you can get the count of NA by using dplyr before plotting your data:
library(tidyverse)
v %>% group_by(x) %>% count(x) %>%
ggplot(aes(x = as.factor(x), y = n)) +
geom_bar(stat = "identity")

Related

Order variables geom_point based on similar pattern across x-axis in R

How could I order the variables so they are plotted such as a heat map/where they show similar pattern, ie: at the top A and D, then B, C, and bottom E. Would want to avoid doing it manually as real data is many more variables.
Variable1 <- c(rep("A",7), rep("B",7),rep("C",7), rep("D",7), rep("E",7))
Variable2 <- c(rep(1:7, 5))
value <- c(15, 16, 11, 12, 13, 11, 12, 4, 3, 6, 5, 4, 3, 2, 3, 3, 2, 3, 3, 4, 3, 18, 17, 15, 2, 3, 4, 5, 2, 3, 4, 5, 6, 10, 18)
dff <- data.frame(Variable1, Variable2, value)
library(dplyr)
dff <- dff %>%group_by(Variable1)%>%
mutate(scaled_val = scale(value)) %>%
ungroup()
dff$Variable <- factor(dff$Variable1,levels=rev(unique(dff$Variable1)))
ggplot(dff, aes(x = Variable2, y = Variable1, label=NA)) +
geom_point(aes(size = scaled_val, colour = value)) +
geom_point(aes(size = scaled_val, colour = value), shape=21, colour="black") +
geom_text(hjust = 1, size = 2) +
theme_bw()+
scale_color_gradient(low = "lightblue", high = "darkblue")+
scale_x_discrete(expand=c(1,0))+
coord_fixed(ratio=4)
And desired:
If you look at a heat map with clustered rows by similarity for example: https://3.bp.blogspot.com/-AI2dxe95VHk/TgTJtEkoBgI/AAAAAAAAC5w/XCyBw3qViGA/s400/heatmap_cluster2.png you see at the top you have the row whose pattern are first x-axis timepoints, then the ones higher at the last x-axis timepoints..
To do: So I wonder if using the scaled value, we can do so the top are the ones with higher mean in Variable2 (1:2), then higher mean Variable2 (3:5) then Variable2 (6:7). Let me know if I am not being clear here and can explain, better.
It sounds like you want to arrange groups A-E based on their mean. You can do that by converting Variable1 into a factor with custom levels:
lvls <- names(sort(by(dff$value, dff$Variable1, mean)))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)
Here's a solution that sorts groups by which.max:
peaks <- c(by(dff$value, dff$Variable1, which.max))
lvls <- names(sort(peaks))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)

R: Label X-axis on line chart with ggplot2

I have this data frame to construct some lines chart using ggplot2. lb is what I want my label to be on x-axis while each other variables (x0.6, x0.8, x0.9, x0.95, x0.99, and x0.999) will be against lb on the y-axis.
# my data
lb <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
x0.6 <- c(0.9200795, 0.9315084, 0.9099002, 0.9160192, 0.9121120, 0.9134098, 0.9130619, 0.9128494, 0.9144164)
x0.8 <- c(0.9804872, 1.0144678, 0.9856382, 0.9730490, 1.0032707, 1.0036311, 0.9726198, 0.9986403, 1.0022643)
x0.9 <- c(1.055256, 1.016159, 1.067242, 1.089894, 1.043502, 1.041497, 1.037738, 1.023274, 1.040536)
x0.95 <- c(1.058024, 1.105353, 1.069076, 1.061077, 1.095764, 1.096789, 1.096670, 1.121497, 1.109918)
x0.99 <- c(1.107258, 1.098061, 1.118248, 1.101253, 1.083208, 1.109715, 1.083704, 1.083704, 1.118057)
x0.999 <- c(1.110732, 1.119625, 1.121221, 1.087423, 1.093228, 1.094003, 1.108910, 1.112413, 1.096734)
#my datafram
pos11 <- data.frame(lb, x0.6, x0.8, x0.9, x0.95, x0.99, x0.999)
#load packages
library("reshape2")
library("ggplot2")
# this `R` CODE reshapes the data
long_pos11 <- melt(pos11, id="lb")
# Here is the `R` code that produces the `line-chart`
pos_line <- ggplot(data = long_pos11,
aes(x=AR, y=value, colour=variable)) +
geom_line()
I want the line-chart to show elements of the vector lb (1, 2, 3, 4, 5, 6, 7, 8, 9) on x-axis as its label just like date is 0n Plotting two variables as lines using ggplot2 on the same graph
Try this. As your variable is of numeric type you would need to set it as factor and then also add group to your aes() statement. Here the code:
library("reshape2")
library("ggplot2")
# this `R` CODE reshapes the data
long_pos11 <- melt(pos11, id="lb")
# Here is the `R` code that produces the `line-chart`
pos_line <- ggplot(data = long_pos11,
aes(x=factor(lb), y=value, colour=variable,group=variable)) +
geom_line()+xlab('lb')
Output:
We can also use pivot_longer
library(ggplot2)
library(tidyr)
library(dplyr)
pos11 %>%
pivot_longer(cols = -lb) %>%
mutate(lb = factor(lb)) %>%
ggplot(aes(x = lb, y = value, color = name, group = name)) +
geom_line() +
xlab('lb')

displaying different symbols for each point within the same factor in R ggplot2

I am trying to create a plot to show the mean of calculated values within each group (organised by factors), as well as the induvidual points themselves. I have managed to do this successfully, however all the points use the same symbol. I want to have a different symbol for each of the points within each factor, and preferably use the same points in the same order for each factor.
An example version of the kind of graph I am currently making is below, however all the points within the same column use the same symbol.
I have thought about using the row number of the points to define the symbol shape, but I think there are only 25 different shapes available in the default ggplot2 package, and my real data has more than 25 points, plus I would prefer if the same points were used in each column, to keep the graph looking consistent.
Mean_list <- data.frame(Cells = factor(c("Celltype1", "Celltype2", "Celltype3",
"Celltype4"),
levels =c("Celltype1", "Celltype2", "Celltype3", "Celltype4")),
Mean = c(mean(c(1, 2, 3)), mean(c(5, 8, 4)), mean(c(9, 8 ,3)),
mean(c(3, 6, 8, 5))))
values_list <- data.frame(Cells2 = rep(c("Celltype1", "Celltype2", "Celltype3",
"Celltype4"), times = c(length(c(1, 2, 3)),
length(c(5, 8, 4)), length(c(9, 8 ,3)),
length(c(3, 6, 8, 5)))),
values = c(1, 2, 3, 5, 8, 4, 9, 8, 3, 3, 6, 8, 5))
ggplot() + geom_col(data = Mean_list, aes(Cells, Mean, fill = Cells)) +
geom_point(data = values_list, aes(Cells2, values))
Before plotting we may assign a number to each row in within a cell:
values_list <- values_list %>% group_by(Cells2) %>% mutate(shape = factor(seq_along(values)))
ggplot() +
geom_col(data = Mean_list, aes(Cells, Mean, fill = Cells)) +
geom_point(data = values_list, aes(Cells2, values, shape = shape))

Which ggplot2 geom should I use?

I have a data frame.
id <- c(1:5)
count_big <- c(15, 25, 7, 0, 12)
count_small <- c(15, 9, 22, 11, 14)
count_black <- c(7, 12, 5, 2, 6)
count_yellow <- c(2, 0, 7, 4, 3)
count_red <- c(8, 4, 4, 2, 5)
count_blue <- c(5, 9, 6, 1, 7)
count_green <- c(8, 9, 7, 2, 5)
df <- data.frame(id, count_big, count_small, count_black, count_yellow, count_red, count_blue, count_green)
How can I display the following in ggplot2 and which geom should I use:
a breakdown of big and small variable by id
a breakdown of colors by id
This is just a subset of the data set that has around 1000 rows.
Can I use this df in ggplot2, or do I need to transform it into tidy data with tidyr? (don't know data.table yet)
You need to first restructure the data from wide to long with tidyr.
library(tidyr)
library(ggplot2)
df <- gather(df, var, value, starts_with("count"))
# remove count_
df$var <- sub("count_", "", df$var)
# plot big vs small
df_size <- subset(df, var %in% c("big", "small"))
ggplot(df_size, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
# same routine for colors
df_color <- subset(df, !(var %in% c("big", "small")))
ggplot(df_color, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
Use stat = "identity" to prevent it from doing a row count. position = position_dodge() is used to place the bars next to each other rather than stacked.

Plot overlapping vertical lines with ggplot

I have a list of time-ordered pairwise interactions. I want to plot a temporal network of these interactions, which would look something like the diagram below.
My data looks like the example below. The id1 and id2 values are the unique identifiers of individuals. The time indicates when an interaction betweens those individuals occurred. So at time = 1, I want to plot a connection between individual-1 and individual-2.
id1 <- c(1, 2, 1, 6, 2, 2, 1)
id2 <- c(2, 4, 5, 7, 3, 4, 5)
time <- c(1, 2, 2, 2, 3, 4, 5)
df <- data.frame(id1, id2, time)
According to this StackOverflow question, I can see that it is possible to draw vertical lines between positions on the y-axis in ggplot. This is achieved by reshaping the data into a long format. This is fine when there is only one pair per time value, but not when there is more than one interacting pair at a time. For example in my dummy data, at time = 2, there are three pairs (in the plot I would show these by overlaying lines with reduced opacity).
My question is, how can I organise these data in a way that ggplot will be able to plot potentially multiple interacting pairs at specified time points?
I have been trying to reorganise the data by assigning an extra identifier to each of the multiple pairs that occur at the same time. I imagined the data table to look like this, but I haven't figure out how to make this in R... In this example the three interactions at time = 2 are identified by an extra grouping of either 1, 2 or 3. Even if I could arrange this I'm still not sure how I would get ggplot to read it.
Ultimately I'm trying to create someting that looks like Fig. 2 in this scientific paper.
Any help would be appreciated!
You can do this without reshaping the data, just set one id to y and the other id to yend in geom_curve:
ggplot(df, aes(x = time, y = id1)) +
geom_curve(aes(xend = time, yend = id2), curvature = 0.3) +
geom_hline(yintercept = 1:7, colour = scales::muted("blue")) +
geom_point(size = 3) +
geom_point(aes(y = id2), size = 3) +
coord_cartesian(xlim = c(0, max(df$time) + 1)) +
theme_bw()
Output:
Libraries:
library('ggplot2')
library('data.table')
Data:
id1 <- c(1, 2, 1, 6, 2, 2, 1)
id2 <- c(2, 4, 5, 7, 3, 4, 5)
time <- c(1, 2, 2, 2, 3, 4, 5)
df <- data.frame(id1, id2, time)
setDT(df)
df1 <- melt.data.table( df, id.vars = c('time'))
Plot:
p <- ggplot( df1, aes(time, value)) +
geom_point() +
geom_curve( mapping = aes(x = time, y = id1, xend = time, yend = id2, colour = "curve"),
data = df,
curvature = 0.2 )
print(p)

Resources