I am trying to create a stacked bar graph in R. I need the bar graph to display three things:
Y axis = Count
X axis bars = Passed Driving Test (Yes or No)
Colours within X axis bars Owns a car (Yes or No)
So my desired output is:
However, my actual output is:
My code so far is:
carData <- read.csv(file="~/Desktop/carData.csv",head=TRUE,sep=";")
ggplot(carData, aes(x = passed.test, fill = owns.car)) + geom_bar()
The passed.test values in the CSV file are either 1 or 0. (1 = passed ,0 = not passed)
The owns.car values in the CSV file are either 1 or 0. (1 = owns a car, 0 = doesn't own a car)
How do I:
A. Add colours to the bar graph to show the second variable (Owns a car = Yes or No)
B. Change the X axis to be 'Yes' and 'No', rather than -0.5 -1.5
You want to make both those columns into factors. Otherwise, numeric values are assumed to be continuous, so when geom_bar counts observations of each value, it doesn't make a whole lot of sense for the levels of owns.car to be continuous.
library(tidyverse)
set.seed(1234)
carData <- tibble(
passed.test = sample(c(0, 1), 100, replace = T),
owns.car = sample(c(0, 1), 100, replace = T)
)
cars_factors <- mutate_all(carData, as.factor)
ggplot(cars_factors, aes(x = passed.test, fill = owns.car)) +
geom_bar() +
scale_x_discrete(labels = c("No", "Yes")) +
scale_fill_discrete(labels = c("No", "Yes"))
Created on 2018-04-28 by the reprex package (v0.2.0).
Related
I am, in R and using ggplot2, plotting the development over time of several variables for several groups in my sample (days of the week, to be precise). An artificial sample (using long data suitable for plotting) is this:
library(tidyverse)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>% ggplot(mapping = aes(x = x, y = values)) + geom_line() + facet_grid(groups2 ~ groups1)
which gives
In this example, the first variable -- shown in the left column -- has unlimited range, while the second variable -- shown in the right column -- is weakly positive.
I would like to reflect this in my plot by allowing the Y axes to differ across the columns in this plot, i.e. set Y axis limits separately for the two variables plotted. However, in order to allow for easy visual comparison of the different groups for each of the two variables, I would also like to have the identical Y axes within each column.
I've looked at the scales option to facet_grid(), but it does not seem to be able to do what I want. Specifically,
passing scales = "free_x" allows the Y axes to vary across rows, while
passing scales = "free_y" allows the X axes to vary across columns, but
there is no option to allow the Y axes to vary across columns (nor, presumably, the X axes across rows).
As usual, my attempts to find a solution have yielded nothing. Thank you very much for your help!
I think the easiest would to create a plot per facet column and bind them with something like {patchwork}. To get the facet look, you can still add a faceting layer.
library(tidyverse)
library(patchwork)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
set.seed(42) ## always better to set a seed before using random functions
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>%
group_split(groups1) %>%
map({
~ggplot(.x, aes(x = x, y = values)) +
geom_line() +
facet_grid(groups2 ~ groups1)
}) %>%
wrap_plots()
Created on 2023-01-11 with reprex v2.0.2
I have data (from excel) with the y-axis as ranges (also calculated in excel) and the x-axis as cell counts and I would like to draw a horizontal line at a specific value in the range, like a reference line. I tried using geom_hline(yintercept = 450) but I am sure it is quite naive and does not work that way for a number in range. I wonder if there are any better suggestions for it :)
plot.new()
library(ggplot2)
d <- read.delim("C:/Users/35389/Desktop/R.txt", sep = "\t")
head(d)
d <- cbind(row.names(d), data.frame(d), row.names=NULL)
d
g <- ggplot(d, aes(d$CTRL,d$Bin.range))+ geom_col()
g + geom_hline(yintercept = 450)
First of all, have a look at my comments.
Second, this is how I suggest you to proceed: don't calculate those ranges on Excel. Let ggplot do it for you.
Say, your data is like this:
df <- data.frame(x = runif(100, 0, 500))
head(df)
#> x
#>1 322.76123
#>2 57.46708
#>3 223.31943
#>4 498.91870
#>5 155.05416
#>6 107.27830
Then you can make a plot like this:
library(ggplot2)
ggplot(df) +
geom_histogram(aes(x = x),
boundary = 0,
binwidth = 50,
fill = "steelblue",
colour = "white") +
geom_vline(xintercept = 450, colour = "red", linetype = 2, size = 1) +
coord_flip()
We don't have your data, but the following data frame is of a similar structure:
d <- data.frame(CTRL = sample(100, 10),
Bin.range = paste(0:9 * 50, 0:9 * 50 + 49.9, sep = "-"))
The first thing to note is that your y axis does not have your ranges ordered correctly. You have 50-99.9 at the top of the y axis. This is because your ranges are stored as characters and ggplot will automatically arrange these alphabetically, not numerically. So you need to reorder the factor levels of your ranges:
d$Bin.range <- factor(d$Bin.range, d$Bin.range)
When you create your plot, don't use d$Bin.range, but instead just use Bin.range. ggplot knows to look for this variable in the data frame you have passed.
g <- ggplot(d, aes(CTRL, Bin.range)) + geom_col()
If you want to draw a horizontal line, your two options are to specify the y axis label at which you want to draw the line (i.e. yintercept = "400-449.9") or, which is what I suspect you want, use a numeric value of 9.5 which will put it between the top two values:
g + geom_hline(yintercept = 9.5, linetype = 2)
I'm trying to display an "absence/presence" heatmap with geom_tile in R. I would like to have a fill for "1" or "present" if a feature (here: OTU) can be found in at least one of the samples within a group. So below is the example code, where I grouped the samples by site:
library(reshape2)
library(ggplot2)
df <- data.frame(
OTU = c("OTU001", "OTU002", "OTU003", "OTU004", "OTU005"),
Sample1 = c(0,0,1,1,0),
Sample2 = c(1,0,0,1,0),
Sample3 = c(1,1,0,1,0),
Sample4 = c(1,1,1,1,0))
molten_df <- melt(df)
# add group data
sites <- data.frame(
site = c(rep("site_A", 10), rep("site_B", 10)))
molten_df2 <- cbind(molten_df, sites)
# plot heatmap based on group variable sites
ggplot(molten_df2, aes(x = site, y = OTU, fill = value)) +
geom_tile()
the tile (site_A, OTU003) consists of the values Sample1 = 1 and Sample2 = 0 and the outcome is 0. On the other hand, the tile (site_B, OTU003) also has Sample3 = 0 and Sample4 = 1, but it turns out as 1. Maybe it uses the last value for the fill? As I would like to display 1 if an OTU appears in any of the grouped samples regardless of the order, I wondered if anyone knows how to do this within ggplot2?
The other way I thought of (but failed coding) is to write a function that sets the remaining values of a given tile to 1, if at least one 1 appears.
With library dplyr, you can create a new variable indicating if OTU at a given site is present in, at least, one sample :
tmp = group_by(molten_df2,OTU, site) %>%
summarise(., PA=as.factor(ifelse(sum(value)>0,1,0)))
Then plot :
ggplot(tmp, aes(x = site, y = OTU, fill = PA)) +
geom_tile()
Or directly inside the ggplot function :
ggplot(group_by(molten_df2,OTU, site) %>%
summarise(., PA=factor(ifelse(sum(value)>0,1,0))),
aes(x = site, y = OTU, fill =PA)) +
geom_tile()
So i want to create a stacked bar chart, with frequency counts printed for each
fill factor.
Showing data values on stacked bar chart in ggplot2
This question places the counts in the center of each segment, but the user specifies the values. In this example we dont input the specific value, and I am seeking an r function that automatically calcualtes counts.
Take the following data for example.
set.seed(123)
a <- sample(1:4, 50, replace = TRUE)
b <- sample(1:10, 50, replace = TRUE)
data <- data.frame(a,b)
data$c <- cut(data$b, breaks = c(0,3,6,10), right = TRUE,
labels = c ("M", "N", "O"))
head(data)
ggplot(data, aes(x = a, fill = c)) + geom_bar(position="fill")
So I want to print a "n= .." for M,N and O value in 1,2,3 and 4
So the end result looks like
Similiar to this question, however we do not have fr
Try the following:
obj <- ggplot_build(p)$data[[1]]
# some tricks for getting centered the y-positions:
library(dplyr)
y_cen <- obj[["y"]]
y_cen[is.na(y_cen)] <- 0
y_cen <- (y_cen - lag(y_cen))/2 + lag(y_cen)
y_cen[y_cen == 0 | y_cen == .5] <- NA
p + annotate("text", x = obj[["x"]], y = y_cen, label = paste("N = ", obj[["count"]]))
Which gives:
For now, I'm just using something like this:
test_data$level <- rep("", nrow(test_data))
test_data[test_data$value <= 1, ]$level <- "1"
test_data[test_data$value > 1 & test_data$value <= 2, ]$level <- "2"
...
test_data[test_data$value > 4 & test_data$value <= 5, ]$level <- "5"
Just wondering if there's a better way to do this in R, or a way to simply apply some scale argument via ggplot2 to do the categorizing.
There could be a couple of approaches to this, so it was hard to phrase my question exactly. Here's the gist... I have data something like so:
set.seed(123)
test_data <- data.frame(var1 = rep(LETTERS[1:3], each = 5),
var2 = rep(letters[1:5], 3),
value = runif(30, 1, 5))
test_data
var1 value
1 A 2.150310
2 B 4.153221
3 C 2.635908
4 D 4.532070
5 E 4.761869
6 F 1.182226
7 G 3.112422
8 H 4.569676
9 I 3.205740
10 J 2.826459
I have a lot more data points, and am plotting something like this:
library(ggplot2)
p <- ggplot(test_data, aes(x = var1, y = var2, colour = value))
p <- p + geom_jitter(position = position_jitter(width = 0.1, heigh = 0.1))
p
Which gives something like so:
My actual data is from a subjective evaluation with 1-5 ratings, but I've bundled similar questions together and averaged them together so they're no longer integers.
I'm plotting the ratings per factor combination to visualize which combinations yielded higher ratings. The default continuous scale doesn't really "pop" and I'd like to get the color scale to treat "bins" of these values (0-1, 1-2, ... 4-5) to be colored like scale_colour_discrete does for factors.
So, my question(s):
1) Is it possible with ggplot2 to "bin" these somehow via scale_colour_continuous so I can get the default factor level coloring scheme to apply even though this is continuous data?
2) If not, is there an easier way to create a new vector where I substitute numbers/letters for my values based on criteria? I'm a bit of an R novice, so I wasn't sure except a bunch of if() or conditional statements (test_data[test_data > 0 & test_data < 1, "values"] <- "a" or something like that).
The easiest solution is to do
ggplot(transform(test_data, Discrete=cut(values, seq(0,5,1), include.lowest=T),...
Now your data.frame will include a column of factors based on the column values, so you can do aes(..., color=Discrete,...) JUST in the context of your ggplot. The format of test_data will be preserved once you are done plotting.
To keep a discrete column, of course, your best option is:
test_data$Discrete <- cut(values, seq(0,5,1), include.lowest=T)
You can switch from the colour bar legend to the discrete-style legend.
library(RColorBrewer) # for brewer.pal
ggplot(test_data, aes(x = var1, y = var2, colour = value)) +
geom_jitter(position = position_jitter(width = 0.1, heigh = 0.1)) +
scale_colour_gradientn(guide = 'legend', colours = brewer.pal(n = 5, name = 'Set1'))
Literally as I posted an update with my current method, I thought of another way to do this...
p <- ggplot(test_data, aes(x = var1, y = var2, colour = factor(value)))
p <- p + geom_jitter(position = position_jitter(width = 0.1, height = 0.1))
p <- p + scale_colour_discrete(breaks = 1:5)
p
Stupidly simple; just force the continuous values to be treated like individual factor levels and then control the colour scale via breaks with ggplot2. I see there's some other answers as well, though I'm not familiar with the methods, so I guess I'll let upvotes decide the best answer.