I have a data frame.
id <- c(1:5)
count_big <- c(15, 25, 7, 0, 12)
count_small <- c(15, 9, 22, 11, 14)
count_black <- c(7, 12, 5, 2, 6)
count_yellow <- c(2, 0, 7, 4, 3)
count_red <- c(8, 4, 4, 2, 5)
count_blue <- c(5, 9, 6, 1, 7)
count_green <- c(8, 9, 7, 2, 5)
df <- data.frame(id, count_big, count_small, count_black, count_yellow, count_red, count_blue, count_green)
How can I display the following in ggplot2 and which geom should I use:
a breakdown of big and small variable by id
a breakdown of colors by id
This is just a subset of the data set that has around 1000 rows.
Can I use this df in ggplot2, or do I need to transform it into tidy data with tidyr? (don't know data.table yet)
You need to first restructure the data from wide to long with tidyr.
library(tidyr)
library(ggplot2)
df <- gather(df, var, value, starts_with("count"))
# remove count_
df$var <- sub("count_", "", df$var)
# plot big vs small
df_size <- subset(df, var %in% c("big", "small"))
ggplot(df_size, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
# same routine for colors
df_color <- subset(df, !(var %in% c("big", "small")))
ggplot(df_color, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
Use stat = "identity" to prevent it from doing a row count. position = position_dodge() is used to place the bars next to each other rather than stacked.
Related
I have the following dataframe
df <- data.frame(
x = rep(c(2, 5, 7, 9, 12), 4),
y = rep(c(1, 2), each = 10),
col1 = rexp(20),
col2 = rnorm(20),
col3 = rexp(20)
)
And this is the plot
ggplot(df, aes(x, y, fill = col2)) + geom_tile()
I want to make an interactive chart where I can use a slider change the target column (switching between column 1, 2 and 3).
Thanks in advance.
You can use frame aesthetic in the ggplotly function from plotly to make an interactive slider with your target (I am not sure what your target feature is) like this:
library(plotly)
df <- data.frame(
x = rep(c(2, 5, 7, 9, 12), 4),
y = rep(c(1, 2), each = 10),
col1 = rexp(20),
col2 = rnorm(20),
col3 = rexp(20)
)
df$target <- rep(sample(c(1:3), 2), 10)
plot <- ggplot(df, aes(x, y, fill = col2, frame = target)) + geom_tile()
ggplotly(plot)
Output:
How could I order the variables so they are plotted such as a heat map/where they show similar pattern, ie: at the top A and D, then B, C, and bottom E. Would want to avoid doing it manually as real data is many more variables.
Variable1 <- c(rep("A",7), rep("B",7),rep("C",7), rep("D",7), rep("E",7))
Variable2 <- c(rep(1:7, 5))
value <- c(15, 16, 11, 12, 13, 11, 12, 4, 3, 6, 5, 4, 3, 2, 3, 3, 2, 3, 3, 4, 3, 18, 17, 15, 2, 3, 4, 5, 2, 3, 4, 5, 6, 10, 18)
dff <- data.frame(Variable1, Variable2, value)
library(dplyr)
dff <- dff %>%group_by(Variable1)%>%
mutate(scaled_val = scale(value)) %>%
ungroup()
dff$Variable <- factor(dff$Variable1,levels=rev(unique(dff$Variable1)))
ggplot(dff, aes(x = Variable2, y = Variable1, label=NA)) +
geom_point(aes(size = scaled_val, colour = value)) +
geom_point(aes(size = scaled_val, colour = value), shape=21, colour="black") +
geom_text(hjust = 1, size = 2) +
theme_bw()+
scale_color_gradient(low = "lightblue", high = "darkblue")+
scale_x_discrete(expand=c(1,0))+
coord_fixed(ratio=4)
And desired:
If you look at a heat map with clustered rows by similarity for example: https://3.bp.blogspot.com/-AI2dxe95VHk/TgTJtEkoBgI/AAAAAAAAC5w/XCyBw3qViGA/s400/heatmap_cluster2.png you see at the top you have the row whose pattern are first x-axis timepoints, then the ones higher at the last x-axis timepoints..
To do: So I wonder if using the scaled value, we can do so the top are the ones with higher mean in Variable2 (1:2), then higher mean Variable2 (3:5) then Variable2 (6:7). Let me know if I am not being clear here and can explain, better.
It sounds like you want to arrange groups A-E based on their mean. You can do that by converting Variable1 into a factor with custom levels:
lvls <- names(sort(by(dff$value, dff$Variable1, mean)))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)
Here's a solution that sorts groups by which.max:
peaks <- c(by(dff$value, dff$Variable1, which.max))
lvls <- names(sort(peaks))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)
This is possibly more of a maths question than a coding one, but this has me really stuck! I have a data frame in the format of this small reproducible example:
library(dplyr)
library(ggplot2)
df <- data.frame(sample = c(rep(1, 3), rep(2, 3), rep(3 , 3), rep(4 , 3)),
condition = c(rep("a", 6), rep("b", 6)),
amino_acid = rep(c("Asx", "Glx", "Ser"), 4),
conc = c(2, 4, 3, 6, 7, 5, 1, 2, 1, 8, 2, 5),
total_conc = c(rep(9, 3), rep(12, 3), rep(4 , 3), rep(15 , 3))) %>%
mutate(pc_conc = conc / total_conc * 100)
I want to look at the change in the proportion of amino_acid concs (i.e. change in pc_conc) with the conditions a and b, which I can do using fun.y = "mean" and position = position_fill():
ggplot(df, aes(x = condition, y = pc_conc, fill = amino_acid)) +
geom_bar(stat = "summary", fun.y = "mean", na.rm = TRUE,
position = position_fill())
What I want is the values of each of these bars (i.e. the average proportion of conc for each amino_acid for each condition). If I average pc_conc in the original dataframe you can see that I don't get the same result:
df2 <- df %>% group_by(condition, amino_acid) %>%
summarise_at(vars(pc_conc), mean)
ggplot(df2, aes(x = condition, y = pc_conc, fill = amino_acid)) +
geom_bar(stat = "summary", fun.y = "identity", na.rm = TRUE)
How can I either extract the values calculated by the position_fill() function from the ggplot object or replicate those calculations so I have the same numbers? I've tried averaging the original conc and total_conc variables and calculating pc_conc but that also doesn't create the normalised percentages I'm looking for.
I have dataframe that contains data on the number of TVs and radios owned by survey respondents now and before:
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)))
I want to sum the total value of each variable and then create a barplot that shows the number of TVs and radios owned by survey respondents now and before.
I can manually create a new dataframe that contains just the sum of the value of each variable in the original DF
DFsum <- data.frame(TV_now = as.numeric(c(sum(DF$TV_now,na.rm = TRUE))),
TV_before = as.numeric(c(sum(DF$TV_before,na.rm = TRUE))),
Radio_now = as.numeric(c(sum(DF$TV_now,na.rm = TRUE))),
Radio_before = as.numeric(c(sum(DF$Radio_before,na.rm = TRUE))))
and then use tidyr to do the following:
library(tidyr)
library(ggplot2)
DFsum %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Number,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")
This gives me the result I want, but seems unnecessarily complicated for what should be easy to achieve. Is there an easier way to plot this?
You can simplify your code with use of dplyr::mutate_all since you are summarizing all your columns:
library(tidyverse)
library(ggplot2)
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")
Simplify data creation. R knows that 4, 9, 1, etc., are numbers, you don't need as.numeric.
DF <- data.frame(TV_now = c(4, 9, 1, 0, 4, NA),
TV_before = c(4, 1, 2, 4, 5, 2),
Radio_now = c(4, 5, 1, 5, 6, 9),
Radio_before = c(6, 5, 3, 6, 7, 10))
Simplify the data manipulation. Tidy your data (convert it to long format) first, then do other things:
DF_long = gather(DF, key = "device") %>%
group_by(device) %>%
summarize(number = sum(value, na.rm = TRUE))
Simplify the plotting. Aesthetics are inherited - you don't need to specify them multiple times. geom_col is preferred to geom_bar with stat = "identity". position = "dodge" does nothing when there is one group per x index.
ggplot(aes(x = device, y = number, fill = device)) +
geom_col()
I generally prefer to do my own data manipulation, but we can also lean on ggplots stacking bars to replace the summing, making the entire code:
gather(DF, key = "device", value = "number") %>%
ggplot(aes(x = device, y = number, fill = device)) +
geom_col()
Base approach
dev = colSums(DF, na.rm = TRUE)
barplot(dev, col = factor(names(dev)))
Welcome to Tidyville.
Below is a small df showing the populations of cities in Tidyville. Some cities belong to the A state and some the B state.
I wish to highlight the cities that decreased in population in red. Mission accomplished so far.
But there are many states in Tidyville. Is there a way to use ggplot's faceting faceting to show a plot for each state. I'm uncertain because I'm new and I do a little calculation outside the ggplot call to identify the cities that decreased in population.
library(ggplot2)
library(tibble)
t1 <- tibble (
y2001 = c(3, 4, 5, 6, 7, 8, 9, 10),
y2016 = c(6, 3, 9, 2, 8, 2, 11, 15),
type = c("A", "A", "B", "B", "A", "A", "B", "B")
)
years <- 15
y2001 <- t1$y2001
y2016 <- t1$y2016
# Places where 2016 pop'n < 2001 pop'n
yd <- y2016 < y2001
decrease <- tibble (
y2001 = t1$y2001[yd],
y2016 = t1$y2016[yd]
)
# Places where 2016 pop'n >= 2001 pop'n
yi <- !yd
increase <- tibble (
y2001 = t1$y2001[yi],
y2016 = t1$y2016[yi]
)
ggplot() +
# Decreasing
geom_segment(data = decrease, aes(x = 0, xend = years, y = y2001, yend = y2016),
color = "red") +
# Increasing or equal
geom_segment(data = increase, aes(x = 0, xend = years, y = y2001, yend = y2016),
color = "black")
I think this would be much easier if you just put your data in a tidy format like ggplot2 expects. Here's a possible solution using tidyverse functions
library(tidyverse)
t1 %>%
rowid_to_column("city") %>%
mutate(change=if_else(y2016 < y2001, "decrease", "increase")) %>%
gather(year, pop, y2001:y2016) %>%
ggplot() +
geom_line(aes(year, pop, color=change, group=city)) +
facet_wrap(~type) +
scale_color_manual(values=c("red","black"))
This results in
Your intermediary steps are unnecessary and lose some of your data. We'll keep what you created first:
t1 <- tibble (
y2001 = c(3, 4, 5, 6, 7, 8, 9, 10),
y2016 = c(6, 3, 9, 2, 8, 2, 11, 15),
type = c("A", "A", "B", "B", "A", "A", "B", "B")
)
years <- 15
But instead of doing all the separating and subsetting, we'll just create a dummy variable for whether or not y2016 > y2001.
t1$incr <- as.factor(ifelse(t1$y2016 >= t1$y2001, 1, 0))
Then we can extract the data argument to the ggplot() call to make it more efficient. We'll only use one geom_segment() argument and set the color() argument to be that dummy variable we created before. We then need to pass a vector of colors to scale_fill_manual()'s value argument. Finally, add the facet_grid() argument. If you're only faceting on one variable, you put a period on the opposite side of the tilde. Period first mean's they'll be paneled side-by-side, period last means they'll be stacked on top of each toher
ggplot(data = t1) +
geom_segment(aes(x = 0, xend = years, y = y2001, yend = y2016, color=incr)) +
scale_fill_manual(values=c("black", "red")) +
facet_grid(type~.)
I believe you don't need to create two new datasets, you can add a column to t1.
t2 <- t1
t2$decr <- factor(yd + 0L, labels = c("increase", "decrease"))
I have left the original t1 intact and altered a copy, t2.
Now in order to apply ggplot facets, maybe this is what you are looking for.
ggplot() +
geom_segment(data = t2, aes(x = 0, xend = years, y = y2001, yend = y2016), color = "red") +
facet_wrap(~ decr)
If you want to change the colors, use the new column decr as an value tocolor. Note that this argument changes its position, it is now aes(..., color = decr).
ggplot() +
geom_segment(data = t2, aes(x = 0, xend = years, y = y2001, yend = y2016, color = decr)) +
facet_wrap(~ decr)
require(dplyr)
t1<-mutate(t1,decrease=y2016<y2001)
ggplot(t1)+facet_wrap(~type)+geom_segment(aes(x = 0, xend = years, y = y2001, yend = y2016, colour=decrease))