Sum variables in a dataframe and plot the sum in ggplot - r

I have dataframe that contains data on the number of TVs and radios owned by survey respondents now and before:
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)))
I want to sum the total value of each variable and then create a barplot that shows the number of TVs and radios owned by survey respondents now and before.
I can manually create a new dataframe that contains just the sum of the value of each variable in the original DF
DFsum <- data.frame(TV_now = as.numeric(c(sum(DF$TV_now,na.rm = TRUE))),
TV_before = as.numeric(c(sum(DF$TV_before,na.rm = TRUE))),
Radio_now = as.numeric(c(sum(DF$TV_now,na.rm = TRUE))),
Radio_before = as.numeric(c(sum(DF$Radio_before,na.rm = TRUE))))
and then use tidyr to do the following:
library(tidyr)
library(ggplot2)
DFsum %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Number,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")
This gives me the result I want, but seems unnecessarily complicated for what should be easy to achieve. Is there an easier way to plot this?

You can simplify your code with use of dplyr::mutate_all since you are summarizing all your columns:
library(tidyverse)
library(ggplot2)
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")

Simplify data creation. R knows that 4, 9, 1, etc., are numbers, you don't need as.numeric.
DF <- data.frame(TV_now = c(4, 9, 1, 0, 4, NA),
TV_before = c(4, 1, 2, 4, 5, 2),
Radio_now = c(4, 5, 1, 5, 6, 9),
Radio_before = c(6, 5, 3, 6, 7, 10))
Simplify the data manipulation. Tidy your data (convert it to long format) first, then do other things:
DF_long = gather(DF, key = "device") %>%
group_by(device) %>%
summarize(number = sum(value, na.rm = TRUE))
Simplify the plotting. Aesthetics are inherited - you don't need to specify them multiple times. geom_col is preferred to geom_bar with stat = "identity". position = "dodge" does nothing when there is one group per x index.
ggplot(aes(x = device, y = number, fill = device)) +
geom_col()
I generally prefer to do my own data manipulation, but we can also lean on ggplots stacking bars to replace the summing, making the entire code:
gather(DF, key = "device", value = "number") %>%
ggplot(aes(x = device, y = number, fill = device)) +
geom_col()
Base approach
dev = colSums(DF, na.rm = TRUE)
barplot(dev, col = factor(names(dev)))

Related

Order variables geom_point based on similar pattern across x-axis in R

How could I order the variables so they are plotted such as a heat map/where they show similar pattern, ie: at the top A and D, then B, C, and bottom E. Would want to avoid doing it manually as real data is many more variables.
Variable1 <- c(rep("A",7), rep("B",7),rep("C",7), rep("D",7), rep("E",7))
Variable2 <- c(rep(1:7, 5))
value <- c(15, 16, 11, 12, 13, 11, 12, 4, 3, 6, 5, 4, 3, 2, 3, 3, 2, 3, 3, 4, 3, 18, 17, 15, 2, 3, 4, 5, 2, 3, 4, 5, 6, 10, 18)
dff <- data.frame(Variable1, Variable2, value)
library(dplyr)
dff <- dff %>%group_by(Variable1)%>%
mutate(scaled_val = scale(value)) %>%
ungroup()
dff$Variable <- factor(dff$Variable1,levels=rev(unique(dff$Variable1)))
ggplot(dff, aes(x = Variable2, y = Variable1, label=NA)) +
geom_point(aes(size = scaled_val, colour = value)) +
geom_point(aes(size = scaled_val, colour = value), shape=21, colour="black") +
geom_text(hjust = 1, size = 2) +
theme_bw()+
scale_color_gradient(low = "lightblue", high = "darkblue")+
scale_x_discrete(expand=c(1,0))+
coord_fixed(ratio=4)
And desired:
If you look at a heat map with clustered rows by similarity for example: https://3.bp.blogspot.com/-AI2dxe95VHk/TgTJtEkoBgI/AAAAAAAAC5w/XCyBw3qViGA/s400/heatmap_cluster2.png you see at the top you have the row whose pattern are first x-axis timepoints, then the ones higher at the last x-axis timepoints..
To do: So I wonder if using the scaled value, we can do so the top are the ones with higher mean in Variable2 (1:2), then higher mean Variable2 (3:5) then Variable2 (6:7). Let me know if I am not being clear here and can explain, better.
It sounds like you want to arrange groups A-E based on their mean. You can do that by converting Variable1 into a factor with custom levels:
lvls <- names(sort(by(dff$value, dff$Variable1, mean)))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)
Here's a solution that sorts groups by which.max:
peaks <- c(by(dff$value, dff$Variable1, which.max))
lvls <- names(sort(peaks))
dff$Variable1 <- factor(dff$Variable1, levels = lvls)

Ggplot2: unique() does not work properly with dplyr piping

I have some problems with the unique() function when piping with dplyr. With my simple example code this works fine:
category <- as.factor(c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4))
quality <- as.factor(c(0, 1, 2, 3, 3, 0, 0, 1, 3, 2, 2, 2, 1, 0, 3, 2, 3, 3, 1, 0, 2, 1))
mydata <- data.frame(category, quality)
This adjusts my dataframe so that it is easier to work with and produce a nice plot:
mydata2 <- mydata %>%
group_by(category, quality) %>%
mutate(count_q = n()) %>%
ungroup() %>%
group_by(category) %>%
mutate(tot_q = n(),pc = count_q*100 / tot_q) %>%
unique() %>%
arrange(category)
myplot <- ggplot(mydata2, aes(x = category, y = pc, fill = quality)) +
geom_col() +
geom_text(aes(
x = category,
y = pc,
label = round(pc,digits = 1),
group = quality),
position = position_stack(vjust = .5)) +
ggtitle("test") +
xlab("cat") +
ylab("%") +
labs("quality")
myplot
Looks exactly like I want:
However, with my actual data the same code produces this mess:
I did find a solution: when I add this line and use the new mydata.unique as the basis for my ggplot, it works exactly like with my example data. This is not needed in the example data for some reason, whereas in my actual data the unique() within piping seems to do nothing.
mydata.unique <- unique(mydata2[c("quality","category", "count_q", "tot_q", "pc")])
What I don't understand is why I need to add the above line. Obviously I can't share my actual data. Maybe someone still understands what this is about. Maybe it has to do with other (irrelevant) columns in the data that can't be processed by unique()?
Try with distinct() instead of unique(). And in this case, probably you need to summarise instead of mutate() + distinct()
If your original df has more variables, try this:
mydata2 <- mydata %>%
group_by(category, quality) %>%
mutate(count_q = n()) %>%
ungroup() %>%
group_by(category) %>%
mutate(tot_q = n(),pc = count_q*100 / tot_q) %>%
distinct(category, quality, count_q, tot_q, pc, .keep_all = TRUE) %>%
arrange(category)
Or maybe as mentioned by #adalvarez replace mutate with summarise.

Replicate position_fill calculations

This is possibly more of a maths question than a coding one, but this has me really stuck! I have a data frame in the format of this small reproducible example:
library(dplyr)
library(ggplot2)
df <- data.frame(sample = c(rep(1, 3), rep(2, 3), rep(3 , 3), rep(4 , 3)),
condition = c(rep("a", 6), rep("b", 6)),
amino_acid = rep(c("Asx", "Glx", "Ser"), 4),
conc = c(2, 4, 3, 6, 7, 5, 1, 2, 1, 8, 2, 5),
total_conc = c(rep(9, 3), rep(12, 3), rep(4 , 3), rep(15 , 3))) %>%
mutate(pc_conc = conc / total_conc * 100)
I want to look at the change in the proportion of amino_acid concs (i.e. change in pc_conc) with the conditions a and b, which I can do using fun.y = "mean" and position = position_fill():
ggplot(df, aes(x = condition, y = pc_conc, fill = amino_acid)) +
geom_bar(stat = "summary", fun.y = "mean", na.rm = TRUE,
position = position_fill())
What I want is the values of each of these bars (i.e. the average proportion of conc for each amino_acid for each condition). If I average pc_conc in the original dataframe you can see that I don't get the same result:
df2 <- df %>% group_by(condition, amino_acid) %>%
summarise_at(vars(pc_conc), mean)
ggplot(df2, aes(x = condition, y = pc_conc, fill = amino_acid)) +
geom_bar(stat = "summary", fun.y = "identity", na.rm = TRUE)
How can I either extract the values calculated by the position_fill() function from the ggplot object or replicate those calculations so I have the same numbers? I've tried averaging the original conc and total_conc variables and calculating pc_conc but that also doesn't create the normalised percentages I'm looking for.

Conditional formatting of table in R...a better way?

Trying to improve this code. What I have worked up works but looks ugly and is VERY clumsy.
Looking for a ggplot method or something that is more user friendly. Would appreciate the tips and advice.
library("dplyr")
thi <- data.frame(RH = c(1,1,1,2,2,2,3,3,3), T = c(1,2,3,1,2,3,1,2,3), THI = c(8,8,5,7,5,10,5,8,7))
table_thi <- tapply(thi$THI, list(thi$RH, thi$T), mean) %>% as.table()
x = 1:ncol(table_thi)
y = 1:nrow(table_thi)
centers <- expand.grid(y,x)
image(x, y, t(table_thi),
col = c("lightgoldenrod", "darkgoldenrod", "darkorange"),
breaks = c(5,7,8,9),
xaxt = 'n',
yaxt = 'n',
xlab = '',
ylab = '',
ylim = c(max(y) + 0.5, min(y) - 0.5))
text(round(centers[,2],0), round(centers[,1],0), c(table_thi), col= "black")
mtext(paste(attributes(table_thi)$dimnames[[2]]), at=1:ncol(table_thi), padj = -1)
mtext(attributes(table_thi)$dimnames[[1]], at=1:nrow(table_thi), side = 2, las = 1, adj = 1.2)
abline(h=y + 0.5)
abline(v=x + 0.5)
How about this:
library(dplyr)
library(ggplot2)
thi <- data.frame(
RH = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
T = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
THI = c(8, 8, 5, 7, 5, 10, 5, 8, 7)
)
names(thi) = c('col1', 'col2', 'thi')
ggplot(thi, aes(x = col1, y = col2, fill = factor(thi), label = thi)) +
geom_tile() +
geom_text()
Or depending on whether thi is really factor (discrete) or continuous variable, you may want something like this:
ggplot(thi, aes(x = col1, y = col2, fill = thi, label = thi)) +
geom_tile() +
geom_text(color = 'white')
Note: You probably want to avoid using column or variable names that are reserved words or abbreviations (e.g. avoid calling something T because that's an abbreviation for the keyword TRUE). In the code above, I renamed the columns of your data.frame.
Since the question says conditional formatting of a table, however, you may want to consider the gt package:
library(gt)
thi %>% gt()
Or this:
thi %>% gt() %>%
data_color(
columns = vars(thi),
colors = scales::col_factor(
palette = "Set1",
domain = NULL
))
Or maybe this:
thi %>% gt() %>%
tab_style(
style = cells_styles(
bkgd_color = "#F9E3D6",
text_style = "italic"),
locations = cells_data(
columns = vars(thi),
rows = thi <= 7
)
)

Which ggplot2 geom should I use?

I have a data frame.
id <- c(1:5)
count_big <- c(15, 25, 7, 0, 12)
count_small <- c(15, 9, 22, 11, 14)
count_black <- c(7, 12, 5, 2, 6)
count_yellow <- c(2, 0, 7, 4, 3)
count_red <- c(8, 4, 4, 2, 5)
count_blue <- c(5, 9, 6, 1, 7)
count_green <- c(8, 9, 7, 2, 5)
df <- data.frame(id, count_big, count_small, count_black, count_yellow, count_red, count_blue, count_green)
How can I display the following in ggplot2 and which geom should I use:
a breakdown of big and small variable by id
a breakdown of colors by id
This is just a subset of the data set that has around 1000 rows.
Can I use this df in ggplot2, or do I need to transform it into tidy data with tidyr? (don't know data.table yet)
You need to first restructure the data from wide to long with tidyr.
library(tidyr)
library(ggplot2)
df <- gather(df, var, value, starts_with("count"))
# remove count_
df$var <- sub("count_", "", df$var)
# plot big vs small
df_size <- subset(df, var %in% c("big", "small"))
ggplot(df_size, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
# same routine for colors
df_color <- subset(df, !(var %in% c("big", "small")))
ggplot(df_color, aes(x = id, y = value, fill = var)) +
geom_bar(stat = "identity", position = position_dodge())
Use stat = "identity" to prevent it from doing a row count. position = position_dodge() is used to place the bars next to each other rather than stacked.

Resources