R: suitable plot to display data with skewed counts - r

I have data like:
Name Count
Object1 110
Object2 111
Object3 95
Object4 40
...
Object2000 1
So only the first 3 objects have high counts, the rest 1996 objects have fewer than 40, with the majority less than 10. I am plotting this data with ggplot bar like:
ggplot(data=object_count, mapping = aes(x=object, y=count)) +
geom_bar(stat="identity") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
My plot is as below. As you can see, because there are so many objects with low counts, the width of the graph is very long, and the width of the bar is tiny, which is almost invisible for the hight-counts objects. Is there a better way to represent this data? My goal is to show a few top-count objects and to show there are many low-count ones. Is there a way to group the low count ones together?

My guess is that your data looks something like this:
set.seed(1)
object_count <- tibble(
obj_num = 1:2000,
object = paste0("Object", obj_num),
count = ceiling(20 * rpois(2000, 10) / obj_num)
)
head(object_count)
## A tibble: 6 x 3
# obj_num object count
# <int> <chr> <dbl>
#1 1 Object1 160
#2 2 Object2 100
#3 3 Object3 46
#4 4 Object4 55
#5 5 Object5 56
#6 6 Object6 40
Sure enough, when I plot that with ggplot(object_count, aes(object, count)) + geom_col() + [theme stuff] I get a similar figure.
Here are some strategies "to show a few top-count objects and to show there are many low-count ones."
Histogram
A vanilla histogram might not be clarifying here, since the important big values appear dramatically less often and would not be prominent enough:
ggplot(object_count, aes(count)) +
geom_histogram()
But we could change that by transforming the y axis to bring more emphasis to small values. The pseudo_log transformation is nice for that since it works like a log transform for large values, but linearly near -1 to 1. In this view, we can clearly see where the outliers with just one appearance are, but also see that there are many more small values. The binwidth = 1 here could be set to something wider if the specific values of the big values aren't as important as their general range.
ggplot(object_count, aes(count)) +
geom_histogram(binwidth = 1) +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 100, 1000), minor_breaks = NULL)
Faceting
Another option could be to split your view into two pieces, one with detail on the big values, the other showing all the small values:
object_count %>%
mutate(biggies = if_else(count > 20, "Big", "Little")) %>%
ggplot(aes(obj_num, count)) +
geom_col() +
facet_grid(~biggies, scales = "free")
Lumping
Another option might be too lump together all the counts under 10. The version below emphasizes the object name and count, and the "Other" category has been labeled to show how many values it includes.
object_count %>%
mutate(group = if_else(count < 10, "Others", object)) %>%
group_by(group) %>%
summarize(avg = mean(count), count = n()) %>%
ungroup() %>%
mutate(group = if_else(group == "Others",
paste0("Others (n =", count, ")"),
group)) %>%
mutate(group = forcats::fct_reorder(group, avg)) %>%
ggplot() +
geom_col(aes(group, avg)) +
geom_text(aes(group, avg, label = round(avg, 0)), hjust = -0.5) +
coord_flip()
Cumulative count (~Pareto chart)
If you're interested in the share of total count, you might also look at the cumulative count and see how the big values make up a large share:
object_count %>%
mutate(cuml = cumsum(count)) %>%
ggplot(aes(obj_num)) +
geom_tile(aes(y = count + lag(cuml, default = 0),
height = count))

Related

How to wrap graphs by categories while keeping the same width of bars with ggplot in R?

I am struggling with using facet_grid() and facet wrap() with ggplot(). I would like to be able to wrap the different stacked barcharts for every two categories (of the variable Department here) but at the same time have the same width of bars. The first action can be achieved with facet wrap() while the second one can be achieved with facet_grid(). I would like to combine the advantages of both functions. Do you have any idea on how to solve the problem please?
The data is:
ID<-c("001","002","003","004","005","006","007","008","009","010","NA","012","013")
Name<-c("Damon Bell","Royce Sellers",NA,"Cali Wall","Alan Marshall","Amari Santos","Evelyn Frye","Kierra Osborne","Mohammed Jenkins","Kara Beltran","Davon Harmon","Kaitlin Hammond","Jovany Newman")
Sex<-c("Male","Male","Male",NA,"Male","Male",NA,"Female","Male","Female","Male","Female","Male")
Age<-c(33,27,29,26,27,35,29,32,NA,25,34,29,26)
UKCountry<-c("Scotland","Wales","Scotland","Wales","Northern Ireland","Wales","Northern Ireland","Scotland","England","Northern Ireland","England","England","Wales")
Department<-c("Sports and travel","Sports and travel","Sports and travel","Health and Beauty Care","Sports and travel","Home and lifestyle","Sports and travel","Fashion accessories","Electronic accessories","Electronic accessories","Health and Beauty Care","Electronic accessories",NA)
The code is:
data<-data.frame(ID,Name,Sex,Age,UKCountry,Department)
## Frequency Table
dDepartmentSexUKCountry <- data %>%
filter(!is.na(Department) & !is.na(Sex) & !is.na(UKCountry)) %>%
group_by(Department,Sex,UKCountry) %>%
summarise(Count = n()) %>%
mutate(Total = sum(Count), Percentage = round(Count/Total,3))
## Graph
dSexDepartmentUKCountry %>%
ggplot(aes(x=Sex,
y=Percentage,
fill=UKCountry)) +
geom_bar(stat="identity",
position="fill") +
geom_text(aes(label = paste0(round(Percentage*100,0),"%\n(", Count, ")")),
position=position_fill(vjust=0.5), color="white") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 45,hjust = 1)) +
#facet_grid(cols = vars(Department),scales = "free", space = "free")
facet_wrap(. ~ Department, scales = "free", ncol = 2)
When using facet_wrap(), I get:
When using facet_grid(), I get:
Ideally, I would like to have (edited on Paint):
I have researched my issue and often I would find one or the other solution but not a combination of both.
Is the following acceptable?
I get this by removing scales = "free" from facet_wrap(). The columns are the same width. You may prefer to not have the open space where one gender does have any data for the department. However, I think this is easier to read as the category axis labels are in the same place on each plot (Female on left and Male on right) and this plot clearly conveys that there are some departments where Female or Male customers make no purchases.
Here is the code:
dDepartmentSexUKCountry %>%
ggplot(aes(x=Sex,
y=Percentage,
fill=UKCountry)) +
geom_bar(stat="identity",
position="fill") +
geom_text(aes(label = paste0(round(Percentage*100,0),"%\n(", Count, ")")),
position=position_fill(vjust=0.5), color="white") +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 45,hjust = 1)) +
facet_wrap(. ~ Department, ncol = 2)
Here's an approach with splitting the data into a set number of rows and assembling a grid of plots with patchwork. This is necessary because facet_grid won't break data in multiple ways along the same dimension, i.e. it won't break data into groups along the x-axis but also wrap them around into multiple rows, and facet_wrap doesn't have the flexibility of free spacing. This is definitely more complex than its worth for something small, but it's a process I've used for graphics that need to get a bunch of information together for publishing. Depends on your situation.
The basic idea here is to divvy up what will become bars into 2 rows of panels. It's a bit tricky because each bar is a combination of department and sex (hence using interaction), and you're not splitting by the number of observations, you're splitting by unique identifiers. You could do this different ways, but the way that made sense to me was with rleid to get group numbers, then scale that based on how many rows you need. After that, split and make the same type of plot for what will become each row. You need country to be a factor and for the fill scale to not drop missing factor levels so you can make sure all the plots have the same legend.
rows <- 2
# only difference between data here & OP is I ungrouped it
dept_ids <- dDepartmentSexUKCountry %>%
mutate(UKCountry = as.factor(UKCountry),
id = data.table::rleid(interaction(Department, Sex)),
row = ceiling(id / max(id) * rows))
dept_ids
#> # A tibble: 9 × 8
#> Department Sex UKCountry Count Total Percentage id row
#> <chr> <chr> <fct> <int> <int> <dbl> <int> <dbl>
#> 1 Electronic accessories Female England 1 2 0.5 1 1
#> 2 Electronic accessories Female Northern Ire… 1 2 0.5 1 1
#> 3 Electronic accessories Male England 1 1 1 2 1
#> 4 Fashion accessories Female Scotland 1 1 1 3 1
#> 5 Health and Beauty Care Male England 1 1 1 4 2
plots <- dept_ids %>%
split(.$row) %>%
purrr::map(function(df) {
ggplot(df, aes(x=Sex,
y=Percentage,
fill=UKCountry)) +
geom_bar(stat="identity",
position="fill") +
geom_text(aes(label = paste0(round(Percentage*100,0),"%\n(", Count, ")")),
position=position_fill(vjust=0.5), color="white") +
theme(axis.ticks.x = element_blank()) +
facet_grid(cols = vars(Department),scales = "free", space = "free") +
scale_fill_discrete(drop = FALSE)
})
patchwork::wrap_plots(plots, nrow = rows, guides = "collect")
One issue with this is that you have duplicate x-axis titles. Since the title in this case is pretty self-explanatory, you could just drop it altogether, or you can turn it off in all the plots' themes, patch them together, and then turn it back on for the last plot. Whatever is last in line going into patchwork's assembly functions is what receives the theme setting.
plots %>%
purrr::map(~. + theme(axis.title.x = element_blank())) %>%
patchwork::wrap_plots(nrow = rows, guides = "collect") +
theme(axis.title.x = element_text())
Like I said, in many cases this will be more work than it's worth, but I tried to keep it flexible enough for larger-scale projects where it does make sense.

Finding multiple peak densities on facet wrapped ggplot for two datasets

I am currently attempting to plot densities of flies on julian dates, per year. The aim is to see when there are peak densities of flies, for two methods of data collection (group 1 and group 2). I have many rows of data, over the course of 10 years, for example, the data set looks like this:
year
julian
group
2000
214
1
2001
198
1
2001
224
1
2000
189
2
2000
214
2
2001
222
2
2001
259
2
2000
260
2
2000
212
1
Each row is a single observation.
This is my first time plotting using ggplots, so I am confused as to how to plot vertical peak lines for each year.
The code currently looks like this:
Code
data$group <- as.factor(data$group)
plots <- ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group),adjust = 2) + facet_wrap(~year, ncol = 2)
I have attempted to plot peaks using this code:
geom_vline(data = vline, aes(xintercept = density(data$julian)$x[which.max(density(data$julian)$y)]))
vline <- summarise(group_by(data,year, group=group), density(ata$julian, group=group)$x[which.max(density(data$julian)$y)])
vline
However I assume it has found the peak density for all years and all groups.
Please may anyone help advise me on how to plot max densities for each year and group across each facet? Even better if there are multiple peaks, how would I find those, and a quantitative value for the peaks?
Thank you in advance, I am very new to ggplots.
Instead of trying to wrangle all computations into one line of code I would suggest to split it into steps like so. Instead of using your code to find the highest peak I make use of this answer which in principle should also find multiple peaks (see below):
library(dplyr)
library(ggplot2)
fun_peak <- function(x, adjust = 2) {
d <- density(x, adjust = adjust)
d$x[c(F, diff(diff(d$y) >= 0) < 0)]
}
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 2) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)
And here is a small example with multiple peaks based on the example data in the linked answer:
x <- c(1,1,4,4,9)
data <- data.frame(
year = 2000,
julian = rep(c(1,1,4,4,9), 2),
group = rep(1:2, each = 5)
)
data$group <- as.factor(data$group)
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian, adjust = 1))
#> `summarise()` has grouped output by 'year', 'group'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 1) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)

r percentage by bin in histogram ggplot

I have a data set like this ->
library(ggplot2)
response <- c("Yes","No")
gend <- c("Female","Male")
purchase <- sample(response, 20, replace = TRUE)
gender <- sample(gend, 20, replace = TRUE)
df <- as.data.frame(purchase)
df <- cbind(df,gender)
so head(df) looks like this ->
purchase gender
1 Yes Female
2 No Male
3 No Female
4 No Female
5 Yes Female
6 No Female
Also, so you can validate my examples, here is table(df) for my particular sampling.
(please don't worry about matching my percentages)
gender
purchase Female Male
No 6 3
Yes 4 7
I want a "histogram" showing Gender, but split by Purchase.
I have gone this far ->
ggplot(df) +
geom_bar(aes(y = (..count..)/sum(..count..)),position = "dodge") +
aes(gender, fill = purchase)
which generates ->
histogram with split bins, by percentage, but not the aggregate level I want
The Y axis has Percentage as I want, but it has each bar of the chart as a percentage of the whole chart.
What I want is the two "Female" bars to each be a percentage of there respective "Purchase". So in the chart above I would like four bars to be,
66%, 36%, 33%, 64%
, in that order.
I have tried with geom_histogram to no avail. I have checked SO, searched, ggplot documentation, and several books.
Regarding the suggestion to look at the previous question about facets; that does work, but I had hoped to keep the chart visually as it is above, as opposed to split into "two charts". So...
Anyone know how to do this?
Thanks.
Try something like this:
library(tidyverse)
df %>%
count(purchase, gender) %>%
ungroup %>%
group_by(gender) %>%
mutate(prop = prop.table(n)) %>%
ggplot(aes(gender, prop, group = purchase)) +
geom_bar(aes(fill = purchase), stat = "identity", position = "dodge")
The first 5 lines create a column prop (for "proportion"), which aggregates across gender.
To get there, you first count each purchase by gender (similar to the output of table(df). Ungrouping then regrouping only by gender gives the aggregation we want.
Regarding the percentages you want, is the denominator based on gender, or purchase? In the example given above, 66% for female & no purchase would be a result of 6 divided by the sum of no purchases (6+3) rather than the sum of all females (6+4).
It's definitely possible to plot that, but I'm not sure if the result would be intuitive to interpret. I got confused myself for a while.
The following hack makes use of the weight aesthetic. I've used purchase as the grouping variable here based on the expected output described in the question, though I think gender makes more sense (as per TTNK's answer above):
df <- data.frame(purchase = c(rep("No", 6), rep("Yes", 4), rep("No", 3), rep("Yes", 7)),
gender = c(rep("Female", 10), rep("Male", 10)))
ggplot(df %>%
group_by(purchase) %>% #change this to gender if that's the intended denominator
mutate(w = 1/n()) %>% ungroup()) +
aes(gender, fill = purchase, weight = w)+
geom_bar(aes(x = gender, fill = purchase), position = "dodge")+
scale_y_continuous(name = "percent", labels = scales::percent)

Making trends in line graphs more or less prominent based on their occurence

This is in relationship with this blog post by Simon - Plotting individual observations and group means with ggplot2 and a previous question of mine where Mr Snake gave a nice solution.
I have a dataset of 600 individuals. For each individual I have a value for an indicator for 4 years - 2014, 2015, 2016, 2017. The value of the indicator takes a value 0-5 for 2014, and 1-4 for the rest of the years. In each year the value can either stay the same as previous year or increase but it cannot go down. I am trying to plot line graphs such that there is a separate line graph for each individual representing the trend of values of indicators through time, so the X-axis is time, and y-asis is indicator value. Below I am giving a minimum sample data needed to illustrate my problem of 17 rows (each row is for one individual and there are total 600 rows/individuals).
df <- data.frame(c(1:17), c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1),
c(3,3,3,3,3,3,1,1,1,1,1,3,1,2,1,1,1), c(3,3,3,3,4,4,1,1,1,1,1,3,4,2,1,1,3), c(3,4,3,3,4,4,1,1,1,1,1,3,4,2,1,1,3))
colnames(df) <- c("individual_id", paste("indicator_level_", 14:17, sep=""))
I am using the following code to achieve this -
library(tidyverse)
df1 <-df %>%
gather(indicator_level_14, indicator_level_15, indicator_level_16, indicator_level_17, key="Level", value = "Level_value")
ggplot(df1, aes(x=Level, y=Level_value, color=as.factor(individual_id))) +
geom_line(aes(group = individual_id)) +
theme(axis.text.x = element_text(angle = 50, hjust = 1)) +
labs(color = "Sample") + theme(legend.position="none")
The trouble I am having is, that since my levels are very discrete and a lot of individuals' level trajectories across time is same, several line graphs are overlapping make it difficult for me to see which are the more prominent trajectories. This gets intensified when I plot the line graphs for all 600 individuals. Is there a way that the thickness of each line graph be proportional to the count of overlapping trajectories?
library(dplyr)
df1 %>%
group_by(Level, Level_value) %>% # Find each unique leg
mutate(count = length(individual_id)) %>% # Count how many are on each leg
ggplot(aes(Level, Level_value, group = individual_id, size = count)) +
geom_line() +
scale_size_continuous(range = c(0.5,3))
df1 %>%
ggplot(aes(Level, Level_value, group = individual_id)) +
geom_line(alpha = 0.25, size = 2)
df %>%
group_by(indicator_level_14, # Find each unique full path
indicator_level_15,
indicator_level_16,
indicator_level_17) %>%
mutate(count = length(individual_id)) %>% # Count indivs on each path
gather(Level, Level_value, -individual_id, -count) %>%
ggplot(aes(Level, Level_value, group = individual_id, size = count)) +
geom_line() +
scale_size_continuous(range = c(0.5,3))

Setting facet-specific breaks in stat_contour

I'd like to show a contour plot using ggplot and stat_contour for two categories of my data with facet_grid. I want to highlight a particular level based on the data. Here's an analogous dummy example using the usual volcano data.
library(dplyr)
library(ggplot2)
v.plot <- volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
facet_grid(~dummy)
Plot 1:
Let's say within each factor level (here east and west halves, I guess), I want to find the mean height of the volcano and show that. I can calculate it manually:
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
summarise(h.bar = mean(value))
# A tibble: 2 × 2
dummy h.bar
<lgl> <dbl>
1 FALSE 140.7582
2 TRUE 119.3717
Which tells me that the mean heights on each half are 141 and 119. I can draw BOTH of those on BOTH facets, but not just the appropriate one on each side.
v.plot + stat_contour(breaks = c(141, 119), colour = "red", size = 2)
Plot 2:
And you can't put breaks= inside an aes() statement, so passing it in as a column in the original dataframe is out. I realize with this dummy example I could probably just do something like bins=2 but in my actual data I don't want the mean of the data, I want something else altogether.
Thanks!
I made another attempt at this problem and came up with a partial solution, but I'm forced to use a different geom.
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
mutate(h.bar = mean(value), # edit1
is.close = round(h.bar) == value) %>% #
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
geom_point(colour = "red", size = 3, # edit 2
aes(alpha = is.close)) + #
scale_alpha_discrete(range = c(0,1)) + #
facet_grid(~dummy)
In edit 1 I added a mutate() to the above block to generate a variable identifying where value was "close enough" (rounded to the nearest integer) to the desired highlight point (the mean of the data for this example).
In edit2 I added geom_points to show the grid locations with the desired value, and hid the undesired ones using an alpha of 0 or totally transparent.
Plot 3:
The problem with this solution is that it's very gappy, and trying to bridge those with geom_path is a jumbled mess. I tried coarser rounding as well, and it just made things muddy.
Would love to hear other ideas! Thanks

Resources