Barplot with set x, y and fill categories R - r

I have the following type of data and would like to create a stacked barplot, which would show the sum of Number on y axis for different bins of Distance on x axis which would indicate distance. In fact, that would be a sort of histogram, but not with frequencies on y but the sums of Number per set bin. This would be cumulative for all categories in Dest which would be marked with different colours.
Thanks so much.
library(ggplot2)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
ggplot(data = df, aes(x = Distance, y = Number, fill = Dest)) +
geom_histogram(colour = c("red","blue","green"))

Here are 2 solutions in case you want to be the one that specifies the (Distance) bins and not the histogram:
Option 1 (using ntile)
Here's a solution that allows you to specify the number of bins using ntile, which means that those bins will have more or less the same number of observations:
library(tidyverse)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
df %>%
group_by(bin = ntile(Distance, 3)) %>% # specify number of bins you want
mutate(DistRange = paste0(round(min(Distance)), " - ", round(max(Distance)))) %>%
ungroup() %>%
group_by(Dest, bin, DistRange = fct_reorder(DistRange, bin)) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(DistRange, sum_number, fill=Dest))+
geom_col()
Option 2 (using cut)
An alternative option using cut to specify ranges:
df %>%
mutate(bin = cut(Distance, breaks = c(min(Distance)-1, 40, 50, 55, max(Distance)))) %>% # specify ranges
group_by(Dest, bin) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(bin, sum_number, fill=Dest))+
geom_col()

Related

R: ggplot, filling density plot with different colors around the mean value

Problem
I am trying to fill the density plot with different colors around the mean. For example, the left part of density plot from a vertical line of the mean will be filled with blue, and the right part with red. I tried the below method with three facets. Within each facet, by setting fill = color, it separates the plot into two density plots around the mean. I want to have only one plot filled with two colors. Can I get some help here?
Sample Data and Current Method
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:1000, 1000, replace=F)
set.seed(10003)
group <- sample(c('A','B','C'), 1000, replace=T)
set.seed(10001)
value1 <- sample(1:300, 1000, replace=T)
set.seed(10004)
value2 <- sample(1:300, 1000, replace=T)
sample <-
data.frame(id, group, value1, value2)
mu <-
sample %>%
gather(state, value, -group, -id) %>%
ddply(c("group"), summarise, grp.mean=mean(value))
p <-
sample %>%
gather(state, value, -group, -id) %>%
left_join(
mu,
by = 'group'
) %>%
distinct %>%
mutate(color = ifelse(value <= grp.mean, 'leq', 'greater')) %>%
select(-grp.mean) %>%
ggplot(aes(x = value, fill = color)) +
geom_density(alpha=0.4) +
geom_vline(
data = mu,
aes(xintercept = grp.mean, color = group),
linetype = "dashed"
) +
facet_wrap(.~group)

How to add a gradient fill to a geom_density chart

I have a dataset where I'd like to plot the density of one column and add a gradient fill that is associated with another column.
For example, this code creates the following plot
library(datasets)
library(tidyverse)
df <- airquality
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = Temp, fill = avgWind)) +
geom_density()
What I'd like is for the plot to have a gradient fill that indicates what the average wind (avgWind) was at each temperature along the x-axis.
I've seen some examples that allow me to create a gradient fill that is associated with the values on the x-axis itself (in this case, Temp) or by percentile/quantiles, but I'd like the gradient fill to be associated with an additional variable.
It's sort of like this, but instead of a bar plot, I'd like to keep it as a smoothed density chart:
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = (Temp), fill = avgWind, group = Temp)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
You can't do gradient fills in geom_polygon so the usual solution is to draw lots of line segments. For example you could do something like this:
library("datasets")
library("tidyverse")
library("viridis")
df <- airquality
df <- df %>%
group_by(Temp) %>%
mutate(count = n(), avgWind = mean(Wind))
## Since we (presumably) want continuous fill, we need to interpolate to
## get avgWind at each Temp value.
## The edges are grey because KDE is estimating density
## Where we don't know the relationship between temp and avgWind
d2fun <- approxfun(df$Temp, df$avgWind)
#> Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
#> 'x' values
dens <- density(df$Temp)
dens_df <- data.frame(x = dens$x, y = dens$y, fill = d2fun(dens[["x"]]))
ggplot(dens_df) +
geom_segment(aes(x = x, xend = x, y = 0, yend = y, color = fill)) +
scale_color_viridis()

ggplot2: stacked barplot over different columns

I have the following sample data with three different cost-types and a year-column:
library(tidyverse)
# Sample data
costsA <- sample(100:200,30, replace=T)
costsB <- sample(100:140,30, replace=T)
costsC <- sample(20:20,30, replace=T)
year <- sample(c("2000", "2010", "2030"), 30, replace=T)
df <- data.frame(costsA, costsB, costsC, year)
My goal is to plot these costs in a stacked barplot, so that I can compare the mean-costs between the three year-categories. In order to do so I aggregated the values:
df %>% group_by(year) %>%
summarise(n=n(),
meanA = mean(costsA),
meanB = mean(costsB),
meanC = mean(costsC)) %>%
ggplot( ... ) + geom_bar()
But how can I plot the graph now? In the x-axis there should be the years and in the y-axis the stacked costs.
You have to make the summarise data into a tidy(-ish) format to generate a plot like the one you posted. In a tidy-verse, you'd do that with gather function where you convert multiple columns into two-columns of key-value pairs. For instance, the following code generates the figure below.
df %>% group_by(year) %>%
summarise(n=n(),
meanA = mean(costsA),
meanB = mean(costsB),
meanC = mean(costsC)) %>%
gather("key", "value", - c(year, n)) %>%
ggplot(aes(x = year, y = value, group = key, fill = key)) + geom_col()
With gather("key", "value", - c(year, n)), three columns (costsA, costsB, costsC) are changed to the key-value pairs.

Percentages of a variable in another variable using dplyr and creating a boxplot with standard deviation

I have this df. I wish to make glasses into a factor with a level <=1.5 and >1.5. hereafter I want to examine how many percent of in both levels have a ciss value above 16. Each levels are considered as one group, so that should count as 100%.
glasses <- c(1.0,1.1,1.1,1.6,1.2,1.7,2.2,5.2,8.2,2.5,3.0,3.3,3.0,3.0)
ciss <- c(2,9,10,54,65,11,70,54,0,65,8,60,47,2)
df <- cbind(glasses, ciss)
df
I want a outcome looking like
glasses Percentages ciss > 16
<=1.5 xx%
>1.5 xx%
I tried using dplyr
dfnew <- df %>% mutate(ani=cut(glasses, breaks=c(-Inf, 1.5, Inf),
labels=c("<=1.5",">1.5")))
dfnew %>% group_by(ani) %>% mutate(perc = ciss>16 / sum(ciss))
And lastly, I would like to demonstrate the percentages in boxplot (glasses on the x axis, percentages of ciss above 16 on the y axis).
try this.
require(tidyverse)
require(ggplot2)
require(reshape2)
#Input data
glasses = c(1.0,1.1,1.1,1.6,1.2,1.7,2.2,5.2,8.2,2.5,3.0,3.3,3.0,3.0)
ciss = c(2,9,10,54,65,11,70,54,0,65,8,60,47,2)
#Bind in dataframe
df = as.data.frame(cbind(glasses,ciss))
df %>%
mutate(typglass = if_else(glasses > 1.5,">1.5","<=1.5")) %>%
filter(ciss > 16) %>%
group_by(typglass) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n)) %>%
ggplot() +
geom_bar(aes(x = typglass, y = freq, fill = typglass), stat = "identity", width = 0.5) +
theme_classic()
Gives the following result:

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

Resources