Imposing normal distribution to column bars by factor - r

I have a dataframe with 3 columns and several rows, with this structure
Label Year Frequency
1 a 1 86.45
2 b 1 35.32
3 c 1 10.94
4 a 2 13.55
5 b 2 46.30
6 c 2 12.70
up until 20 years. I plot it like this:
ggplot(data=df, aes(x=df$Year, y=df$Frequency, fill=df$Label))+
geom_col(position=position_dodge2(width = 0.1, preserve = "single"))+
scale_fill_manual(name=NULL,
labels=c("A", "B", "C"),
values=c("red", "cyan", "green")) +
scale_x_continuous(breaks = seq(0, 20, by = 1),
limits = c(0, 20)) +
scale_y_continuous(expand = c(0, 0),
limits = c(0, 90),
breaks = seq(0, 90, by = 10)) +
theme_bw()
What I want to do is to add three normal distribution to the plot, so that each group of data (A, B, C) can be visually compared with the normal distribution more similar to its distribution, using the same colors (the normal distribution for label A will be red, and so on).
From the data used in here as an example, I will expect to see a red distribution higher and narrower than the green distribution, which will be shorter and wider. How can I add them to the plot?

Related

Normalize/proportionalize count of individual bins 2d histogram

I am trying to normalize the count of individual bins in a 2d histogram. Here, group 3 has a substantially higher number of inputs, however, I want to compare bins. So I am trying to get it to show the proportional y values of each bin, that the total count of each bin adds up to e.g. 100.
I reckon that this has to be done with the dataframe beforehand. I have managed to normalize the values per group, however, I havent managed to reduce the count to be able to visualize it like so in with the 2d histogram function.
perClassNormalized <- Variables %>%
group_by(Class) %>%
mutate(Nor = procntStad/(max(procntStad)))
Variables <- dataframe with about 10 variables (columns), each with x entries per one of 5 classes. The current total counts per class are: 1 = 639, 2 = 247, 3 = 9881, 4 = 1084, 5 = 823. So the number of inputs for 3 is substantially higher than the others.
Class
variable1
variable2
1
3
7
1
2
3
2
2
6
2
5
8
3
3
9
3
2
1
3
2
3
3
8
4
4
9
5
5
10
2
Example of what image I currently have
my_breaks = c(2, 10, 50, 100, 5000)
##
procentStadVSKlasse <- ggplot(perClassNormalized , aes(x = Class, y = (Nor))) + geom_bin2d(bins = 10) +
ylab("Percentage bebouwd oppervlak") + xlab("Norm klasse regionale kering") +
labs(title = "Bebouwd oppervlak") +
scale_fill_gradient(name = "count", trans = "log", breaks = my_breaks, labels = my_breaks,
low = '#55C667FF', high = '#FDE725FF') +
theme_bw() +
scale_x_discrete(limits = c(1,2,3,4,5)) +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(size=14),
axis.text.x = element_text(size=12),
axis.title.y = element_text(size=14))
The new image should likely be similar, however, the visualization is likely to be improved and distinctions are hopefully more easily spotted.

Trim scale_color_gradient() legend in ggplot

I am trying to 'trim' the legend of the plot below:
df <- data.frame(x = seq(0, 15 , 0.001),
y = seq(0, 15, 0.001))
ggplot(df, aes(x=x, y=y, col = y)) +
geom_line() +
scale_color_gradientn(colours = c("green", "black", "red", "red"),
values = rescale(x = c(0, 2, 4, 15), to = c(0,1), from = c(0, 15) ))
I am able to set the required breaks and values by adding breaks = c(0,2,4), labels = c("0", "2", "4+"):
But when I add the limits=c(0,4), the gradient gets messed up.
Question
Is it possible to 'trim' the legend, so that it shows the values from 0 to 4+ (i.e., and omits all values above)?
The following is what you probably want.
library(ggplot2)
library(scales)
df <- data.frame(x = seq(0, 15 , 0.001),
y = seq(0, 15, 0.001))
ggplot(df, aes(x=x, y=y, col = y)) +
geom_line() +
scale_color_gradientn(colours = c("green", "black", "red", "red"),
values = rescale(x = c(0, 2, 4, 15), from = c(0, 4)),
oob = squish,
limits = c(0, 4))
What's happening is as follows. Suppose we have some values in data space (meaning they haven't been rescaled yet).
# Colour positions in data space
print(col_pos_data <- c(0, 2, 4, 15))
#> [1] 0 2 4 15
By default, the scales::rescale() function brings everything into the [0,1] interval. However, when you set a custom range, any out of bounds values will be scaled linearly with the in bounds values. You'll notice that the 15 becomes 3.75 in this case.
# Colour positions in [0,1] interval
col_pos_scaled <- rescale(col_pos_data, from = c(0, 4))
print(col_pos_scaled)
#> [1] 0.00 0.50 1.00 3.75
However, the default manner that ggplot enforces the limits of continuous scales, is to set anything that exceeds the limits to NA, which is often removed later.
# Default ggplot limit enforcing
print(censor(col_pos_scaled))
#> [1] 0.0 0.5 1.0 NA
Now that is a bit too bad for your scale purposes, but one of the alternatives is to 'squish' the data. This brings any (finite) out of bounds values to the nearest limit.
Notice that the last value is no longer NA but set to the largest limit in the [0,1] interval.
print(scaled_squish <- squish(col_pos_scaled))
#> [1] 0.0 0.5 1.0 1.0
The same thing holds true for values in data space if the range is adjusted accordingly.
print(censor(col_pos_data, range = c(0, 4)))
#> [1] 0 2 4 NA
print(data_squish <- squish(col_pos_data, range = c(0, 4)))
#> [1] 0 2 4 4
Internally, ggplot rescales all data to the limits and the order of operations doesn't matter for squishing/rescaling, so the data values and the colour positions in [0,1] line up nicely.
# So when data values are rescaled, they match up the colours
identical(rescale(data_squish), scaled_squish)
#> [1] TRUE
Created on 2020-04-24 by the reprex package (v0.3.0)

ggplot: ranges of values as discrete linerange plots

I would like to make this plot:
Plot 1: The plot that I wanted
My data looks like this:
> head(ranges_example)
labels Minimum Maximum error
1 One -275 -240 1
2 Two -265 -210 1
3 Three -260 -215 1
4 Four -273 -230 1
5 Five NaN -200 1
6 Six NaN -240 1
But, alas, I had to make that plot in illustrator by modifying the plot that I did make in R, this one:
Plot 2: The plot that I got
And I made it using geom_linerange, specifically:
ggplot() +
geom_linerange(data = ranges_example,
mapping=aes(x = labels, ymin = Minimum, ymax = Maximum,
lwd = 1, color = error, alpha = 0.5),
position = position_dodge(width = 1)) +
scale_y_continuous(c(-240, -300)) +
coord_flip()
Plot 2 is good enough for this once--it takes maybe 15 minutes to turn it into Plot 1 in Illustrator--but I'll probably need to make a good few more of these.
The reason why I don't just remove the position_dodge statement is that then it just blends the colors together, like this:
I need them to be their own, distinct colors so that it's easy to tell them apart. The different shades mean different things and I need to be able to easily distinguish between and alter them.
How can I create a plot that looks more like Plot 2 right out of the box?
ggplot() +
geom_linerange(data = ranges_example %>% arrange(-error),
mapping=aes(x = labels, ymin = Minimum, ymax = Maximum,
lwd = 1, color = error)) +
scale_y_continuous(c(-240, -300)) +
scale_color_continuous(high = "lightgreen", low = "forestgreen") +
coord_flip() +
theme_classic()
# Example data
ranges_example <- tribble(
~labels, ~Minimum, ~Maximum, ~error,
"One", -275, -240, 1,
"Two", -265, -210, 1,
"One", -285, -215, 2,
"Two", -275, -190, 2,
"One", -300, -200, 3,
"Two", -290, -180, 3)

Selectively colored geom_hline

I am using hline from ggplot to construct an axis for a data set I am looking out. Essentially I want to selectively color this axis based on a dataframe. This dataframe consists of an array of (7684, 7685,...,7853) and each corresponds to a letter "a", "b", "c", and "d". I would like to correspond each letter with a color used to color that interval on the axis.
For example row 1 of this data frame is: (7684, "c") so I would want to color the interval on the axis from 7684 to 7685 the color of "c" which could be red for instance. I have yet to think of a straightforward solution to this, I am not sure if hline would be the way to go with this.
> df
p nucleotide
1 c 7684
2 c 7685
3 t 7686
4 t 7687
5 a 7688
6 c 7689
7 a 7690
8 t 7691
9 a 7692
10 c 7693
Small snippet of what I am talking about. Basically want to associate df$p with colors. And color the interval of the corresponding df$nucleotide
You never use a for loop in ggplot and you should never use df$.. in an aesthetic.
library(dplyr)
library(ggplot2)
ggplot(df) +
geom_segment(aes(x = nucleotide, xend = lead(nucleotide), y = 1, yend = 1, color = p), size = 4)
#> Warning: Removed 1 rows containing missing values (geom_segment).
This takes us half the way. What is does is draw a segment from x to xend. x is mapped to the nucleotide value, xend is mapped to lead(nucleotide), meaning the next value. This of course lead to leaving out the last line, as it does not have a next value.
The following code takes care of that, admittedly in a hackish way, adding a row to the df, and then limiting scale_x . It may be not generalizable.
It also add some graphical embellishment.
df %>%
add_row(p = '', nucleotide = max(.$nucleotide) + 1) %>%
ggplot() +
geom_segment(aes(x = nucleotide, xend = lead(nucleotide), y = 1, yend = 1, color = p), size = 4) +
geom_text(aes(x = nucleotide, y = 1, label = nucleotide), nudge_x = .5, size = 3) +
scale_x_continuous(breaks = NULL, limits = c(min(df$nucleotide), max(df$nucleotide) + 1)) +
scale_color_brewer(palette = 'Dark2', limits = c('a', 'c', 't'), direction = 1) +
theme(aspect.ratio = .2,
panel.background = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
#> Warning: Removed 1 rows containing missing values (geom_segment).
#> Warning: Removed 1 rows containing missing values (geom_text).
Data
df <- read.table(text = ' p nucleotide
1 c 7684
2 c 7685
3 t 7686
4 t 7687
5 a 7688
6 c 7689
7 a 7690
8 t 7691
9 a 7692
10 c 7693', header = T)

Plotting baseball pitches as qualitative variable by color

I was thinking of doing this in R but am new to it and would appreciate any help
I have a dataset (pitches) of baseball pitches identified by
'pitchNumber' and 'outcome' e.g S = swinging strike, B = ball, H= hit
etc.
e.g.
1 B ;
2 H ;
3 S ;
4 S ;
5 X ;
6 H; etc.
All I want to do is have a graph that plots them in a line cf BHSSXB
but replacing the letter with a small bar colored to represent the letter, with a legend, and optionally having the pitch number above the color . Somewhat like a sparkline.
Any suggestion on how to implement this much appreciated
And the same graph using ggplot.
Data courtesy of #GavinSimpson.
ggplot(baseball, aes(x=pitchNumber, y=1, ymin=0, ymax=1, colour=outcome)) +
geom_point() +
geom_linerange() +
ylab(NULL) +
xlab(NULL) +
scale_y_continuous(breaks=c(0, 1)) +
opts(
panel.background=theme_blank(),
panel.grid.minor=theme_blank(),
axis.text.y = theme_blank()
)
Here is a base graphics idea from which to work. First some dummy data:
set.seed(1)
baseball <- data.frame(pitchNumber = seq_len(50),
outcome = factor(sample(c("B","H","S","S","X","H"),
50, replace = TRUE)))
> head(baseball)
pitchNumber outcome
1 1 H
2 2 S
3 3 S
4 4 H
5 5 H
6 6 H
Next we define the colours we want:
## better colours - like ggplot for the cool kids
##cols <- c("red","green","blue","yellow")
cols <- head(hcl(seq(from = 0, to = 360,
length.out = nlevels(with(baseball, outcome)) + 1),
l = 65, c = 100), -1)
then plot the pitchNumber as a height 1 histogram-like bar (type = "h"), suppressing the normal axes, and we add on points to the tops of the bars to help visualisation:
with(baseball, plot(pitchNumber, y = rep(1, length(pitchNumber)), type = "h",
ylim = c(0, 1.2), col = cols[outcome],
ylab = "", xlab = "Pitch", axes = FALSE, lwd = 2))
with(baseball, points(pitchNumber, y = rep(1, length(pitchNumber)), pch = 16,
col = cols[outcome]))
Add on the x-axis and the plot frame, plus a legend:
axis(side = 1)
box()
## note: this assumes that the levels are in alphabetical order B,H,S,X...
legend("topleft", legend = c("Ball","Hit","Swinging Strike","X??"), lty = 1,
pch = 16, col = cols, bty = "n", ncol = 2, lwd = 2)
Gives this:
This is in response to your last comment on #Gavin's answer. I'm going to build off of the data provided by #Gavin and the ggplot2 plot by #Andrie. ggplot() supports the concept of faceting by a variable or variables. Here you want to facet by pitcher and at the pitch limit of 50 per row. We'll create a new variable that corresponds to each row we want to plot separately. The equivalent code in base graphics would entail adjusting mfrow or mfcol in par() and calling separate plots for each group of data.
#150 pitches represents a somewhat typical 9 inning game.
#Thanks to Gavin for sample data.
longGame <- rbind(baseball, baseball, baseball)
#Starter goes 95 pitches, middle relief throws 35, closer comes in for 20 and the glory
longGame$pitcher <- c(rep("S", 95), rep("M", 35), rep("C",20))
#Adjust pitchNumber accordingly
longGame$pitchNumber <- c(1:95, 1:35, 1:20)
#We want to show 50 pitches at a time, so will combine the pitcher name
#with which set of pitches this is
longGame$facet <- with(longGame, paste(pitcher, ceiling(pitchNumber / 50), sep = ""))
#Create the x-axis in increments of 1-50, by pitcher
longGame <- ddply(longGame, "facet", transform, pitchFacet = rep(1:50, 5)[1:length(facet)])
#Convert facet to factor in the right order
longGame$facet <- factor(longGame$facet, levels = c("S1", "S2", "M1", "C1"))
#Thanks to Andrie for ggplot2 function. I change the x-axis and add a facet_wrap
ggplot(longGame, aes(x=pitchFacet, y=1, ymin=0, ymax=1, colour=outcome)) +
geom_point() +
geom_linerange() +
facet_wrap(~facet, ncol = 1) +
ylab(NULL) +
xlab(NULL) +
scale_y_continuous(breaks=c(0, 1)) +
opts(
panel.background=theme_blank(),
panel.grid.minor=theme_blank(),
axis.text.y = theme_blank()
)
You can obviously change the labels for the facet variable, but the above code will produce:

Resources