Visualizing the difference between two points with ggplot2 - r

I want to visualize the difference between two points with a line/bar in ggplot2.
Suppose we have some data on income and spending as a time series.
We would like to visualize not only them, but the balance (=income - spending) as well.
Furthermore, we would like to indicate whether the balance was positive (=surplus) or negative (=deficit).
I have tried several approaches, but none of them produced a satisfying result. Here we go with a reproducible example.
# Load libraries and create LONG data example data.frame
library(dplyr)
library(ggplot2)
library(tidyr)
df <- data.frame(year = rep(2000:2009, times=3),
var = rep(c("income","spending","balance"), each=10),
value = c(0:9, 9:0, rep(c("deficit","surplus"), each=5)))
df
1.Approach with LONG data
Unsurprisingly, it doesn't work with LONG data,
because the geom_linerange arguments ymin and ymax cannot be specified correctly. ymin=value, ymax=value is definately the wrong way to go (expected behaviour). ymin=income, ymax=spending is obviously wrong, too (expected behaviour).
df %>%
ggplot() +
geom_point(aes(x=year, y=value, colour=var)) +
geom_linerange(aes(x=year, ymin=value, ymax=value, colour=net))
#>Error in function_list[[i]](value) : could not find function "spread"
2.Approach with WIDE data
I almost got it working with WIDE data.
The plot looks good, but the legend for the geom_point(s) is missing (expected behaviour).
Simply adding show.legend = TRUE to the two geom_point(s) doesn't solve the problem as it overprints the geom_linerange legend. Besides, I would rather have the geom_point lines of code combined in one (see 1.Approach).
df %>%
spread(var, value) %>%
ggplot() +
geom_linerange(aes(x=year, ymin=spending, ymax=income, colour=balance)) +
geom_point(aes(x=year, y=spending), colour="red", size=3) +
geom_point(aes(x=year, y=income), colour="green", size=3) +
ggtitle("income (green) - spending (red) = balance")
3.Approach using LONG and WIDE data
Combining the 1.Approach with the 2.Approach results in yet another unsatisfying plot. The legend does not differentiate between balance and var (=expected behaviour).
ggplot() +
geom_point(data=(df %>% filter(var=="income" | var=="spending")),
aes(x=year, y=value, colour=var)) +
geom_linerange(data=(df %>% spread(var, value)),
aes(x=year, ymin=spending, ymax=income, colour=balance))
Any (elegant) way out of this dilemma?
Should I use some other geom instead of geom_linerange?
Is my data in the right format?

Try
ggplot(df[df$var != "balance", ]) +
geom_point(
aes(x = year, y = value, fill = var),
size=3, pch = 21, colour = alpha("white", 0)) +
geom_linerange(
aes(x = year, ymin = income, ymax = spending, colour = balance),
data = spread(df, var, value)) +
scale_fill_manual(values = c("green", "red"))
Output:
The main idea is that we use two different types of aesthetics for colours (fill for the points, with the appropriate pch, and colour for the lines) so that we get separate legends for each.

Related

Additional x axis on ggplot

I'm aware there are similar posts but I could not get those answers to work in my case.
e.g. Here and here.
Example:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut))
Returns a plot:
Since I used scale, those numbers are the zscores or standard deviations away from the mean of each break.
I would like to add as a row underneath the equivalent non scaled raw number that corresponds to each.
Tried:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(aes(label = price))
Gives:
Error: geom_text requires the following missing aesthetics: y
My primary question is how can I add the raw values underneath -3:3 of each break? I don't want to change those breaks, I still want 6 breaks between -3:3.
Secondary question, how can I get -3 and 3 to actually show up in the chart? They have been trimmed.
[edit]
I've been trying to make it work with geom_text but keep hitting errors:
diamonds %>%
ggplot(aes(x = scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(label = price)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, :
object 'price' not found
I then tried changing my call to geom_text()
geom_text(data = diamonds, aes(price), label = price)
This results in the same error message.
You can make a custom labeling function for your axis. This takes each label on the axis and performs a custom transform for you. In your case you could paste the z score, a line break, and the z-score times the standard deviation plus the mean. Because of the distribution of prices in the diamonds data set, this means that z scores below about -1 represent negative prices. This may not be a problem in your own data. For clarity I have drawn in a vertical line representing $0
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(diamonds$price) * x + mean(diamonds$price)))
}
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
geom_vline(aes(xintercept = -0.98580251364833), linetype = 2) +
facet_wrap(vars(cut)) +
scale_x_continuous(label = labeller, limits = c(-3, 3)) +
xlab("price")
We can use the sec_axis functionality in scale_x_continuous. To use this functionality we need to manually scale your data. This will add a secondary axis at the top of the plot, not underneath. So it's not quite exactly what you're looking for.
library(tidyverse)
# manually scale the data
mean_price <- mean(diamonds$price)
sd_price <- sd(diamonds$price)
diamonds$price_scaled <- (diamonds$price - mean_price) / sd_price
# make the plot
ggplot(diamonds, aes(price_scaled))+
geom_density()+
facet_wrap(~cut)+
scale_x_continuous(sec.axis = sec_axis(~ mean_price + (sd_price * .)),
limits = c(-3, 4), breaks = -3:3)
You could cheat a bit by passing some dummy data to geom_text:
geom_text(data = tibble(label = round(((-3:3) * sd_price) + mean_price),
y = -0.25,
x = -3:3),
aes(x, y, label = label))

Visualize clock effect by coloring ggplot data bases on criteria

I am trying to visualize the correlation between the scheduled arrival time and the actual arrival time of data in the flights package.
Doing this I generated the plot as shown here:
Code:
library(nycflights13)
library(ggplot2)
attach(flights)
ggplot(flights, aes(x = sched_arr_time, y = arr_time)) +
geom_point(size = 2, color='darkblue', alpha = 0.2) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95, color='red')
I would like to color all data that is effected by the clock passing 00.00 orange.
How would I do this?
I guess you made a mistake in your question because based on your conditions all points will be orange.
Anyway, if you would like to color conditionally your points, there is multiple ways of doing it, you can for example create an extra column with an ifelse statement containing your color of interest, then plot your points using the new column as color argument in your aes and call scale_color_identity function to apply the color pattern contained in this column:
library(dplyr)
library(ggplot2)
flights %>% mutate(Color = ifelse(sched_arr_time < 2400 & arr_time > 0, "orange","darkblue")) %>%
ggplot(aes(x = sched_arr_time, y = arr_time, color = Color))+
geom_point(alpha = 0.2)+
geom_smooth(color = "red")+
scale_color_identity()

How to adjust distances between years on x-axis and adjust line of geom_line(), in ggplot, R-studio?

I would like to create a line plot using ggplot's geom_line() where all distances between years are equal independent of the actual value the year-variable takes and where the dots of geom_point() are connected if there are only two years in between but not if the temporal distance is more than that.
Example:
my.data<-data.frame(
year=c(2001,2003,2005,NA,NA,NA,NA,NA,NA,2019),
value=c(runif(10)))
As for the plot I have tried two different things, both of which are not ideal:
Plotting year as continuous variable with breaks=year and minor_breaks=F, where, obviously the distances between the first three observations are much smaller than the distance between 2005 and 2019, and where, unfortunately, all dots are connected:
library(ggplot2)
library(dplyr)
my.data %>%
ggplot(aes(x=year,y=value)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=c(2001,2003,2005,2019), minor_breaks=F) +
theme_minimal()
Removing NAs and plotting year as factor which yields equal spacing between the years, but obviously removes the lines between data points:
my.data %>%
filter(!is.na(year)) %>%
ggplot(aes(x=factor(year),y=value)) +
geom_line() +
geom_point() +
theme_minimal()
Are there any solutions to these issues? What am I overlooking?
First attempt:
Second attempt:
What I need (but ideally without the help of Paint):
my.data %>%
ggplot(aes(x=year)) +
geom_line(aes(y = ifelse(year <= 2005,value,NA))) +
geom_point(aes(y = value)) +
scale_x_continuous(breaks=c(2001,2003,2005,2019), minor_breaks=F) +
theme_minimal()
maybe something like this would work
I came to a bit convoluted and not super clean solution, but it might get the job done. I am checking if one year should be connected to the next one with lead(). And "remove" the appropriate connections by turning them white. The dummy column is there to put all years in one line and not two.
my.data = data.frame(year=c(2001,2003,2005,2008,2009,2012,2015,2016,NA,2019),
value=c(runif(10))) %>%
filter(!is.na(year)) %>%
mutate(grouped = if_else(lead(year) - year <= 2, "yes", "no")) %>%
fill(grouped, .direction = "down") %>%
mutate(dummy = "all")
my.data %>%
ggplot(aes(x = factor(year),y = value)) +
geom_line(aes(y = value, group = dummy, color = grouped), show.legend = FALSE) +
geom_point() +
scale_color_manual(values = c("yes" = "black", "no" = "white")) +
theme_classic()

dodge columns in ggplot2

I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.

Stacked Bar Plot for Temperature vs Home Runs

I am trying to make some changes to my plot, but am having difficulty doing so.
(1) I would like warm, avg, and cold to be filled in as the colors red, yellow, and blue, respectively.
(2) I am trying to make the y-axis read "Count" and have it be horizontally written.
(3) In the legend, I would like the title to be Temperatures, rather than variable
Any help making these changes would be much appreciated along with other suggestions to make the plot look nicer.
df <- read.table(textConnection(
'Statistic Warm Avg Cold
Homers(Away) 1.151 1.028 .841
Homers(Home) 1.202 1.058 .949'), header = TRUE)
library(ggplot2)
library(reshape2)
df <- melt(df, id = 'Statistic')
ggplot(
data = df,
aes(
y = value,
x = Statistic,
group = variable,
shape = variable,
fill = variable
)
) +
geom_bar(stat = "identity")
You are on the right lines by trying to reshape the data into long format. My preference is to use gather from the tidyr package for that. You can also create the variable names Temperatures and Count in the gather step.
The next step is to turn the 3 classes of temperature into a factor, ordered from cold, through average, to warm.
Now you can plot. You want position = "dodge" to get the bars side by side, since it makes no sense to stack the values in a single bar. Fill colours you specify using scale_fill_manual.
You rotate the y-axis title by manipulating axis.title.y.
So putting all of that together (plus a black/white theme):
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
gather(Temperatures, Count, -Statistic) %>%
mutate(Temperatures = factor(Temperatures, c("Cold", "Avg", "Warm"))) %>%
ggplot(aes(Statistic, Count)) +
geom_col(aes(fill = Temperatures), position = "dodge") +
scale_fill_manual(values = c("blue", "yellow", "red")) +
theme_bw() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5))
Result:
I'd question whether Count is a sensible variable name in this case.
You are almost there. To map specific colors to specific factor levels you can use scale_fill_manual and create your own scale:
scale_fill_manual(values=c("Warm"="red", "Avg"="yellow", "Cold"="blue")) +
Changing the y axis legend is also easy in ggplot:
ylab("Count") +
And to change the legend title you can use:
labs(fill='TEMPERATURE') +
Giving us:
ggplot(df, aes(y = value, x = Statistic, group= variable, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("Warm"="red", "Avg"="yellow", "Cold"="blue")) +
labs(fill='TEMPERATURE') +
ylab("Count") +
xlab("") +
theme_bw() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5))

Resources