Label Range between two points on scatterplot with the percent difference - r

I have a simple scatterplot showing sales difference between years at different ranges.
So, when the range is ">$400", sales are X in 2013 and X in 2014.
I am trying to add an annotation at certain points showing the percent difference from 2013 to 2014. Is that possible?
Here is the dput:
structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L), Range = structure(c(8L, 9L, 10L, 11L, 12L, 13L,
14L, 16L, 17L, 18L, 19L, 20L, 21L, 23L, 24L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 26L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 17L, 18L, 19L,
20L, 21L, 23L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 26L), .Label = c("$40M",
"$50M", "$60M", "$70M", "$71-80M", "$81-90M", "$91-100M", "$101-110M",
"$111-120M", "$121-130M", "$131-140M", "$141-150M", "$151-160M",
"$161-170M", "$171-180M", "$181-190M", "$191-200M", "$200-225M",
"$226-250M", "$251-275M", "$276-300M", "$301-325M", "$326-350M",
"$351-375M", "$376-400M", ">$400M"), class = "factor"), Avg_TOTALS = c(44732492.5,
42902206, 47355762, 49604750.6666667, 51132411, 51943986, 54798652.5,
61313778.5, 68577392, 74457422.6666667, 84805802.5, 96762417,
99355792, 172956681, 189815908, 31762600.8571429, 33042576.2857143,
34964083.8, 34349980.2, 35193407, 36049038.6666667, 42039793.3333333,
486133671, 35996925, 35496337.5, 39139472.5, 36993568.5, 39570379,
40139421.5, 43835119, 51358298.5, 53024160, 61185564, 67726723,
71481251, 89873814, 27746650.1428571, 27633867, 29855703.5714286,
29655265.2, 31163788.8, 29240507, 33810795.25, 192756973)), .Names = c("Year",
"Range", "Avg_TOTALS"), class = "data.frame", row.names = c(NA,
-44L))
And here is the chart I am currently generating:
orderlist = c("$40M", "$50M", "$60M", "$70M", "$71-80M", "$81-90M", "$91- 100M", "$101-110M", "$111-120M", "$121-130M",
"$131-140M", "$141-150M", "$151-160M", "$161-170M", "$171-180M", "$181-190M", "$191-200M", "$200-225M",
"$226-250M", "$251-275M", "$276-300M", "$301-325M", "$326-350M", "$351-375M", "$376-400M", ">$400M")
myDF = transform(myDF, Range = factor(Range, levels = orderlist))
myChart <- ggplot(myDF, aes(x = Range, y = Avg_TOTALS)) +
geom_point(aes(color = factor(Year))) +
theme_tufte() +
theme(axis.text.x= element_text(angle = 90, hjust = 0)) +
labs(x = "Range", y = "Sales by Range", title = "MyChart")+
scale_y_continuous(breaks = c(50000000, 100000000, 200000000,
300000000,400000000, 500000000),
labels = dollar)
Which gives me:
And leads me to this question:
How would I add the percent difference between each of those points, with 2013 being the base year? Also, there are a few ranges where there were sales in only one of the two years- would it be possible to skip the percent labels on those? A condition in which data must exist in both years to be included?
Thanks for any help!

Here is one way. I think there are better ways. This is my best with my sleepy brain right now. Hope you do not mind that. Let me briefly explain the code. I followed you. Then, I obtained the data which ggplot is using, which I called foo. I created a master data frame to deal with missing data points and used join. The dplyr part was doing some calculation and stuff to get proportion. Using the outcome of it in annotate, I assigned the labels you wanted. Hope this will help you. zzz...
DATA
mydf <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L), Range = structure(c(8L, 9L, 10L, 11L, 12L, 13L,
14L, 16L, 17L, 18L, 19L, 20L, 21L, 23L, 24L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 26L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 17L, 18L, 19L,
20L, 21L, 23L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 26L), .Label = c("$40M",
"$50M", "$60M", "$70M", "$71-80M", "$81-90M", "$91-100M", "$101-110M",
"$111-120M", "$121-130M", "$131-140M", "$141-150M", "$151-160M",
"$161-170M", "$171-180M", "$181-190M", "$191-200M", "$200-225M",
"$226-250M", "$251-275M", "$276-300M", "$301-325M", "$326-350M",
"$351-375M", "$376-400M", ">$400M"), class = "factor"), Avg_TOTALS = c(44732492.5,
42902206, 47355762, 49604750.6666667, 51132411, 51943986, 54798652.5,
61313778.5, 68577392, 74457422.6666667, 84805802.5, 96762417,
99355792, 172956681, 189815908, 31762600.8571429, 33042576.2857143,
34964083.8, 34349980.2, 35193407, 36049038.6666667, 42039793.3333333,
486133671, 35996925, 35496337.5, 39139472.5, 36993568.5, 39570379,
40139421.5, 43835119, 51358298.5, 53024160, 61185564, 67726723,
71481251, 89873814, 27746650.1428571, 27633867, 29855703.5714286,
29655265.2, 31163788.8, 29240507, 33810795.25, 192756973)), .Names = c("Year",
"Range", "Avg_TOTALS"), class = "data.frame", row.names = c(NA,
-44L))
orderlist = c("$40M", "$50M", "$60M", "$70M", "$71-80M", "$81-90M", "$91- 100M", "$101-110M", "$111-120M", "$121-130M",
"$131-140M", "$141-150M", "$151-160M", "$161-170M", "$171-180M", "$181-190M", "$191-200M", "$200-225M",
"$226-250M", "$251-275M", "$276-300M", "$301-325M", "$326-350M", "$351-375M", "$376-400M", ">$400M")
mydf = transform(myDF, Range = factor(Range, levels = orderlist))
g <- ggplot(mydf, aes(x = Range, y = Avg_TOTALS)) +
geom_point(aes(color = factor(Year))) +
#theme_tufte() +
theme(axis.text.x= element_text(angle = 90, hjust = 0))+
labs(x="Range", y = "Sales by Range", title = "MyChart")+
scale_y_continuous(breaks = c(50000000, 100000000, 200000000, 300000000,400000000, 500000000), labels = dollar)
library(dplyr)
foo <- ggplot_build(g)$data[[1]] %>%
arrange(group) %>%
mutate(year = c(rep("2013", times = 23), rep("2014", times = 21)))
master <- expand.grid(year = c("2013", "2014"), group = 1:24)
full_join(master, foo, by = c("year", c("group" = "x"))) %>%
group_by(group) %>%
mutate(prop = round(order_by(year, y / first(y)), 2)) %>%
summarise(y = first(y), prop = min(prop, na.rm = FALSE)) -> txt
g + annotate("text", x = txt$group, y = txt$y + 15000000, label = txt$prop)

Related

Linking Shiny Reactive inputs and input updates

I have a dataset of baby names that are ranked by popularity for each year.
What I currently have: a simple shiny app that filters on year based on a slider and a select button that identifies which column is the rank column to use (which also creates a color highlight). This will be two datasets in actuality, one for gender marked as M or gender marked as F, but I've left it simple for the example here.
What I would like to do: update this to be reactive to the values of a slider, which then updates a select option for which file column to sort and highlight.
The current approach works for simplicity, but the focus selector for the year obviously throws an error if it's a value that no longer exists in the slider selected range.
I've dug around and tried a few approaches, but I just haven't been able to get the reactivity portion to work successfully. I'm sure I'm missing something elementary but hitting a wall. Thank you for any input.
Example:
library(shiny)
library(tidyverse)
library(DT)
#Fake Data
dat <- structure(list(Name = c("Bill", "Sean", "Kirby", "Philbert",
"Bob", "Lucius", "Fry", "Tyron", "Lionel", "Alister", "Newt",
"Craig", "A-Aron", "Bill", "Sean", "Kirby", "Philbert", "Bob",
"Lucius", "Fry", "Tyron", "Lionel", "Alister", "Newt", "Craig",
"A-Aron", "Bill", "Sean", "Kirby", "Philbert", "Bob", "Lucius",
"Fry", "Tyron", "Lionel", "Alister", "Newt", "Craig", "A-Aron"
), rank = c(8L, 1L, 2L, 3L, 4L, 6L, 5L, 9L, 7L, 25L, 10L, 35L,
99L, 4L, 1L, 3L, 2L, 5L, 6L, 7L, 11L, 5L, 12L, 8L, 9L, 10L, 4L,
2L, 3L, 10L, 8L, 11L, 5L, 6L, 12L, 7L, 13L, 9L, 1L), year = c(2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L,
2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L)), class = "data.frame", row.names = c(NA, -39L))
#Get years
years <- unique(dat$year)
ui <- fluidPage(
titlePanel("Top Ten Male Baby Names"),
sliderInput("range",
label = "Choose year range",
min = min(as.numeric(years)),
max = max(as.numeric(years)),
sep = "",
value = c(max(as.numeric(years))-1,max(as.numeric(years)))
),
selectInput("year",
label = "Choose year for rank",
choices = as.numeric(years),
selected = max(as.numeric(years))
)
,
mainPanel(
dataTableOutput("DataTable")
)
)
server <- function(input, output) {
output$DataTable <- renderDataTable({
dat1 <- dat %>%
filter((year >= input$range[1] & year <= input$range[2]) ) %>%
pivot_wider(id_cols = Name,
values_from = rank,
names_from = year) %>%
filter(.[colnames(.) == as.character(input$year)] <11) %>%
arrange(.[colnames(.)== as.character(input$year)])
datatable(dat1,
options = list(ordering=F,
lengthChange = F,
pageLength = -1)) %>%
formatStyle(input$year,
backgroundColor = "lightgreen"
)
})
}
shinyApp(ui, server)
You could set up an observeEvent to watch for changes to the sliderInput. Then if your select input is not in the range of the slider. Update the selection.
Note: you need to add the session param to the server function.
Also since output$DataTable is filtered by the range and the year. I've added a validate statement incase the user were to manually chose a year that is not in the current range.
server <- function(input, output, session) {
# Observe for a change to slider input
observeEvent(input$range, {
sel = input$year
# update selection if original selected year is not in range
if(!(sel %in% input$range)) {
sel = min(input$range)
updateSelectInput(session, "year", selected = sel)
}
})
output$DataTable <- renderDataTable({
validate(need(input$year %in% input$range,"Current selection not in range"))
dat1 <- dat %>%
filter((year >= input$range[1] & year <= input$range[2]) ) %>%
pivot_wider(id_cols = Name,
values_from = rank,
names_from = year) %>%
filter(.[colnames(.) == as.character(input$year)] <11) %>%
arrange(.[colnames(.)== as.character(input$year)])
datatable(dat1,
options = list(ordering=F,
lengthChange = F,
pageLength = -1)) %>%
formatStyle(input$year,
backgroundColor = "lightgreen"
)
})
}
shinyApp(ui, server)

How to plot Unequal Interval on x axis of a time series data?

I am using plotrix package to visualize changes in the data using colors. The data is available here.
I am using below code for plotting the data.
library(plotrix)
my_colors1=c("red", "green","blue")
a<-read.csv("DataSt.csv")
x<-a$Year
y<-a$TP
clplot(x, y, main="",lwd=5,labels=y,levels=c(37,964,4377),col=my_colors1, showcuts=T, bty="n",xlab="Year", ylab = "numbers", axes=F)
axis(1, at = a$Year, las=2)
axis(2, at = seq(0, 4400, by = 100), las=2)
I am getting the above chart
I want to reduce the axis space between the year 1975 and 1989. Please help me to get unequal interval at the x axis.
It's a bit dangerous to do this give that the viewer might not realize the inconsistent spacing among the x-axis values. Nevertheless, the following example shows a possible solution by treating the x-values as factor levels. The problem is that that plotting function only allows numeric values. I thus plot with factors, but then need to use numeric values to plot some sort of interpolated values in between using segments:
a <- structure(list(Year = c(2021L, 2020L, 2019L, 2018L, 2017L, 2016L,
2015L, 2014L, 2013L, 2012L, 2011L, 2010L, 2009L, 2008L, 2007L,
2006L, 2005L, 2004L, 2003L, 2002L, 2001L, 2000L, 1999L, 1998L,
1997L, 1996L, 1995L, 1994L, 1993L, 1992L, 1991L, 1990L, 1989L,
1975L), TP = c(785L, 848L, 1067L, 1079L, 1263L, 678L, 1204L,
542L, 661L, 387L, 3534L, 4377L, 964L, 244L, 237L, 145L, 86L,
37L, 39L, 23L, 14L, 11L, 7L, 9L, 6L, 3L, 7L, 7L, 6L, 1L, 1L,
1L, 2L, 1L)), class = "data.frame", row.names = c(NA, -34L))
a$Year <- factor(a$Year)
a <- a[order(a$Year),]
head(a)
my_colors1=c("red", "green","blue")
plot(TP ~ Year, a, col = NA, border = NA, las = 2)
for(i in 2:nrow(a)){
b <- as.data.frame(approx(x = as.numeric(a$Year[(i-1):i]), y = a$TP[(i-1):i], n = 100))
b$col <- my_colors1[as.numeric(cut(b$y, breaks = c(-Inf,37,964,4377,Inf)))]
segments(x0 = b$x[-nrow(b)], x1 = b$x[-1], y0 = b$y[-nrow(b)], y1 = b$y[-1], col = b$col[-1])
}
abline(h = c(37,964), lty = 2)

Numeric year but month as character. How to change months into numeric?

So imagine I have a dataset where the column "date" contains years 2011-2017 and months for each year, however months are written out in letters. For example:
date: 11-Jan
I would like to make the months numeric so I get:
date: 11-01
Any suggestions on how I can tackle this problem?
Kind regards!
Make your input proper dates, parse them, then format them.
x <- c("11-Jan", "12-Feb")
Sys.setlocale("LC_TIME", "C") #parsing of months depends on locale
format(
as.Date(paste0(x, "-1"), format = "%y-%b-%d"),
"%y-%m"
)
#[1] "11-01" "12-02"
See help("strptime") for details on format strings.
Assuming your data is like:
df1 <- structure(list(day_mon = c("16-Dec", "18-Nov", "12-Oct", "8-Oct",
"15-May", "29-Jun", "22-Feb", "25-May", "23-Jan", "24-Oct", "23-May",
"27-Sep", "9-Apr", "28-Oct", "18-Jan", "8-Apr", "7-Jan", "13-Dec",
"28-Nov", "24-May"), year = c(2012L, 2014L, 2011L, 2015L, 2015L,
2015L, 2011L, 2015L, 2012L, 2015L, 2011L, 2012L, 2014L, 2012L,
2013L, 2011L, 2017L, 2016L, 2014L, 2014L)),
row.names = c(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), class = "data.frame")
You can:
# Format the month and day: mon_day_fmt => character vector
df1$mon_day_fmt <- paste(
sprintf(
"%02d",
match(
gsub(
"\\d+\\-(\\w+)",
"\\1",
with(
df1,
day_mon
)
),
month.abb
)
),
sprintf(
"%02d",
as.integer(
gsub(
"^(\\d+)\\-\\w+$",
"\\1",
with(
df1,
day_mon
)
)
)
),
sep = "-"
)
# Create a date vector: date => Date Vector
df1$date <- as.Date(
paste(
df1$year,
df1$mon_day_fmt,
sep = "-"
)
)

How to change the geom_point appearance?

Using this data:
Data2 <- structure(list(year = c(2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L, 2019L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2019L,
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 2019L), variable = c("var1", "var1", "var1", "var1",
"var1", "var1", "var1", "var1", "var1", "var1", "var1", "var1",
"var2", "var2", "var2", "var2", "var2", "var2", "var2", "var2",
"var2", "var2", "var2", "var2", "var3", "var3", "var3", "var3",
"var3", "var3", "var3", "var3", "var3", "var3", "var3", "var3"
), frequency = c(1L, 0L, 0L, 0L, 1L, 1L, 3L, 4L, 3L, 10L, 9L,
0L, 0L, 1L, 0L, 0L, 3L, 2L, 2L, 3L, 8L, 9L, 12L, 3L, 0L, 0L,
0L, 0L, 0L, 2L, 1L, 0L, 1L, 2L, 4L, 0L)), class = "data.frame", row.names = c(NA,
-36L))
I try to produce a plot like this one:
library(ggplot2)
ggplot(Data2, aes(year, variable, size = frequency, color = variable)) +
geom_point()+ theme(text = element_text(size = 18))
However I have some problem with the view of this graph:
How can I change the x axis to have all years, make bigger the geom_point (I tried the size but it is a fix option and all circles made the same) and from frequency show from 1 and not 0 cicles if exist?
An easy way to get years to show up is to turn them into factors. As far as starting your size scale at one: I'm not sure if you want to remove points representing 0, or just start the scale from 1. The following code should give you and idea of how to do either, or both. To remove 0s from a plot it might be easiest to just turn them into NAs, which won't be plotted (it will throw a warning, but that's okay). You can change your size scale breaks using scale_size_continuous:
Data2 %>%
mutate(frequency = ifelse(frequency == 0, NA, frequency),
year = as.factor(year)
) %>%
ggplot(aes(year, size = frequency, variable, color = variable)) +
geom_point() +
theme(text = element_text(size = 18)) +
scale_size_continuous(breaks = c(1, 4, 8, 12))
When the axis labels are not all showing up, or when they are superimposed, a standard trick is to rotate them. In this case I will rotate the x axis labels by 45 degrees.
library(ggplot2)
ggplot(Data2, aes(year, variable,
size = frequency, color = variable)) +
geom_point() +
theme(text = element_text(size = 18),
axis.text.x = element_text(angle = 45, hjust = 1))
I have a quick&dirty solution for no 0 frequencies: Data2[Data2==0] <- NA
alternatively only for frequency:
library(data.table)
setDT(Data2)
Data2[frequency==0, frequency:= NA]
For having proper years:
Data2$year = as.Date(strptime(Data2$year, "%Y"))
ggplot(Data2, aes(year, variable, size = frequency, color = variable)) +
geom_point()+
theme( axis.text.x = element_text(angle = 60, hjust = 1) )

How to plot 2 categorical variables on X-axis and two continuous variables as "fill" using ggplot2 package?

I have a dataset that has two categorical variables, viz., Year and Category and two continuous variables TotalSales and AverageCount.
Year Category TotalSales AverageCount
1 2013 Beverages 102074.29 22190.06
2 2013 Condiments 55277.56 14173.73
3 2013 Confections 36415.75 12138.58
4 2013 Dairy Products 30337.39 24400.00
5 2013 Seafood 53019.98 27905.25
6 2014 Beverages 81338.06 35400.00
7 2014 Condiments 55948.82 19981.72
8 2014 Confections 44478.36 24710.00
9 2014 Dairy Products 84412.36 32466.00
10 2014 Seafood 65544.19 14565.37
In MS Excel, we can happily get a pivot-plot for the same table, with Year and Category as AXIS, TotalSales and AverageCount as sigma values.
Using R, how do I draw such a graph as shown in the image, where the categorical variables are shown as multiple layers in the same graph?
P.S. One option that I could see is, by splitting the data frame into two separate dataframes (One for year 2013 and another for year 2014 in our case) and draw two graphs on one single plot, arranged in multiple rows to get the same effect. But is there any way to draw it as shown above?
Sample data used above
dat <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L), Category = structure(c(1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("Beverages", "Condiments",
"Confections", "Dairy Products", "Seafood"), class = "factor"),
TotalSales = c(102074.29, 55277.56, 36415.75, 30337.39, 53019.98,
81338.06, 55948.82, 44478.36, 84412.36, 65544.19), AverageCount = c(22190.06,
14173.73, 12138.58, 24400, 27905.25, 35400, 19981.72, 24710,
32466, 14565.37)), .Names = c("Year", "Category", "TotalSales",
"AverageCount"), class = "data.frame", row.names = c(NA, -10L
)
You need to first reformat your data, as #EDi showed you how to in one of your older questions (ggplot : Multi variable (multiple continuous variable) plotting) and #docendo discimus suggested in the comments.
library(reshape2)
dat_l <- melt(dat, id.vars = c("Year", "Category"))
Then you can use faceting like so:
library(ggplot2)
p <- ggplot(data = dat_l, aes(x = Category, y = value, group = variable, fill = variable))
p <- p + geom_bar(stat = "identity", width = 0.5, position = "dodge")
p <- p + facet_grid(. ~ Year)
p <- p + theme_bw()
p <- p + theme(axis.text.x = element_text(angle = 90))
p
If you are particularly interested in making the figure more consistent with an Excel-look, there are some strategies in the answer here that might be helpful: How do I plot charts with nested categories axes?.
Your original data in an easier to paste format:
dat <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L), Category = structure(c(1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("Beverages", "Condiments",
"Confections", "Dairy Products", "Seafood"), class = "factor"),
TotalSales = c(102074.29, 55277.56, 36415.75, 30337.39, 53019.98,
81338.06, 55948.82, 44478.36, 84412.36, 65544.19), AverageCount = c(22190.06,
14173.73, 12138.58, 24400, 27905.25, 35400, 19981.72, 24710,
32466, 14565.37)), .Names = c("Year", "Category", "TotalSales",
"AverageCount"), class = "data.frame", row.names = c(NA, -10L
))

Resources