Adding total counts as horizontal lines to histograms in facet_grid() - r

Data:
I have a data frame comprising 4 variables and about 300k rows including a unique account ID, a start date in yyyy-mm-dd, a start year, and the total number of months to-date the customer has held an account active. Snippet of the data below (don't let the row numbers confuse, this is obviously a subset, if more data is necessary, let me know):
> head(ten.by.id)
acct.id start_date strt.yr max_ten
1 155 1998-11-01 1998 175
19 902 2001-09-01 2001 143
39 995 2001-09-01 2001 143
59 1014 2000-10-01 2000 153
78 1017 2000-04-01 2000 160
100 1137 2000-11-01 2000 153
Problem (Why I want to render a faceted plot):
Showing a histogram of the entire dataset across all years renders the following:
Obviously, there are mixed distributions of information here, but the effect is unknown. First I thought I'd check for time domain effects with a visual. By using facets, I can provide a serial histogram of frequency distributions by year, overlaying the KDE plot for each year.
If multiple distributions were a product of something that occurred over time, I could spot check relevant shape changes (i.e. uni to multimodal). I used the code below to generate this plot:
maxten_time <- ggplot(ten.by.id, aes(max_ten))
+ geom_histogram(colour="grey19", fill="orange", binwidth=2, stat="bin")
+ scale_y_continuous(breaks=seq(0,12000,by=100))
+ scale_x_continuous(breaks=seq(0,180,by=45))
+ labs(title ="Serial Distribution of Max Length of Tenure for all Customers by Start Date", x="Max Tenure(months)", y="# of Customers", colour="blue")
+ facet_grid(. ~ strt.yr) + geom_density(fill=NA, colour="orange", cex=1) + aes(y = ..count..)
Which renders the following:
Questions for recreating the faceted plot:
What I wish to do is add a horizontal line (or some other single marker) to each facet which indicates
the total # of customer starts for each year. Can this be done in a faceted
plot?
I would like to add an additional axis that spans across the facets to
mark the number of months across all years (1 to 175). Am I reaching with ggplot to try to do this (i.e. since each facet is its own plot, would aligning the month markers across all facets even be possible)? I haven't seen any relevant examples on doing something quite like this.
The objective is merely to combine the horiz lines in each facet and the axis across facets into the entire plot. Any direction would be helpful.
Phillip

Related

plotting multiple lines in ggplot R

I have neuroscientific data where we count synapses/cells in the cochlea and quantify these per frequency. We do this for animals of different ages. What I thus ideally want is the frequencies (5,10,20,30,40) in the x-axis and the amount of synapses/cells plotted on the y-axis (usually a numerical value from 10 - 20). The graph then will contain 5 lines of the different ages (6 weeks, 17 weeks, 43 weeks, 69 weeks and 96 weeks).
I try this with ggplot and first just want to plot one age. When I use the following command:
ggplot(mydata, aes(x=Frequency, y=puncta6)) + geom_line()
I get a graph, but no line and the following error: 'geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?'
So I found I have to adjust the code to:
ggplot(mydata, aes(x=Frequency, y=puncta6, group = 1)) + geom_line()
This works, except for the fact that my first data point (5 kHz) is now plotted behind my last data point (40 kHz)......... (This also happens without the 'group = 1' addition). How do I solve this or is there an easier way to plot this kind of data?
I couldnt add a file so I added a photo of my code + graph with the 5 kHz data point oddly located and I added a photo of my data in excel.
example data
example code and graph

Chart showing current value and historic value relative to a range

I would like to recreate the following chart in R using ggplot. My data is as per a similar table where for each code (A, B, C etc.). I have a current value, a value 12M ago and the respective range (max, min) over the period.
My chart needs to show the current value in red, the value 12M ago in blue and then a line show the max and min range.
I can produce this painstaking in Excel using error bars, but I would like to reproduce it in R.
Any ideas on how I can do this using ggplot? Thanks.
Here's what I came up with, but just a note: please if you post your dataset, don't post an image, but instead post the result of dput(your.data.frame). The result of that is easily copy-pasted into the console in order to replicate your dataset, whereas I recreated your data frame manually. :/
A few points first regarding your data as is and the intended plot:
The red and blue hash marks used to indicate 12 months ago and today are not a geom I know of off the top of my head, so I'm using geom_point here to show them (easiest way). You can pick another geom of you wish to show them differently.
The ranges for high and low are already specified by those column names. I'll use those values for the required aesthetics in geom_errorbar.
You can use your data as is to plot and use two separate geom_point calls (one for "today" and one for "12M ago"), but that's going to make creating the legend more difficult than it needs to be, so the better option is to adjust the dataset to support having the legend created automatically. For that, we'll use the gather function from tidyr, being sure to just "gather together" the information in "today" and "12M ago" (my column name for that was different b/c you need to start with a letter in the data frame), but leave alone the columns for "high", "low", and the letters (called "category" in my dataframe).
Where df is the original data frame:
df1 <- df %>% gather(time, value, -category, -high, -low)
The new dataframe (df1) looks like this (18 observations total):
category high low time value
1 A 82 28 M12.ago 81
2 B 82 54 M12.ago 80
3 C 80 65 M12.ago 75
4 D 76 34 M12.ago 70
5 E 94 51 M12.ago 93
6 F 72 61 M12.ago 65
where "time" has "M12.ago" or "today".
For the plot, you apply category to x and value to y, and specify ymax and ymin with high and low, respectively for the geom_errorbar:
ggplot(df1, aes(x=category, y=value)) +
geom_errorbar(aes(ymin=low, ymax=high), width=0.2) + ylim(0,100) +
geom_point(aes(color=time), size=2) +
scale_color_manual(values=list('M12.ago'='blue', 'today'='red')) +
theme_bw() + labs(color="") + theme(legend.position='bottom')
Giving you this:

set x/y limits in facet_wrap with scales = 'free'

I've seen similar questions asked, and this discussion about adding functionality to ggplot Setting x/y lim in facet_grid . In my research I often want to produce several panels plots, say for different simulation trials, where the axes limits remain the same to highlight differences between the trials. This is especially useful when showing the plot panels in a presentation. In each panel plot I produce, the individual plots require independent y axes as they're often weather variables, temperature, relative humidity, windspeed, etc. Using
ggplot() + ... + facet_wrap(~ ..., scales = 'free_y')
works great as I can easily produce plot panels of different weather variables.
When I compare between different plot panels, its nice to have consistent axes. Unfortunately ggplot provides no way of setting the individual limits of each plot within a panel plots. It defaults to using the range of given data. The Google Group discussion linked above discusses this shortcoming, but I was unable to find any updates as to whether this could be added. Is there a way to trick ggplot to set the individual limits?
A first suggestion that somewhat sidesteps the solution I'm looking for is to combine all my data into one data table and use facet_grid on my variable and simulation
ggplot() + ... + facet_grid(variable~simulation, scales = 'free_y')
This produces a fine looking plot that displays the data in one figure, but can become unwieldy when considering many simulations.
To 'hack' the plotting into producing what I want, I first determined which limits I desired for each weather variable. These limits were found by looking at the greatest extents for all simulations of interest. Once determined I created a small data table with the same columns as my simulation data and appended it to the end. My simulation data had the structure
'year' 'month' 'variable' 'run' 'mean'
1973 1 'rhmax' 1 65.44
1973 2 'rhmax' 1 67.44
... ... ... ... ...
2011 12 'windmin' 200 0.4
So I created a new data table with the same columns
ylims.sims <- data.table(year = 1, month = 13,
variable = rep(c('rhmax','rhmin','sradmean','tmax','tmin','windmax','windmin'), each = 2),
run = 201, mean = c(20, 100, 0, 80, 100, 350, 25, 40, 12, 32, 0, 8, 0, 2))
Which gives
'year' 'month' 'variable' 'run' 'mean'
1 13 'rhmax' 201 20
1 13 'rhmax' 201 100
1 13 'rhmin' 201 0
1 13 'rhmin' 201 80
1 13 'sradmean' 201 100
1 13 'sradmean' 201 350
1 13 'tmax' 201 25
1 13 'tmax' 201 40
1 13 'tmin' 201 12
1 13 'tmin' 201 32
1 13 'windmax' 201 0
1 13 'windmax' 201 8
1 13 'windmin' 201 0
1 13 'windmin' 201 2
While the choice of year and run is aribtrary, the choice of month need to be anything outside 1:12. I then appended this to my simulation data
sim1data.ylims <- rbind(sim1data, ylims)
ggplot() + geom_boxplot(data = sim1data.ylims, aes(x = factor(month), y = mean)) +
facet_wrap(~variable, scale = 'free_y') + xlab('month') +
xlim('1','2','3','4','5','6','7','8','9','10','11','12')
When I plot these data with the y limits, I limit the x-axis values to those in the original data. The appended data table with y limits has month values of 13. As ggplot still scales axes to the entire dataset, even when the axes are limited, this gives me the y limits I desire. Important to note that if there are data values greater than the limits you specify, this will not work.
Before: Notice the differences in the y limits for each weather variable between the panels.
After: Now the y limits remain consistent for each weather variable between the panels.
I hope to edit this post in the coming days and add a reproducible example for better explanation. Please comment if you've heard anything about adding this functionality to ggplot.

Ordering bars in a stacked bar plot using ggplot

The following is a simplified version of my dataframe (without too much loss in generality)
sales<-data.frame(ItemID=c(1,3,7,9,10,12),
Salesman=c("Bob","Sue","Jane","Bob","Sue","Jane"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00,-1.00))
which produces
ItemID Salesman ProfitLoss
1 1 Bob 10.0
2 3 Sue 9.0
3 7 Jane 9.5
4 9 Bob -7.5
5 10 Sue -11.0
6 12 Jane -1.0
The following produces a stacked bar plot of each salesman's sales, ordered by the overall profit for each salesman.
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,FUN="sum") #to order the bars
profits<-sales[which(sales$ProfitLoss>0),]
losses<-sales[which(sales$ProfitLoss<0),]
ggplot()+
geom_bar(data=losses,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")+
geom_bar(data=profits,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")
This works exactly as I desire. My issue arises when one of the salesmen has a profit but no loss, or a loss but no profit. For instance, changing sales to
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesman=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00))
and reapplying the previous steps produces
So, the salesman are clearly out of order. For this example I can cheat and plot my profits before losses like
ggplot()+
geom_bar(data=profits,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")+
geom_bar(data=losses,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")
but that won't work for my real dataset.
Edit: In my real dataset, each salesman has more than two sales, and for each salesman I've stacked the bars so that the smallest bars in magnitude are closest to the x axis and the largest bars (i.e. biggest profit, biggest loss) are farthest from the x axis. For this reason, I need to call geom_bar() on both the profits dataframe and the losses dataframe. (I originally left this information out to try to avoid making my question too complex.)
The problem is the first plot call to geom_bar(losses dataset) only has two levels of salesman, hence the order is changed - that's why calling profits first still works (as there are still all levels). But your reordering works if you change the plot call
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesman=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00))
#to order the bars
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,FUN="sum")
# Changed plot call
ggplot(sales, aes(x = factor(Salesman), y = ProfitLoss)) +
geom_bar(stat = "identity",position="dodge",color="white")
-------------------------------------------------------------------------------
Following edit; Do you want the longest bars [ie the largest (profit + abs(losses))] furthest from the y-axis, rather than by descending revenue. You can do this by changing the reorder function. Apologies if i misunderstand.
I changed Jane's data so that it is the longest overall bar
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesmn=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,29.50,-7.50,-11.00))
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,function(z) sum(abs(z)))
ggplot(sales, aes(x = factor(Salesman), y = ProfitLoss)) +
geom_bar(stat = "identity",position="dodge",color="white")

How to label percentage values inside stacked bar plot using R-base [duplicate]

This question already has an answer here:
How to label percentage values inside stacked bar plot using R-base
(1 answer)
Closed 10 years ago.
I am new to R. I would like others to explain to me how to add absolute values inside the individual stacked bars in a consistent way using the basic R plotting function (R base). I tried to plot a stacked bar graph using R base but the values appear in an inconsistent/illogical way in such a way that its supposed to be 100% for each village but they don't sum up to 100%.
Here is the data that am working on:
Village 100 200 300 400 500
Male 68.33333 53.33333 70 70 61.66667
Female 31.66667 46.66667 30 30 38.33333
In summary, there are five villages and the data showing the head of household interviewed by sex.
I have used the following command towards plotting the graph:
barplot(mydata,col=c("yellow","green")
x<-barplot(mydata,col=c("yellow","green")
text(x,mydata,labels=mydata,pos=3,offset=.5)
Please help to allocate the correct values in each bar
Thanks
This started as a comment but it seemed unfair to not turn into an answer. To answer your question (even on Stack Overflow) properly we need to know how "mydata" is structured. I assumed at first it was a data frame with 5 rows and 2 or 3 columns but in this case your code makes no sense. However, if this were how it is structured here is one way to do what I think you want:
mydata <- data.frame(
row.names =c(100, 200, 300, 400, 500),
Male =c(68.33333, 53.33333, 70, 70, 61.66667),
Female =c(31.66667, 46.66667, 30, 30, 38.33333))
x <- barplot(t(as.matrix(mydata)), col=c("yellow", "green"),
legend=TRUE, border=NA, xlim=c(0,8), args.legend=
list(bty="n", border=NA),
ylab="Cumulative percentage", xlab="Village number")
text(x, mydata$Male-10, labels=round(mydata$Male), col="black")
text(x, mydata$Male+10, labels=100-round(mydata$Male))
which produces the following:
An alternative would be to set the y value to 40 for all the male text labels, and 80 for all the females - this would have the advantage of less confusing jitter of the labels, and the disadvantage that the text vertical position is no longer notionally attached to data.
Personally, I don't much like this barplot at all, although there are many far worse crimes against data visualisation than a straightforward bar plot. Numbers on plots are cluttering and detract from the visual impact of the actual mapping of data to colours, shapes and sizes. I'd rather a simple dot plot like:
library(ggplot2)
ggplot(mydata, aes(x=row.names(mydata), y=Male)) +
geom_point(size=4) +
coord_flip() +
labs(x="Village number\n", y="Percentage male") +
ylim(0,100) +
geom_hline(yintercept=50, linetype=2)
which gives
There is less redundant clutter in the plot, a higher data to ink ratio, etc. However in the end you need to produce the plot that will mean something for your audience.

Resources