Ordering bars in a stacked bar plot using ggplot - r

The following is a simplified version of my dataframe (without too much loss in generality)
sales<-data.frame(ItemID=c(1,3,7,9,10,12),
Salesman=c("Bob","Sue","Jane","Bob","Sue","Jane"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00,-1.00))
which produces
ItemID Salesman ProfitLoss
1 1 Bob 10.0
2 3 Sue 9.0
3 7 Jane 9.5
4 9 Bob -7.5
5 10 Sue -11.0
6 12 Jane -1.0
The following produces a stacked bar plot of each salesman's sales, ordered by the overall profit for each salesman.
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,FUN="sum") #to order the bars
profits<-sales[which(sales$ProfitLoss>0),]
losses<-sales[which(sales$ProfitLoss<0),]
ggplot()+
geom_bar(data=losses,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")+
geom_bar(data=profits,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")
This works exactly as I desire. My issue arises when one of the salesmen has a profit but no loss, or a loss but no profit. For instance, changing sales to
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesman=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00))
and reapplying the previous steps produces
So, the salesman are clearly out of order. For this example I can cheat and plot my profits before losses like
ggplot()+
geom_bar(data=profits,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")+
geom_bar(data=losses,aes(x=Salesman, y=ProfitLoss),stat="identity", color="white")
but that won't work for my real dataset.
Edit: In my real dataset, each salesman has more than two sales, and for each salesman I've stacked the bars so that the smallest bars in magnitude are closest to the x axis and the largest bars (i.e. biggest profit, biggest loss) are farthest from the x axis. For this reason, I need to call geom_bar() on both the profits dataframe and the losses dataframe. (I originally left this information out to try to avoid making my question too complex.)

The problem is the first plot call to geom_bar(losses dataset) only has two levels of salesman, hence the order is changed - that's why calling profits first still works (as there are still all levels). But your reordering works if you change the plot call
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesman=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,9.50,-7.50,-11.00))
#to order the bars
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,FUN="sum")
# Changed plot call
ggplot(sales, aes(x = factor(Salesman), y = ProfitLoss)) +
geom_bar(stat = "identity",position="dodge",color="white")
-------------------------------------------------------------------------------
Following edit; Do you want the longest bars [ie the largest (profit + abs(losses))] furthest from the y-axis, rather than by descending revenue. You can do this by changing the reorder function. Apologies if i misunderstand.
I changed Jane's data so that it is the longest overall bar
sales<-data.frame(ItemID=c(1,3,7,9,10),
Salesmn=c("Bob","Sue","Jane","Bob","Sue"),
ProfitLoss=c(10.00,9.00,29.50,-7.50,-11.00))
sales$Salesman<-reorder(sales$Salesman,-sales$ProfitLoss,function(z) sum(abs(z)))
ggplot(sales, aes(x = factor(Salesman), y = ProfitLoss)) +
geom_bar(stat = "identity",position="dodge",color="white")

Related

Making a line graph with certain X + Y values expressed differently with lines of 33 user IDs in R

I'm trying to put ActivityDate on the X Axis, and Calories on the Y Axis, relating to how 33 different users ranged in their calorie burnings daily. I'm new to ggplot and visualizations as you can tell, so I'd appreciate the most basic solution that I can understand. Thank you so much.
I really tried several iterations of this code, and each one of them weren't quite right in how the visualization turned out. Here are a couple of my thoughts:
##first and foremost:
install.packages("tidyverse") install.packages("here") library(tidyverse) library(here)
Attempt 1 Bar Graph
ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=Id, color=ActivityDate))
Attempt 1 Bar Graph
##Not probably the best for stakeholders, but if I could maybe have the bars a little closer together that might help, so I tried to identify the unique IDs. Perhaps the reason why they are so small is that they appear in long number format, and are not sequential, so it could be adding the extra space and making the bars so small because of the spaces of empty sequential numbers.
Attempt 2 Bar Graph
UId <- unique("Id") ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=UId, color=ActivityDate))
Attempt 2 Bar Graph
##Facepalm, definitely not what I was looking for at all, but that was my effort to solve the above problem.
Attempt 3 Bar Graph
ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=ActivityDate, fill=Id)) + theme(axis.text.x = element_text(angle=45))
Attempt 3 Bar Graph
##The fill function does not work, and on the y-axis if you will, I don't know what "count" is referring to in this case, so could be useful except for those two issues.
##Finally, I switch to a line graph
Attempt 4 Line Graph
ggplot(data=trimmed_dactivity) + geom_line(mapping=aes(x=ActivityDate, y=Calories)) + theme(axis.text.x = element_text(angle=45))
Attempt 4 Line Graph
##Now what I get is separate lines going up and down, and what I want is 33 separate lines representing unique Id numbers to travel along the x axis for time, and rise in the y axis for calories. Of course I'm not sure how to do that...
Any help with what I'm missing on this journey here?
what I want is 33 separate lines representing unique Id numbers…
It sounds like you want a spaghetti plot. To make one, map Id to color (or to group if you don’t want each id to be colored differently).
library(ggplot2)
ggplot(fakedata, aes(ActivityDate, Calories)) +
geom_line(aes(color = factor(Id)), show.legend = FALSE)
Example data:
set.seed(13)
fakedata <- expand.grid(
Id = 1:33,
ActivityDate = seq(as.Date("2016-04-13"), length.out = 10, by = "day")
)
fakedata$Calories <- round(rnorm(330, 2500, 500))

struggling with scaling a secondary axis on a plot that is not a percentage

I'm getting crazy here, please help me!
I'm new to R and this is why. I have a graph here in which I'm trying to plot steps given against time needed to fall asleep (in minutes) and I decided to plot user ID on the x axis and the other two variables in a vertical axis of its own.
The result is as follows:
I'm not happy with many things. The scaling of the line plot and the scale of the secondary axis, the width of the columns in geom_col, and the y axis labels, I mean, the user IDs have 10 digits each and it shows up as a potency.
Can you please help me out with all I mentioned, specially with the scaling of the secondary axis?
I've searched and searched and can't do it.
The code is this one:
ggplot(data= sleep_steps) +
+ geom_col(mapping = aes(x=Id, y=AVGSteps), fill = 'cyan') +
+ geom_line(mapping = aes(x=Id,y=AVGMinToFallAsleep)) +
+ labs(title = "Relationship between Steps and Time to Fall Asleep") +
+ scale_y_continuous(sec.axis = sec_axis(~ . - 8*60*60, name = "Minutes to Fall Asleep"))
And the table is like this:
head(sleep_steps)
Id AVGSteps AVGKcal AVGMinToFallAsleep AVGTotalMinAsleep
1 1503960366 12116.742 1816.419 22.92000 360.2800
2 1644430081 7282.967 2811.300 52.00000 294.0000
3 1844505072 2580.065 1573.484 309.00000 652.0000
4 1927972279 916.129 2172.806 20.80000 417.0000
5 2026352035 5566.871 1540.645 31.46429 506.1786
6 2347167796 9519.667 2043.444 44.53333 446.8000
I'm clueless. Since it is not a percentage nor is a datetime variable, I'm not sure what to do. I've tried to change the trans argument in sec_axis function but no success. The structure of the data frame is all num.
Thank you!
You need Id as a factor to start because they are individuals, not actual numbers.
Insert before plot
sleep_steps$Id <- as.factor(sleep_steps$Id)
Without the code for your data to check, I would also say that you need another fill colour for your second scale, but you are using geom_line which is not normally how you would plot individuals because they are not connected. You may need to reconsider that. Normally you would plot all your data with boxplots which would show the averages and the quartiles etc.
If you are looking for an actual RELATIONSHIP, then you need to look into an lm plot LINK HERE

How to create a bar chart from the percentage of one variable when the X-axis is another variable?

I'm making a simple bar chart but I can't seem to figure it out. I've got my data as laid out here:
Candidate
SkinTone
Elected
1
7
1
2
4
0
3
3
0
4
2
1
Skin tone refers to a person's skin tone (obvs) and elected is a dummy variable that denotes whether a candidate was elected or not. What I want to do is have every skin tone value (it goes from 1-11) as a tick on my x-axis and my y-axis should be the percentage of those candidates that have a "1" as their elected value. So, for example, this tiny data set should generate a chart that looks like this:
Final Bar Graph
The problem I encounter is that I'm not able to figure out how to get this graph's y-axis correctly. Using this code below, I can generate a graph that looks like the one below:
ggplot(data=data, mapping = aes(x=Tone, y=Elected)) +
geom_bar(stat='identity',
fill="yellow",
col="black",
width=1,
alpha=.2) +
coord_cartesian(xlim = c(0.5,11.5)) +
scale_x_continuous(breaks = 1*1:11,
expand = expansion(add = .5)) +
labs(title="Skin Tone Electoral Success Barplot", x="Skin Tone", y="Percentage of Candidates Elected")
Incorrect Bar Graph
However, this doesn't work for me as the y-axis is showing the count of candidates who had a 1 in the Elected variable instead of the percentage. In addition, I'm getting these black blocks in between each observation, which I haven't gotten before when using col=. Lastly, I also find trouble adding in a density line as geom_density() gives me an error saying I'm missing my y aesthetic.

Is something wrong with my ggplot2 or R code to plot an ordered ancestry stacked barplot?

I want to create an organized stacked barplot where bars with similar proportions appear together. I have a data frame of 10,000 individuals and each individual comes from three populations. Here is my data.
library(MCMCpack)
library(ggplot2)
n = 10000
alpha = c(0.1, 0.1, 0.1)
q <- as.data.frame(rdirichlet(n,alpha))
head(q)
individuals <- c(1:nrow(q))
q <- cbind(q, individuals)
head(q)
V1 V2 V3 individuals
1 0.0032720232 3.381345e-08 0.996727943 1
2 0.3354060035 4.433923e-01 0.221201688 2
3 0.0004121665 9.661220e-01 0.033465842 3
4 0.9966997182 3.234048e-03 0.000066234 4
5 0.7789280208 2.090134e-01 0.012058562 5
6 0.0005048727 9.408364e-02 0.905411485 6
# long format for ggplot2 plotting
qm <- gather(q, key, value, -individuals)
colnames(qm) <- c("individuals", "ancestry", "proportions")
head(qm)
individuals ancestry proportions
1 1 V1 0.0032720232
2 2 V1 0.3354060035
3 3 V1 0.0004121665
4 4 V1 0.9966997182
5 5 V1 0.7789280208
6 6 V1 0.0005048727
Without any kind of ordering of data, I plotted the stacked barplot as:
ggplot(qm) + geom_bar(aes(x = individuals, y = proportions, fill= ancestry), stat="identity")
I have two questions:
(1) I don't know how to make these individuals with similar proportions cluster together, and I have tried many solutions on stack exchange already but can't get them to work on my dataset!
(2) For some reason, it seems like when I implement the code to order individuals by decreasing/increasing proportions in one ancestry, the code sometimes works on toy datasets of lower dimensions I create, but when I try to plot 10,000 individuals, the code doesn't work anymore! Is this a problem in ggplot2 or am I doing something wrong? I would appreciate any answer to this thread to also plot n = 10,000 stacked barplots.
(3) Not sure if I'm imagining this, but in my stacked barplot, it seems like R is clustering the stacked bar plots in some order unknown to me -- because I can see regular gaps between the stacked plots. In reality, there should be no gaps and I'm not sure why this is happening.
I would appreciate any help since I have already worked on this code for an embarrassingly long amount of time!!
Since, the variance of proportions within the ancestry is very high, the bars look like clustered with other ancestry. It is plotted in the right way. However, we couldn't distinguish the difference because the number of individuals is high.
If you think that the proportions on your data set would not lose it's meaning and could be interpreted in the same way if they're transformed intro exponential or log values, you can try it.
The stacked bar with exponential of the proportions:
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity")
If you don't want have gaps between the bars, set widht to 1.
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity",
width=1)

R ggplot2 - geom_histogram: levels/color removed in plot due to limiting y-scale

I have for each year an amount of distinct patient who belongs in one of three levels. I would like to plot relative frequency distribution of the three levels for each year. Let's say that 80% of the patient are labeled with C and the other patient with A and B. Since the majority has C the distribution for A and B wouldn't be visible. So, I changed the y-axis. I got the following problem with ggplot: Colored column for A and B are shown but for C it disappeared from the plot.
Here I made an example:
library(ggplot2)
# Data set
grp <- rep(c("A","B","C"), c(10,10,80))
year <- floor(runif(100)*10/3)
df <- data.frame(grp,year)
# Plot
ggplot(df,aes(year)) +
geom_histogram(aes(fill=grp),position="fill") +
scale_y_continuous(lim=c(0,0.5))
If I remove the last line (scale_y...) then I get the whole range from 0 to 1 and all levels (colors) are shown. With scale_y.. level (color) C disappears and only the grey background is visible. Does anyone knows how I can avoid that the color for C disappears? Thanks for hints.
As #Harpal already said when you set limits inside scale_y_continuous() all values that are outside this limits are removed from the plot. If you need to "zoom" your plot to values from 0 to 0.5 use coord_cartesian() instead of scale_y_continuous().
ggplot(df,aes(year)) +
geom_histogram(aes(fill=grp),position="fill") +
coord_cartesian(y=c(0,0.5))

Resources