Bar chart - bars jumped to y-axis - r

I was plotting a bar chart with the code which worked perfectly well until some of the data had a value of 0.
barwidth = 0.35
df1:
norms_number R2.c
1 0.011
2 0
3 0.015
4 0.011
5 0
6 0.012
df2:
norms_number R2.c
1 0.001
2 0
3 0.012
4 0.006
5 0
6 0.004
test <- ggplot()+
geom_bar(data=df1, aes(x=norms_number, y=R2.c),stat="identity", position="dodge", width = barwidth)+
geom_bar(data=df2, aes(x=norms_number+barwidth+0.03, y=R2.c),
stat="identity", position="dodge",width = barwidth)
my result was:
and I got a warning that position stack requires non-overlapping x intervals (but they are not overlapping?)
I looked into it and changed the DV to factor (from numeric), which half helped, because now the graph looks like this:
why are the bars on the y axis? how else can I get around this weird error with values of 0?

First of all, you are intending to plot a bar chart where the heights are represented by a value rather than by number of cases. See here for more details, but you should be using geom_col instead of geom_bar.
With that being said, the error you are getting and the result is because it seems with x=norms_number+barwidth+0.03 you are trying to specify the precise positioning of the second set of data (df2) relative to the first set of data (df1).
In order for ggplot to dodge, it has to understand what to use as a basis for the dodge, and then it will separate (or "dodge") each observation containing the same x= aesthetic based upon that particular group used as the basis. Under normal circumstances, you would specify in aes( something like fill=, and ggplot is smart enough to know that whatever you set as fill= will also be the basis for position='dodge' to function. in the abscence of that (or if you wanted to override that), you would need to specify a group= aesthetic that would be used for dodging.
Ultimately, this means that you need to combine your datasets and provide ggplot a way of deciding how to dodge. This makes sense, since both of your dataframes are intended to be placed in the same plot, and both have identical x and y aesthetics. If you leave them as separate dataframes, you can overplot them in the same plot, but there is no good way to have ggplot use position='dodge', because it needs to see all the data in the geom_col call in order to know what to use as the basis for the dodge.
With all that being said, here's what I would recommend:
# combine datasets, but first make a marker called "origin"
# this will be used as a basis for the dodge and fill aesthetics
df1$origin <- 'df1'
df2$origin <- 'df2'
df <- rbind(df1, df2)
# need to change norms_number to a factor to allow for discrete axis
df$norms_number <- as.factor(df$norms_number)
You then use only one call to geom_col to get your plot. In the first case, I will use only the group= aesthetic to show you how ggplot uses this for the dodge mechanism:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(group=origin), color='black')
As mentioned, you can also just supply a fill= aesthetic, and ggplot will know to use that as the mechanism for dodging:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(fill=origin), color='black')

Not very sure if you are trying to draw something more complicated like a bar over a bar etc.. anyhow, one way is to use geom_rect() if you want to have one over the other:
ggplot()+
geom_rect(data=df1,
aes(xmin=norms_number-barwidth,xmax=norms_number,
ymin=0,ymax=R2.c))+
geom_rect(data=df2,
aes(xmin=norms_number,xmax=norms_number+barwidth,
ymin=0,ymax=R2.c))+
scale_x_continuous(breaks=1:6)

Related

Log scale on bar plot brake axis values [duplicate]

This question already has answers here:
Bar plot with log scales
(2 answers)
Closed 2 years ago.
I'm making a the following bar plot with ggplot:
df %>% ggplot( aes(x= group,y= cases,fill=color ) ) +
geom_bar(stat="identity") +
theme_minimal()
Which gives the following result:
The issue is that the smaller colors are not visible, hence I tried to use a log scale:
df %>% ggplot( aes(x= group,y= cases,fill=color ) ) +
geom_bar(stat="identity") +
scale_y_log10(labels = comma) +
theme_minimal()
But this completelly broke the scales, now I´m getting a 10 MM value from nowhere and bar sizes are wrong
The data I´m ussing for this is the following:
index,group,color,cases
1,4,4,9
2,4,3,61
3,1,1,5000
4,4,2,138
5,4,1,246
6,3,1,359
7,2,1,2000
8,3,2,57
9,1,2,153
10,2,2,130
11,2,3,15
12,1,3,23
13,3,3,11
14,2,4,1
TL;DR: You cannot and should not use a log scale with a stacked barplot. If you want to use a log scale, use a "dodged" barplot instead. You'll also have better luck to use geom_col instead of geom_bar here and set your fill= variable as a factor.
Geom_col vs. geom_bar
Try using geom_col in place of geom_bar. You can use coord_flip() if the direction is not to your liking. See here for reference, but the gist of the issue is that geom_bar should be used when you want to plot against "count", and geom_col should be used when you want to plot against "values". Here, your y-axis is "cases" (a value), so use geom_col.
The Problem with log scales and Stacked Barplots
With that being said, u/Dave2e is absolutely correct. The plot you are getting makes sense, because the underlying math being done to calculate the y-axis values is: log10(x) + log10(y) + log10(z) instead of what you expected, which was log10(x + y + z).
Let's use the numbers in your actual data frame for comparison here. In "group 1", you have the following:
index group color cases
3 1 1 5000
9 1 2 153
12 1 3 23
So on the y-axis what's happening is the total value of a stacked barplot (without a log scale) will be the sum of all. In other words:
> 5000 + 153 + 23
[1] 5176
This means that each of the bars represents the correct relative size, and when you add them up (or stack them up), the total size of the bar is equivalent to the total sum. Makes sense.
Now consider the same case, but for a log10 scale:
> log10(5000) + log10(153) + log10(23)
[1] 7.245389
Or, just about 17.5 million. The total height of the bar is still the sum of all individual bars (because that's what a stacked barplot is), and you can still compare the relative sizes, but the sum total of the individual logs does not equal the log of the sum:
>log10(5000 + 153 + 23)
[1] 3.713994
Suggested Way to Change your Plot
Moral of the story: you can still use a log scale to "stretch out" the small bars, but don't stack them. Use postion='dodge':
df %>% ggplot( aes(x= group,y= log10(cases),fill=as.factor(color) ) ) +
geom_col(position='dodge') +
theme_minimal()
Finally, position='dodge' (or position=position_dodge(width=...)) does not work with fill=color, since df$color is not a factor (it's numeric). This is also why your legend is showing a gradient for a categorical variable. That's why I used as.factor(color) in the ggplot call here, although you can also just apply that to the original dataset with df$color <- as.factor(df$color) and do the same thing.

Overlay two bar plots with geom_bar()

I'm trying to overlay two bar plots on top of each other, not beside.
The data is from the same dataset. I want 'Block' on the x-axis and 'Start' and 'End' as overlaying bar plots.
Block Start End
1 P1L 76.80 0.0
2 P1S 68.87 4.4
3 P2L 74.00 0.0
4 P2S 74.28 3.9
5 P3L 82.22 7.7
6 P3S 80.82 17.9
My script is
ggplot(data=NULL,aes(x=Block))+
geom_bar(data=my_data$Start,stat="identity",position ="identity",alpha=.3,fill='lightblue',color='lightblue4')+
geom_bar(data=my_data$End,stat="identity",position ="identity",alpha=.8,fill='pink',color='red')
I get Error: ggplot2 doesn't know how to deal with data of class numeric
I've also tried
ggplot(my_data,aes(x=Block,y=Start))+
geom_bar(data=my_data$End, stat="identity",position="identity",...)
Anyone know how I can make it happen? Thank you.
Edit:
How to get dodge overlaying bars?
I edit this post, because my next question is relevant as it's the opposite problem of my original post.
#P.merkle
I had to change my plot into four bars showing the mean values of all Blocks labeled L and S. The L stand for littoral, and S for Sublittoral. They were exposed for two treatments: Normal and reduced.
I've calculated the means, and their standard deviation.
I need four bars with their respective error bars:
Normal/Littoral , Reduced/Littoral , Normal/Sublittoral , Reduced/Sublittoral.
Problem is when I plot it, both the littoral bars and both the sublittoral bars overlay each other! So now I want them not to overlap!
How can i make it happen? I've tried all sorts of position = 'dodge' andposition = position_dodge(newdata$Force), without luck...
My newdata contain this information:
Zonation Force N mean sd se
1 Litoral Normal 6 0.000000 0.000000 0.000000
2 Litoral Redusert 6 5.873333 3.562868 1.454535
3 Sublitoral Normal 6 7.280000 2.898903 1.183472
4 Sublitoral Redusert 6 21.461667 4.153535 1.695674
My script is this:
ggplot(data=cdata,aes(x=newdata$Force,y=newdata$mean))+
geom_bar(stat="identity",position ="dodge",
alpha=.4,fill='red', color='lightblue4',width = .6)+
geom_errorbar(aes(ymin=newdata$mean-sd,ymax=newdata$mean+sd),
width=.2, position=position_dodge(.9))
The outcome is unfortunately this
As of the error bars, it's clearly four bars there, but they overlap. Please, how can I solve this?
If you don't need a legend, Solution 1 might work for you. It is simpler because it keeps your data in wide format.
If you need a legend, consider Solution 2. It requires your data to be converted from wide format to long format.
Solution 1: Without legend (keeping wide format)
You can refine your aesthetics specification on the level of individual geoms (here, geom_bar):
ggplot(data=my_data, aes(x=Block)) +
geom_bar(aes(y=Start), stat="identity", position ="identity", alpha=.3, fill='lightblue', color='lightblue4') +
geom_bar(aes(y=End), stat="identity", position="identity", alpha=.8, fill='pink', color='red')
Solution 2: Adding a legend (converting to long format)
To add a legend, first use reshape2::melt to convert your data frame from wide format into long format.
This gives you two columns,
the variable column ("Start" vs. "End"),
and the value column
Now use the variable column to define your legend:
library(reshape2)
my_data_long <- melt(my_data, id.vars = c("Block"))
ggplot(data=my_data_long, aes(x=Block, y=value, fill=variable, color=variable, alpha=variable)) +
geom_bar(stat="identity", position ="identity") +
scale_colour_manual(values=c("lightblue4", "red")) +
scale_fill_manual(values=c("lightblue", "pink")) +
scale_alpha_manual(values=c(.3, .8))

Scatterplot in ggplot stacked like barplot

I want to create a scatterplot in ggplot where there are multiple y values for each x value. I want to add these y values and plot the sum against the x value.
>df
a b
1 2
1 2
2 1
2 4
3 1
3 5
I want a plot that plots the sums of the b values for each a
a b
1 4
2 5
3 6
I can do this for a barplot by making a stacked barplot:
ggplot(data=df, aes(x=df$a, y=df$b)) + geom_bar(stat="identity")
but if I do this with geom_point ggplot just plots each value of y without stacking.
I could use ddply for this, but that would require a number of more steps. If there is a more expedient way I'd appreciate it.
I searched the site for other answers. While there were plenty about "stacked scatterplots" they were all about overlaid plots.
I don't see anything stacked about your bar chart example. If you just want to summarize the values to a single pont, you can use stat_summary
ggplot(data=df, aes(x=a, y=b)) + stat_summary(fun.y=sum, geom="point")
There are many ways to achieve this effect - of a 'histogram' but without bars, whose height is the sum of all values at the same X.
This type of graph is called a Cleveland Dot Plot, and is used because the conspicuous bars of a histogram can a distraction or at worse be misleading. (see works by Cleveland, Tufte etc).
One way to achieve this is to pre-process the data to do the sum, using functions such as table or hist or tapply or xtabs...
Note that base R has the function dotchart for the production of this type of graph.
dotchart(xtabs(rev(df)))
... but since we are discussing ggplot, which has powerful ways to summarise the data while plotting it, let's stick to MrFlick's theme of how to do it directly ggplot operators (i.e. not preprocessed).
Using a weighted bin summary statistic:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_point(stat="bin")
you may want to adjust the lower y limit to 0 here.
By stacking the height of the points:
ggplot(data=df, aes(x=factor(a),y=b)) + geom_point(position="stack")
the additional dots visible on this plot are probably superfluous and definitely ambiguous, but highlight the fact of multiplicity in the source data.
Building a dotplot
This one is popular in newspapers, but usually has dollar bills instead of giant black holes:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_dotplot(method="histodot")
It's probably not what you are looking for, but it's worth being aware of.
You should also be aware that scales are difficult to get correct in this mode, so it's best used in a hand-tuned mode, with the y scale numbering turned off.

Plotting percent change for a large number of factors on same figure using ggplot by faceting or color-coding factors

Here is an example of the code I'm working with
x<-as.factor(rep(c("tree_mean","tree_qmean","tree_skew"),3))
factor<-c(rep("mfn2_burned_99",3),rep("mfna_burned_5_7",3),rep("mfna_burned_5_7_10_12",3)))
y<-c(0.336457409,-0.347422910,-0.318945621,1.494109367, 0.003578698,-0.019985780,-0.484171146, 0.611589217,-0.322292664)
dat<-as.data.frame(cbind(x,factor,y))
head(dat)
x factor y
tree_mean mfn2_burned_99 -0.3364574
tree_qmean mfn2_burned_99 -0.3474229
tree_skew mfn2_burned_99 -0.3189456
tree_mean mfna_burned_5_7 -0.8269814
tree_qmean mfna_burned_5_7 -0.8088810
tree_skew mfna_burned_5_7 -2.5429226
tree_mean mfna_burned_5_7_10_12 -0.8601206
tree_qmean mfna_burned_5_7_10_12 -0.8474920
tree_skew mfna_burned_5_7_10_12 -2.9854178
I am trying to plot how much x deviates from 0, and facet it by each factor, as so:
ggplot(dat) +
geom_point(aes(x=x,y=y),shape=1,size=3)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_hline(yintercept=0)+
facet_grid(factor~.)
This works fine when I have three factors (ignore the *: I had a significance column which I have since removed.
Example below:
However, I have 8 factors in total, and faceting obscures the plot such that the distance from zero for each x value gets very distorted.
Example below
So, my question is this: what would be a better way of coding/rendering this plot given my large number of x values and factors using faceting or color coding by factor in ggplot??
I would be very open to color-coding each distance for x by factor rather than faceting, but I have been beating my head against the wall trying to figure out how to even do that in ggplot (very new to ggplot), so I can't yet say if it would make the figure much more interpretable.
One option as you note is to color your point and/or linerange by a factor. You can then use position_dodge to move the points slightly on the x axis.
For example:
ggplot(dat, aes(color = factor)) +
geom_point(aes(x=x,y=y),shape=1,size=3, position = position_dodge(width = 0.5)+
geom_linerange(aes(x=x,ymin=0,ymax=y), position = position_dodge(width =0.5))+
geom_hline(yintercept=0)
I think this would still be difficult with many factors, but with 8 it might suit your purposes.

How can I create a (100%) stacked histogram in R?

My dataset:
I have data in the following format (here, imported from a CSV file). You can find an example dataset as CSV here.
PAIR PREFERENCE
1 5
1 3
1 2
2 4
2 1
2 3
… and so on. In total, there are 19 pairs, and the PREFERENCE ranges from 1 to 5, as discrete values.
What I'm trying to achieve:
What I need is a stacked histogram, e.g. a 100% high column, for each pair, indicating the distribution of the PREFERENCE values.
Something similar to the "100% stacked columns" in Excel, or (although not quite the same, a so-called "mosaic plot"):
What I tried:
I figured it'd be easiest using ggplot2, but I don't even know where to start. I know I can create a simple bar chart with something like:
ggplot(d, aes(x=factor(PAIR), y=factor(PREFERENCE))) + geom_bar(position="fill")
… that however doesn't get me very far. So I tried this, and it gets me somewhat closer to what I'm trying to achieve, but it still uses the count of PREFERENCE, I suppose? Note the ylab being "count" here, and the values ranging to 19.
qplot(factor(PAIR), data=d, geom="bar", fill=factor(PREFERENCE_FIXED))
Results in:
So, what do I have to do to get the stacked bars to represent a histogram?
Or do they actually do this already?
If so, what do I have to change to get the labels right (e.g. have percentages instead of the "count")?
By the way, this is not really related to this question, and only marginally related to this (i.e. probably same idea, but not continuous values, instead grouped into bars).
Maybe you want something like this:
ggplot() +
geom_bar(data = dat,
aes(x = factor(PAIR),fill = factor(PREFERENCE)),
position = "fill")
where I've read your data into dat. This outputs something like this:
The y label is still "count", but you can change that manually by adding:
+ scale_x_discrete("Pairs") + scale_y_continuous("Votes")

Resources