I want to create a scatterplot in ggplot where there are multiple y values for each x value. I want to add these y values and plot the sum against the x value.
>df
a b
1 2
1 2
2 1
2 4
3 1
3 5
I want a plot that plots the sums of the b values for each a
a b
1 4
2 5
3 6
I can do this for a barplot by making a stacked barplot:
ggplot(data=df, aes(x=df$a, y=df$b)) + geom_bar(stat="identity")
but if I do this with geom_point ggplot just plots each value of y without stacking.
I could use ddply for this, but that would require a number of more steps. If there is a more expedient way I'd appreciate it.
I searched the site for other answers. While there were plenty about "stacked scatterplots" they were all about overlaid plots.
I don't see anything stacked about your bar chart example. If you just want to summarize the values to a single pont, you can use stat_summary
ggplot(data=df, aes(x=a, y=b)) + stat_summary(fun.y=sum, geom="point")
There are many ways to achieve this effect - of a 'histogram' but without bars, whose height is the sum of all values at the same X.
This type of graph is called a Cleveland Dot Plot, and is used because the conspicuous bars of a histogram can a distraction or at worse be misleading. (see works by Cleveland, Tufte etc).
One way to achieve this is to pre-process the data to do the sum, using functions such as table or hist or tapply or xtabs...
Note that base R has the function dotchart for the production of this type of graph.
dotchart(xtabs(rev(df)))
... but since we are discussing ggplot, which has powerful ways to summarise the data while plotting it, let's stick to MrFlick's theme of how to do it directly ggplot operators (i.e. not preprocessed).
Using a weighted bin summary statistic:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_point(stat="bin")
you may want to adjust the lower y limit to 0 here.
By stacking the height of the points:
ggplot(data=df, aes(x=factor(a),y=b)) + geom_point(position="stack")
the additional dots visible on this plot are probably superfluous and definitely ambiguous, but highlight the fact of multiplicity in the source data.
Building a dotplot
This one is popular in newspapers, but usually has dollar bills instead of giant black holes:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_dotplot(method="histodot")
It's probably not what you are looking for, but it's worth being aware of.
You should also be aware that scales are difficult to get correct in this mode, so it's best used in a hand-tuned mode, with the y scale numbering turned off.
Related
I was plotting a bar chart with the code which worked perfectly well until some of the data had a value of 0.
barwidth = 0.35
df1:
norms_number R2.c
1 0.011
2 0
3 0.015
4 0.011
5 0
6 0.012
df2:
norms_number R2.c
1 0.001
2 0
3 0.012
4 0.006
5 0
6 0.004
test <- ggplot()+
geom_bar(data=df1, aes(x=norms_number, y=R2.c),stat="identity", position="dodge", width = barwidth)+
geom_bar(data=df2, aes(x=norms_number+barwidth+0.03, y=R2.c),
stat="identity", position="dodge",width = barwidth)
my result was:
and I got a warning that position stack requires non-overlapping x intervals (but they are not overlapping?)
I looked into it and changed the DV to factor (from numeric), which half helped, because now the graph looks like this:
why are the bars on the y axis? how else can I get around this weird error with values of 0?
First of all, you are intending to plot a bar chart where the heights are represented by a value rather than by number of cases. See here for more details, but you should be using geom_col instead of geom_bar.
With that being said, the error you are getting and the result is because it seems with x=norms_number+barwidth+0.03 you are trying to specify the precise positioning of the second set of data (df2) relative to the first set of data (df1).
In order for ggplot to dodge, it has to understand what to use as a basis for the dodge, and then it will separate (or "dodge") each observation containing the same x= aesthetic based upon that particular group used as the basis. Under normal circumstances, you would specify in aes( something like fill=, and ggplot is smart enough to know that whatever you set as fill= will also be the basis for position='dodge' to function. in the abscence of that (or if you wanted to override that), you would need to specify a group= aesthetic that would be used for dodging.
Ultimately, this means that you need to combine your datasets and provide ggplot a way of deciding how to dodge. This makes sense, since both of your dataframes are intended to be placed in the same plot, and both have identical x and y aesthetics. If you leave them as separate dataframes, you can overplot them in the same plot, but there is no good way to have ggplot use position='dodge', because it needs to see all the data in the geom_col call in order to know what to use as the basis for the dodge.
With all that being said, here's what I would recommend:
# combine datasets, but first make a marker called "origin"
# this will be used as a basis for the dodge and fill aesthetics
df1$origin <- 'df1'
df2$origin <- 'df2'
df <- rbind(df1, df2)
# need to change norms_number to a factor to allow for discrete axis
df$norms_number <- as.factor(df$norms_number)
You then use only one call to geom_col to get your plot. In the first case, I will use only the group= aesthetic to show you how ggplot uses this for the dodge mechanism:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(group=origin), color='black')
As mentioned, you can also just supply a fill= aesthetic, and ggplot will know to use that as the mechanism for dodging:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(fill=origin), color='black')
Not very sure if you are trying to draw something more complicated like a bar over a bar etc.. anyhow, one way is to use geom_rect() if you want to have one over the other:
ggplot()+
geom_rect(data=df1,
aes(xmin=norms_number-barwidth,xmax=norms_number,
ymin=0,ymax=R2.c))+
geom_rect(data=df2,
aes(xmin=norms_number,xmax=norms_number+barwidth,
ymin=0,ymax=R2.c))+
scale_x_continuous(breaks=1:6)
I've been searching for a while, and I've found a number of answers for problems similar to mine, but not quite working when I try to implement them.
I'm trying to make a series of radar plots for different observations of performance. The data has been normalized such that the mean is 0 and the standard deviation is 1, and the y-axis on the plot has been set from -3 to 3 so as to make it visually comparable how well the subjects performed, with more extreme observations being worse. I would like to add colors associated with that scale, preferably such that -1 to 1 is green, and then the bands between +/- 1-2 is yellow and +/- 2-3 is red. All the examples I've been able to find relating to color fills is based directly in the data or from factors rather than a fixed scale, and anything I try seems to not show correctly. I'm not even sure if it is normally in the functionality of ggplot to be able to set a color scale in the way I'm looking for...
Here's the toy data I've been working with while working out the plotting (after reshaping):
variable <- c("time", "distance", "turns")
value <- c(0.9536197, 0.5842319, -2.1814528)
df <- data.frame(variable, value)
and here's my most recent attempt as far as ggplot code goes (using ggiraphExtra):
ggplot(temp, aes(x=variable, y=value, group=1)) + geom_point() + geom_polygon() +
ggiraphExtra:::coord_radar() + ylim(-3,3) +
scale_fill_gradient(low="red", high="green")
and this is the output:
radar plot with solid green geom_polygon fill
I'm trying to plot a plot with mean and sd bars by three levels of a factor.
(After two hours of searching on the internet, then checking the Rbook and Rgraphs book I'm still not finding the answer. I think this is because it is a very simple question.)
I have a simple data frame with three columns: my categories, mean, sd.
I would like to do a plot with the mean by category and its sd bars, just like
this one (edit: link broken)
My dataframe looks like this
color mean.temp sd
black 37.93431 2.267125
red 37.01423 1.852052
orange 36.61345 1.339032
I'm so sorry for asking this dumb question but I sincerely couldn't find any simple answer to my simple question.
With ggplot:
read data:
df=read.table(text=' color mean.temp sd
1 black 37.93431 2.267125
2 red 37.01423 1.852052
3 orange 36.61345 1.339032',header=TRUE)
plotting:
ggplot(df, aes(x=color, y=mean.temp)) +
geom_errorbar(aes(ymin=mean.temp-sd, ymax=mean.temp+sd), width=.2) +
geom_line() +
geom_point()
output
Create a data.frame holding your data:
foo <- data.frame(color=c("black","red","orange"),
mean.temp=c(37.93431,37.01423,36.61345),
sd=c(2.267125,1.852052,1.339032))
Now, we first plot the means as dots, making sure that we have enough room horizontally (xlim) and vertically (ylim), suppressing x axis annotation (xaxt="n") and all axis labeling (xlab="", ylab="").
plot(1:3,foo$mean.temp,pch=19,xlab="",ylab="",xaxt="n",xlim=c(0.5,3.5),
ylim=c(min(foo$mean.temp-foo$sd),max((foo$mean.temp+foo$sd))))
Next, we plot the standard deviations as lines. You could also use three separate lines commands, which may be easier to read. This way, we first collect the data into matrices via rbind(). R will automatically turn these matrices into vectors and recycle them. The NAs are there so we don't join the end of one line to the beginning of the next one. (Try removing the NAs to see what happens.)
lines(rbind(1:3,1:3,NA),rbind(foo$mean.temp-foo$sd,foo$mean.temp+foo$sd,NA))
Finally, annote the x axis:
axis(side=1,at=1:3,labels=foo$color)
I am fairly new to R and ggplot2 and am having some trouble plotting multiple variables in the same histogram plot.
My data is already grouped and just needs to be plotted. The data is by week and I need to plot the number for each category (A, B, C and D).
Date A B C D
01-01-2011 11 0 11 1
08-01-2011 12 0 3 3
15-01-2011 9 0 2 6
I want the Dates as the x axis and the counts plotted as different colors according to a generic y axis.
I am able to plot just one of the categories at a time, but am not able to find an example like mine.
This is what I use to plot one category. I am pretty sure I need to use position="dodge" to plot multiple as I don't want it to be stacked.
ggplot(df, aes(x=Date, y=A)) + geom_histogram(stat="identity") +
labs(title = "Number in Category A") +
ylab("Number") +
xlab("Date") +
theme(axis.text.x = element_text(angle = 90))
Also, this gives me a histogram with spaces in between the bars. Is there any way to remove this? I tried spaces=0 as you would do when plotting bar graphs, but it didn't seem to work.
I read some previous questions similar to mine, but the data was in a different format and I couldn't adapt it to fit my data.
This is some of the help I looked at:
Creating a histogram with multiple data series using multhist in R
http://www.cookbook-r.com/Graphs/Plotting_distributions_%28ggplot2%29/
I'm also not quite sure what the bin width is. I think it is how the data should be spaced or grouped, which doesn't apply to my question since it is already grouped. Please advise me if I am wrong about this.
Any help would be appreciated.
Thanks in advance!
You're not really plotting histograms, you're just plotting a bar chart that looks kind of like a histogram. I personally think this is a good case for faceting:
library(ggplot2)
library(reshape2) # for melt()
melt_df <- melt(df)
head(melt_df) # so you can see it
ggplot(melt_df, aes(Date,value,fill=Date)) +
geom_bar() +
facet_wrap(~ variable)
However, I think in general, that changes over time are much better represented by a line chart:
ggplot(melt_df,aes(Date,value,group=variable,color=variable)) + geom_line()
My dataset:
I have data in the following format (here, imported from a CSV file). You can find an example dataset as CSV here.
PAIR PREFERENCE
1 5
1 3
1 2
2 4
2 1
2 3
… and so on. In total, there are 19 pairs, and the PREFERENCE ranges from 1 to 5, as discrete values.
What I'm trying to achieve:
What I need is a stacked histogram, e.g. a 100% high column, for each pair, indicating the distribution of the PREFERENCE values.
Something similar to the "100% stacked columns" in Excel, or (although not quite the same, a so-called "mosaic plot"):
What I tried:
I figured it'd be easiest using ggplot2, but I don't even know where to start. I know I can create a simple bar chart with something like:
ggplot(d, aes(x=factor(PAIR), y=factor(PREFERENCE))) + geom_bar(position="fill")
… that however doesn't get me very far. So I tried this, and it gets me somewhat closer to what I'm trying to achieve, but it still uses the count of PREFERENCE, I suppose? Note the ylab being "count" here, and the values ranging to 19.
qplot(factor(PAIR), data=d, geom="bar", fill=factor(PREFERENCE_FIXED))
Results in:
So, what do I have to do to get the stacked bars to represent a histogram?
Or do they actually do this already?
If so, what do I have to change to get the labels right (e.g. have percentages instead of the "count")?
By the way, this is not really related to this question, and only marginally related to this (i.e. probably same idea, but not continuous values, instead grouped into bars).
Maybe you want something like this:
ggplot() +
geom_bar(data = dat,
aes(x = factor(PAIR),fill = factor(PREFERENCE)),
position = "fill")
where I've read your data into dat. This outputs something like this:
The y label is still "count", but you can change that manually by adding:
+ scale_x_discrete("Pairs") + scale_y_continuous("Votes")