Plot mean and standard deviation by category - r

I'm trying to plot a plot with mean and sd bars by three levels of a factor.
(After two hours of searching on the internet, then checking the Rbook and Rgraphs book I'm still not finding the answer. I think this is because it is a very simple question.)
I have a simple data frame with three columns: my categories, mean, sd.
I would like to do a plot with the mean by category and its sd bars, just like
this one (edit: link broken)
My dataframe looks like this
color mean.temp sd
black 37.93431 2.267125
red 37.01423 1.852052
orange 36.61345 1.339032
I'm so sorry for asking this dumb question but I sincerely couldn't find any simple answer to my simple question.

With ggplot:
read data:
df=read.table(text=' color mean.temp sd
1 black 37.93431 2.267125
2 red 37.01423 1.852052
3 orange 36.61345 1.339032',header=TRUE)
plotting:
ggplot(df, aes(x=color, y=mean.temp)) +
geom_errorbar(aes(ymin=mean.temp-sd, ymax=mean.temp+sd), width=.2) +
geom_line() +
geom_point()
output

Create a data.frame holding your data:
foo <- data.frame(color=c("black","red","orange"),
mean.temp=c(37.93431,37.01423,36.61345),
sd=c(2.267125,1.852052,1.339032))
Now, we first plot the means as dots, making sure that we have enough room horizontally (xlim) and vertically (ylim), suppressing x axis annotation (xaxt="n") and all axis labeling (xlab="", ylab="").
plot(1:3,foo$mean.temp,pch=19,xlab="",ylab="",xaxt="n",xlim=c(0.5,3.5),
ylim=c(min(foo$mean.temp-foo$sd),max((foo$mean.temp+foo$sd))))
Next, we plot the standard deviations as lines. You could also use three separate lines commands, which may be easier to read. This way, we first collect the data into matrices via rbind(). R will automatically turn these matrices into vectors and recycle them. The NAs are there so we don't join the end of one line to the beginning of the next one. (Try removing the NAs to see what happens.)
lines(rbind(1:3,1:3,NA),rbind(foo$mean.temp-foo$sd,foo$mean.temp+foo$sd,NA))
Finally, annote the x axis:
axis(side=1,at=1:3,labels=foo$color)

Related

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

How can I wrap a ggplot column n times after hitting threshold number

I have a barplot where I have one entry that is so much larger then my other entries that it makes it difficult to do interesting analysis on the other smaller valued data-points.
plt <- ggplot(dffd[dffd$Month==i & dffd$UniqueCarrier!="AA",],aes(x=UniqueCarrier,y=1,fill=DepDelay))+
geom_col()+
coord_flip()+
scale_fill_gradientn(breaks=late_breaks,labels=late_breaks,limits=c(0,150),colours=c('black','yellow','orange','red','darkred'))
When I remove it I get back to an interesting degree of interpretation but now I'm tossing out upwards of half the data and arguably the most important one to explore.
I was wondering if there is a way that I could set an interval on my bar plot, say 500 in this case, after which I can start another column for the same entry right under it and resume building up my bar plot. In this example, that would translate here into WN splitting into 3 bars of length 500 500 and ~400 stacked one below the other all under that one WN label (ideally it shows the one tick for all three). Since I have a couple of other disproportionately large representative, plots doing this in as a layer during the plotting is of great interest to me.
Typically, when you have such disproportionate values in your data set, you should either put your values on a log scale (or use some other transformation) or zoom in on the plot using coord_cartesian. I think you probably could hack your way around and create the desired plot, but it's going to be quite misleading in terms of visualisation and analysis.
EDIT:
Based on your comments, I have a rather hacky solution. The data you've pasted was not directly usable (a part of dput was missing + there's no DepDelay columns, so I improvised).
The idea is to create an extra tag column based on the UniqueCarrier column and the max amount you want.
df2 <- df %>%
filter(UniqueCarrier != "AA" & Month == i) %>%
group_by(UniqueCarrier) %>%
mutate(tag = paste(UniqueCarrier, rep(seq(1, n()%/%500+1), each=500), sep="_")[1:n()])
This adds a tag column that basically says how many columns you'll have in each category.
plt <- ggplot(df2, aes(x=tag, y=1, fill=DepDelay)) +
geom_col() +
coord_flip() +
scale_fill_gradientn(breaks=late_breaks, labels=late_breaks,
limits=c(0,150),
colours=c('black','yellow','orange','red','darkred')) +
scale_x_discrete(labels=str_replace(sort(unique(df2$tag)), "_[:digit:]", ""))
plt
In the image above, I've used CarrierDelay with break interval of 100. You can see that the WN label then repeats - there are ways to remove the extra ones (some more creative replacements in scale_x_discrete labels.
If you want the columns to be ordered differently, just replace seq(1, n()%/%500+1) with seq(n()%/%500+1, 1).

Scatterplot in ggplot stacked like barplot

I want to create a scatterplot in ggplot where there are multiple y values for each x value. I want to add these y values and plot the sum against the x value.
>df
a b
1 2
1 2
2 1
2 4
3 1
3 5
I want a plot that plots the sums of the b values for each a
a b
1 4
2 5
3 6
I can do this for a barplot by making a stacked barplot:
ggplot(data=df, aes(x=df$a, y=df$b)) + geom_bar(stat="identity")
but if I do this with geom_point ggplot just plots each value of y without stacking.
I could use ddply for this, but that would require a number of more steps. If there is a more expedient way I'd appreciate it.
I searched the site for other answers. While there were plenty about "stacked scatterplots" they were all about overlaid plots.
I don't see anything stacked about your bar chart example. If you just want to summarize the values to a single pont, you can use stat_summary
ggplot(data=df, aes(x=a, y=b)) + stat_summary(fun.y=sum, geom="point")
There are many ways to achieve this effect - of a 'histogram' but without bars, whose height is the sum of all values at the same X.
This type of graph is called a Cleveland Dot Plot, and is used because the conspicuous bars of a histogram can a distraction or at worse be misleading. (see works by Cleveland, Tufte etc).
One way to achieve this is to pre-process the data to do the sum, using functions such as table or hist or tapply or xtabs...
Note that base R has the function dotchart for the production of this type of graph.
dotchart(xtabs(rev(df)))
... but since we are discussing ggplot, which has powerful ways to summarise the data while plotting it, let's stick to MrFlick's theme of how to do it directly ggplot operators (i.e. not preprocessed).
Using a weighted bin summary statistic:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_point(stat="bin")
you may want to adjust the lower y limit to 0 here.
By stacking the height of the points:
ggplot(data=df, aes(x=factor(a),y=b)) + geom_point(position="stack")
the additional dots visible on this plot are probably superfluous and definitely ambiguous, but highlight the fact of multiplicity in the source data.
Building a dotplot
This one is popular in newspapers, but usually has dollar bills instead of giant black holes:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_dotplot(method="histodot")
It's probably not what you are looking for, but it's worth being aware of.
You should also be aware that scales are difficult to get correct in this mode, so it's best used in a hand-tuned mode, with the y scale numbering turned off.

Plot a 'top 10' style list/ranking in R based on numerical column of dataframe

I have an R dataframe that contains a string variable and a numerical variable, and I would like to plot the top 10 strings, based on the value of the numerical variable.
I can of course get the top 10 entries pretty simply:
top10_rank <- rank[order(rank$numerical_var_name),]
My first approach to trying to visualize this was to simple attempt to plot this like:
ggplot(data=top10_rank, aes(x = top10_rank$numerical_var_name, y = top10_rank$string_name)) + geom_point(size=3)
And to a first approximation this "works" - the problem is that the strings on the y axis are sorted alphabetically rather than by the numerical value.
My preference would be to find a way to plot the top 10 strings without having to bother showing the numerical variable at all - just basically as a list (even better would be if I could enumerate the list). I am attempting to plot this so it looks more pleasing than simply dumping the text to the screen.
Any ideas greatly appreciated!
The y-axis tick marks may be sorted alphabetically, but the points are drawn in order(from left to right) of the top10_rank dataframe. What you need to do is change the order of the y-axis. Add this to your call of ggplot + scale_y_discrete(limits=top10_rank$String) and it should work.
ggplot(data=top10_rank, aes(x = top10_rank$Number,
y = top10_rank$String)) + geom_point(size=3) + scale_y_discrete(limits=top10_rank$String)
Here is a link to a great resource on R graphics: R Graphics Cookbook

Why do geom_line() and geom_freqpoly() give back different graphs?

I am trying get my head around ggplot2 which creates beautiful graphs as you probably all know :)
I have a dataset with some transactions of sold houses in it (courtesy of: http://support.spatialkey.com/spatialkey-sample-csv-data/ )
I would like to have a line chart that plots the cities on the x axis and 4 lines showing the number of transactions in my datafile per city for each of the 4 home types. Doesn't sound too hard, so I found two ways to do this.
using an intermediate table doing the counts and geom_line() to plot the results
using geom_freqpoly() on my raw dataframe
the basic charts look the same, however chart nr. 2 seems to be missing plots for all the 0 values of the counts (eg. for the cities right of SACRAMENTO, there is no data for Condo, Multi-Family or Unknown (which seems to be missing completely in this graph)).
I personally like the syntax of method number 2 more than that of number 1 (it's a personal thing probably).
So my question is: Am I doing something wrong or is there a method to have the 0 counts also plotted in method 2?
# line chart example
# setup the libraries
library(RCurl) # so we can download a dataset
library(ggplot2) # so we can make nice plots
library(gridExtra) # so we can put plots on a grid
# get the data in from the web straight into a dataframe (all data is from: http://support.spatialkey.com/spatialkey-sample-csv-data/)
data <- read.csv(text=getURL('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'))
# create a data frame that counts the number of trx per city/type combination
df_city_type<-data.frame(table(data$city,data$type))
# correct the column names in the dataframe
names(df_city_type)<-c('city','type','qty')
# alternative 1: create a ggplot with a geom_line on the calculated values - to show the nr. trx per city (on the x axis) with a differenct colored line for each type
cline1<-ggplot(df_city_type,aes(x=city,y=qty,group=type,color=type)) + geom_line() + theme(axis.text.x=element_text(angle=90,hjust=0))
# alternative 2: create a ggplot with a geom_freqpoly on the source data - - to show the nr. trx per city (on the x axis) with a differenct colored line for each type
c_line <- ggplot(na.omit(data),aes(city,group=type,color=type))
cline2<- c_line + geom_freqpoly() + theme(axis.text.x=element_text(angle=90,hjust=0))
# plot the two graphs in rows to compare, see that right of SACRAMENTO we miss two lines in plot 2, while they are in plot 1 (and we want them)
myplot<-grid.arrange(cline1,cline2)
As #joran pointed out, this gives a "similar" plot, when using "continuous" values:
ggplot(data, aes(x=as.numeric(factor(city)), group=type, colour=type)) +
geom_freqpoly(binwidth=1)
However, this is not exactly the same (compare the start of the graph), as the breaks are screwed up. Instead of binning from 1 to 39 with binwidth of 1, it, for some reason starts at 0.5 and goes until 39.5.

Resources