Using geom_bar() to stack values that add up? - r

I have a data frame in R which looks like:
Month numFlights numDelays onTime
1 2000-01-01 7520584 1299743 6220841
2 2000-02-01 6949127 1223397 5725730
3 2000-03-01 7808080 1390796 6417284
4 2000-04-01 7534239 1178425 6355814
5 2000-05-01 7720013 1236135 6483878
6 2000-06-01 7727349 1615408 6111941
7 2000-07-01 8000680 1652590 6348090
8 2000-08-01 7990440 1481498 6508942
9 2000-09-01 6811541 875381 5936160
10 2000-10-01 7026150 1046749 5979401
11 2000-11-01 6689783 987175 5702608
12 2000-12-01 6895454 1535196 5360258
What I'm looking to do, is create a bar chart where for each month (on the x axis), the bar reaches the numbers of delayed flights + number of flight on time. I tried figuring out how to do that using the example
qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))
but my data isn't formatted the same way as mtcars. And I can't just use an alpha value, because I want to eventually add cancelled flights.
I know that some questions are very similar to this one, but not similar enough for me to work it out (this question is almost the same but I don't know how to manipulate the facet_grid yet.) Any ideas?
Thanks!

You can do this for example:
library(reshape2)
dat.m <- melt(dat,id.vars='Month',measure.vars=c('numDelays','onTime'))
library(ggplot2)
library(scales)
ggplot(dat.m) +
geom_bar(aes(Month,value,fill=variable))+
scale_y_continuous(labels = comma) +
coord_flip() +
theme_bw()

Related

Making a time plot of the frequency that a certain value appears in data set

I have a dataset about a university's student body with 10 columns that represent different factors such as their student id, gender, ethnicity, etc.
For right now I'm just interested in the term they were admitted, and their ethnicity because I want to see how the number of students from different ethnic backgrounds has changed over time. So I created a new data frame with two columns called ethnicitydf:
> head(ethnicitydf)
admit_term ethn_desc
1 2011-10-01 White/Caucasian
2 2011-10-01 Filipino/Filipino-American
3 2011-10-01 White/Caucasian
4 2011-10-01 Latino/Other Spanish
5 2011-10-01 East Indian/Pakistani
6 2011-10-01 White/Caucasian
I'm not exactly sure how I would create a plot that has the admit_term (time) in the x-axis and the frequency that each ethnicity occurs for each admit_term. There are 12 unique ethnicities in the second column and I want to have the frequency of all 12 ethnicities for each admit_term (6 terms in total) in one graph, each ethnicity having a different color.
The first step I was thinking was counting up each ethnicity for each term using length(which(ethnicitydf$admit_term == "2011-10-01" & ethnicitydf$ethn_desc == "White/Caucasian")) for example and recording the data in a new data frame, but I feel like there should be a faster and more efficient way of doing this. Maybe the use of a package? Could any body help me out? Thank you!
A bar plot will do the counts for you.
library(ggplot2)
ethnicitydf <- data.frame(admit_term = sample(c("2011-10-01","2012-10-01","2013-10-01"), 100, TRUE),
ethn_desc =sample(c("White/Caucasian","Filipino/Filipino-American","East Indian/Pakistani"), 100, TRUE))
ggplot() +
geom_bar(data=ethnicitydf, mapping=aes(x=admit_term, fill=ethn_desc), position="dodge")
Created on 2019-07-03 by the reprex package (v0.3.0)
You can also just plot points if you have a lot of series, like this.
ggplot() +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")
To get lines you will need to make sure your y axis is numeric (turns the text dates into numbers, e.g. years).
ethnicitydf$admit_term <- as.Date(ethnicitydf$admit_term)
ggplot() +
geom_line(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count") +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")

geom_tile adds a third fill colour not in the data

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

Adding points to plot using ggplot2

Here is the first 9 rows (out of 54) and the first 8 columns (out of 1003) of my dataset
stream n rates means 1 2 3 4
1 Brooks 3 3.0 0.9629152 0.42707006 1.9353659 1.4333884 1.8566225
2 Siouxon 3 3.0 0.5831929 0.90503736 0.2838483 0.2838483 1.0023212
3 Speelyai 3 3.0 0.6199235 0.08554021 0.7359903 0.4841935 0.7359903
4 Brooks 4 7.5 0.9722707 1.43338843 1.8566225 0.0000000 1.3242210
5 Siouxon 4 7.5 0.5865031 0.50574543 0.5057454 0.2838483 0.4756304
6 Speelyai 4 7.5 0.6118634 0.32252396 0.4343109 0.6653132 2.2294652
7 Brooks 5 10.0 0.9637475 0.88984211 1.8566225 0.7741612 1.3242210
8 Siouxon 5 10.0 0.5804420 0.47501800 0.7383634 0.5482181 0.6430847
9 Speelyai 5 10.0 0.5959238 0.15079491 0.2615963 0.4738504 0.0000000
Here is a simple plot I have made using the values found in the means column for all rows with stream name Speelyai (18).
The means column is calculated by taking the mean for the entire row. Each column represents 1 simulation. So, the mean column is the mean of 1000 simulations. I would like to plot the actual simulation values on the plot as well. I think it would be informative to not only have the mean plotted (with a line) but also show the "raw" data (simulations) as points. I see that I can use the geom_point(), but am not sure how to get all the points for any row that has the stream name "Speelyai"
THANKS
As you can see, the scales are much different, which I would assume, given these points are results from simulations, or resampling the original data. But How could I overlay these points on my original image in a way that still preserves the visual content? In this image the line looks almost flat, but in my original image we can see that it fluctuates quite a bit, just on a small scale...
Agree with #NickKennedy that it's a good idea reshaping your data from wide to long:
library(reshape)
x2<-melt(x,id=c("stream","n","rates"))
x2<-x2[which(x2$variable!="means"),] # this eliminates the entries for means
Now it's time to re-calculate the means:
library(data.table)
setDT(x2)
setkey(x2,"stream")
means.sp<-x2["Speelyai",.(mean.stream=mean(value)),by=rates]
So now you can plot:
library(ggplot2)
p<-ggplot(means.sp,aes(rates,mean.stream))+geom_line()
Which is exactly what you had, so now let's add the points:
p<-p+geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,value))
Notice that in the call to geom_point you need to specifically declare data= as you are working with a different dataset to the one you specified in the call to ggplot.
========== EDIT TO ADD =============
replying to your comments, and borrowing from the answer #akrun gave you here, you'll need to add the calculation of the error and then change the call to geom_point:
df2 <- data.frame(stream=c('Brooks', 'Siouxon', 'Speelyai'),
value=c(0.944062036, 0.585852702, 0.583984402), stringsAsFactors=FALSE)
x2$error <- x2$value-df2$value[match(x2$stream, df2$stream)]
And then change the call to geom_point:
geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,error))
I would suggest reformatting your data in a long format rather than wide. For example:
library("tidyr")
library("ggplot2")
my_data_tidy <- gather(my_data, column, value, -c(stream, n, rates, means))
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() +
stat_summary(fun.y = "mean", geom = "line")
Note this will also recalculate the means from your data. If you wanted to use your existing means, you could do:
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() + geom_line(aes(rates, means), data = subset(my_data, stream == "Speelyai"))

How do you order a barplot by magnitude using qplot? [duplicate]

This question already has answers here:
Order Bars in ggplot2 bar graph
(16 answers)
Closed 8 years ago.
I use the arrange function to put my data frame in order by deaths, but when I try to do a bargraph of the top 5, they are in alphabetical order. How do I get them into order by value? Do I need to use ggplot?
library(dplyr)
library(ggplot2)
EventsByDeaths <- arrange(SumByEvent, desc(deaths))
> head(EventsByDeaths, 10)
Source: local data frame [10 x 3]
EVTYPE deaths damage
1 TORNADO 4662 2584635.60
2 EXCESSIVE HEAT 1418 53.80
3 HEAT 708 277.00
4 LIGHTNING 569 338956.35
5 FLASH FLOOD 567 759870.68
6 TSTM WIND 474 1090728.50
7 FLOOD 270 358109.37
8 RIP CURRENTS 204 162.00
9 HIGH WIND 197 170981.81
10 HEAT WAVE 172 1269.25
qplot(y=deaths, x=EVTYPE, data=EventsByDeaths[1:5,], geom="bar", stat="identity")
You could use the reorder() function
EventsByDeaths <- transform(EventsByDeaths, EVTYPE = reorder(EVTYPE, -deaths))
Then your original qplot call should work as desired. Hope this helps!

ggplot: Drawing separate y-lines when passing factor as argument

I want to create the following plot: The x-axis goes from 1 to 900 representing trial numbers. The y-axis shows three different lines with the moving average of reaction time. One line is shown for each difficulty level (Hard, Medium, Easy). Separate plots should be shown for each participant using facet_wrap.
Now this all works fine if I use ggplot's geom_smooth() function. Like this:
ggplot(cw_trials_f, aes(x=trial_number, y=as.numeric(correct), col=difficulty)) +
facet_wrap(~session_id) +
geom_smooth() +
ggtitle("Stroop Task")
The problem arises when I try to use zoo library's rollmean function. Here is what I tried:
ggplot(cw_trials_f, aes(x=trial_number, y=rollmean(as.numeric(correct)-1, 50, na.pad=T, align="right"), col=difficulty)) +
facet_wrap(~session_id) +
geom_line() +
ggtitle("Stroop Task")
It seems that this doesn't partition the data according to difficulty first and then apply the rollmean function, but the other way around. Thus only one line is shown but in all three colors. How can I have rollmean be applied to each category of trials (Easy, Medium, Hard) separately?
Here is some sample data:
session_id test_number trial_number trial_duration rule concordant switch correct reaction_time difficulty
1 11674020 1 1 1872 word concordant yes yes 1357 Easy
2 11674020 1 2 2839 word discordant no yes 2324 Medium
3 11674020 1 3 1525 color discordant yes no 1025 Hard
4 11674020 1 4 1544 color discordant no no 1044 Medium
5 11674020 1 5 1451 word concordant yes yes 952 Easy
6 11674020 1 6 1252 color concordant yes yes 746 Easy
So, I ended up following #joran's suggestion (thanks) from the comment above and did the following:
cw_trials_f <- ddply(cw_trials_f, .(session_id, difficulty), .fun = function(X) transform(X, movrt = rollmean(X$reaction_time, 50, na.pad=T, align="right"), movacc = rollmean(as.numeric(X$correct)-1, 50, na.pad=T, align="right")))
This adds two additional columns to the data.frame with the moving averages of accuracy and reaction time.
Then this works fine to plot them:
ggplot(cw_trials_f, aes(x=trial_number, y=movacc, col=difficulty)) + geom_line() + facet_wrap(~session_id) + ggtitle("Stroop Task")
There is one (minor) disadvantage to this compared to what I originally wanted to do: It makes it a bit tedious and slow to try out different lengths for the moving average function.

Resources