Adding points to plot using ggplot2 - r

Here is the first 9 rows (out of 54) and the first 8 columns (out of 1003) of my dataset
stream n rates means 1 2 3 4
1 Brooks 3 3.0 0.9629152 0.42707006 1.9353659 1.4333884 1.8566225
2 Siouxon 3 3.0 0.5831929 0.90503736 0.2838483 0.2838483 1.0023212
3 Speelyai 3 3.0 0.6199235 0.08554021 0.7359903 0.4841935 0.7359903
4 Brooks 4 7.5 0.9722707 1.43338843 1.8566225 0.0000000 1.3242210
5 Siouxon 4 7.5 0.5865031 0.50574543 0.5057454 0.2838483 0.4756304
6 Speelyai 4 7.5 0.6118634 0.32252396 0.4343109 0.6653132 2.2294652
7 Brooks 5 10.0 0.9637475 0.88984211 1.8566225 0.7741612 1.3242210
8 Siouxon 5 10.0 0.5804420 0.47501800 0.7383634 0.5482181 0.6430847
9 Speelyai 5 10.0 0.5959238 0.15079491 0.2615963 0.4738504 0.0000000
Here is a simple plot I have made using the values found in the means column for all rows with stream name Speelyai (18).
The means column is calculated by taking the mean for the entire row. Each column represents 1 simulation. So, the mean column is the mean of 1000 simulations. I would like to plot the actual simulation values on the plot as well. I think it would be informative to not only have the mean plotted (with a line) but also show the "raw" data (simulations) as points. I see that I can use the geom_point(), but am not sure how to get all the points for any row that has the stream name "Speelyai"
THANKS
As you can see, the scales are much different, which I would assume, given these points are results from simulations, or resampling the original data. But How could I overlay these points on my original image in a way that still preserves the visual content? In this image the line looks almost flat, but in my original image we can see that it fluctuates quite a bit, just on a small scale...

Agree with #NickKennedy that it's a good idea reshaping your data from wide to long:
library(reshape)
x2<-melt(x,id=c("stream","n","rates"))
x2<-x2[which(x2$variable!="means"),] # this eliminates the entries for means
Now it's time to re-calculate the means:
library(data.table)
setDT(x2)
setkey(x2,"stream")
means.sp<-x2["Speelyai",.(mean.stream=mean(value)),by=rates]
So now you can plot:
library(ggplot2)
p<-ggplot(means.sp,aes(rates,mean.stream))+geom_line()
Which is exactly what you had, so now let's add the points:
p<-p+geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,value))
Notice that in the call to geom_point you need to specifically declare data= as you are working with a different dataset to the one you specified in the call to ggplot.
========== EDIT TO ADD =============
replying to your comments, and borrowing from the answer #akrun gave you here, you'll need to add the calculation of the error and then change the call to geom_point:
df2 <- data.frame(stream=c('Brooks', 'Siouxon', 'Speelyai'),
value=c(0.944062036, 0.585852702, 0.583984402), stringsAsFactors=FALSE)
x2$error <- x2$value-df2$value[match(x2$stream, df2$stream)]
And then change the call to geom_point:
geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,error))

I would suggest reformatting your data in a long format rather than wide. For example:
library("tidyr")
library("ggplot2")
my_data_tidy <- gather(my_data, column, value, -c(stream, n, rates, means))
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() +
stat_summary(fun.y = "mean", geom = "line")
Note this will also recalculate the means from your data. If you wanted to use your existing means, you could do:
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() + geom_line(aes(rates, means), data = subset(my_data, stream == "Speelyai"))

Related

Trying to make a graph with multiple lines using ggplot

I am new to R and I have been trying to make a line graph with mupltiple lines. I have tried the 'plot' function but didn't get the desired result so I am now trying the ggplot.
I keep running into error:
Aesthetics must be either length 1 or the same as the data (100): x
and there's obviously no graph output.
Any help is much appreciated
I have rearranged my data, before it had 4 separate columns for different consumer types but now I have merged them and made a column that identifies each consumer.
This is the part of the code that generates the error
ggplot(data=consumers,aes(x=scenarios,y=unitary.bill)) +
geom_line(aes(color=consumer.type,group=consumer.type))
my data looks like this:
scenario unitary.bill consumer.type
1 1 0.076536835 net.cons
2 2 0.075835361 net.cons
3 3 0.076696548 net.cons
4 4 0.076431602 net.cons
5 5 0.076816135 net.cons
.........
27 2 0.076794287 smart.cons
28 3 0.075555555 smart.cons
29 4 0.077126955 smart.cons
30 5 0.077925161 smart.cons
.......
100 25 0.049247761 smart.pros
I expect the a line graph to have four different colors (each representing my consumer type) and the scenarios at the x-axis.
Thanks for all the help from Camille and Infominer. My code now looks like this (I added some more details)
ggplot(data=consumers,aes(x = scenarios,y = unitary.bill, colour= SMCs)) +
geom_line(size=1) + scale_colour_manual(values=c("indianred1", "yellowgreen","lightpink","springgreen4"))+
ggtitle(" Unitary bill for each SMC type at the end of the scenario runs")+
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25))
and the graph looks as I wanted it to. However, if I could put some more distance between the title and the graph that will make it prettier.
you can view the graph here

geom_tile adds a third fill colour not in the data

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

Using geom_bar() to stack values that add up?

I have a data frame in R which looks like:
Month numFlights numDelays onTime
1 2000-01-01 7520584 1299743 6220841
2 2000-02-01 6949127 1223397 5725730
3 2000-03-01 7808080 1390796 6417284
4 2000-04-01 7534239 1178425 6355814
5 2000-05-01 7720013 1236135 6483878
6 2000-06-01 7727349 1615408 6111941
7 2000-07-01 8000680 1652590 6348090
8 2000-08-01 7990440 1481498 6508942
9 2000-09-01 6811541 875381 5936160
10 2000-10-01 7026150 1046749 5979401
11 2000-11-01 6689783 987175 5702608
12 2000-12-01 6895454 1535196 5360258
What I'm looking to do, is create a bar chart where for each month (on the x axis), the bar reaches the numbers of delayed flights + number of flight on time. I tried figuring out how to do that using the example
qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))
but my data isn't formatted the same way as mtcars. And I can't just use an alpha value, because I want to eventually add cancelled flights.
I know that some questions are very similar to this one, but not similar enough for me to work it out (this question is almost the same but I don't know how to manipulate the facet_grid yet.) Any ideas?
Thanks!
You can do this for example:
library(reshape2)
dat.m <- melt(dat,id.vars='Month',measure.vars=c('numDelays','onTime'))
library(ggplot2)
library(scales)
ggplot(dat.m) +
geom_bar(aes(Month,value,fill=variable))+
scale_y_continuous(labels = comma) +
coord_flip() +
theme_bw()

ggplot: Drawing separate y-lines when passing factor as argument

I want to create the following plot: The x-axis goes from 1 to 900 representing trial numbers. The y-axis shows three different lines with the moving average of reaction time. One line is shown for each difficulty level (Hard, Medium, Easy). Separate plots should be shown for each participant using facet_wrap.
Now this all works fine if I use ggplot's geom_smooth() function. Like this:
ggplot(cw_trials_f, aes(x=trial_number, y=as.numeric(correct), col=difficulty)) +
facet_wrap(~session_id) +
geom_smooth() +
ggtitle("Stroop Task")
The problem arises when I try to use zoo library's rollmean function. Here is what I tried:
ggplot(cw_trials_f, aes(x=trial_number, y=rollmean(as.numeric(correct)-1, 50, na.pad=T, align="right"), col=difficulty)) +
facet_wrap(~session_id) +
geom_line() +
ggtitle("Stroop Task")
It seems that this doesn't partition the data according to difficulty first and then apply the rollmean function, but the other way around. Thus only one line is shown but in all three colors. How can I have rollmean be applied to each category of trials (Easy, Medium, Hard) separately?
Here is some sample data:
session_id test_number trial_number trial_duration rule concordant switch correct reaction_time difficulty
1 11674020 1 1 1872 word concordant yes yes 1357 Easy
2 11674020 1 2 2839 word discordant no yes 2324 Medium
3 11674020 1 3 1525 color discordant yes no 1025 Hard
4 11674020 1 4 1544 color discordant no no 1044 Medium
5 11674020 1 5 1451 word concordant yes yes 952 Easy
6 11674020 1 6 1252 color concordant yes yes 746 Easy
So, I ended up following #joran's suggestion (thanks) from the comment above and did the following:
cw_trials_f <- ddply(cw_trials_f, .(session_id, difficulty), .fun = function(X) transform(X, movrt = rollmean(X$reaction_time, 50, na.pad=T, align="right"), movacc = rollmean(as.numeric(X$correct)-1, 50, na.pad=T, align="right")))
This adds two additional columns to the data.frame with the moving averages of accuracy and reaction time.
Then this works fine to plot them:
ggplot(cw_trials_f, aes(x=trial_number, y=movacc, col=difficulty)) + geom_line() + facet_wrap(~session_id) + ggtitle("Stroop Task")
There is one (minor) disadvantage to this compared to what I originally wanted to do: It makes it a bit tedious and slow to try out different lengths for the moving average function.

How to indicate factors in ggplot with horizontal line and Text

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).
First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Resources