ggplot: Drawing separate y-lines when passing factor as argument - r

I want to create the following plot: The x-axis goes from 1 to 900 representing trial numbers. The y-axis shows three different lines with the moving average of reaction time. One line is shown for each difficulty level (Hard, Medium, Easy). Separate plots should be shown for each participant using facet_wrap.
Now this all works fine if I use ggplot's geom_smooth() function. Like this:
ggplot(cw_trials_f, aes(x=trial_number, y=as.numeric(correct), col=difficulty)) +
facet_wrap(~session_id) +
geom_smooth() +
ggtitle("Stroop Task")
The problem arises when I try to use zoo library's rollmean function. Here is what I tried:
ggplot(cw_trials_f, aes(x=trial_number, y=rollmean(as.numeric(correct)-1, 50, na.pad=T, align="right"), col=difficulty)) +
facet_wrap(~session_id) +
geom_line() +
ggtitle("Stroop Task")
It seems that this doesn't partition the data according to difficulty first and then apply the rollmean function, but the other way around. Thus only one line is shown but in all three colors. How can I have rollmean be applied to each category of trials (Easy, Medium, Hard) separately?
Here is some sample data:
session_id test_number trial_number trial_duration rule concordant switch correct reaction_time difficulty
1 11674020 1 1 1872 word concordant yes yes 1357 Easy
2 11674020 1 2 2839 word discordant no yes 2324 Medium
3 11674020 1 3 1525 color discordant yes no 1025 Hard
4 11674020 1 4 1544 color discordant no no 1044 Medium
5 11674020 1 5 1451 word concordant yes yes 952 Easy
6 11674020 1 6 1252 color concordant yes yes 746 Easy

So, I ended up following #joran's suggestion (thanks) from the comment above and did the following:
cw_trials_f <- ddply(cw_trials_f, .(session_id, difficulty), .fun = function(X) transform(X, movrt = rollmean(X$reaction_time, 50, na.pad=T, align="right"), movacc = rollmean(as.numeric(X$correct)-1, 50, na.pad=T, align="right")))
This adds two additional columns to the data.frame with the moving averages of accuracy and reaction time.
Then this works fine to plot them:
ggplot(cw_trials_f, aes(x=trial_number, y=movacc, col=difficulty)) + geom_line() + facet_wrap(~session_id) + ggtitle("Stroop Task")
There is one (minor) disadvantage to this compared to what I originally wanted to do: It makes it a bit tedious and slow to try out different lengths for the moving average function.

Related

How can I create a legend in ggplot which assigns names and colors to columns and not to values within a column of a dataframe?

I have been looking up ideas to create legends in ggplot, yet all solutions only offer legends which divide the data of a single column in a dataframe in different groups by color and name with group = "columnname".
This the head of the dataframe given:
ewmSlots
ewmValues
ewmValues2
ewmValues3
1
0.7785078
0.7785078
0
2
0.7198410
0.7491744
0
3
0.7333798
0.7412771
0
4
0.9102729
0.8257750
0
5
0.7243151
0.7750450
0
6
0.8706777
0.8228614
0
Now I want a legend that shows ewmValues, ewmValues2 and ewmValues3 in their respective names and colors.
To give a simple example other solutions I found would resolve something like this
time
sex
lunch
male
dinner
female
dinner
male
lunch
female
where a legend would show sex and the colors to each sex, which is obviously not the issue
I want to tackle here.
What about just melting the data (let's call your above example a data frame named ewm)?
# With melt
ggplot(melt(ewm, id.vars = "ewmSlots"), aes(ewmSlots, value, color=variable, stat='identity')) +
geom_line(size=1.4) + labs(color="")
If you are opposed to melting, the below gives the exact same thing:
# Without melt
ggplot(ewm, aes(ewmSlots)) +
geom_line(aes(y=ewmValues, color="ewmValues"), size=1.4) +
geom_line(aes(y=ewmValues2, color="ewmValues2"), size=1.4) +
geom_line(aes(y=ewmValues3, color="ewmValues3"), size=1.4) +
labs(color="", y="value")

Calculating a ratio in a ggplot2 graph while retaining faceting variables

So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.

geom_tile adds a third fill colour not in the data

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

R column dataframe names number

I have a dataframe like this
geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13
As the names of 2nd an 3rd column are considered as number I'm not able to get a bar chart wth ggplot2. Is there any solution to set the column names to be considered as text?
data
pivot_dat <- read.table(text="geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13",strin=F,h=T)
pivot_dat <- setNames(pivot_dat,c("geo","2001","2002"))
Here's how to do it :
library(ggplot2)
ggplot(pivot_dat, aes(x = geo, y = `2002`)) + geom_col()+ coord_flip()
by using ticks instead of quotes/double quotes you make sure you pass a name to the function and not a string.
If you use quotes, ggplot will convert this character value to a factor and recycle it, so all bars will have the same length of 1, and a label of value "2002".
Note 1 :
You might want to learn the difference between geom_col and geom_bar :
?ggplot2::geom_bar
In short geom_col is geom_bar with stat = "identity", which is what you want here since you want to show on your plot the raw values from your table.
Note 2:
aes_string can be used to give string instead of names but here it doesn't work as "2002" is evaluated as a number :
ggplot(pivot_dat, aes_string(x = "geo", y = "2002")) +
geom_col()+ coord_flip() # incorrect output
ggplot(pivot_dat, aes_string(x = "geo", y = "`2002`")) +
geom_col()+ coord_flip() # correct output
Without an example to see exactly what your problem is, and what you want, it is hard to give you a perfect answer. But here's the thing.
You can do a geom_bar with numeric data. There are 3 possible ways I see that you could have problems (but I may not be able to guess every way.
First, let's set up the r for plotting.
library(readr)
library(ggplot2)
test <- read_csv("geo,2001,2002
Spain,21,23
Germany,34,50
Italy,57,89
France,19,13")
Next, let's make the first mistake...incorrectly calling the column name. In the next example I will tell ggplot to make a bar of the number 2001. Not the column 2001! r has to guess whether we mean 2001 or whether we mean the object 2001. By default it always picks the number instead of the column.
ggplot(test) +
geom_bar(aes(x=2001))
Ok, that just gives you a bar at 2001...because you gave it a single number input instead of a column. Let's fix that. Use the right facing quotes `` to identify the column name 2001 instead of the number 2001.
ggplot(test) +
geom_bar(aes(x=`2001`))
This creates a perfectly workable bar chart. But maybe you don't want the spaces? That's the only possible reason you would use text instead of a number. But you want text so I'm going to show you how to use as.factor to do something similar (and more powerful).
ggplot(test) +
geom_bar(aes(x=as.factor(`2001`)))

Adding points to plot using ggplot2

Here is the first 9 rows (out of 54) and the first 8 columns (out of 1003) of my dataset
stream n rates means 1 2 3 4
1 Brooks 3 3.0 0.9629152 0.42707006 1.9353659 1.4333884 1.8566225
2 Siouxon 3 3.0 0.5831929 0.90503736 0.2838483 0.2838483 1.0023212
3 Speelyai 3 3.0 0.6199235 0.08554021 0.7359903 0.4841935 0.7359903
4 Brooks 4 7.5 0.9722707 1.43338843 1.8566225 0.0000000 1.3242210
5 Siouxon 4 7.5 0.5865031 0.50574543 0.5057454 0.2838483 0.4756304
6 Speelyai 4 7.5 0.6118634 0.32252396 0.4343109 0.6653132 2.2294652
7 Brooks 5 10.0 0.9637475 0.88984211 1.8566225 0.7741612 1.3242210
8 Siouxon 5 10.0 0.5804420 0.47501800 0.7383634 0.5482181 0.6430847
9 Speelyai 5 10.0 0.5959238 0.15079491 0.2615963 0.4738504 0.0000000
Here is a simple plot I have made using the values found in the means column for all rows with stream name Speelyai (18).
The means column is calculated by taking the mean for the entire row. Each column represents 1 simulation. So, the mean column is the mean of 1000 simulations. I would like to plot the actual simulation values on the plot as well. I think it would be informative to not only have the mean plotted (with a line) but also show the "raw" data (simulations) as points. I see that I can use the geom_point(), but am not sure how to get all the points for any row that has the stream name "Speelyai"
THANKS
As you can see, the scales are much different, which I would assume, given these points are results from simulations, or resampling the original data. But How could I overlay these points on my original image in a way that still preserves the visual content? In this image the line looks almost flat, but in my original image we can see that it fluctuates quite a bit, just on a small scale...
Agree with #NickKennedy that it's a good idea reshaping your data from wide to long:
library(reshape)
x2<-melt(x,id=c("stream","n","rates"))
x2<-x2[which(x2$variable!="means"),] # this eliminates the entries for means
Now it's time to re-calculate the means:
library(data.table)
setDT(x2)
setkey(x2,"stream")
means.sp<-x2["Speelyai",.(mean.stream=mean(value)),by=rates]
So now you can plot:
library(ggplot2)
p<-ggplot(means.sp,aes(rates,mean.stream))+geom_line()
Which is exactly what you had, so now let's add the points:
p<-p+geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,value))
Notice that in the call to geom_point you need to specifically declare data= as you are working with a different dataset to the one you specified in the call to ggplot.
========== EDIT TO ADD =============
replying to your comments, and borrowing from the answer #akrun gave you here, you'll need to add the calculation of the error and then change the call to geom_point:
df2 <- data.frame(stream=c('Brooks', 'Siouxon', 'Speelyai'),
value=c(0.944062036, 0.585852702, 0.583984402), stringsAsFactors=FALSE)
x2$error <- x2$value-df2$value[match(x2$stream, df2$stream)]
And then change the call to geom_point:
geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,error))
I would suggest reformatting your data in a long format rather than wide. For example:
library("tidyr")
library("ggplot2")
my_data_tidy <- gather(my_data, column, value, -c(stream, n, rates, means))
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() +
stat_summary(fun.y = "mean", geom = "line")
Note this will also recalculate the means from your data. If you wanted to use your existing means, you could do:
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
geom_point() + geom_line(aes(rates, means), data = subset(my_data, stream == "Speelyai"))

Resources