geom_tile adds a third fill colour not in the data - r

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

Related

Calculating a ratio in a ggplot2 graph while retaining faceting variables

So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.

Trying to make a graph with multiple lines using ggplot

I am new to R and I have been trying to make a line graph with mupltiple lines. I have tried the 'plot' function but didn't get the desired result so I am now trying the ggplot.
I keep running into error:
Aesthetics must be either length 1 or the same as the data (100): x
and there's obviously no graph output.
Any help is much appreciated
I have rearranged my data, before it had 4 separate columns for different consumer types but now I have merged them and made a column that identifies each consumer.
This is the part of the code that generates the error
ggplot(data=consumers,aes(x=scenarios,y=unitary.bill)) +
geom_line(aes(color=consumer.type,group=consumer.type))
my data looks like this:
scenario unitary.bill consumer.type
1 1 0.076536835 net.cons
2 2 0.075835361 net.cons
3 3 0.076696548 net.cons
4 4 0.076431602 net.cons
5 5 0.076816135 net.cons
.........
27 2 0.076794287 smart.cons
28 3 0.075555555 smart.cons
29 4 0.077126955 smart.cons
30 5 0.077925161 smart.cons
.......
100 25 0.049247761 smart.pros
I expect the a line graph to have four different colors (each representing my consumer type) and the scenarios at the x-axis.
Thanks for all the help from Camille and Infominer. My code now looks like this (I added some more details)
ggplot(data=consumers,aes(x = scenarios,y = unitary.bill, colour= SMCs)) +
geom_line(size=1) + scale_colour_manual(values=c("indianred1", "yellowgreen","lightpink","springgreen4"))+
ggtitle(" Unitary bill for each SMC type at the end of the scenario runs")+
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25))
and the graph looks as I wanted it to. However, if I could put some more distance between the title and the graph that will make it prettier.
you can view the graph here

Plotting coordinates for sequence alignment

I would like to reproduce this figure from a recent publication in R, but I am unsure how.
The plot's idea is simple. At the top is a representation of a full length virus sequence and each line underneath it depicts a sequenced isolate.
For each sequence, there are two pieces of information:
Where the sequence starts, ends, deleted, etc
For instance: sequence 1, starts at position 1 and ends at position 9000
But sequence 2, starts at position 1, ends at 2000, in the middle everything is deleted and then starts again at 8000-9000
Color-coded based on whether it is deleted, full-length, etc
Initially I thought I could use a bar graph, like this one I drew in Illustrator, where the x would be essentially each sequence line and the y would be the coordinates of where it maps. But I am not sure if this will allow me to designate "gaps", as seen in sequence 3 in my illustrator picture.
The data itself is organized as such:
Sequence name Mapped Start Mapped End
1 1 9000
2 4000 9000
3 1 2000
3 7000 9000
The data set only includes the mapped start and end positions, not the deleted positions.
Would appreciate hearing your input!
Thanks
I might recommend using a series of geoms, one for each sequence. If you have your data organized in a certain way, it would be fairly straightforward. For example, if your data is in long format as follows:
dat <- data.frame(sequence=c(1,2,2,2), start=c(1,1,2001,8000), stop=c(9000,2000,7999,9000), type=c("mapped","mapped","deletion","mapped"))
Which looks like...
sequence start stop type
1 1 9000 mapped
2 1 2000 mapped
2 2001 7999 deletion
2 8000 9000 mapped
You could do the following:
library(ggplot2)
g <- ggplot(data=dat, mapping=aes(ymin=0, ymax=1, xmin=start, xmax=stop, fill=type)) +
geom_rect() + facet_grid(sequence~., switch="y") +
labs(x="Position (BP)", y="Sequence / Strain", title="Mapped regions for all sequences") +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
theme(plot.title = element_text(hjust = 0.5))
Which looks like

ggplot: Drawing separate y-lines when passing factor as argument

I want to create the following plot: The x-axis goes from 1 to 900 representing trial numbers. The y-axis shows three different lines with the moving average of reaction time. One line is shown for each difficulty level (Hard, Medium, Easy). Separate plots should be shown for each participant using facet_wrap.
Now this all works fine if I use ggplot's geom_smooth() function. Like this:
ggplot(cw_trials_f, aes(x=trial_number, y=as.numeric(correct), col=difficulty)) +
facet_wrap(~session_id) +
geom_smooth() +
ggtitle("Stroop Task")
The problem arises when I try to use zoo library's rollmean function. Here is what I tried:
ggplot(cw_trials_f, aes(x=trial_number, y=rollmean(as.numeric(correct)-1, 50, na.pad=T, align="right"), col=difficulty)) +
facet_wrap(~session_id) +
geom_line() +
ggtitle("Stroop Task")
It seems that this doesn't partition the data according to difficulty first and then apply the rollmean function, but the other way around. Thus only one line is shown but in all three colors. How can I have rollmean be applied to each category of trials (Easy, Medium, Hard) separately?
Here is some sample data:
session_id test_number trial_number trial_duration rule concordant switch correct reaction_time difficulty
1 11674020 1 1 1872 word concordant yes yes 1357 Easy
2 11674020 1 2 2839 word discordant no yes 2324 Medium
3 11674020 1 3 1525 color discordant yes no 1025 Hard
4 11674020 1 4 1544 color discordant no no 1044 Medium
5 11674020 1 5 1451 word concordant yes yes 952 Easy
6 11674020 1 6 1252 color concordant yes yes 746 Easy
So, I ended up following #joran's suggestion (thanks) from the comment above and did the following:
cw_trials_f <- ddply(cw_trials_f, .(session_id, difficulty), .fun = function(X) transform(X, movrt = rollmean(X$reaction_time, 50, na.pad=T, align="right"), movacc = rollmean(as.numeric(X$correct)-1, 50, na.pad=T, align="right")))
This adds two additional columns to the data.frame with the moving averages of accuracy and reaction time.
Then this works fine to plot them:
ggplot(cw_trials_f, aes(x=trial_number, y=movacc, col=difficulty)) + geom_line() + facet_wrap(~session_id) + ggtitle("Stroop Task")
There is one (minor) disadvantage to this compared to what I originally wanted to do: It makes it a bit tedious and slow to try out different lengths for the moving average function.

How to indicate factors in ggplot with horizontal line and Text

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).
First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Resources