Plot barplot as density plot in ggplot - r

Could anyone help me to plot the data below as a density plot where colour=variable?
> head(combined_length.m)
length seq mir variable value
1 22 TGAGGTATTAGGTTGTATGGTT mmu-let-7c-5p Ago1 8.622468
2 23 TGAGGGAGTAGGTTGTATGGTTT mmu-let-7c-5p Ago1 22.212471
3 21 TGAGGTAGTAGGTTGCATGGT mmu-let-7c-5p Ago1 9.745199
4 22 TGAGGTAGTATGTTGTATGGTT mmu-let-7c-5p Ago1 11.635982
5 22 TGAGTTAGTAGGTTGTATGGTT mmu-let-7c-5p Ago1 13.203627
6 20 TGAGGTAGTAGGCTGTATGG mmu-let-7c-5p Ago1 7.752571
ggplot(combined_length.m, aes(factor(length),value)) + geom_bar(stat="identity") + facet_grid(~variable) +
theme_bw(base_size=16
I tried this without success:
ggplot(combined_length.m, aes(factor(length),value)) + geom_density(aes(fill=variable), size=2)
Error in data.frame(counts = c(167, 9324, 177, 150451, 62640, 74557, 4, :
arguments imply differing number of rows: 212, 6, 1, 4
I want something like this:
http://i.stack.imgur.com/qitOs.jpg

Using factor(length) for x seems to create problems. Just use length.
Also, density plots display the distribution of whatever you define as x. So by definition the y axis is the density at a given value of x. In your code you seem to be trying to specify both x and y, which makes no sense. You can specify a y in geom_density(...) but this controls the scaling, as shown below. [Note: Your example has only one type of variable (Ago1) so I created an artificial dataset].
set.seed(1) # for reproducible example
df <- data.frame(variable=rep(LETTERS[1:3],c(5,10,15)),
length =rpois(30,25),
value =rnorm(30,mean=20,sd=5))
library(ggplot2)
ggplot(df,aes(x=length))+geom_density(aes(color=variable))
In this representation, the area under each curve is 1. This is the same as setting y=..density..
ggplot(df,aes(x=length))+geom_density(aes(color=variable,y=..density..))
You can also set y=..count.. which scales based on the counts. In this example, since there are 15 observations for C and only 5 for A, the blue curve (C) has three times the area as the red curve (A).
ggplot(df,aes(x=length))+geom_density(aes(color=variable,y=..count..))
You can also set y=..scaled.. which adjusts the curves so the maximum value in each corresponds to 1.
ggplot(df,aes(x=length))+geom_density(aes(color=variable,y=..scaled..))
Finally, if you want to get rid of all those annoying extra lines, use stat_density(...) instead:
ggplot(df,aes(x=length))+
stat_density(aes(color=variable),geom="line",position="identity")

Related

Creating a Bar Plot with Proportions on ggplot

I'm trying to create a bar graph on ggplot that has proportions rather than counts, and I have c+geom_bar(aes(y=(..count..)/sum(..count..)*100)) but I'm not sure what either of the counts refer to. I tried putting in the data but it didn't seem to work. What should I input here?
This is the data I'm using
> describe(topprob1)
topprob1
n missing unique Info Mean
500 0 9 0.93 3.908
1 2 3 4 5 6 7 8 9
Frequency 128 105 9 15 13 172 39 12 7
% 26 21 2 3 3 34 8 2 1
You haven't provided a reproducible example, so here's an illustration with the built-in mtcars data frame. Compare the following two plots. The first gives counts. The second gives proportions, which are displayed in this case as percentages. ..count.. is an internal variable that ggplot creates to store the count values.
library(ggplot2)
library(scales)
ggplot(mtcars, aes(am)) +
geom_bar()
ggplot(mtcars, aes(am)) +
geom_bar(aes(y=..count../sum(..count..))) +
scale_y_continuous(labels=percent_format())
You can also use ..prop.. computed variable with group aesthetics:
library(ggplot2)
library(scales)
ggplot(mtcars, aes(am)) +
geom_bar(aes(y=..prop.., group = 1)) +
scale_y_continuous(labels=percent_format())

set x/y limits in facet_wrap with scales = 'free'

I've seen similar questions asked, and this discussion about adding functionality to ggplot Setting x/y lim in facet_grid . In my research I often want to produce several panels plots, say for different simulation trials, where the axes limits remain the same to highlight differences between the trials. This is especially useful when showing the plot panels in a presentation. In each panel plot I produce, the individual plots require independent y axes as they're often weather variables, temperature, relative humidity, windspeed, etc. Using
ggplot() + ... + facet_wrap(~ ..., scales = 'free_y')
works great as I can easily produce plot panels of different weather variables.
When I compare between different plot panels, its nice to have consistent axes. Unfortunately ggplot provides no way of setting the individual limits of each plot within a panel plots. It defaults to using the range of given data. The Google Group discussion linked above discusses this shortcoming, but I was unable to find any updates as to whether this could be added. Is there a way to trick ggplot to set the individual limits?
A first suggestion that somewhat sidesteps the solution I'm looking for is to combine all my data into one data table and use facet_grid on my variable and simulation
ggplot() + ... + facet_grid(variable~simulation, scales = 'free_y')
This produces a fine looking plot that displays the data in one figure, but can become unwieldy when considering many simulations.
To 'hack' the plotting into producing what I want, I first determined which limits I desired for each weather variable. These limits were found by looking at the greatest extents for all simulations of interest. Once determined I created a small data table with the same columns as my simulation data and appended it to the end. My simulation data had the structure
'year' 'month' 'variable' 'run' 'mean'
1973 1 'rhmax' 1 65.44
1973 2 'rhmax' 1 67.44
... ... ... ... ...
2011 12 'windmin' 200 0.4
So I created a new data table with the same columns
ylims.sims <- data.table(year = 1, month = 13,
variable = rep(c('rhmax','rhmin','sradmean','tmax','tmin','windmax','windmin'), each = 2),
run = 201, mean = c(20, 100, 0, 80, 100, 350, 25, 40, 12, 32, 0, 8, 0, 2))
Which gives
'year' 'month' 'variable' 'run' 'mean'
1 13 'rhmax' 201 20
1 13 'rhmax' 201 100
1 13 'rhmin' 201 0
1 13 'rhmin' 201 80
1 13 'sradmean' 201 100
1 13 'sradmean' 201 350
1 13 'tmax' 201 25
1 13 'tmax' 201 40
1 13 'tmin' 201 12
1 13 'tmin' 201 32
1 13 'windmax' 201 0
1 13 'windmax' 201 8
1 13 'windmin' 201 0
1 13 'windmin' 201 2
While the choice of year and run is aribtrary, the choice of month need to be anything outside 1:12. I then appended this to my simulation data
sim1data.ylims <- rbind(sim1data, ylims)
ggplot() + geom_boxplot(data = sim1data.ylims, aes(x = factor(month), y = mean)) +
facet_wrap(~variable, scale = 'free_y') + xlab('month') +
xlim('1','2','3','4','5','6','7','8','9','10','11','12')
When I plot these data with the y limits, I limit the x-axis values to those in the original data. The appended data table with y limits has month values of 13. As ggplot still scales axes to the entire dataset, even when the axes are limited, this gives me the y limits I desire. Important to note that if there are data values greater than the limits you specify, this will not work.
Before: Notice the differences in the y limits for each weather variable between the panels.
After: Now the y limits remain consistent for each weather variable between the panels.
I hope to edit this post in the coming days and add a reproducible example for better explanation. Please comment if you've heard anything about adding this functionality to ggplot.

Adding bidirectional error bars to points on scatter plot in ggplot

I am trying to add x and y axis error bars to each individual point in a scatter plot.
Each point represents a standardized mean value for fitness for males and females (n=33).
I have found the geom_errorbar and geom_errorbarh functions and this example
ggplot2 : Adding two errorbars to each point in scatterplot
However my issue is that I want to specify the standard error for each point (which I have already calculated) from another column in my dataset which looks like this below
line MaleBL1 FemaleBL1 BL1MaleSE BL1FemaleSE
3 0.05343516 0.05615977 0.28666600 0.3142001
4 -0.53321642 -0.27279609 0.23929438 0.1350793
5 -0.25853484 -0.08283566 0.25904025 0.2984323
6 -1.11250479 0.03299387 0.23553281 0.2786233
7 -0.14784506 0.28781883 0.27872358 0.2657080
10 0.38168220 0.89476555 0.25620796 0.3108585
11 0.24466921 0.14419021 0.27386482 0.3322349
12 -0.06119015 1.42294820 0.32903199 0.3632367
14 0.38957538 1.66850680 0.30362671 0.4437925
15 0.05784842 -0.12453429 0.32319116 0.3372879
18 0.71964923 -0.28669563 0.16336556 0.1911489
23 0.03191843 0.13955703 0.34522310 0.1872229
28 -0.04598340 -0.35156017 0.27001451 0.1822967
'line' is the population (n=10 individuals in each) from where each value comes from my x,y variables are 'MaleBL1' & 'FemaleBL1' and the standard error for each populations for males and females respectively 'BL1MaleSE' & 'BL1FemaleSE'
So far code wise I have
p<-ggplot(BL1ggplot, aes(x=MaleBL1, y=FemaleBL1)) +
geom_point(shape=1) +
geom_smooth(method=lm)+ # add regression line
xmin<-(MaleBL1-BL1MaleSE)
xmax<-(MaleBL1+BL1MaleSE)
ymin<-(FemaleBL1-BL1FemaleSE)
ymax<-(FemaleBL1+BL1FemaleSE)
geom_errorbarh(aes(xmin=xmin,xmax=xmax))+
geom_errorbar(aes(ymin=ymin,ymax=ymax))
I think the last two lines are wrong with specifying the limits of the error bars. I just don't know how to tell R where to take the SE values for each point from the columns BL1MaleSE and BL1FemaleSE
Any tips greatly appreciated
You really should study some tutorials. You haven't understood ggplot2 syntax.
BL1ggplot <- read.table(text=" line MaleBL1 FemaleBL1 BL1MaleSE BL1FemaleSE
3 0.05343516 0.05615977 0.28666600 0.3142001
4 -0.53321642 -0.27279609 0.23929438 0.1350793
5 -0.25853484 -0.08283566 0.25904025 0.2984323
6 -1.11250479 0.03299387 0.23553281 0.2786233
7 -0.14784506 0.28781883 0.27872358 0.2657080
10 0.38168220 0.89476555 0.25620796 0.3108585
11 0.24466921 0.14419021 0.27386482 0.3322349
12 -0.06119015 1.42294820 0.32903199 0.3632367
14 0.38957538 1.66850680 0.30362671 0.4437925
15 0.05784842 -0.12453429 0.32319116 0.3372879
18 0.71964923 -0.28669563 0.16336556 0.1911489
23 0.03191843 0.13955703 0.34522310 0.1872229
28 -0.04598340 -0.35156017 0.27001451 0.1822967", header=TRUE)
library(ggplot2)
p<-ggplot(BL1ggplot, aes(x=MaleBL1, y=FemaleBL1)) +
geom_point(shape=1) +
geom_smooth(method=lm)+
geom_errorbarh(aes(xmin=MaleBL1-BL1MaleSE,
xmax=MaleBL1+BL1MaleSE),
height=0.2)+
geom_errorbar(aes(ymin=FemaleBL1-BL1FemaleSE,
ymax=FemaleBL1+BL1FemaleSE),
width=0.2)
print(p)
Btw., looking at the errorbars you should probably use Deming regression or Total Least Squares instead of OLS regression.

Multiple Plots in R

I want to plot 2 graphs in 1 frame. Basically I want to compare the results.
Anyways, the code I tried is:
plot(male,pch=16,col="red")
lines(male,pch=16,col="red")
par(new=TRUE)
plot(female,pch=16,col="green")
lines(female,pch=16,col="green")
When I run it, I DO get 2 plots in a frame BUT it changes my y-axis. Added my plot below. Anyways, y-axis values are -4,-4,-3,-3,...
It's like both of the plots display their own axis.
Please help.
Thanks
You don't need the second plot. Just use
> plot(male,pch=16,col="red")
> lines(male, pch=16, col = "red")
> lines(female, pch=16, col = "green")
> points(female, pch=16, col = "green")
Note: that will set the frame boundaries based on the first data set, so some data from the second plot could be outside the boundaries of the plot. You can fix it by e.g. setting the limits of the first plot yourself.
For this kind of plot I usually like the plotting with ggplot2 much better. The main reason: It generalizes nicely to more than two lines without a lot of code.
The drawback for your sample data is that it is not available as a data.frame, which is required for ggplot2. Furthermore, in every case you need a x-variable to plot against. Thus, first let us create a data.frame out of your data.
dat <- data.frame(index=rep(1:10, 2), vals=c(male, female), group=rep(c('male', 'female'), each=10))
Which leaves us with
> dat
index vals group
1 1 -0.4334269341 male
2 2 0.8829902521 male
3 3 -0.6052638138 male
4 4 0.2270191965 male
5 5 3.5123679143 male
6 6 0.0615821014 male
7 7 3.6280155376 male
8 8 2.3508890457 male
9 9 2.9824432680 male
10 10 1.1938052833 male
11 1 1.3151289227 female
12 2 1.9956491556 female
13 3 0.8229389822 female
14 4 1.2062726250 female
15 5 0.6633392820 female
16 6 1.1331669670 female
17 7 -0.9002109636 female
18 8 3.2137052284 female
19 9 0.3113656610 female
20 10 1.4664434215 female
Note that my command assumes you have 10 data values each. That command would have to be adjusted according to your actual data.
Now we may use the mighty power of ggplot2:
library(ggplot2)
ggplot(dat, aes(x=index, y=vals, color=group)) + geom_point() + geom_line()
The call above has three elements: ggplot initializes the plot, tells R to use dat as datasource and defines the plot aesthetics, or better: Which aesthetic properties of the plot (such as color, position, size, etc.) are influenced by your data. We use the x and y-values as expected and furthermore set the color aesthetic to the grouping variable - that makes ggplot automatically plot two groups with different colors. Finally, we add two geometries, that pretty much do what is written above: Draw lines and draw points.
The result:
If you have your data saved in the standard way in R (in a data.frame), you end with one line of code. And if after some thousands years of evolution you want to add another gender, it is still one line of code.

R lattice - trying to change labels colors with y.scale.components customisation

I currently try to customise a lattice parallel plot, by changing its Y axis label colors, depending on the character of these same lables. I created a customised y.scale.components function, as described in many books/forums. However, after assigning a vector of new colors to the ans$left$labels$col parameter, only default color (black) is used for the plot.
Here's the code:
test2 <- read.table(textConnection("
species evalue l1 l2 l3
Daphnia.pulex 1.0E-6 17 41 35
Daphnia.pulex 1.0E-10 11 30 25
Daphnia.pulex 1.0E-20 4 14 17
Daphnia.pulex 1.0E-35 4 8 15
Daphnia.pulex 1.0E-50 1 4 8
Daphnia.pulex 1.0E-75 0 2 6
Ixodes.scapularis 1.0E-6 7 20 118
Ixodes.scapularis 1.0E-10 6 17 107
Ixodes.scapularis 1.0E-20 4 6 46
Ixodes.scapularis 1.0E-35 2 3 14
Ixodes.scapularis 1.0E-50 0 0 5
Ixodes.scapularis 1.0E-75 0 0 2
")->con,header=T);close(con)
#data.frame to assign a color to the data, depending on species names on y axis
orga<-c("Daphnia.pulex","Ixodes.scapularis")
color<-c("cornsilk2","darkolivegreen1" );
phylum<-c("arthropoda","arthropoda" );
colorChooser<-data.frame(orga,color,phylum)
#fonction for custom rendering of left y axis labels
yscale.components.custom<-function(...) {
ans<-yscale.components.default(...)
#vector for new label colors, grey60 by default
new_colors<-c()
new_colors<-rep("grey60",length(ans$left$labels$labels))
# the for() check all labels character and assign the corresponding color with the colorChooser data.frame
n<-1
for (i in ans$left$labels$labels) {
new_colors[n]<-as.character(colorChooser$color[colorChooser$orga==i])
#got the color corresponding to the label, with the colorChooser dataframe
n<-n+1
}
print(length(new_colors))
cat(new_colors,sep="\n") #print the content of the generated color vector
ans$left$labels$col<-new_colors #assign this vector to col parameter
ans
}
#plot everything
bwplot( reorder(species,l1,median)~l1,
data=test2,
panel = function(..., box.ratio) {
panel.grid(h=length(colnames(cdata[,annot.arthro]))-1,v=0,col.line="grey80")
panel.violin(..., col = "white",varwidth = FALSE, box.ratio = box.ratio )
panel.bwplot(..., fill = NULL, box.ratio = .07)
},
yscale.components=yscale.components.custom
)
Here's the output of the cat() command, included in the yscale.components.custom function. As you can see, it outputs two times the color labels, but the vector assigned to ans$left$labels$col is of length 2. Is there a second call that setup the Y axis labels colors ? where does it come from ?
[1] 2
darkolivegreen1
cornsilk2
[1] 2
darkolivegreen1
cornsilk2
Any help is welcome, i don't undestand why the colors are assigned to ans$left$labels$col but everything is drawn in blacK. I would like also to change the violin border colors, using the same colorChooser data.frame, but that's another story...
After asking to Deepayan Sarkar, the ans$left$labels$col value is apparently ignored during lattice execution.
I used a different solution with the "scales" argument. Unfortunately, I cannot anymore rely on the lattice reorder function to reorder my data series by their median.
For the code mentionned above, I order them manually, then create a vector of color with the corresponding order. I cannot rely on the lattice reordering anymore (the "reorder" in my lattice formula). Then, I setup my axix labels colors with
scales=list(col=c(color1,color2,...))

Resources