ggplot stacked bar plot with x value and no y/fill value - r

I am trying to make a stacked bar plot with the X axis being time, Y axis being amount, and fill color being certain features.
My data looks something like this:
> base
Number Mut Time Percent
1 117 22:A->G 2 81.81
2 13 24:G->A 2 9.09
3 10 22:A->G 24:G->A 108:G->A 158:G->A 162:G->A 2 6.99
4 1 22:A->G 24:G->A 2 0.69
5 32 24:G->A 3 94.11
6 1 24:G->A 162:G->T 3 2.94
7 1 24:G->A 82:G->T 3 2.94
When I do a stacked bar graph in ggplot using the code:
ggplot(base,aes(x = Time, fill = Mut, y = Percent)) + geom_bar(stat='identity') + theme(legend.key.size = unit(.5, "cm")) + ylab("Number")
I get a graph that looks like this:
http://imgur.com/32XCkTm,yfCAJsx#0
My problem is I want there to be zero values for time = 1 and time = 4.
Something similar to this:
http://imgur.com/32XCkTm,yfCAJsx#1
Is there a way I can do this? Right now I just added 0 values to the data for times 1 and 4 and added my fill feature(Mut) to be one that already showed up in the data:
> base
Reads Mut Time Percent
1 0 22:A->G 1 0.00
2 117 22:A->G 2 81.81
3 13 24:G->A 2 9.09
4 10 22:A->G 24:G->A 108:G->A 158:G->A 162:G->A 2 6.99
5 1 22:A->G 24:G->A 2 0.69
6 32 24:G->A 3 94.11
7 1 24:G->A 162:G->T 3 2.94
8 1 24:G->A 82:G->T 3 2.94
9 0 22:A->G 4 0.00
My problem is I dont want to have to keep searching for a feature (Mut) that is already in the data. is there a way to just have ggplot automatically put x values for time=1 and time =4 with no bar graphs without having to add values to the data? I have been searching for hours and cant find any answers.
Thanks.

Add + scale_x_discrete(limits = 1:4)

Related

Gnuplot bar chart with personalize interval on x-axis

I'm new using gnuplot and i would like to replicate this plot: https://images.app.goo.gl/DqygL2gfk3jZ7jsK6
I have a file.dat with continuous value between 0 and 100 and i would like to plot it, subdivided in intervals ( pident> 98, 90 < pident < 100...) Etc. And on y-axis the total occurrences.
I looked everywhere finding a way but still I cannot do it.
Thank you !
sample of the data, with the value and the counts:
33.18 5
43.296 1
33.19 1
27.168 5
71.429 11
30.698 9
47.934 1
43.299 3
30.699 3
37.092 2
24.492 2
24.493 2
24.494 7
47.938 1
24.497 1
37.097 8
37.099 2
33.824 7
51.111 15
59.025 2
62.553 2
62.554 2
57.867 2
33.826 2
62.555 1
33.827 5
62.556 2
33.828 1
59.028 1
46.429 11
51.117 1
75.158 2
27.621 1
27.623 1
27.624 2
37.5 113
37.6 2
32.313 8
27.626 3
37.7 3
32.314 1
67.797 3
27.628 2
32.316 2
37.9 1
61.044 1
43.81 5
32.317 8
32.318 2
43.82 4
32.319 2
43.83 2
37.551 3
61.048 1
48.993 6
29.43 2
This is the code tried so far (where i also calculate the mean):
#!/usr/bin/gnuplot -persist
set noytics
# Find the mean
mean= system("awk '{sum+=$1*$2; tot+=$2} END{print sum/tot}' hist.dat")
set arrow 1 from mean,0 to mean, graph 1 nohead ls 1 lc rgb "blue"
set label 1 sprintf(" Mean: %s", mean) at mean, screen 0.1
# Histogram
binwidth=10
bin(x,width)=width*floor(x/width)
plot 'hist.dat' using (bin($1,binwidth)):(1.0) smooth freq with boxes
This is the result:
The following script takes your data and sums up the second column within the defined bins.
If you have values of equal 100 in the first column, those values would be in the bin 100-<110.
With Bin(x) = floor(x/BinWidth)*BinWidth + BinWidth*0.5, the bins are shifted by half a binwidth to let the boxes on the x-axis range from the beginning of the bin to the end of the bin (and not centered at the beginning of the respective bin).
If you explicitely want to have xtics labels like in the example graph you've shown, i.e. 10-<20, 20-<30 etc. you would have to fiddle around with the xtic labels.
Edit: Forgot the mean value. There is no need for calling awk. Gnuplot can do this for you as well, check help stats.
Code:
### create histogram
reset session
$Data <<EOD
33.18 5
43.296 1
33.19 1
27.168 5
71.429 11
30.698 9
47.934 1
43.299 3
30.699 3
37.092 2
24.492 2
24.493 2
24.494 7
47.938 1
24.497 1
37.097 8
37.099 2
33.824 7
51.111 15
59.025 2
62.553 2
62.554 2
57.867 2
33.826 2
62.555 1
33.827 5
62.556 2
33.828 1
59.028 1
46.429 11
51.117 1
75.158 2
27.621 1
27.623 1
27.624 2
37.5 113
37.6 2
32.313 8
27.626 3
37.7 3
32.314 1
67.797 3
27.628 2
32.316 2
37.9 1
61.044 1
43.81 5
32.317 8
32.318 2
43.82 4
32.319 2
43.83 2
37.551 3
61.048 1
48.993 6
29.43 2
EOD
# Histogram
BinWidth = 10
Bin(x) = floor(x/BinWidth)*BinWidth + BinWidth*0.5
# Mean
stats $Data u ($1*$2):2 nooutput
mean = STATS_sum_x/STATS_sum_y
set arrow 1 from mean, graph 0 to mean, graph 1 nohead lw 2 lc rgb "red" front
set label 1 sprintf("Mean: %.1f", mean) at mean, graph 1 offset 1,-0.7
set xlabel "Identity / %"
set xrange [0:100]
set xtics 10 out
set ylabel "The number of blast hits"
set style fill solid 0.3
set boxwidth BinWidth
set key noautotitle
set grid x,y
plot $Data using (Bin($1)):2 smooth freq with boxes lc "blue"
### end of code
Result:

geom_bar removed 3 rows with missing values

I'm trying to create a histogram using ggplot2 in R.
This is the code I'm using:
library(tidyverse)
dat_male$explicit_truncated <- trunc(dat_male$explicit_mean)
means2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), mean, na.rm=TRUE)
colnames(means2) <- c("explicit", "id", "IAT_D")
sd2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), sd, na.rm=TRUE)
length2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), length)
se2 <- sd2$x / sqrt(length$x)
means2$lo <- means2$IAT_D - 1.6*se2
means2$hi <- means2$IAT_D + 1.6*se2
ggplot(data = means2, aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2), position=position_dodge(0.9), data=means2) +
xlab("Explicit attitude score") +
ylab("D-score")
For some reason I get the following warning message:
Removed 3 rows containing missing values (geom_bar).
And I get the following histogram:
I really have no clue what is going on.
Please let me know if you need to see anything else of my code, I'm never really sure what to include.
dat_male is a dataset that looks like this (I have only included the variables that I mentioned in this question, as the dataset contains 68 variables):
id explicit_mean IAT_D explicit_truncated
5 1 3.1250 0.366158652 3
6 1 3.3125 0.373590066 3
9 1 3.6250 0.208096230 3
11 1 3.1250 0.661983618 3
15 1 2.3125 0.348246184 2
19 1 3.7500 0.562406383 3
28 1 2.5625 -0.292888526 2
35 1 4.3750 0.560039531 4
36 1 3.8125 -0.117455439 3
37 1 3.1250 0.074375196 3
46 1 2.5625 0.488265849 2
47 1 4.2500 -0.131005579 4
53 1 2.0625 0.193040876 2
55 1 2.6875 0.875420303 2
62 1 3.8750 0.579146056 3
63 1 3.3125 0.666095380 3
66 1 2.8125 0.115607820 2
68 1 4.3750 0.259929946 4
80 1 3.0000 0.502709149 3
means2 is a dataset I have used to calculate means, and that looks like this:
explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864
Now that I see it front of me, it probably has something to do with the NaN's?
From your dataset it seems like everything is alright.
The errors that you get are an indication that your data.frame has empty values (i.e. NaN and NA).
I actually got two warning messages:
Warning messages:
1: Removed 1 rows containing missing values
(geom_bar).
2: Removed 2 rows containing missing values
(geom_errorbar).
Regarding the plot, because you don't have any zero values under explicit, you don't see it in the graph. Similarly, because you have NAs under lo and hi for one in explicit, you don't get the corresponding error bar.
Dataset:
means2 <- read.table(text = " explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864",
header = TRUE)
plot:
means2 %>%
ggplot(aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2),
position=position_dodge(0.9)) +
xlab("Explicit attitude score") +
ylab("D-score")

How to plot using multiple criteria in R?

Following are first 15 rows of my data:
> head(df,15)
frame.group class lane veh.count mean.speed
1 [22,319] 2 5 9 23.40345
2 [22,319] 2 4 9 24.10870
3 [22,319] 2 1 11 14.70857
4 [22,319] 2 3 8 20.88783
5 [22,319] 2 2 6 16.75327
6 (319,616] 2 5 15 22.21671
7 (319,616] 2 2 16 23.55468
8 (319,616] 2 3 12 22.84703
9 (319,616] 2 4 14 17.55428
10 (319,616] 2 1 13 16.45327
11 (319,616] 1 1 1 42.80160
12 (319,616] 1 2 1 42.34750
13 (616,913] 2 5 18 30.86468
14 (319,616] 3 3 2 26.78177
15 (616,913] 2 4 14 32.34548
'frame.group' contains time intervals, 'class' is the vehicle class i.e. 1=motorcycles, 2=cars, 3=trucks and 'lane' contains lane numbers. I want to create 3 scatter plots with frame.group as x-axis and mean.speed as y-axis, 1 for each class. In a scatterplot for one vehicle class e.g. cars, I want 5 plots i.e. one for each lane. I tried following:
cars <- subset(df, class==2)
by(cars, lane, FUN = plot(frame.group, mean.speed))
There are two problems:
1) R does not plot as expected i.e. 5 plots for 5 different lanes.
2) Only one is plotted and that too is box-plot probably because I used intervals instead of numbers as x-axis.
How can I fix the above issues? Please help.
Each time a new plot command is issued, R replaces the existing plot with the new plot. You can create a grid of plots by doing par(mfrow=c(1,5)), which will be 1 row with 5 plots (other numbers will have other numbers of rows and columns). If you want a scatterplot instead of a boxplot you can use plot.default
It is easier to do all this with the ggplot2 library instead of the base graphics, and the resulting plot will look much nicer:
library(ggplot2)
ggplot(cars,aes(x=frame.group,y=mean.speed))+geom_point()+facet_wrap(~lane)
See the ggplot2 documentation for more details: http://docs.ggplot2.org/current/

How to fill/shade space between timelines, plotted with ggplot?

I have a 12x13 matrix that looks like that:
monat beob werex_00 werex_11 werex_22 werex_33 werex_44 werex_55 werex_66 werex_77 werex_88 werex_99 Min Max
1 22.4930171 9.1418697 8.1558828 8.0312839 10.013298 8.8922567 9.395811 10.7933080 6.5136136 8.721697 10.279974 0.108381 59.65309
2 25.1414834 13.5886794 9.1694683 10.8709352 13.021066 10.3316655 10.579970 17.0555902 7.5915886 11.035921 13.366310 0.924013 66.94970
3 33.8286673 16.3800292 10.0202342 11.3072626 17.674761 16.1370288 15.018551 15.3331395 12.6856599 15.479521 13.929905 -0.794309 78.78572
4 22.0579421 11.9930633 8.4899130 8.2304118 12.987301 7.8763578 8.554007 12.4956321 9.4723508 7.057423 7.688662 -10.496481 49.01380
5 2.5535161 -2.4503375 -4.2354520 -3.6309377 -2.969866 -4.5876993 -5.383716 -3.2612018 -5.2054387 -2.780719 -4.359513 -19.579135 32.54282
6 -2.4405826 -8.8534136 -9.4666674 -7.4249244 -7.820072 -9.1485440 -8.546798 -7.8179739 -7.4222923 -10.978398 -12.644807 -22.821617 18.62139
7 -2.2580848 -6.7569968 -8.3390114 -8.8757506 -8.248305 -8.4171552 -7.760800 -5.7471163 -8.7864075 -6.239596 -8.870658 -22.933219 20.84375
8 -0.3448858 -5.6683742 -5.0467756 -5.7201820 -2.800106 -5.9640095 -5.011171 -3.3557601 -2.8967683 -4.407761 -6.146411 -17.042893 17.86556
9 3.3963303 0.4305926 -0.8554308 -0.9985536 -1.184610 -0.5520555 0.347758 -0.3838614 -0.2199835 -1.174712 -1.630363 -8.533647 19.66163
10 5.1839209 1.6050281 1.1578316 1.8503193 2.327975 1.6633771 1.557532 1.5563157 2.2776155 1.667714 1.333829 -4.686715 31.17342
11 9.2551418 4.4810518 2.9992301 4.9848408 3.824927 4.2413024 3.939119 5.4256008 3.5804488 4.965302 3.790589 -1.615777 43.90991
12 18.2233848 7.7648233 6.3344735 7.3477135 6.573620 7.1884950 7.428654 7.3119002 6.9405167 7.663072 8.342437 0.014096 62.83760
That are time-lines of a certain value. In the next step I plot it with ggplot(). Therefore I used the melt() operation to get the matrix in shape for plot:
R1_Grundwasserneubildung_Rg1Rg2_Monat_mean_druckreif <- melt(R1_Grundwasserneubildung_Rg1Rg2_Monat_mean, na.rm = FALSE, id.vars="monat")
This data looks like that now:
Monat Projektion value
1 1 beob 22.4930171
2 2 beob 25.1414834
3 3 beob 33.8286673
4 4 beob 22.0579421
5 5 beob 2.5535161
6 6 beob -2.4405826
7 7 beob -2.2580848
8 8 beob -0.3448858
9 9 beob 3.3963303
10 10 beob 5.1839209
11 11 beob 9.2551418
12 12 beob 18.2233848
13 1 werex_00 9.1418697
14 2 werex_00 13.5886794
15 3 werex_00 16.3800292
16 4 werex_00 11.9930633
17 5 werex_00 -2.4503375
18 6 werex_00 -8.8534136
19 7 werex_00 -6.7569968
20 8 werex_00 -5.6683742
21 9 werex_00 0.4305926
22 10 werex_00 1.6050281
23 11 werex_00 4.4810518
24 12 werex_00 7.7648233
25 1 werex_11 8.1558828
... ... ... ...
I also added some new names for the melted data (as already seen above):
names(R1_Grundwasserneubildung_Rg1Rg2_Monat_mean_druckreif)<-c("Monat","Projektion","value")
Next step defines some custom colors for the plot:
Projektionen_Farben<-c("#000000","#00EEEE","#EEAD0E","#006400","#BDB76B","#EE7600","#68228B","#8B0000","#1E90FF","#EE6363","#556B2F","#D6D6D6","#D6D6D6")
Now I plot the melted data:
ggplot(R1_Grundwasserneubildung_Rg1Rg2_Monat_mean_druckreif,
aes(x=Monat,y=value,color=Projektion,group=Projektion)) +
geom_line(size=0.8) +
xlab("Monat") +
ylab("Grundwasserneubildung [mm/Monat]") +
ggtitle("Grundwasserneubildung") +
theme_bw() +
scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12),
labels = c("Jan","Feb","Mär","Apr","Mai","Jun","Jul","Aug","Sep","Okt","Nov","Dez")) +
theme(axis.title=element_text(size=15,vjust = 0.3, face="bold"),
title=element_text(size=15,vjust = 1.5,face="bold")) +
scale_colour_manual(values = Projektionen_Farben)
Sorry, but I haven't got enough reputation to post a pic of the plot.
Now I want to fill/shade the space between the Max-line and the Min-line with, lets say, a light grey (alpha=.3). I have tried with geom_ribbon() but haven't found the right way to define x, ymin and ymax as needed. Does someone know a way to fill the space between these two lines?
Use your original data frame for the geom_ribbon() and provide columns Min and Max as ymin and ymax.
+ geom_ribbon(data=R1_Grundwasserneubildung_Rg1Rg2_Monat_mean,
aes(x=monat,ymin=Min,ymax=Max),
inherit.aes=FALSE,alpha=0.3,color="grey30")

How to subset data for additional geoms while using facets in ggplot2?

I want additional 'geoms' to only apply to a subset of the initial data. I would like this subset to be from each units created by facets=~.
My trials using subletting of either the data or of the plotted variables leads to subsetting of the whole data set, rather than the subletting of the units created by 'facets=~' and in two different ways (apparently dependant on the sorting of the data).
This difficulty is appears with any 'geom' while using 'facets'
library(ggplot2)
test.data<-data.frame(factor=rep(c("small", "big"), each=9),
x=c(c(1,2,3,3,3,2,1,1,1), 2*c(1,2,3,3,3,2,1,1,1)),
y=c(c(1,1,1,2,3,3,3,2,1), 2*c(1,1,1,2,3,3,3,2,1)))
factor x y
1 small 1 1
2 small 2 1
3 small 3 1
4 small 3 2
5 small 3 3
6 small 2 3
7 small 1 3
8 small 1 2
9 small 1 1
10 big 2 2
11 big 4 2
12 big 6 2
13 big 6 4
14 big 6 6
15 big 4 6
16 big 2 6
17 big 2 4
18 big 2 2
qplot(data=test.data,
x=x,
y=y,
geom="polygon",
facets=~factor)+
geom_polygon(data=test.data[c(2,3,4,5,6,2),],
aes(x=x,
y=y),
fill=I("red"))
qplot(data=test.data,
x=x,
y=y,
geom="polygon",
facets=~factor)+
geom_polygon(aes(x=x[c(2,3,4,5,6,2)],
y=y[c(2,3,4,5,6,2)]),
fill=I("red"))
The answer is to subset the data in a first step.
library(ggplot2)
library(plyr)
test.data<-data.frame(factor=rep(c("small", "big"), each=9),
x=c(c(1,2,3,3,3,2,1,1,1), 2*c(1,2,3,3,3,2,1,1,1)),
y=c(c(1,1,1,2,3,3,3,2,1), 2*c(1,1,1,2,3,3,3,2,1)))
subset.test<-ddply(.data=test.data,
.variables="factor",
function(data){
data[c(2,3,4,5,6,2),]})
qplot(data=test.data,
x=x,
y=y,
geom="polygon",
facets=~factor)+
geom_polygon(data=subset.test,
fill=I("red"))

Resources