ggplot boxplot only shows one box instead of 10, how to fix? - r

My data is in this format
Responder_status variable value
1. good AHSP 0.01
2. good AHSP 1.16
3. poor AHSP 0.00
4. good HBB 0.25
It keeps going for all 10 variables, a row for each cell (792 cells). So in total I have 7920 rows. Here's the output of str.
'data.frame': 7920 obs. of 3 variables:
$ Responder_status: Factor w/ 3 levels "good","poor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ variable : Factor w/ 10 levels "AHSP","APOC1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 8.76 1.62 10.35 2.58 0 ...
When I plot a boxplot for it like this:
library(ggplot2)
ggplot(data, aes(x=factor(variable), y=value))+ geom_boxplot(aes(fill=factor(Responder_status)))
or like this:
ggplot(data, aes(x=factor(variable), y=value, fill=factor(Responder_status))) + geom_boxplot()
I get the following plot:
Why do I only get the box for my final variable and not for all of them (what I want)?

You can try wrapping fill inside the aesthetic function like below:
library(ggplot2)
ggplot(data, aes(x=factor(variable), y=value, fill=factor(Responder_status)))+
geom_boxplot()

Related

Where should I do reorder on bargraph to achieve make the bar group same squence as dataframe

I have a dataframe like this:
> str(mydata6)
'data.frame': 6 obs. of 4 variables:
$ Comparison : Factor w/ 6 levels "Decreased_Adult",..: 5 2 6 3 4 1
$ differential_IR_number: num 446 305 965 599 1799 ...
$ Stage : Factor w/ 3 levels "AdultvsE11","E14vsE11",..: 2 2 3 3 1 1
$ Change : Factor w/ 2 levels "Decrease","Increase": 2 1 2 1 2 1
column 1,3,4 are factors and column 2 are numeric
I used the following code to do a bargraph:
ggplot(mydata6, aes(x=Stage, y=differential_IR_number, fill=Change)) + #don't need to use "" for x= and y, comparing to the above code
geom_bar(stat = "identity", position = "stack") + #using stack to make decrease and increase stack with each other
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + #using theme function to change the labeling to be vertical
geom_text(aes(label=differential_IR_number), position=position_stack(vjust=0.5))
The result is following:
But I want the order to be E14vsE11 E18vsE11 and AdultvsE11, I tried to reorder/sort at different positions but none works.
Why it does not following the order of mydataframe?
The order is the one of the levels of the factor. You can set the order you want as follows:
mydata6$Stage <- factor(mydata6$Stage, levels = c("E14vsE11", "E18vsE11", "AdultvsE11"))

How to change order for pyramid plots with ggplot2 to dataset order?

I have a dataset with climate suitability values (0-1) for tree species for both present and future.
I would like to visualise the data in a pyramid plot with the ggplot2 package, whereas present should be displayed on the left side of the plot and future on the right side and the tree species in the according order given in my raw dataset.
b2010<-read.csv("csi_before2010_abund_order.csv",header=T,sep = ";")
str(b2010)
'data.frame': 20 obs. of 7 variables:
$ species: Factor w/ 10 levels "Acer platanoides",..: 9 9 7 7 8 8 6 6 5 5 ...
$ time : Factor w/ 2 levels "future","present": 2 1 2 1 2 1 2 1 2 1 ...
$ grid1 : num 0.6001 0.5945 0.6366 0.0424 0.6941 ...
$ grid2 : num 0.6399 0.5129 0.6981 0.0399 0.711 ...
$ grid3 : num 0.6698 0.5212 0.6863 0.0446 0.6795 ...
$ mean : num 0.6366 0.5429 0.6737 0.0423 0.6949 ...
$ group : Factor w/ 1 level "before 2010": 1 1 1 1 1 1 1 1 1 1 ...
b2010$mean = ifelse(b2010$time == "future", b2010$mean * -1,b2010$mean)
head(b2010)
species time grid1 grid2 grid3 mean group
1 Tilia europaea present 0.60009009 0.63990200 0.66975713 0.63658307 before 2010
2 Tilia europaea future 0.59452874 0.51294094 0.52115256 -0.54287408 before 2010
3 Sorbus intermedia present 0.63659602 0.69813931 0.68629903 0.67367812 before 2010
4 Sorbus intermedia future 0.04242327 0.03990654 0.04460707 -0.04231229 before 2010
5 Tilia cordata present 0.69414478 0.71097034 0.67950863 0.69487458 before 2010
6 Tilia cordata future 0.55790818 0.53918493 0.51979470 -0.53896260 before 2010
ggplot(b2010, aes(x = factor(species), y = mean, fill = time)) +
geom_bar(stat = "identity") +
facet_share(~time, dir = "h", scales = "free", reverse_num = T) +
coord_flip()
Now, future and present are in the wrong order and also the species are ordered alphabetically, even though they are clearly "factors" and should therefore be ordered according to my dataset. I would very much appreciate your help.
Thank you and kind regards
You are misunderstanding how factors work. Bars are plotted in the order as printed by levels(b2010$species). In order to change this order, you'll have to manually reorder them, i.e.
b2010$species <- factor(b2010$species,
levels = c("Sorbus intermedia", "Tilia chordata"...))
These levels can naturally be also a function of some statistic, i.e. mean. To do that, you would do something along the lines of
myorder <- b2010[order(b2010$mean) & b2010$time == "present", "species"]
b2010$species <- factor(b2010$species, levels = myorder)

R ggplot - Error stat_bin requires continuous x variable

My table is data.combined with following structure:
'data.frame': 1309 obs. of 12 variables:
$ Survived: Factor w/ 3 levels "0","1","None": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : num 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ Title : Factor w/ 4 levels "Master.","Miss.",..: 3 3 2 3 3 3 3 1 3 3 ...
I want to draw a graph to reflect the relationship between Title and Survived, categorized by Pclass. I used the following code:
ggplot(data.combined[1:891,], aes(x=Title, fill = Survived)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~Pclass) +
ggtitle ("Pclass") +
xlab("Title") +
ylab("Total count") +
labs(fill = "Survived")
However this results in error: Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
If I change variable Title into numeric: data.combined$Title <- as.numeric(data.combined$Title) then the code works but the label in the graph is also numeric (below). Please tell me why it happens and how to fix it. Thanks.
Btw, I use R 3.2.3 on Mac El Capital.
Graph: Instead of Mr, Miss,Mrs the x axis shows numeric values 1,2,3,4
Sum up the answer from the comments above:
1 - Replace geom_histogram(binwidth=0.5) with geom_bar(). However this way will not allow binwidth customization.
2 - Using stat_count(width = 0.5) instead of geom_bar() or geom_histogram(binwidth = 0.5) would solve it.
extractTitle <- function(Name) {
Name <- as.character(Name)
if (length(grep("Miss.", Name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", Name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", Name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", Name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
titles <- NULL
for (i in 1:nrow(data.combined)){
titles <- c(titles, extractTitle(data.combined[i, "Name"]))
}
data.combined$title <- as.factor(titles)
ggplot(data.combined[1:892,], aes(x = title, fill = Survived))+
geom_bar(width = 0.5) +
facet_wrap("Pclass")+
xlab("Pclass")+
ylab("total count")+
labs(fill = "Survived")
As stated above use geom_bar() instead of geom_histogram, refer sample code given below(I wanted separate graph for each month for birth date data):
ggplot(data = pf,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 3)
I had the same issue but none of the above solutions worked. Then I noticed that the column of the data frame I wanted to use for the histogram wasn't numeric:
df$variable<- as.numeric(as.character(df$variable))
Taken from here
I had the same error. In my original code, I read my .csv file with read_csv(). After I changed the file into .xlsx and read it with read_excel(), the code ran smoothly.

ggplot2: overlay control group line on graph panel set

I have a stacked areaplot made with ggplot2:
dists.med.areaplot<-qplot(starttime,value,fill=dists,facets=~groupname,
geom='area',data=MDist.median, stat='identity') +
labs(y='median distances', x='time(s)', fill='Distance Types')+
opts(title=subt) +
scale_fill_brewer(type='seq') +
facet_wrap(~groupname, ncol=2) + grect #grect adds the grey/white vertical bars
It looks like this:
I want to add a an overlay of the profile of the control graph (bottom right) to all the graphs in the output (groupname==rowH is the control).
So far my best efforts have yielded this:
cline<-geom_line(aes(x=starttime,y=value),
data=subset(dists.med,groupname=='rowH'),colour='red')
dists.med.areaplot + cline
I need the 3 red lines to be 1 red line that skims the top of the dark blue section. And I need that identical line (the rowH line) to overlay each of the panels.
The dataframe looks like this:
> str(MDist.median)
'data.frame': 2880 obs. of 6 variables:
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ fCycle : Factor w/ 6 levels "predark","Cycle 1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ fPhase : Factor w/ 2 levels "Light","Dark": 2 2 2 2 2 2 2 2 2 2 ...
$ starttime: num 0.3 60 120 180 240 300 360 420 480 540 ...
$ dists : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 110 117 115 113 114 ...
The red line should be calculated as the sum of the value at each starttime, where groupname='rowH'. I have tried creating cline the following ways. Each results in an error or incorrect output:
#sums the entire y for all points and makes horizontal line
cline<-geom_line(aes(x=starttime,y=sum(value)),data=subset(dists.med,groupname=='rowH'),colour='red')
#using related dataset with pre-summed y's
> cline<-geom_line(aes(x=starttime,y=tot_dist),data=subset(t.med,groupname=='rowH'))
> dists.med.areaplot + cline
Error in eval(expr, envir, enclos) : object 'dists' not found
Thoughts?
ETA:
It appears that the issue I was having with 'dists' not found has to do with the fact that the initial plot, dists.med.areaplot was created via qplot. To avoid this issue, I can't build on a qplot. This is the code for the working plot:
cline.data <- subset(
ddply(MDist.median, .(starttime, groupname), summarize, value = sum(value)),
groupname == "rowH")
cline<-geom_line(data=transform(cline.data,groupname=NULL), colour='red')
dists.med.areaplot<-ggplot(MDist.median, aes(starttime, value)) +
grect + nogrid +
geom_area(aes(fill=dists),stat='identity') +
facet_grid(~groupname)+ scale_fill_brewer(type='seq') +
facet_wrap(~groupname, ncol=2) +
cline
resulting in this graphset:
This Learning R blog post should be of some help:
http://learnr.wordpress.com/2009/12/03/ggplot2-overplotting-in-a-faceted-scatterplot/
It might be worth computing the summary outside of ggplot with plyr.
cline.data <- ddply(MDist.median, .(starttime, groupname), summarize, value = sum(value))
cline.data.subset <- subset(cline.data, groupname == "rowH")
Then add it to the plot with
last_plot() + geom_line(data = transform(cline.data.subset, groupname = NULL), color = "red")

stacked barchart with lattice: is my data too big?

I want a graph that looks similar to the example given in the lattice docs:
#EXAMPLE GRAPH, not my data
> barchart(yield ~ variety | site, data = barley,
+ groups = year, layout = c(1,6), stack = TRUE,
+ auto.key = list(points = FALSE, rectangles = TRUE, space = "right"),
+ ylab = "Barley Yield (bushels/acre)",
+ scales = list(x = list(rot = 45)))
I melted my data to obtain this "long" form dataframe:
> str(MDist)
'data.frame': 34560 obs. of 6 variables:
$ fCycle : Factor w/ 2 levels "Dark","Light": 2 2 2 2 2 2 2 2 2 2 ...
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ timepoint: num 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 55.7 75.3 99.2 45.9 73.8 79.3 73.5 69.8 67.6 ...
I want to create a stacked barchart for each groupname and fCycle. I tried this:
barchart(value~timepoint|groupname*fCycle, data=MDist, groups=variable,stack=T)
It doesn't throw any errors, but it's still thinking after 30 minutes. Is this because it doesn't know how to deal with the 36 values that contribute to each bar? How can I make this data easier for barchart to digest?
I don't know lattice well, but could it be because your timepoint variable is numeric, not a factor?

Resources