I'm struggling get the right ordering of variables in a graph I made with ggplot2 in R.
Suppose I have a dataframe such as:
set.seed(1234)
my_df<- data.frame(matrix(0,8,4))
names(my_df) <- c("year", "variable", "value", "vartype")
my_df$year <- rep(2006:2007)
my_df$variable <- c(rep("VX",2),rep("VB",2),rep("VZ",2),rep("VD",2))
my_df$value <- runif(8, 5,10)
my_df$vartype<- c(rep("TA",4), rep("TB",4))
which yields the following table:
year variable value vartype
1 2006 VX 5.568517 TA
2 2007 VX 8.111497 TA
3 2006 VB 8.046374 TA
4 2007 VB 8.116897 TA
5 2006 VZ 9.304577 TB
6 2007 VZ 8.201553 TB
7 2006 VD 5.047479 TB
8 2007 VD 6.162753 TB
There are four variables (VX, VB, VZ and VD), belonging to two groups of variable types, (TA and TB).
I would like to plot the values as horizontal bars on the y axis, ordered vertically first by variable groups and then by variable names, faceted by year, with values on the x axis and fill colour corresponding to variable group.
(i.e. in this simplified example, the order should be, top to bottom, VB, VX, VD, VZ)
1) My first attempt has been to try the following:
ggplot(my_df,
aes(x=variable, y=value, fill=vartype, order=vartype)) +
# adding or removing the aesthetic "order=vartype" doesn't change anything
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
However, the variables are listed in reverse alphabetical order, but not by vartype : the order=vartype aesthetic is ignored.
2) Following an answer to a similar question I posted yesterday, i tried the following, based on the post Order Bars in ggplot2 bar graph :
my_df$variable <- factor(
my_df$variable,
levels=rev(sort(unique(my_df$variable))),
ordered=TRUE
)
This approach does gets the variables in vertical alphabetical order in the plot, but ignores the fact that the variables should be ordered first by variable goups (with TA-variables on top and TB-variables below).
3) The following gives the same as 2 (above):
my_df$vartype <- factor(
my_df$vartype,
levels=sort(unique(my_df$vartype)),
ordered=TRUE
)
... which has the same issues as the first approach (variables listed in reverse alphabetical order, groups ignored)
4) another approach, based on the original answer to Order Bars in ggplot2 bar graph , also gives the same plat as 2, above
my_df <- within(my_df,
vartype <- factor(vartype,
levels=names(sort(table(vartype),
decreasing=TRUE)))
)
I'm puzzled by the fact that, despite several approaches, the aesthetic order=vartype is ignored. Still, it seems to work in an unrelated problem: http://learnr.wordpress.com/2010/03/23/ggplot2-changing-the-default-order-of-legend-labels-and-stacking-of-data/
I hope that the problem is clear and welcome any suggestions.
Matteo
I posted a similar question yesterday, but, unfortunately I made several mistakes when descrbing the problem and providing a reproducible example.
I've listened to several suggestions since, and thoroughly searched stakoverflow for similar question and applied, to the best of my knowledge, every suggested combination of solutions, to no avail.
I'm posting the question again hoping to be able to solve my issue and, hopefully, be helpful to others.
This has little to do with ggplot, but is instead a question about generating an ordering of variables to use to reorder the levels of a factor. Here is your data, implemented using the various functions to better effect:
set.seed(1234)
df2 <- data.frame(year = rep(2006:2007),
variable = rep(c("VX","VB","VZ","VD"), each = 2),
value = runif(8, 5,10),
vartype = rep(c("TA","TB"), each = 4))
Note that this way variable and vartype are factors. If they aren't factors, ggplot() will coerce them and then you get left with alphabetical ordering. I have said this before and will no doubt say it again; get your data into the correct format first before you start plotting / doing data analysis.
You want the following ordering:
> with(df2, order(vartype, variable))
[1] 3 4 1 2 7 8 5 6
where you should note that we get the ordering by vartype first and only then by variable within the levels of vartype. If we use this to reorder the levels of variable we get:
> with(df2, reorder(variable, order(vartype, variable)))
[1] VX VX VB VB VZ VZ VD VD
attr(,"scores")
VB VD VX VZ
1.5 5.5 3.5 7.5
Levels: VB VX VD VZ
(ignore the attr(,"scores") bit and focus on the Levels). This has the right ordering, but ggplot() will draw them bottom to top and you wanted top to bottom. I'm not sufficiently familiar with ggplot() to know if this can be controlled, so we will also need to reverse the ordering using decreasing = TRUE in the call to order().
Putting this all together we have:
## reorder `variable` on `variable` within `vartype`
df3 <- transform(df2, variable = reorder(variable, order(vartype, variable,
decreasing = TRUE)))
Which when used with your plotting code:
ggplot(df3, aes(x=variable, y=value, fill=vartype)) +
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
produces this:
Related
Question 1:
Lets assume I have carried out a "before-after" (repeated measures with two points in time) experiment with 100 subjects. Each subject ticks a score on a 1 to 3 numerical scale in the 'before' condition (at T1) and again, after some treatment applied, in the 'after' condition (at T2). The behavior of each subject in the experiment can be described as 'transition from the score value at T1 to score value at T2'. E.g. from 2 to 3, or from 1 to 1, or from 3 to 1 and so on... The cartesian product tells us that 9 different transition types are theoretically possible. For each transition type, I calculated (in an external program) the count of observations.
This gives the following dataframe:
MyData1 <- data.frame(TransitionTypeID=seq(1:9), T1=c(1,1,1,2,2,2,3,3,3), T2=c(1,2,3,1,2,3,1,2,3), Count=c(2,14,0,18,12,8,23,12,11))
MyData1
For each score value (on the y-axis) I would like to plot a point and a line between T1 and T2 (on the x-axis), whereas the thickness of the line between T1 and T2 should (somehow) correspond to the count which is observed. The plot should simply visualize which transitions occur more often than others. Any hints?
Question 2:
While some pre-calculation steps of the above example have been carried out outside of R (in MS Access), I believe there must exist a way reaching the desired result from within R, i.e. using a dataframe with individual records for each subject and each point in time (i.e. with two rows per subject, one for the score at T1 and one for T2, hence in the 'long' format).
In that case the dataframe is something like this:
MyData2 <- data.frame(SubjectID=seq(1:100), Condition = c(rep("T1",100), rep("T2",100)), Score=floor(runif(200,min=1, max=4)))
library(ggplot2)
ggplot(data = MyData2, aes(x = Condition, y = Score, group = SubjectID)) + geom_line()
I get a nice plot showing the observed transitions, but obviously the individual lines for each subject are just plotted on top of each other, i.e. the thicknesses between T1 and T2 do not reflect the count of observations for each type of transition. Again: hints on how to achieve meaningful line thicknesses would be highly appreciated.
Here is a possible solution for Question 1.
MyData1 <- data.frame(TransitionTypeID=seq(1:9),
T1=c(1,1,1,2,2,2,3,3,3),
T2=c(1,2,3,1,2,3,1,2,3),
Count=c(2,14,0,18,12,8,23,12,11))
MyData1
df <- data.frame(x=rep(c("T1","T2"), each=nrow(MyData1)),
y=c(MyData1$T1,MyData1$T2),
ID=rep(MyData1$TransitionTypeID,2),
cnt=rep(MyData1$Count,2)
)
df_lb <- data.frame(x=rep(c("T1","T2"), each=3),
y=rep(1:3,2),
hj=rep(c(2,-1),each=3))
library(ggplot2)
pal <- colorRampPalette(c("white","blue","red"))
ggplot(data=df, aes(x=x, y=y, group=ID, size=cnt, color=cnt)) +
geom_line() +
geom_point(show.legend=F) +
labs(x="", y="Score") +
scale_color_gradientn(colours=pal(10)) +
geom_text(data=df_lb, aes(x=x, y=y, label=y), size=7, inherit.aes=F, hjust=df_lb$hj) +
theme_void()
For question #1
You get to your goal very tidily with geom_segment:
ggplot( data=cbind( MyData1 ),
aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+
geom_segment()
I suspect you will need to change the 0 Counts to NA to get that 1->3 transition to go away. I think a size of 0 should make a segment disappear, but apparently Hadley thinks otherwise.
Yep:
is.na(MyData1 ) <- MyData1==0
ggplot( data=cbind( MyData1 ), aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+geom_segment()
The same code as above now delivers the correct plot.
This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.
I'm new to this site and relatively new to R. I need to make a grouped barchart and my issue is that I need my y axis to be the count of observations of a specific variable/column. I have several hundred observations and I need to graph the number of participants who have ever been dependent on alcohol (y-axis) by race and gender.
I did this:
AlcDep_byRG_Table <- table(baseline_Alc_Race_Gender$ALCDEP_3.[baseline_Alc_Race_Gender$ALCDEP_3. >= 3],
baseline_Alc_Race_Gender$PRACE[baseline_Alc_Race_Gender$ALCDEP_3. >= 3],
baseline_Alc_Race_Gender$PGENDER[baseline_Alc_Race_Gender$ALCDEP_3. >= 3])
m <- colSums(AlcDep_byRD_Table[,,2]) # rows, columns, slice; 2 = second slice, 1, male
f <- colSums(AlcDep_byRD_Table[,,3])
barplot(c(m,f), main = "Alcohol Dependence by Race and Gender", beside = TRUE, xaxt = "n")
BUT, I don't know how to make it a grouped/clustered bar chart
I tried using ifelse but then didn't know how to subset it like the above.
Thank you so much for your help!
edit: I tried ggplot
ggplot(baseline_Alc_Race_Gender, aes(x=PRACE, y=Alc_Dep, fill = PGENDER)) +
geom_bar(stat='identity')
but the graph has lots of horizontal lines in each bar - anyone know what the mistake could be? Thank you!
(edited)
# create a dataframe like yours
PART_ID<-c(1001,1002,1003,1004,1005)
Alc_Dep<-c(1, 0,1,1,1)
PRACE<-c(1,2,1,3,1)
PGENDER<-c(1,2,2,2,1)
baseline_Alc_Race_Gender <- data.frame(PART_ID,PGENDER,PRACE,Alc_Dep)
baseline_Alc_Race_Gender
# solve the factor issue
baseline_Alc_Race_Gender$PGENDER<-
factor(baseline_Alc_Race_Gender$PGENDER)
baseline_Alc_Race_Gender$PRACE<-factor(baseline_Alc_Race_Gender$PRACE)
baseline_Alc_Race_Gender$Alc_Dep<-factor(baseline_Alc_Race_Gender$Alc_Dep)
# plot it
# do not specify a y!
ggplot(baseline_Alc_Race_Gender, aes(x=PRACE, fill=PGENDER)) +
geom_bar(position=position_dodge())
Sarah,
It would help immensely if you could share a sample of your data. But I'll make something up that may be your case. Three columns race, gender, and a yes or no to alcohol dependence. This should work no matter how many cases you have. It may require you to uncomment the install ggplot2 command
d <- data.frame(column1=rep(c("race1","race2","race3"), each=4),
column2=rep(c("male", "female"), 6),
column3=rep(c("yes", "no"), 6))
d
# install.packages("ggplot2") you may need this
require(ggplot2)
ggplot(d, aes(x=column1, fill=column2)) +
geom_bar(position=position_dodge())
I have a variable the can take on the values 0 or 1 for each entry in the data frame. At the same time, each of the values were generated in a certain condition.
Now, I want to plot the proportion of '1's per condition. Note, that the respective data entries in the two conditions are not balanced, i.e., condition 'a' could have 20 entries of 0 or 1, whereas condition 'b' could have 200 entries of 0 or 1.
Thanks to a few posts here, I have come this far:
x <- rbinom(378,1,.9)
cond <- rbinom(378,1,.7)+1
myDf <- data.frame(x,factor(cond,labels=c('a','b')))
names(myDf) <- c('val', 'cond')
g <- ggplot(data.frame(myDf),aes(x=val, fill=cond))
g + geom_histogram(aes(y=0.5*..density..), binwidth=0.5, position=position_dodge())
If you inspect the plot, you quickly see that one set of bars is superfluous.
--> How can I skip the plotting the bars at x-axis tick 0? They are already represented with the bars at x-axis tick 1, because I am plotting proportions after all.
Edit: If you have an idea, how the difference in proportions could be tested for significance, feel free to check out this related question.
Like Henrik described in the comments to my question, the problem can be solved by calculating the proportions first and then plotting them using the geom_col().
Based on the code in the original question:
df <- aggregate(val ~ cond, myDf, function(x) sum(x)/length(x))
ggplot(df, aes(x = cond, y = val, fill = cond)) + geom_col()
I've had a good look around this site and others on how to set the hjust and vjust according to a value in a particular column. The following shows how the data is structured (but is a simplified subset of many entries for many years):
YearStart <- c(2001,2002,2003,2001,2002,2003)
Team <- c("MU","MU","MU","MC","MC","MC")
Attendance <- c(67586,67601,67640,33058,34564,46834)
Position <- c(3,1,3,1,9,16)
offset <-c()
df <- data.frame(YearStart,Team,Attendance,Position)
so
> head(df)
YearStart Team Attendance Position
1 2001 MU 67586 3
2 2002 MU 67601 1
3 2003 MU 67640 3
4 2001 MC 33058 1
5 2002 MC 34564 9
6 2003 MC 46834 16
what I would like to acheive is a vjust value based on the Team. In the following, MU would be vjust=1 and MC would be vjust=-1 so I can control where the data label is located from the data group with which it is associated.
I've tried to hack around a couple of examples that use a function containing a lookup table (it's not a straight ifelse as I have many values for Team) but I can't seem to pass a string to the function through the aes method along these lines:
lut <- list(MU=1,MC=-1)
vj <-function(x){lut[[x]]}
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(vjust = vj(Team) ) )
print(p)
The following is pseudo(ish)code which applies the labels twice to each group in each location above and below the points.
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(Team="MU"), vjust=1)+
geom_text(aes(Team="MC"), vjust=-1)
print(p)
I've tried several other strategies for this and I can't tell whether I'm trying this from the wrong direction or I'm just missing a very trivial piece of ggplot syntax. I've accomplished a stop-gap solution by labelling them manually in Excel but that's not sustainable :-)
To specify an aesthetic, that aesthetic should be a column in your data.frame.
(Notice also that your lookup function should have single brackets, not double.)
And a final thought: vjust and hjust are strictly only defined between [0, 1] for left/bottom and right/top justification. In practise, however, it is usually possible to extend this. I find that settings of (-0.2, 1.2) work quite well, in most cases.
lut <- list(MU=-0.2, MC=1.2)
vj <- function(x) lut[x]
df$offset <- vj(df$Team)
library(ggplot2)
ggplot(df, aes(YearStart, Attendance, label=Position, group=Team)) +
geom_point(aes(colour=Team)) +
geom_text(aes(vjust = offset))