Conditional text formatting with ggplot - r

I've had a good look around this site and others on how to set the hjust and vjust according to a value in a particular column. The following shows how the data is structured (but is a simplified subset of many entries for many years):
YearStart <- c(2001,2002,2003,2001,2002,2003)
Team <- c("MU","MU","MU","MC","MC","MC")
Attendance <- c(67586,67601,67640,33058,34564,46834)
Position <- c(3,1,3,1,9,16)
offset <-c()
df <- data.frame(YearStart,Team,Attendance,Position)
so
> head(df)
YearStart Team Attendance Position
1 2001 MU 67586 3
2 2002 MU 67601 1
3 2003 MU 67640 3
4 2001 MC 33058 1
5 2002 MC 34564 9
6 2003 MC 46834 16
what I would like to acheive is a vjust value based on the Team. In the following, MU would be vjust=1 and MC would be vjust=-1 so I can control where the data label is located from the data group with which it is associated.
I've tried to hack around a couple of examples that use a function containing a lookup table (it's not a straight ifelse as I have many values for Team) but I can't seem to pass a string to the function through the aes method along these lines:
lut <- list(MU=1,MC=-1)
vj <-function(x){lut[[x]]}
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(vjust = vj(Team) ) )
print(p)
The following is pseudo(ish)code which applies the labels twice to each group in each location above and below the points.
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(Team="MU"), vjust=1)+
geom_text(aes(Team="MC"), vjust=-1)
print(p)
I've tried several other strategies for this and I can't tell whether I'm trying this from the wrong direction or I'm just missing a very trivial piece of ggplot syntax. I've accomplished a stop-gap solution by labelling them manually in Excel but that's not sustainable :-)

To specify an aesthetic, that aesthetic should be a column in your data.frame.
(Notice also that your lookup function should have single brackets, not double.)
And a final thought: vjust and hjust are strictly only defined between [0, 1] for left/bottom and right/top justification. In practise, however, it is usually possible to extend this. I find that settings of (-0.2, 1.2) work quite well, in most cases.
lut <- list(MU=-0.2, MC=1.2)
vj <- function(x) lut[x]
df$offset <- vj(df$Team)
library(ggplot2)
ggplot(df, aes(YearStart, Attendance, label=Position, group=Team)) +
geom_point(aes(colour=Team)) +
geom_text(aes(vjust = offset))

Related

Adding another label to third variable in barplot (ggplot2)

My goal is to create a barplot that visualises the percentages of three variables; however, my current graph does so in a rather confusing way.
A little bit of context: Each of my variables can can have one of two possible values:
Reference: null or overt
Variety: SING or GB
Register: S1A or S1B
Overall, the data frame looks like this (with a few more thousand lines):
Reference Register Variety
1 null S1A SING
2 null S1A SING
3 null S1A SING
4 null S1A SING
5 null S1A SING
6 null S1A SING
I have used the following code to create the barplot below:
data_raw <- read.csv("INPUT.csv", TRUE, ",")
data_2 <- data_raw %>%
count(Reference, Variety, Register) %>%
mutate(pct = n / sum(n),
pct_label = scales::percent(pct))
ggplot(data_2, aes(x= Reference, fill = Variety, y = pct)) +
geom_col() +
geom_text(aes(label = paste(pct_label, n, sep = "\n")),
lineheight = 0.8,
position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = scales::percent)
The third variable, Register, is represented by two separate values within a single-coloured box, e.g., 684/20.22% (S1B) and 931/27.52% (S1A) for the variety GB. While I can infer from my data which of these two values stands for S1A or S1B, I need this to be apparent from the barplot as well. For example, would it be possible to add a label to "684/20.22%" that indicates that it is the S1B value?
Another obvious problem is that the data for the x-value "null" contains very low percentages, making it hard to read. I'm not sure what would be the best way to handle this. Perhaps it would make sense to do away with the numbers altogether and rely on colours only.
I'd be very grateful for any suggestions or solutions to my problem. I'm still a beginner and hope to become better at using R for data analysis.
If you just want to add Register into the label, I think just add it to the label should work:
...
geom_text(aes(label = paste(Register, pct_label, n, sep = "\n")),
...
However I think you may want to look for some more aesthetic ideas, such as adding stripes or making it semi-transparent for the Register variable.
To jitter crowded label, you can look at this post.

ggplot plotting problems and error bars

So I have some data that I imported into R using read.csv.
d = read.csv("Flux_test_results_for_R.csv", header=TRUE)
rows_to_plot = c(1,2,3,4,5,6,13,14)
d[rows_to_plot,]
It looks like it worked fine:
> d[rows_to_plot,]
strain selective rate ci.low ci.high
1 4051 rif 1.97539e-09 6.93021e-10 5.63066e-09
2 4052 rif 2.33927e-09 9.92957e-10 5.51099e-09
3 4081 (mutS) rif 1.32915e-07 1.05363e-07 1.67671e-07
4 4086 (mutS) rif 1.80342e-07 1.49870e-07 2.17011e-07
5 4124 (mutL) rif 5.53369e-08 4.03940e-08 7.58077e-08
6 4125 (mutL) rif 1.42575e-07 1.14957e-07 1.76828e-07
13 4760-all rif 6.74928e-08 5.41247e-08 8.41627e-08
14 4761-all rif 2.49119e-08 1.91979e-08 3.23265e-08
So now I'm trying to plot the column "rate", with "strain" as labels, and ci.low and ci.high as boundaries for confidence intervals.
Using ggplot, I can't even get the plot to work. This gives a plot where all the dots are at 1 on the y-axis:
g <- ggplot(data=d[rows_to_plot,], aes(x=strain, y=rate))
g + geom_dotplot()
Attempt at error bars:
error_limits = aes(ymax = d2$ci.high, ymin = d2$ci.low)
g + geom_errorbar(error_limits)
As you can tell I'm a complete noob to plotting things in R, any help appreciated.
Answer update
There were two things going on. As per boshek's answer, which I selected, I it seems that geom_point(), not geom_dotplot(), was the way to go.
The other issue was that originally, I filtered the data to only plot some rows, but I didn't also filter the error limits by row. So I switched to:
d2 = d[c(1,2,3,4,5,6,13,14),]
error_limits = aes(ymax = d2$ci.high, ymin = d2$ci.low)
g = ggplot(data=d2, ...etc...
A couple general comments. Get away from using attach. Though it has its uses, for beginners it can be very confusing. Get used to things like d$strain and d$selective. That said, once you call the dataframe with ggplot() you can refer to variables in that dataframe subsequently just by their names. Also you really need to ask questions with a reproducible example. This is a very important step in figuring out how to ask questions in R.
Now for the plot. I think this should work:
error_limits = aes(ymax = rate + ci.high, ymin = rate - ci.low)
ggplot(data=d[rows_to_plot,], aes(x=strain, y=rate)) +
geom_point() +
geom_errorbar(error_limits)
But of course this is untested because you haven't provided a reproducible examples.

How to create a dotplot in R using ggplot2

I have data that looks like this
year species number.of.seed dist.to.forest
2006 0 -5
2006 Bridelia_speciosa 3 0
2006 0 5
2006 Carapa 5 10
2006 0 15
And I have created a bar chart, that shows for each year the number of different species found in seed traps and as shown by their distance from forest, which looks like this:
bu I would like to use the geom = "dotplot", and have a single dot representing each species which I have counted, basically exactly the same as the bar chart, but instead of the first bar in year 2006, 24 dots, and instead of the second bar 23 dots etc. But when I use geom = "dotplot" I just cant get it to work, not the way i want it, i can get it with a single dot at 24, or 23, but not 24 dots. I have tried a number of other solutions to similar problems on SO but nothing is working. Many thanks in advance.
my code:
dat1<-read.csv(file="clean_06_14_inc_dist1.csv")
diversity<-function(years){
results<-data.frame()
dat2<-subset(dat1, dat1$year %in% years)
for (j in years){
for (i in seq(-5, 50, 5)){
dat3<-subset(dat2, dat2$year == j & dat2$dist.to.forest == i)
a<-length(unique(dat3$species))
new_row<-data.frame(year = j, dist.to.forest = i, number.of.species = a)
results<-rbind(results, new_row)
}}
print(results)
qplot(x = dist.to.forest, y =number.of.species, data = results, facets = .~year, geom = "bar", stat = "identity")
}
diversity(2006:2008)
I think your problem is that you are trying to do a dotplot-graph with both an x and y-value as in your bar-graph, whereas I believe dotplot-graphs are meant to be used as histograms, taking only the one variable..
So, if I'm not mistaken, if you make your dataframe distinct in the variables you are interested in (since you wanted unique number of species), you can plot it straight away, basically something like
dat2 = unique(dat1[,c("year","species", "dist.to.forest")])
qplot(x=dist.to.forest, data=dat2, facets=.~year, geom="dotplot")
On a side note, I think you may be making this charting more complicated than needs be and you may want to look into dplyr which makes this kind of data manipulation a breeze..
require(dplyr)
dat1 = tbl_df(read.csv(file="clean_06_14_inc_dist1.csv"))
dat2 = dat1 %.%
filter (year %in% 2006:2008) %.%
group_by (year, dist.to.forest) %.%
summarise (number.of.species = n_distinct(species))
qplot(x=dist.to.forest, y=number.of.species, data=dat2, facets=.~year, geom="bar")
Disclaimer: Since you did not provide any sample data, this code is just off the top of my head and may contain errors.

R plot function - axes for a line chart

assume the following frequency table in R, which comes out of a survey:
1 2 3 4 5 8
m 5 16 3 16 5 0
f 12 25 3 10 3 1
NA 1 0 0 0 0 0
The rows stand for the gender of the survey respondent (male/female/no answer). The colums represent the answers to a question on a 5 point scale (let's say: 1= agree fully, 2 = agree somewhat, 3 = neither agree nor disagree, 4= disagree somewhat, 5 = disagree fully, 8 = no answer).
The data is stored in a dataframe called "slm", the gender variable is called "sex", the other variable is called "tv_serien".
My problem is, that I don't find a (in my opinion) proper way to create a line chart, where the x-axis represents the 5-point scale (plus the don't know answers) and the y-axis represents the frequencies for every point on the scale. Furthemore I want to create two lines (one for males, one for females).
My solution so far is the following:
I create a plot without plotting the "content" and the x-axis:
plot(slm$tv_serien, xlim = c(1,6), ylim = c(0,100), type = "n", xaxt = "n")
The problem here is that it feels like cheating to specify the xlim=c(1,6), because the raw scores of slm$tv_serienare 100 values. I tried also to to plot the variable via plot(factor(slm$tv_serien)...), but then it would still create a metric scale from 1 to 8 (because the dont know answer is 8).
So my first question is how to tell R that it should take the six distinct values (1 to 5 and 8) and take that as the x-axis?
I create the new x axis with proper labels:
axis(1, 1:6, labels = c("1", "2", "3", "4", "5", "DK"))
At least that works pretty well. ;-)
Next I create the line for the males:
lines(1:5, table(slm$tv_serien[slm$sex == 1]), col = "blue")
The problem here is that there is no DK (=8) answer, so I manually have to specify x = 1:5 instead of 1:6 in the "normal" case. My question here is, how to tell R to also draw the line for nonexisting values? For example, what would have happened, if no male had answered with 3, but I want a continuous line?
At last I create the line for females, which works well:
lines(1:6, table(slm$tv_serien[slm$sex == 2], col = "red")
To summarize:
How can I tell R to take the 6 distinct values of slm$tv_serien as the x axis?
How can i draw continuous lines even if the line contains "0"?
Thanks for your help!
PS: Attached you find the current plot for the abovementiond functions.
PPS: I tried to make a list from "1." to "4." but it seems that every new list element started again with "1.". Sorry.
Edit: Response to OP's comment.
This directly creates a line chart of OP's data. Below this is the original answer using ggplot, which produces a far superior output.
Given the frequency table you provided,
df <- data.frame(t(freqTable)) # transpose (more suitable for plotting)
df <- cbind(Response=rownames(df),df) # add row names as first column
plot(as.numeric(df$Response),df$f,type="b",col="red",
xaxt="n", ylab="Count",xlab="Response")
lines(as.numeric(df$Response),df$m,type="b",col="blue")
axis(1,at=c(1,2,3,4,5,6),labels=c("Str.Agr.","Sl.Agr","Neither","Sl.Disagr","Str.Disagr","NA"))
Produces this, which seems like what you were looking for.
Original Answer:
Not quite what you asked for, but converting your frequency table to a data frame, df
df <- data.frame(freqTable)
df <- cbind(Gender=rownames(df),df) # append rownames (Gender)
df <- df[-3,] # drop unknown gender
df
# Gender X1 X2 X3 X4 X5 X8
# m m 5 16 3 16 5 0
# f f 12 25 3 10 3 1
df <- df[-3,] # remove unknown gender column
library(ggplot2)
library(reshape2)
gg=melt(df)
labels <- c("Agree\nFully","Somewhat\nAgree","Neither Agree\nnor Disagree","Somewhat\nDisagree","Disagree\nFully", "No Answer")
ggp <- ggplot(gg,aes(x=variable,y=value))
ggp <- ggp + geom_bar(aes(fill=Gender), position="dodge", stat="identity")
ggp <- ggp + scale_x_discrete(labels=labels)
ggp <- ggp + theme(axis.text.x = element_text(angle=90, vjust=0.5))
ggp <- ggp + labs(x="", y="Frequency")
ggp
Produces this:
Or, this, which is much better:
ggp + facet_grid(Gender~.)

Order of legend entries in ggplot2 barplots with coord_flip()

I'm struggling get the right ordering of variables in a graph I made with ggplot2 in R.
Suppose I have a dataframe such as:
set.seed(1234)
my_df<- data.frame(matrix(0,8,4))
names(my_df) <- c("year", "variable", "value", "vartype")
my_df$year <- rep(2006:2007)
my_df$variable <- c(rep("VX",2),rep("VB",2),rep("VZ",2),rep("VD",2))
my_df$value <- runif(8, 5,10)
my_df$vartype<- c(rep("TA",4), rep("TB",4))
which yields the following table:
year variable value vartype
1 2006 VX 5.568517 TA
2 2007 VX 8.111497 TA
3 2006 VB 8.046374 TA
4 2007 VB 8.116897 TA
5 2006 VZ 9.304577 TB
6 2007 VZ 8.201553 TB
7 2006 VD 5.047479 TB
8 2007 VD 6.162753 TB
There are four variables (VX, VB, VZ and VD), belonging to two groups of variable types, (TA and TB).
I would like to plot the values as horizontal bars on the y axis, ordered vertically first by variable groups and then by variable names, faceted by year, with values on the x axis and fill colour corresponding to variable group.
(i.e. in this simplified example, the order should be, top to bottom, VB, VX, VD, VZ)
1) My first attempt has been to try the following:
ggplot(my_df,
aes(x=variable, y=value, fill=vartype, order=vartype)) +
# adding or removing the aesthetic "order=vartype" doesn't change anything
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
However, the variables are listed in reverse alphabetical order, but not by vartype : the order=vartype aesthetic is ignored.
2) Following an answer to a similar question I posted yesterday, i tried the following, based on the post Order Bars in ggplot2 bar graph :
my_df$variable <- factor(
my_df$variable,
levels=rev(sort(unique(my_df$variable))),
ordered=TRUE
)
This approach does gets the variables in vertical alphabetical order in the plot, but ignores the fact that the variables should be ordered first by variable goups (with TA-variables on top and TB-variables below).
3) The following gives the same as 2 (above):
my_df$vartype <- factor(
my_df$vartype,
levels=sort(unique(my_df$vartype)),
ordered=TRUE
)
... which has the same issues as the first approach (variables listed in reverse alphabetical order, groups ignored)
4) another approach, based on the original answer to Order Bars in ggplot2 bar graph , also gives the same plat as 2, above
my_df <- within(my_df,
vartype <- factor(vartype,
levels=names(sort(table(vartype),
decreasing=TRUE)))
)
I'm puzzled by the fact that, despite several approaches, the aesthetic order=vartype is ignored. Still, it seems to work in an unrelated problem: http://learnr.wordpress.com/2010/03/23/ggplot2-changing-the-default-order-of-legend-labels-and-stacking-of-data/
I hope that the problem is clear and welcome any suggestions.
Matteo
I posted a similar question yesterday, but, unfortunately I made several mistakes when descrbing the problem and providing a reproducible example.
I've listened to several suggestions since, and thoroughly searched stakoverflow for similar question and applied, to the best of my knowledge, every suggested combination of solutions, to no avail.
I'm posting the question again hoping to be able to solve my issue and, hopefully, be helpful to others.
This has little to do with ggplot, but is instead a question about generating an ordering of variables to use to reorder the levels of a factor. Here is your data, implemented using the various functions to better effect:
set.seed(1234)
df2 <- data.frame(year = rep(2006:2007),
variable = rep(c("VX","VB","VZ","VD"), each = 2),
value = runif(8, 5,10),
vartype = rep(c("TA","TB"), each = 4))
Note that this way variable and vartype are factors. If they aren't factors, ggplot() will coerce them and then you get left with alphabetical ordering. I have said this before and will no doubt say it again; get your data into the correct format first before you start plotting / doing data analysis.
You want the following ordering:
> with(df2, order(vartype, variable))
[1] 3 4 1 2 7 8 5 6
where you should note that we get the ordering by vartype first and only then by variable within the levels of vartype. If we use this to reorder the levels of variable we get:
> with(df2, reorder(variable, order(vartype, variable)))
[1] VX VX VB VB VZ VZ VD VD
attr(,"scores")
VB VD VX VZ
1.5 5.5 3.5 7.5
Levels: VB VX VD VZ
(ignore the attr(,"scores") bit and focus on the Levels). This has the right ordering, but ggplot() will draw them bottom to top and you wanted top to bottom. I'm not sufficiently familiar with ggplot() to know if this can be controlled, so we will also need to reverse the ordering using decreasing = TRUE in the call to order().
Putting this all together we have:
## reorder `variable` on `variable` within `vartype`
df3 <- transform(df2, variable = reorder(variable, order(vartype, variable,
decreasing = TRUE)))
Which when used with your plotting code:
ggplot(df3, aes(x=variable, y=value, fill=vartype)) +
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
produces this:

Resources