Using apply functions with ggplot to plot a subset of dataframe columns - r

I have a dataframe df with many columns ...
I'd like plot of subset of columns where c is a list of the columns I'd like to plot.
I'm currently doing the following
df <-structure(list(Image.Name = structure(1:5, .Label = c("D1C1", "D2C2", "D4C1", "D5C3", "D6C2"), class = "factor"), Experiment = structure(1:5, .Label = c("020718 perfusion EPC_BC_HCT115_Day 5", "020718 perfusion EPC_BC_HCT115_Day 6", "020718 perfusion EPC_BC_HCT115_Day 7", "020718 perfusion EPC_BC_HCT115_Day 8", "020718 perfusion EPC_BC_HCT115_Day 9"), class = "factor"), Type = structure(c(2L, 1L, 1L, 2L, 1L), .Label = c("VMO", "VMT"), class = "factor"), Date = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "18-Apr-18", class = "factor"), Time = structure(1:5, .Label = c("12:42:02 PM", "12:42:29 PM", "12:42:53 PM", "12:43:44 PM", "12:44:23 PM"), class = "factor"), Low.Threshold = c(10L, 10L, 10L, 10L, 10L), High.Threshold = c(255L, 255L, 255L, 255L, 255L), Vessel.Thickness = c(7L, 7L, 7L, 7L, 7L), Small.Particles = c(0L, 0L, 0L, 0L, 0L), Fill.Holes = c(0L, 0L, 0L, 0L, 0L), Scaling.factor = c(0.001333333, 0.001333333, 0.001333333, 0.001333333, 0.001333333), X = c(NA, NA, NA, NA, NA), Explant.area = c(1.465629333, 1.093447111, 1.014612444, 1.166950222, 1.262710222), Vessels.area = c(0.255562667, 0.185208889, 0.195792, 0.153907556, 0.227996444), Vessels.percentage.area = c(17.43706003, 16.93807474, 19.29722044, 13.18887067, 18.05611774), Total.Number.of.Junctions = c(56L, 32L, 39L, 18L, 46L), Junctions.density = c(38.20884225, 29.26524719, 38.43832215, 15.42482246, 36.42957758), Total.Vessels.Length = c(12.19494843, 9.545333135, 10.2007416, 7.686755647, 11.94211976), Average.Vessels.Length = c(0.182014156, 0.153956986, 0.188902622, 0.08938088, 0.183724919), Total.Number.of.End.Points = c(187L, 153L, 145L, 188L, 167L), Average.Lacunarity = c(0.722820111, 0.919723402, 0.86403871, 1.115896082, 0.821753818)), .Names = c("Image.Name", "Experiment", "Type", "Date", "Time", "Low.Threshold", "High.Threshold", "Vessel.Thickness", "Small.Particles", "Fill.Holes", "Scaling.factor", "X", "Explant.area", "Vessels.area", "Vessels.percentage.area", "Total.Number.of.Junctions", "Junctions.density", "Total.Vessels.Length", "Average.Vessels.Length", "Total.Number.of.End.Points", "Average.Lacunarity"), row.names = c(NA, -5L), class = "data.frame")
doBarPlot <- function(x) {
p <- ggplot(x, aes_string(x="Type", y=colnames(x), fill="Type") ) +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width=0.5, na.rm = TRUE) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5) )
print(p)
ggsave(sprintf("plots/%s_bars.pdf", colnames(x) ) )
return(p)
}
c = c('Total.Vessels.Length', 'Total.Number.of.Junctions', 'Total.Number.of.End.Points', 'Average.Lacunarity')
p[c] <- lapply(df[c], doBarPlot)
However this yields the following error :
Error: ggplot2 doesn't know how to deal with data of class numeric
Debugging shows that x inside of doBarPlot is of the type numeric rather than data.frame, so ggplot errors. However, test <- df2[c] yields a variable of the type data.frame.
Why is x a numeric?
What's the best way to apply doBarPlot without resorting to a loop?

As others have noted, the problem with your initial approach is that when you use lapply on a data frame, the elements that you are iterating over will be the column vectors, rather than 1-column data frames. However, even if you did iterate over 1-column data frames, your function would fail: the data frame supplied to the ggplot call wouldn't contain the Type column that you use in the plot.
Instead, you could modify the function to take two arguments: the full data frame, and the name of the column that you want to use on the y-axis.
doBarPlot <- function(data, y) {
p <- ggplot(data, aes_string(x = "Type", y = y, fill = "Type")) +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(
fun.data = "mean_cl_normal",
geom = "errorbar",
width = 0.5,
na.rm = TRUE
) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5))
print(p)
ggsave(sprintf("plots/%s_bars.pdf", y))
return(p)
}
Then, you can use lapply to iterate over the character vector of columns you want to plot, while supplyig the data frame via the ... as a fixed argument to your plotting function:
library(ggplot2)
cols <- c('Total.Vessels.Length', 'Total.Number.of.Junctions',
'Total.Number.of.End.Points', 'Average.Lacunarity')
p <- lapply(cols, doBarPlot, data = df)
Further, if you don't mind having all of the plots in one file, you could also use tidyr::gather to reshape your data into long form, and use facet_wrap in your plot (as suggested by #RichardTelford in his comment), avoiding the iteration and the need for a function altogether:
library(tidyverse)
df %>%
gather(variable, value, cols) %>%
ggplot(aes(x = Type, y = value, fill = Type)) +
facet_wrap(~ variable, scales = "free_y") +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(
fun.data = "mean_cl_normal",
geom = "errorbar",
width = 0.5,
na.rm = TRUE
) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5))

The apply family of functions vectorise the objected passed. A simple example to illustrate this:
lapply(mtcars, function(x) print(x))
With your code, you are passing a vector of each column in your df to the function doBarPlot. The ggplot2 package works with dataframes, not lists or vectors and therefore you get the error.
If you want to use your function, apply it directly to the subsetted df:
doBarPlot(df[ , c])
If you have a bunch of dataframes and you want to subset by the columns in c checkout this answer:
How to apply same function to every specified column in a data.table
or alternatively, look into the dplyr::select()

Related

Graphing continuous data points using date and time in R

I am very new to RStudio so my coding is rudimentary.
I have a data set that contains six (6) columns: date5m, time5m, T5m, date28m, time28m, T28m. The data set is temperature data at two depths (5m and 28m) with an associate date and time stamp. My resulting graph appears to place all the data by day rather than a continuous display by the time that it was collected. Any assistance would be appreciated.
library(ggplot2)
library(scales)
library(dplyr)
Aberdeen <- read.csv(file.choose(), header = TRUE)
head(Aberdeen)
Aberdeen$ï..date5m = as.Date(Aberdeen$ï..date5m, format = "%Y-%m-%d")
Aberdeen$date28m = as.Date(Aberdeen$date28m, format = "%Y-%m-%d")
ggplot() + geom_point(data = Aberdeen, aes(x = ï..date5m, y = T5m),
colour = "darkgreen", size=0.25, na.rm=TRUE) +
geom_point(data = Aberdeen, aes(x = date28m, y = T28m), colour = "forestgreen", size=0.25, na.rm=TRUE) +
labs(x = "Date", y = "Temperature (\u00B0C)") +
ggtitle("Aberdeen") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
scale_x_date(date_breaks = "month", labels=date_format("%b-%Y"))
I would like for the graph to display the data in a continuous fashion using both date and time stamp like this:
Here is the first 10 lines of my data set:
structure(list(date5m = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "2018-06-01", class = "factor"), time5m =
structure(1:10, .Label = c("14:40:30",
"14:42:34", "14:44:39", "14:46:40", "14:48:43", "14:50:46", "14:52:51",
"14:54:56", "14:56:59", "14:59:03"), class = "factor"), T5m = c(9.1,
9.02, 9, 9.12, 9.12, 9.1, 9.06, 9.02, 8.98, 9.02), date28m =
structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "2018-06-01", class =
"factor"),
time28m = structure(1:10, .Label = c("14:39:00", "14:49:00",
"14:59:00", "15:10:00", "15:20:00", "15:30:00", "15:40:00",
"15:50:00", "16:00:00", "16:11:00"), class = "factor"), T28m = c(1.93,
1.93, 1.93, 1.93, 1.93, 1.93, 1.93, 1.93, 1.93, 1.91)), row.names = c(NA,
10L), class = "data.frame")
This turned out to tricker than expected since the date and time columns are not consistent across the rows.
I had to manipulate the column names to provide a consistent separator in the name. I also combined the date and time columns into a single datetime object in order to plot properly.
Once the original data frame was converted from the original wide format into a long format the ggplot call was simplified.
"Aberdeen" is the name of the original dataframe from the read.csv statement (assume to match the sample data posted). Please see the code comments for additional details:
library(tidyr)
library(dplyr)
library(stringr)
#Rename the columns to add a '_' seperator between the letter and first number
#this is needed to make the separation and the pivot easier.
# See the tidyr pivot Vignette "Multiple observations per row"
names(Aberdeen) <- names(Aberdeen) %>% str_replace( "(\\D)(\\d)", "\\1_\\2")
#Adding a rownumber for tracking purposes
#Unite the date and time columns into 1 column
#reshape to long
dflong<-Aberdeen %>% mutate(rowid=row_number()) %>%
unite("datetime_5m", c(date_5m, time_5m)) %>%
unite("datetime_28m", c(date_28m, time_28m)) %>%
pivot_longer(cols= -rowid, names_to = c(".value", "depth"), names_sep="_")
#convert datetime column from character to datetime oject:
dflong$datetime<-as.POSIXct(dflong$datetime, "%Y-%m-%d_%H:%M:%S", tz="")
#plot grouping and coloring by the depth
ggplot(data = dflong, aes(x = datetime, y = T, group=depth, color=depth)) +
geom_point() +
labs(x = "Date", y = "Temperature (\u00B0C)") +
ggtitle("Aberdeen") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
scale_x_datetime(date_breaks = "hour", labels=date_format("%b-%Y"))

Bubble chart without axis with labels in R

I have the following data frame:
> dput(df)
structure(list(text = structure(c(9L, 10L, 1L, 7L, 5L, 12L, 1L,
11L, 5L, 8L, 2L, 13L, 2L, 5L, NA, 6L, 13L, 4L, NA, 5L, 4L, 3L
), .Label = c("add ", "change ", "clarify", "correct", "correct ",
"delete", "embed", "follow", "name ", "remove", "remove ", "specifiy ",
"update"), class = "factor"), ID = c(1052330L, 915045L, 931207L,
572099L, 926845L, 510057L, 927946L, 490640L, 928498L, 893872L,
956074L, 627059L, 508649L, 508657L, 1009304L, 493138L, 955579L,
144052L, 1011166L, 151059L, 930992L, 913074L)), .Names = c("text",
"ID"), class = "data.frame", row.names = c(NA, -22L))
I would like to have a bubble chart for my df with circles labeling with each verb in the text column and also the number of IDs that are related to each verb in the text column. This is the code I have for the circles but I don't know how to do the labeling:
> library(packcircles)
> library(ggplot2)
> packing <- circleProgressiveLayout(df)
> dat.gg <- circleLayoutVertices(packing)
> ggplot(data = dat.gg) +geom_polygon(aes(x, y, group = id, fill = factor(id)), colour = "black",show.legend = FALSE) +scale_y_reverse() +coord_equal()
You create a data.frame for your labels with the appropriate x and y coordinate and use geom_text
library(ggplot2)
packing <- circleProgressiveLayout(df)
dat.gg <- circleLayoutVertices(packing)
cbind(df, packing) -> new_df
ggplot(data = dat.gg) +geom_polygon(aes(x, y, group = id, fill = factor(id)), colour = "black",show.legend = FALSE) +
scale_y_reverse() +coord_equal() +geom_text(data = new_df, aes(x, y,label = text))
For the Text and ID, you can do:
new_df$text2 <- paste0(new_df$text,"\n",new_df$ID)
ggplot(data = dat.gg) +geom_polygon(aes(x, y, group = id, fill = factor(id)), colour = "black",show.legend = FALSE) +
scale_y_reverse() +coord_equal() +geom_text(data = new_df, aes(x, y,label = text2))

Creating multiple graphs based upon the column names

This is my first question on stackoverlow, please correct me if I am not following correct question protocols.
I am trying to create some graphs for data that has been collected over three time points (time 1, time 2, time 3) which equates to X1..., X2... and X3... at the beginning of column names. The graphs are also separated by the column $Group from the data frame.
I have no problem creating the graphs, I just have many variables (~170) and am wanting to compare time 1 vs time 2, time 2 vs time 3, etc. so am trying to work a shortcut to be running this kind of code rather than having to type out each one individually.
As indicated above, I have created variable names like X1... X2... which indicate the time that the variable was recorded i.e. X1BCSTCAT = time 1; X2BCSTCAT = time 2; X3BCSTCAT = time 3. Here is a small sample of what my data looks like:
df <- structure(list(ID = structure(1:6, .Label = c("101","102","103","118","119","120"), class = "factor"),
Group = structure(c(1L,1L,1L,2L,2L,2L), .Label = c("C8","TC"), class = "factor"),
Wave = structure(c(1L, 2L, 3L, 4L, 1L, 2L), .Label = c("A","B","C","D"), class = "factor"),
Yr = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("3","5"), class = c("ordered", "factor")),
Age.Yr. = c(10.936,10.936, 9.311, 10.881, 10.683, 11.244),
Training..hr. = c(10.667,10.333, 10.667, 10.333, 10.333, 10.333),
X1BCSTCAT = c(-0.156,0.637,-1.133,0.637,2.189,1.229),
X1BCSTCR = c(0.484,0.192, -1.309, 0.912, 1.902, 0.484),
X1BCSTPR = c(-1.773,0.859, 0.859, 0.12, -1.111, 0.12),
X2BCSTCAT = c(1.006, -0.379,-1.902, 0.444, 2.074, 1.006),
X2BCSTCR = c(0.405, -0.457,-1.622, 1.368, 1.981, 0.168),
X2BCSTPR = c(-0.511, -0.036,2.189, -0.036, -0.894, 0.949),
X3BCSTCAT = c(1.18, -1.399,-1.399, 1.18, 1.18, 1.18),
X3BCSTCR = c(0.967, -1.622, -1.622,0.967, 0.967, 1.255),
X3BCSTPR = c(-1.282, -1.282, 1.539,1.539, 0.792, 0.792)),
row.names = c(1L, 2L, 3L, 4L, 5L,8L), class = "data.frame")
Here is some working code to create one graph using ggplot for time 1 vs time 2 data on one variable:
library(ggplot2)
p <- ggplot(df, aes(x=df$X1BCSTCAT, y=df$X2BCSTCAT, shape = df$Group, color = df$Group)) +
geom_point() + geom_smooth(method=lm, aes(fill=df$Group), fullrange = TRUE) +
labs(title="BCSTCAT", x="Time 1", y = "Time 2") +
scale_color_manual(name = "Group",labels = c("C8","TC"),values = c("blue", "red")) +
scale_shape_manual(name = "Group",labels = c("C8","TC"),values = c(16, 17)) +
scale_fill_manual(name = "Group",labels = c("C8", "TC"),values = c("light blue", "pink"))
So I am really trying to create some kind of a shortcut where R will cycle through and match up variable names X1... vs X2... and so on and create the graphs. I assume there must be some way to plot either based upon matching column numbers e.g. df[,7] vs df[,10] and iterating through this process or plotting by actually matching the names (where the only difference in variable names is the number which indicates time).
I have previously cycled through creating individual graphs using the lapply function, but have no idea where to even start with trying to do this one.
A solution using tidyeval approach. We will need ggplot2 v3.0.0 (remember to restart your R session)
install.packages("ggplot2", dependencies = TRUE)
First we build a function that takes column and group names as inputs. Note the use of rlang::sym, rlang::quo_name & !!.
Then create 2 name vectors for x- & y- values so that we can loop through them simultaneously using purrr::map2.
library(rlang)
library(tidyverse)
df <- structure(list(ID = structure(1:6, .Label = c("101","102","103","118","119","120"), class = "factor"),
Group = structure(c(1L,1L,1L,2L,2L,2L), .Label = c("C8","TC"), class = "factor"),
Wave = structure(c(1L, 2L, 3L, 4L, 1L, 2L), .Label = c("A","B","C","D"), class = "factor"),
Yr = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("3","5"), class = c("ordered", "factor")),
Age.Yr. = c(10.936,10.936, 9.311, 10.881, 10.683, 11.244),
Training..hr. = c(10.667,10.333, 10.667, 10.333, 10.333, 10.333),
X1BCSTCAT = c(-0.156,0.637,-1.133,0.637,2.189,1.229),
X1BCSTCR = c(0.484,0.192, -1.309, 0.912, 1.902, 0.484),
X1BCSTPR = c(-1.773,0.859, 0.859, 0.12, -1.111, 0.12),
X2BCSTCAT = c(1.006, -0.379,-1.902, 0.444, 2.074, 1.006),
X2BCSTCR = c(0.405, -0.457,-1.622, 1.368, 1.981, 0.168),
X2BCSTPR = c(-0.511, -0.036,2.189, -0.036, -0.894, 0.949),
X3BCSTCAT = c(1.18, -1.399,-1.399, 1.18, 1.18, 1.18),
X3BCSTCR = c(0.967, -1.622, -1.622,0.967, 0.967, 1.255),
X3BCSTPR = c(-1.282, -1.282, 1.539,1.539, 0.792, 0.792)),
row.names = c(1L, 2L, 3L, 4L, 5L,8L), class = "data.frame")
# define a function that accept strings as input
pair_plot <- function(x_var, y_var, group_var) {
# convert strings to symbols
x_var <- rlang::sym(x_var)
y_var <- rlang::sym(y_var)
group_var <- rlang::sym(group_var)
# unquote symbols using !!
ggplot(df, aes(x = !! x_var, y = !! y_var, shape = !! group_var, color = !! group_var)) +
geom_point() + geom_smooth(method = lm, aes(fill = !! group_var), fullrange = TRUE) +
labs(title = "BCSTCAT", x = rlang::quo_name(x_var), y = rlang::quo_name(y_var)) +
scale_color_manual(name = "Group", labels = c("C8", "TC"), values = c("blue", "red")) +
scale_shape_manual(name = "Group", labels = c("C8", "TC"), values = c(16, 17)) +
scale_fill_manual(name = "Group", labels = c("C8", "TC"), values = c("light blue", "pink")) +
theme_bw()
}
# Test if the new function works
pair_plot("X1BCSTCAT", "X2BCSTCAT", "Group")
# Create 2 parallel lists
list_x <- colnames(df)[-c(1:6, (ncol(df)-2):(ncol(df)))]
list_x
#> [1] "X1BCSTCAT" "X1BCSTCR" "X1BCSTPR" "X2BCSTCAT" "X2BCSTCR" "X2BCSTPR"
list_y <- lead(colnames(df)[-(1:6)], 3)[1:length(list_x)]
list_y
#> [1] "X2BCSTCAT" "X2BCSTCR" "X2BCSTPR" "X3BCSTCAT" "X3BCSTCR" "X3BCSTPR"
# Loop through 2 lists simultaneously
# Supply inputs to pair_plot function using purrr::map2
map2(list_x, list_y, ~ pair_plot(.x, .y, "Group"))
Sample outputs:
#> [[1]]
#>
#> [[2]]
Created on 2018-05-24 by the reprex package (v0.2.0).

Annotate faceted plot in ggplot2

I am working on the dataset reported here below (pre.sss)
pre.sss <- pre.sss <- structure(list(Pretest.num = c(63, 62, 61, 60, 59, 58, 57, 4,2, 1), stress = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,1L), .Label = c("[0,6]", "(6,9]"), class = "factor"), time = c(1L,1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L), after = structure(c(2L,2L, 2L, 2L, 2L, 2L, 1L, 1L, NA, 1L), .Label = c("no", "yes"), class = "factor"),id = c("call_fam", "call_fam", "call_fam", "call_fam", "call_fam","call_fam", "call_fam", "counselor", "counselor", "counselor")), .Names = c("Pretest.num", "stress", "time", "after","id"), reshapeLong = structure(list(varying = structure(list(after = c("after.call.fam", "after.speak", "after.send.email","after.send.card", "after.attend", "after.fam.mtg", "after.sup.grp","after.counselor")), .Names = "after", v.names = "after", times = 1:8),v.names = "after", idvar = "Pretest.num", timevar = "time"), .Names = c("varying","v.names", "idvar", "timevar")), row.names = c("63.1", "62.1","61.1", "60.1", "59.1", "58.1", "57.1", "4.8", "2.8", "1.8"), class = "data.frame")
and I need to plot the counts of several categorical variables according to a specific level of another categorical variable ('stress'): so, a faceted bobble-lot would do the job in my case
So what I do is the following:
ylabels = c('call_fam' = "call fam.member for condolences",
'speak' = "speak to fam.member in person",
'send.email' = "send condolence email to fam.member",
'send.card' = "send condolence card/letter to fam.member",
'attend' = "attend funeral/wake",
'fam.mtg' = "provide fam.meeting",
'sup.grp' = "suggest attending support grp.",
'counselor' = "make referral to bereavement counselor" )
p = ggplot(pre.sss, aes(x = after, y = id)) +
geom_count(alpha = 0.5, col = 'darkblue') +
scale_size(range = c(1,30)) +
theme(legend.position = 'none') +
xlab("Response") +
ylab("What did you do after learning about death?") +
scale_y_discrete(labels = ylabels) +
facet_grid(.~ pre.sss$stress, labeller = as_labeller(stress.labels))
and I obtain the following image, exactly as I want.
Now I would like to label each bubble with the count with which the corresponding data appear in the dataset.
dat = data.frame(ggplot_build(p)$data[[1]][, c('x', 'y', 'PANEL', 'n')])
dat$PANEL = ifelse(dat$PANEL==1, "[0,6]", "(6-9]")
colnames(dat) = c('x', 'y', 'stress', 'n')
p + geom_text(aes(x, y, label = n, group = NULL), data = dat)
This gives me the following error I really can't understand.
> p + geom_text(aes(x, y, label=n, group=NULL), data=dat)
Error in `$<-.data.frame`(`*tmp*`, "PANEL", value = c(1L, 1L, 1L, 1L, :
replacement has 504 rows, data has 46
Can anybody help me with this?
Thanks!
EM
The function you refer to as your labeller function is missing from this example still. geom_count uses stat_sum, which calculates a parameter n, the number of observations at that point. Because you can use this calculated parameter, you don't actually have to assign the plot to a variable and pull out its data, as you did with ggplot_build.
This should do what you're looking for:
ggplot(pre.sss, aes(x = after, y = id)) +
geom_count(alpha = 0.5, col = 'darkblue') +
# note the following line
stat_sum(mapping = aes(label = ..n..), geom = "text") +
scale_size(range = c(1,30)) +
theme(legend.position = 'none') +
xlab("Response") +
ylab("What did you do after learning about death?") +
scale_y_discrete(labels = ylabels) +
facet_grid(.~ stress)
The line I added computes the same thing as what's behind the scenes in geom_count, but gives it a text geom instead, with the label mapped to that computed parameter n.

How to fix the following output plot by R? [duplicate]

I have the following plot:
library(reshape)
library(ggplot2)
library(gridExtra)
require(ggplot2)
data2<-structure(list(IR = structure(c(4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L
), .Label = c("0.13-0.16", "0.17-0.23", "0.24-0.27", "0.28-1"
), class = "factor"), variable = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("Real queens", "Simulated individuals"
), class = "factor"), value = c(15L, 11L, 29L, 42L, 0L, 5L, 21L,
22L), Legend = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Real queens",
"Simulated individuals"), class = "factor")), .Names = c("IR",
"variable", "value", "Legend"), row.names = c(NA, -8L), class = "data.frame")
p <- ggplot(data2, aes(x =factor(IR), y = value, fill = Legend, width=.15))
data3<-structure(list(IR = structure(c(4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L
), .Label = c("0.13-0.16", "0.17-0.23", "0.24-0.27", "0.28-1"
), class = "factor"), variable = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("Real queens", "Simulated individuals"
), class = "factor"), value = c(2L, 2L, 6L, 10L, 0L, 1L, 4L,
4L), Legend = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Real queens",
"Simulated individuals"), class = "factor")), .Names = c("IR",
"variable", "value", "Legend"), row.names = c(NA, -8L), class = "data.frame")
q<- ggplot(data3, aes(x =factor(IR), y = value, fill = Legend, width=.15))
##the plot##
q + geom_bar(position='dodge', colour='black') + ylab('Frequency') + xlab('IR')+scale_fill_grey() +theme(axis.text.x=element_text(colour="black"), axis.text.y=element_text(colour="Black"))+ opts(title='', panel.grid.major = theme_blank(),panel.grid.minor = theme_blank(),panel.border = theme_blank(),panel.background = theme_blank(), axis.ticks.x = theme_blank())
I want the y-axis to display only integers. Whether this is accomplished through rounding or through a more elegant method isn't really important to me.
If you have the scales package, you can use pretty_breaks() without having to manually specify the breaks.
q + geom_bar(position='dodge', colour='black') +
scale_y_continuous(breaks= pretty_breaks())
This is what I use:
ggplot(data3, aes(x = factor(IR), y = value, fill = Legend, width = .15)) +
geom_col(position = 'dodge', colour = 'black') +
scale_y_continuous(breaks = function(x) unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
With scale_y_continuous() and argument breaks= you can set the breaking points for y axis to integers you want to display.
ggplot(data2, aes(x =factor(IR), y = value, fill = Legend, width=.15)) +
geom_bar(position='dodge', colour='black')+
scale_y_continuous(breaks=c(1,3,7,10))
You can use a custom labeller. For example, this function guarantees to only produce integer breaks:
int_breaks <- function(x, n = 5) {
l <- pretty(x, n)
l[abs(l %% 1) < .Machine$double.eps ^ 0.5]
}
Use as
+ scale_y_continuous(breaks = int_breaks)
It works by taking the default breaks, and only keeping those that are integers. If it is showing too few breaks for your data, increase n, e.g.:
+ scale_y_continuous(breaks = function(x) int_breaks(x, n = 10))
These solutions did not work for me and did not explain the solutions.
The breaks argument to the scale_*_continuous functions can be used with a custom function that takes the limits as input and returns breaks as output. By default, the axis limits will be expanded by 5% on each side for continuous data (relative to the range of data). The axis limits will likely not be integer values due to this expansion.
The solution I was looking for was to simply round the lower limit up to the nearest integer, round the upper limit down to the nearest integer, and then have breaks at integer values between these endpoints. Therefore, I used the breaks function:
brk <- function(x) seq(ceiling(x[1]), floor(x[2]), by = 1)
The required code snippet is:
scale_y_continuous(breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 1))
The reproducible example from original question is:
data3 <-
structure(
list(
IR = structure(
c(4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L),
.Label = c("0.13-0.16", "0.17-0.23", "0.24-0.27", "0.28-1"),
class = "factor"
),
variable = structure(
c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L),
.Label = c("Real queens", "Simulated individuals"),
class = "factor"
),
value = c(2L, 2L, 6L, 10L, 0L, 1L, 4L,
4L),
Legend = structure(
c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("Real queens",
"Simulated individuals"),
class = "factor"
)
),
row.names = c(NA,-8L),
class = "data.frame"
)
ggplot(data3, aes(
x = factor(IR),
y = value,
fill = Legend,
width = .15
)) +
geom_col(position = 'dodge', colour = 'black') + ylab('Frequency') + xlab('IR') +
scale_fill_grey() +
scale_y_continuous(
breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 1),
expand = expand_scale(mult = c(0, 0.05))
) +
theme(axis.text.x=element_text(colour="black", angle = 45, hjust = 1),
axis.text.y=element_text(colour="Black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks.x = element_blank())
I found this solution from Joshua Cook and worked pretty well.
integer_breaks <- function(n = 5, ...) {
fxn <- function(x) {
breaks <- floor(pretty(x, n, ...))
names(breaks) <- attr(breaks, "labels")
breaks
}
return(fxn)
}
q + geom_bar(position='dodge', colour='black') +
scale_y_continuous(breaks = integer_breaks())
The source is:
https://joshuacook.netlify.app/post/integer-values-ggplot-axis/
You can use the accuracy argument of scales::label_number() or scales::label_comma() for this:
fakedata <- data.frame(
x = 1:5,
y = c(0.1, 1.2, 2.4, 2.9, 2.2)
)
library(ggplot2)
# without the accuracy argument, you see .0 decimals
ggplot(fakedata, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(label = scales::comma)
# with the accuracy argument, all displayed numbers are integers
ggplot(fakedata, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(label = ~ scales::comma(.x, accuracy = 1))
# equivalent
ggplot(fakedata, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(label = scales::label_comma(accuracy = 1))
# this works with scales::label_number() as well
ggplot(fakedata, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(label = scales::label_number(accuracy = 1))
Created on 2021-08-27 by the reprex package (v2.0.0.9000)
All of the existing answers seem to require custom functions or fail in some cases.
This line makes integer breaks:
bad_scale_plot +
scale_y_continuous(breaks = scales::breaks_extended(Q = c(1, 5, 2, 4, 3)))
For more info, see the documentation ?labeling::extended (which is a function called by scales::breaks_extended).
Basically, the argument Q is a set of nice numbers that the algorithm tries to use for scale breaks. The original plot produces non-integer breaks (0, 2.5, 5, and 7.5) because the default value for Q includes 2.5: Q = c(1,5,2,2.5,4,3).
EDIT: as pointed out in a comment, non-integer breaks can occur when the y-axis has a small range. By default, breaks_extended() tries to make about n = 5 breaks, which is impossible when the range is too small. Quick testing shows that ranges wider than 0 < y < 2.5 give integer breaks (n can also be decreased manually).
This answer builds on #Axeman's answer to address the comment by kory that if the data only goes from 0 to 1, no break is shown at 1. This seems to be because of inaccuracy in pretty with outputs which appear to be 1 not being identical to 1 (see example at the end).
Therefore if you use
int_breaks_rounded <- function(x, n = 5) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
with
+ scale_y_continuous(breaks = int_breaks_rounded)
both 0 and 1 are shown as breaks.
Example to illustrate difference from Axeman's
testdata <- data.frame(x = 1:5, y = c(0,1,0,1,1))
p1 <- ggplot(testdata, aes(x = x, y = y))+
geom_point()
p1 + scale_y_continuous(breaks = int_breaks)
p1 + scale_y_continuous(breaks = int_breaks_rounded)
Both will work with the data provided in the initial question.
Illustration of why rounding is required
pretty(c(0,1.05),5)
#> [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2
identical(pretty(c(0,1.05),5)[6],1)
#> [1] FALSE
Google brought me to this question. I'm trying to use real numbers in a y scale. The y scale numbers are in Millions.
The scales package comma method introduces a comma to my large numbers. This post on R-Bloggers explains a simple approach using the comma method:
library(scales)
big_numbers <- data.frame(x = 1:5, y = c(1000000:1000004))
big_numbers_plot <- ggplot(big_numbers, aes(x = x, y = y))+
geom_point()
big_numbers_plot + scale_y_continuous(labels = comma)
Enjoy R :)
One answer is indeed inside the documentation of the pretty() function. As pointed out here Setting axes to integer values in 'ggplot2' the function contains already the solution. You have just to make it work for small values. One possibility is writing a new function like the author does, for me a lambda function inside the breaks argument just works:
... + scale_y_continuous(breaks = ~round(unique(pretty(.))
It will round the unique set of values generated by pretty() creating only integer labels, no matter the scale of values.
If your values are integers, here is another way of doing this with group = 1 and as.factor(value):
library(tidyverse)
data3<-structure(list(IR = structure(c(4L, 3L, 2L, 1L, 4L, 3L, 2L, 1L
), .Label = c("0.13-0.16", "0.17-0.23", "0.24-0.27", "0.28-1"
), class = "factor"), variable = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("Real queens", "Simulated individuals"
), class = "factor"), value = c(2L, 2L, 6L, 10L, 0L, 1L, 4L,
4L), Legend = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Real queens",
"Simulated individuals"), class = "factor")), .Names = c("IR",
"variable", "value", "Legend"), row.names = c(NA, -8L), class = "data.frame")
data3 %>%
mutate(value = as.factor(value)) %>%
ggplot(aes(x =factor(IR), y = value, fill = Legend, width=.15)) +
geom_col(position = 'dodge', colour='black', group = 1)
Created on 2022-04-05 by the reprex package (v2.0.1)
This is what I did
scale_x_continuous(labels = function(x) round(as.numeric(x)))

Resources