Automatically determine number of axis ticks for discrete variable - r

I want to automatically set the number of breaks and the position of the breaks itself for the axis of a discrete variable such that the labels which are plotted are actually readable.
For example in the code below, the resulting plot should only show a portion of the labels/the x-variable.
ggData <- data.frame(x=paste0('B',1:100), y=rnorm(100))
ggplot(ggData, aes_string('x', 'y')) +
geom_point(size=2.5, shape=19, na.rm = TRUE)
So far, I tried to use pretty, and pretty_breaks which are, however, not for discrete variables.

Fist we turn the factor into a character and then into a ordered factor. Secondly, we subset ggData$x to create a vector (labels) with the ticks we want. In the example every 10 elements. Finally, we create the plot using scale_x_discrete, using the previous vector (labels), inside the parameter breaks.
ggData <- data.frame(x=paste0('B',1:100), y=rnorm(100))
ggData$x <- as.character(ggData$x)
ggData$x <- factor(ggData$x, levels=unique(ggData$x))
labels <- ggData$x[seq(0, 100, by= 10)]
ggplot(ggData, aes_string('x', 'y')) +
geom_point(size=2.5, shape=19, na.rm = TRUE) +
scale_x_discrete(breaks=labels)

Related

R ggplot loop: in a for loop of ggplot histograms, how can you automatically set the y axis scale based on max frequency?

I have the following loop to produce several histograms based off certain columns (columns 2 to 5) in a larger dataset (df):
loop.vector <- 2:5
for (i in loop.vector){
x <- df[,i]
print(ggplot(df,aes(x=x)) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=seq(0,max((x),1)))
}
I'd like to have my y-axis scale done automatically as I have for the x-axis, where it ranges between zero and whatever the maximum frequency value is, at increments of 1.
I know how to set these values manually if I were to plot, take a look at it, and enter the max y-axis value separately, but i'd like to do this automatically within the loop.
Thanks!
Answering the question: how to access max counts for a histogram plot?
The information you're missing on each plot in order to create your scale_y_continuous command is the maximum number of counts. There is a nice way to access this information once you have created a ggplot object, which is to use the built-in ggplot_build() function from ggplot2. For a given plot, myPlot, the following will give you a list of dataframes that are used for each layer in your plot:
ggplot_build(myPlot)$data
In the case of your example, you can access the count column of the first data frame (since you only have one histogram geom layer). Here's how you can write the function to do what you need it to do. I'll use an example dataset that can show you the results. Note that I've also changed your scale_x_continuous line to be able to accomodate positive and negative numbers by using a combination of min(), max(), and the ceiling() and floor() functions:
set.seed(1234)
df <- data.frame(
y1=rnorm(100,10,1),
y2=rnorm(100,12,3),
y3=rnorm(100,5,4),
y4=rnorm(100,13,5))
for (i in 1:ncol(df)) {
p <- ggplot(df, aes(df[,i])) +
geom_histogram(alpha=0.5, color='black', fill='red', binwidth=1) +
scale_x_continuous(breaks=seq(floor(min(df[,i])),ceiling(max(df[,i])))) +
ggtitle(names(df)[i])
# get max counts
max_count <- max(ggplot_build(p)$data[[1]]$count)
p <- p + scale_y_continuous(breaks=seq(0,max_count,1))
print(p)
}
Is there a better way?
While that gets you what need, it's typically hard to deal with multiple plots output to your graphics device iteratively. I would recommend reformatting the above code as a function and then using lapply() and using something like plot_grid() from cowplot to display the output. This suggested approach is detailed in the code below:
myPlots <- function(data, column, fill_color) {
# column = character name of column
p <- ggplot(data, aes_string(x=column)) +
geom_histogram(fill='red', binwidth=1, alpha=0.5, color='black') +
scale_x_continuous(breaks=seq(floor(min(data[column])), ceiling(max(data[column])),1)) +
ggtitle(column)
max_count <- max(ggplot_build(p)$data[[1]]$count)
p <- p + scale_y_continuous(breaks=seq(0,max_count,1))
return(p)
}
library(cowplot)
plotList <- lapply(names(df), myPlots, data=df)
plot_grid(plotlist = plotList)
Figured it out - my values are integers, so what ended up working was a variation on Duck's response. See below:
loop.vector <- 2:5
for (i in loop.vector){
x <- df[,i]
print(ggplot(df,aes(x=x)) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=seq(0,max((x),1)))+scale_y_continuous(breaks=seq(0,max(table(x)),1)))
}

Can I set free scales for aesthetics other than x and y (e.g. size) when using facet_grid?

facet_grid and facet_wrap have the scales parameter, which as far as I know allows each plot to adjust the scales of the x and/or y axis to the data being plotted. Since according to the grammar of ggplot x and y are just two among many aesthetics, and there's a scale for each aesthetic, I figured it would be reasonable to have the option of letting each aesthetic be free, but so far I didn't find a way to do it.
I was trying to set it in particular for the Size, since sometimes a variables lives in a different order of magnitude depending on the group I'm using for the facet, and having the same scale for every group blocks the possibility of seeing within-group variation.
A reproducible example:
set.seed(1)
x <- runif(20,0,1)
y <- runif(20,0,1)
groups <- c(rep('small', 10), rep('big', 10))
size_small <- runif(10,0,1)
size_big <- runif(10,0,1) * 1000
df <- data.frame(x, y, groups, sizes = c(size_small, size_big))
And an auxiliary function for plotting:
basic_plot <- function(df) ggplot(df) +
geom_point(aes(x, y, size = sizes, color = groups)) +
scale_color_manual(values = c('big' = 'red', 'small' = 'blue')) +
coord_cartesian(xlim=c(0,1), ylim=c(0,1))
If I we plot the data as is, we get the following:
basic_plot(df)
Non faceted plot
The blue dots are relatively small, but there is nothing we can do.
If we add the facet:
basic_plot(df) +
facet_grid(~groups, scales = 'free')
Faceted plot
The blue dots continue being small. But I would like to take advantage of the fact that I'm dividing the data in two, and allow the size scale to adjust to the data of each plot. I would like to have something like the following:
plot_big <- basic_plot(df[df$groups == 'big',])
plot_small <- basic_plot(df[df$groups == 'small',])
grid.arrange(plot_big, plot_small, ncol = 2)
What I want
Can it be done without resorting to this kind of micromanaging, or a manual rescaling of the sizes like the following?
df %>%
group_by(groups) %>%
mutate(maximo = max(sizes),
sizes = scale(sizes, center = F)) %>%
basic_plot() +
facet_grid(~groups)
I can manage to do those things, I'm just trying to see if I'm not missing another option, or if I'm misunderstanding the grammar of graphics.
Thank you for your time!
As mentioned, original plot aesthetics are maintained when calling facet_wrap. Since you need grouped graphs, consider base::by (the subsetting data frame function) wrapped in do.call:
do.call(grid.arrange,
args=list(grobs=by(df, df$groups, basic_plot),
ncol=2,
top="Grouped Point Plots"))
Should you need to share a legend, I always use this wrapper from #Steven Lockton's answer
do.call(grid_arrange_shared_legend, by(df, df$groups, basic_plot))

Line chart with categorical values in ggplot2?

I am having a problem to construct a line chart. Here is the output of my line chart. Why is the output like this, I mean the lines don`t touch (are not continuous). Maybe the issue is connected with my data format or type?
The code for line chart:
plotLine <- ggplot(sales_clean,aes(x=sales_clean$Date,y=sales_clean$Net_Rev,na.rm = FALSE))
plotLine + geom_line()
1) The issue is that sales_clean$Year is a factor.
2) ggplot interprit your x-value as categorical, y-value as continous and aggregated value into the bar plot (instead bar there are lines).
Please see the simulation:
library(ggplot2)
set.seed(123)
sales_clean <- data.frame(Year = rep(factor(2014:2018), 1000), Net_Rev = abs(rnorm(5000)))
plotLine <- ggplot(sales_clean, aes(Year, Net_Rev, na.rm = FALSE))
plotLine + geom_line()
3) One of the solutions is to convert factor into the numeric and aggregate by Year.
Please see the result:
sales_clean$Year_num <- as.numeric(as.character(sales_clean$Year))
sales_clean_plot <- aggregate(Net_Rev ~ Year_num, sales_clean, sum)
plotLine <- ggplot(sales_clean_plot, aes(Year_num, Net_Rev, na.rm = FALSE))
plotLine + geom_line()
4) It is better not to use $ in ggplot's aes(), as the data.frame name is already mentioned in the first argument of ggplot(). The code become crumpy and difficult to read.

How to plot deviation from mean

In R I have created a simple matrix of one column yielding a list of numbers with a set mean and a given standard deviation.
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
r <- rnorm2(100,4,1)
I now would like to plot how these numbers differ from the mean. I can do this in Excel as shown below:
But I would like to use ggplot2 to create a graph in R. in the Excel graph I have cheated by using a line graph but if I could do this as columns it would be better. I have tried using a scatter plot but I cant work out how to turn this into deviations from the mean.
Perhaps you want:
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(100,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
par(las=1,bty="l") ## cosmetic preferences
plot(x, r, col = "green", pch=16) ## draws the points
## if you don't want points at all, use
## plot(x, r, type="n")
## to set up the axes without drawing anything inside them
segments(x0=x, y0=4, x1=x, y1=r, col="green") ## connects them to the mean line
abline(h=4)
If you were plotting around 0 you could do this automatically with type="h":
plot(x,r-4,type="h", col="green")
To do this in ggplot2:
library("ggplot2")
theme_set(theme_bw()) ## my cosmetic preferences
ggplot(data.frame(x,r))+
geom_segment(aes(x=x,xend=x,y=mean(r),yend=r),colour="green")+
geom_hline(yintercept=mean(r))
Ben's answer using ggplot2 works great, but if you don't want to manually adjust the line width, you could do this:
# Half of Ben's data
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(50,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
# New variable for the difference between each value and the mean
value <- r - mean(r)
ggplot(data.frame(x, value)) +
# geom_bar anchors each bar at zero (which is the mean minus the mean)
geom_bar(aes(x, value), stat = "identity"
, position = "dodge", fill = "green") +
# but you can change the y-axis labels with a function, to add the mean back on
scale_y_continuous(labels = function(x) {x + mean(r)})
in base R it's quite simple, just do
plot(r, col = "green", type = "l")
abline(4, 0)
You also tagged ggplot2, so in that case it will be a bit more complicated, because ggplot requires creating a data frame and then melting it.
library(ggplot2)
library(reshape2)
df <- melt(data.frame(x = 1:100, mean = 4, r = r), 1)
ggplot(df, aes(x, value, color = variable)) +
geom_line()

Combine continuous and discrete color scale in ggplot2?

I am a ggplot2 newbie. I am making a scatter plot where the points are colored based on a third continuous variable. However, for some of the points, that continuous variable has either an Inf value or a NaN. How can I generate a continuous scale that has a special, separate color for Inf and another separate color for NaN?
One way to get this behavior is to subset the data, and make a separate layer for the special points, where the color is set. But I'd like the special colors to enter the legend as well, and think it would be cleaner to eliminate the need to subset the data.
Thanks!
Uri
I'm sure this can be made more efficient, but here's one approach. Essentially, we follow your advice of subsetting the data into the different parts, divide the continuous data into discrete bins, then patch everything back together and use a scale of our own choosing.
library(ggplot2)
library(RColorBrewer)
#Sample data
dat <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))
dat[sample(nrow(dat), 5), 3] <- NA
dat[sample(nrow(dat), 5), 3] <- Inf
#Subset out the real values
dat.good <- dat[!(is.na(dat$z)) & is.finite(dat$z) ,]
#Create 6 breaks for them
dat.good$col <- cut(dat.good$z, 6)
#Grab the bad ones
dat.bad <- dat[is.na(dat$z) | is.infinite(dat$z) ,]
dat.bad$col <- as.character(dat.bad$z)
#Rbind them back together
dat.plot <- rbind(dat.good, dat.bad)
#Make your own scale with RColorBrewer
yourScale <- c(brewer.pal(6, "Blues"), "red","green")
ggplot(dat.plot, aes(x,y, colour = col)) +
geom_point() +
scale_colour_manual("Intensity", values = yourScale)

Resources