ggplot2 geom_line() to skip NA values - r

I have a data set that has multiple NA values in it. When plotting this data, ggplot's geom_line() option joins lines across the NA values. Is there any way to have ggplot skip joining the lines across NA values?
Edit: A thousand apologies to all involved. I made a mistake in my manipulation of the data frame. I figured out my problem. My x axis was not continuous when I created a subset. The missing data had not been replaced by NAs, so the data was being linked because there were no NAs created in the subset between rows.

geom_line does make breaks for NAs in the y column, but it joins across NA values in the x column.
# Set up a data frame with NAs in the 'x' column
independant <- c(0, 1, NA, 3, 4)
dependant <- 0:4
d <- data.frame(independant=independant, dependant=dependant)
# Note the unbroken line
ggplot(d, aes(x=independant, y=dependant)) + geom_line()
I assume that your NA values are in your as.POSIXlt(date). If so, one solution would be to map the columns with NA values to y, and then use coord_flip to make the y axis horizontal:
ggplot(d, aes(x=dependant, y=independant)) + geom_line() +
coord_flip()
Presumably your code would be:
ggplot(crew.twelves, aes(x=laffcu, y=as.POSIXlt(date)) + geom_line() +
coord_flip()

Related

Boxplots by base R and ggplot2 do not match

I have a simple dataset. When I generate boxplot for the data by base R and ggplot separately, they do not match. In fact the base R boxplot is consistent with the summary function.
library(tidyverse)
library(ggplotify)
library(patchwork)
df <- read.csv("test_boxplot_data.csv")
summary(df)
p1 <- as.ggplot(~boxplot(df$y, outline=FALSE))
p2 <- ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) + ylim(0,100)
p1 + p2 + plot_layout(ncol = 2)
Generated plot kept here.
Any clue what is happening? It is also surprising that ggplot throws warning that "Removed 845 rows containing non-finite values (stat_boxplot)" but there is no NA in the data.
From: "Removed 845 rows containing non-finite values (stat_boxplot)". It just so happens that the data contains 845 points > 100. These points are being deleted in the calculation of the box plot.
From the first line of help for ylim():
"This is a shortcut for supplying the limits argument to the individual scales. By default, any values outside the limits specified are replaced with NA. Be warned that this will remove data outside the limits and this can produce unintended results. For changing x or y axis limits without dropping data observations, see coord_cartesian()."
This should provide the desired graph:
ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim=c(0,100))

How to connect across multiple consecutive missing data values using geom_line?

I have a similar problem to Q: Connecting across missing values with geom_line, but found the answers provided only connect the lines when there is one missing value only. If there are 2+ consecutive missing values the solutions offered do not apply.
I need to connect multiple observations made over time for individual trees. Sometimes measurements were missed such that there are missing values in my df, and sometimes an individual tree was missed more than one year in a row, such that there are multiple consecutive NAs.
When there is only one consecutive NA, using geom_line with this specification works a treat to connect across missing values:
geom_line(data = df[!is.na(df$y),])
When there is more than one consecutive NA (i.e. 2 measurements missed) geom_line will not draw across the missing data. Applying !is.na to the whole df does not solve the problem, nor does using geom_path.
Here is code to generate a df that replicates the issue:
x <- c(1,2,3,4,5,6,7,8,9)
tr1 <- c(20,25,18,16,22,12,NA,15,45)
tr2 <- c(12,NA,NA,NA,30,48,30,NA,NA)
df <- data.frame(x, tr1,tr2)
The following code can be used to graph a) tree1 with NA missing, b) tree1 with NA bridged, b) tree2 with geom_line correction in code but missing the expected line across NAs
tree1 <- ggplot(df, aes(x, tr1)) + geom_point() +
geom_line()
tree1.fix <- ggplot(df, aes(x, tr1)) + geom_point() +
geom_line(data = df[!is.na(df$tr1),])
nofix <- ggplot(df, aes(x, tr2)) + geom_point() +
geom_line(data = df[!is.na(df$tr2),])
grid.arrange(tree1, tree1.fix, nofix, ncol = 3)
Any ideas?
geom_line() does not connect across any missing data (NA). And geom_point() does not plot missing data either. That is the correct default behaviour for missing data. NA cannot be placed on numerical axes.
What you are doing with df[!is.na(df$tr2),] is removing the missing data before sending it to geom_line(), tricking into thinking that your data is complete.
To better understand this, print out df[!is.na(df$tr2), c("x", "tr2")]. That's the data that geom_line() receives. All of this data is displayed and connected. There are no NAs in that data, because you removed them.
In your "nofix example, you get a line from x=1 to x=5, over three consecutive NA.
So I assume that you mean that geom_line() does not continue after x=7?
But look at the data. There is no data after x=7. Every x>7 has y=NA. And if you remove NAs, then there is no data at all after x=7.
If your example had one more point, say x=10 y=10, then the line would continue from x=7 to x=10.

R - ggplot2 - difference between ggplot(data, aes(x=variable...)) and ggplot(data, aes(x=data$variable...)) [duplicate]

This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.

How to barplot select rows of data from a dataframe in R?

This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!
It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")

stat_qq removes values when setting group

I am trying to make a QQ-plot in ggplot2, where a select few of the points should have a different shape. But when I map the shape to a variable in the aesthetics, stat_qq includes this variable to split the data (there are 2x3 factors involved).
Here is a reproducible example:
library(ggplot2)
set.seed(331)
df <- do.call(rbind, replicate(10, {expand.grid(method=factor(letters[1:3]), model=factor(LETTERS[1:2]))}, simplify=FALSE ))
df$x <- runif(nrow(df))
df$y <- rnorm(nrow(df), sd=0.2) + 1*as.integer(df$method)
df$top <- FALSE
df <- df[order(df$y, decreasing=TRUE),]
df$top[which(df$method=='a')[1:10]] <- TRUE
So far, I have managed to make a simple QQ-plot:
ggplot(df, aes(sample=y, colour=method)) + stat_qq() + facet_grid(.~model)
This is basically what I want, except for a hand full of the points in method 'a' having a different shape, as indicated by the variable 'top'.
From the code, we know that these corresponds to the top 5 values in method 'a' in each model; i.e. that the five left most of the red dots in each facet should have a different shape.
Here I have attempted to add it as an aesthetics:
ggplot(df, aes(sample=y, colour=method, shape=top)) + stat_qq() + facet_grid(.~model)
Now, it is quite clear, that stat_qq has included the variable 'top' to split the data set, as the top 5 data points are plotted parallel to the the non-top points.
This is not as intended.
How can I instruct stat_qq how to group the data?
I could try the group-aesthetic:
ggplot(df, aes(sample=y, colour=method, shape=top, group=method)) + stat_qq() + facet_grid(.~model)
Warning messages:
1: Removed 10 rows containing missing values (geom_point).
2: Removed 10 rows containing missing values (geom_point).
But for some reason, this entirely removes all data points connected to the model.
Any ideas how to overcome this?
Since you want to violate one of the fundamental concepts of ggplot2 it would be easier to do the calculations outside of ggplot:
library(plyr)
df <- ddply(df, .(model, method),
transform, theo=qqnorm(y, plot.it=FALSE)[["x"]])
ggplot(df, aes(x=theo, y=y, colour=method, shape=top)) +
geom_point() + facet_grid(.~model)

Resources