Remove unused factor levels from a ggplot bar plot - r

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.
The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I get this:
I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).
Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!
Update based on the na.omit() suggestion
Consider a revised data set:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.

One easy options is to use na.omit() on your data frame df to remove those rows with NA
ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()
Given your update, the following
ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()
works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables
ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()

Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.
My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.

A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:
library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
df %>%
drop_na(var1) %>%
ggplot(aes(name, var1)) +
geom_col()
Created on 2021-12-03 by the reprex package (v2.0.1)

Related

Debt/GDP graph with ggplot

I have a dataset in which the first column is named central_government_debt_percent_of_gdp and contains a list of years, then several columns with the name of some countries that contain the debt/GDP ratio for each of them in every year.
You can see some of the data at this link:
I want to create a graph that shows the evolution of the ratio for each country, with separate lines. How can I do it with ggplot?
Do I have to add a geom_line for each country?
Should I do some data manipulation ?
As some have already mentioned, it would be appreciated if you provided a reproducible example. I will still try to answer your question, based on the link you included.
You need to do some data transformation, as your data is not yet in "tidy" format. This means: You want a column for every variable, a row for every observation and a cell should contain one value. For that, you need the pivot_longer() function.
library(tidyverse)
data %>%
pivot_longer(
cols= austria:germania,
names_to= "countries",
values_to= "values") %>%
ggplot(aes(x= central_government_dept_percent_of_gdp,
y=values,
color= countries)+
geom_line()

Reorder in Ggplot2

df <- data.frame(Country = c("Indonesia","Indonesia","Brazil","Colombia","Mexico","Colombia","Costa Rica" ,"Mexico","Brazil","Costa Rica"),
Subject = c("Boys", "Girls","Boys","Boys","Boys","Girls","Boys","Girls","Girls","Girls"),
Value = c(358.000,383.000,400.000,407.000,415.000,417.000,419.000,426.000,426.000,434.000))
I'm trying to make a plot of Country vs Value, but ordering the points by the Value ascending for the Boys rows only. I know I can use something like:
df %>%
ggplot(aes(reorder(Country, Value), Value)) +
geom_point()
This does not take into account the Boys only rows in the subject column. How do I go about doing this?
Edit: The ordering can be done outside ggplot as:
df <- df %>%
arrange(Value, Subject)
However, I just cannot yet replicate it in the ggplot reorder. Included is an example of the data in question.
Arranging your data frame does not change the way the column Country will be ordered on the x axis. The priority for the order on the axis for discrete variables is:
If you supply a reorder or final specification in aes(), use that ordering
If the column is a factor, use the order of the levels of that factor
If the column is not a factor, order alphanumerically
As far as I know, you can only specify one column to use in reorder(), so the next step is to convert to a factor and specify the levels. The way the items appear in the ordering of the data frame does not matter, since the columns are treated completely separate from the order in which they appear in the data frame. In fact, this is kind of the whole idea behind mapping.
Therefore, if you want this particular order, you'll have to convert the Country column into a factor and specify levels. You can do that separately, or pipe it all together using mutate(). Just note that we have to specify to use unique() values of the Country column to ensure we only provide each level one time in the order in which they appear in the sorted data frame.
# color and size added for clarity on the sorting
df %>%
arrange(Subject, Value) %>%
mutate(Country=factor(Country, levels=unique(Country))) %>%
ggplot(aes(Country, Value, color=Subject)) + geom_point(size=3)

error filtering data: Faceting variables must have at least one value

I am trying to write a code by using dplyr and a yeast dataset
I Read in the data with this code
gdat <- read_csv(file = "brauer2007_tidy1.csv")
I ommitted na's by using this
gdat <- na.omit(gdat)
library(ggplot2)
Then I tried to filter some genes according to their column name "symbol" and used ggplot to make a plot
filter(gdat, symbol=="QRI7", symbol== "CFT2", symbol== "RIB2",
symbol=="EDC3", symbol=="VPS5", symbol=="AMN1" & rate=.05) %>%
ggplot(aes(x=rate,
y=expression,
group=1,
colour=nutrient)) +
geom_line(lwd=1.5) +
facet_wrap(~nutrient)
facet_wrap(~nutrient) is used to seperate each gene's rate vs. expression graphs according to the nutrient which is depleted but this error keeps coming:
error: Faceting variables must have at least one value
I checked these genes by using the filter function if all of them could be displayed on r and they did when I filtered them individually but when I combine multiple genes with ggplot I get this error.
Also when I use "&rate=.05" I can't get only the values which are at rate=.05.
Does anyone know how I can fix this problem? I have a deadline till tomorrow 17.30 and if somebody could help me I would be very glad, thanks.
I downloaded what I assume is the same dataset like this:
library(readr)
library(dplyr)
library(ggplot2)
gdat <- read_csv("https://raw.githubusercontent.com/bioconnector/workshops/master/data/brauer2007_tidy.csv")
So the first problem is your filter. If you are looking for any of those gene symbols, you need to use %in%. And rate requires a double equals ==:
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1"),
rate == 0.05)
I don't think you want to filter for one rate and then use geom_line, because you will just get one vertical line at one value of x (rate).
Neither do I think you want to use geom_line for multiple values of rate, because there are several values for expression at each rate and a line will generate a nasty-looking zigzag.
And as you are faceting on nutrient, there's no need to color by nutrient. Perhaps you want to color by gene?
So you need to think about what makes a good visualisation of this data. Here's a simple example to get you started.
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1")) %>%
ggplot(aes(x=rate,
y=expression,
color = symbol)) +
geom_line() +
facet_wrap(~nutrient)
Result:

Create a Boxplot using gplot2 from Excel File

My goal is to create a simple boxplot using the ggplot function. I've imported the data from an excel file using read_excel. The data consists of 4 columns (corresponding to treatment) and for each row/treatment, there are around 600 values (see screenshot of head of data.frame in R). For ggplot however, i don't have a clue what are the x/y names of my data frame to put in the aes() argument and I don't know how to create x/y names for the aes function.
I just want to know what the x and y values to put in aes()...or how to define them if I haven't yet.
So far, code is:
Library(readxl)
Library(ggplot2)
CM<-read_excel(file.choose(new=FALSE))
CM<-data.frame(CM)
So for the graph: ggplot(CM, aes(??,??)
Didn't get very far...
Your data is not the format supported by ggplot2, which follows the grammar of graphics. However, you can transform your data frame using tidyr, another part of tidyverse (of which ggplot2 is a part). You'd have to use gather(), like so:
library(tidyr)
library(ggplot2)
CM2 <- CM %>%
gather(key = "Names", value = "Values") # You get to choose these names; choose wisely. You'll use them later
ggplot(CM2, aes(x = Names, y = Values))+
geom_boxplot()
I wrote the least possible unnecessary code. Also, I created a second data.frame called CM2 because I don't know if you use the original later, but you can just write CM <- CM %>% [...] if you prefer.
You can find some reference on tidyr here.

Removing unused x-axis factors from each plot while creating multiple plots using the lapply function [duplicate]

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.
The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I get this:
I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).
Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!
Update based on the na.omit() suggestion
Consider a revised data set:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.
One easy options is to use na.omit() on your data frame df to remove those rows with NA
ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()
Given your update, the following
ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()
works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables
ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()
Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.
My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.
A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:
library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
df %>%
drop_na(var1) %>%
ggplot(aes(name, var1)) +
geom_col()
Created on 2021-12-03 by the reprex package (v2.0.1)

Resources