Create a Boxplot using gplot2 from Excel File - r

My goal is to create a simple boxplot using the ggplot function. I've imported the data from an excel file using read_excel. The data consists of 4 columns (corresponding to treatment) and for each row/treatment, there are around 600 values (see screenshot of head of data.frame in R). For ggplot however, i don't have a clue what are the x/y names of my data frame to put in the aes() argument and I don't know how to create x/y names for the aes function.
I just want to know what the x and y values to put in aes()...or how to define them if I haven't yet.
So far, code is:
Library(readxl)
Library(ggplot2)
CM<-read_excel(file.choose(new=FALSE))
CM<-data.frame(CM)
So for the graph: ggplot(CM, aes(??,??)
Didn't get very far...

Your data is not the format supported by ggplot2, which follows the grammar of graphics. However, you can transform your data frame using tidyr, another part of tidyverse (of which ggplot2 is a part). You'd have to use gather(), like so:
library(tidyr)
library(ggplot2)
CM2 <- CM %>%
gather(key = "Names", value = "Values") # You get to choose these names; choose wisely. You'll use them later
ggplot(CM2, aes(x = Names, y = Values))+
geom_boxplot()
I wrote the least possible unnecessary code. Also, I created a second data.frame called CM2 because I don't know if you use the original later, but you can just write CM <- CM %>% [...] if you prefer.
You can find some reference on tidyr here.

Related

error filtering data: Faceting variables must have at least one value

I am trying to write a code by using dplyr and a yeast dataset
I Read in the data with this code
gdat <- read_csv(file = "brauer2007_tidy1.csv")
I ommitted na's by using this
gdat <- na.omit(gdat)
library(ggplot2)
Then I tried to filter some genes according to their column name "symbol" and used ggplot to make a plot
filter(gdat, symbol=="QRI7", symbol== "CFT2", symbol== "RIB2",
symbol=="EDC3", symbol=="VPS5", symbol=="AMN1" & rate=.05) %>%
ggplot(aes(x=rate,
y=expression,
group=1,
colour=nutrient)) +
geom_line(lwd=1.5) +
facet_wrap(~nutrient)
facet_wrap(~nutrient) is used to seperate each gene's rate vs. expression graphs according to the nutrient which is depleted but this error keeps coming:
error: Faceting variables must have at least one value
I checked these genes by using the filter function if all of them could be displayed on r and they did when I filtered them individually but when I combine multiple genes with ggplot I get this error.
Also when I use "&rate=.05" I can't get only the values which are at rate=.05.
Does anyone know how I can fix this problem? I have a deadline till tomorrow 17.30 and if somebody could help me I would be very glad, thanks.
I downloaded what I assume is the same dataset like this:
library(readr)
library(dplyr)
library(ggplot2)
gdat <- read_csv("https://raw.githubusercontent.com/bioconnector/workshops/master/data/brauer2007_tidy.csv")
So the first problem is your filter. If you are looking for any of those gene symbols, you need to use %in%. And rate requires a double equals ==:
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1"),
rate == 0.05)
I don't think you want to filter for one rate and then use geom_line, because you will just get one vertical line at one value of x (rate).
Neither do I think you want to use geom_line for multiple values of rate, because there are several values for expression at each rate and a line will generate a nasty-looking zigzag.
And as you are faceting on nutrient, there's no need to color by nutrient. Perhaps you want to color by gene?
So you need to think about what makes a good visualisation of this data. Here's a simple example to get you started.
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1")) %>%
ggplot(aes(x=rate,
y=expression,
color = symbol)) +
geom_line() +
facet_wrap(~nutrient)
Result:

Are there any modification/another function to plot two numerical variable against one string variable?

I have a data set like this one: Names of mutations and two numerical variables representing values in two conditions (CIP and TIG):
I was able to plot one variable (e.g. CIP) in these mutation using the following code:
Data names as "Dotchart2)
dotchart(Dotchart2$`CIP resistance`,
labels = rownames((Dotchart2)), pch = 16, cex = 1, pt.cex = 2)
This appeared as follows:
Since I am comparing CIP vs TIG, I would like to have the same figure but showing another dots for the TIG for the same mutation (i.e. on each horizontal mutation line, there will be two dots of different color, one for CIP value and the other for TIG value). It should appear like this figure for instance
Could any of you provide a simplified code for this ?
I think you'll find your answer here.
In the link provided, #JoshO'Brien creates a dotchart plot using a lattice configuration:
autos_data <- read.table("~/Documents/R/test.txt", header=F)
library(lattice)
dotplot(V1~V2, data=autos_data)
This documentation does a thorough job of explaining and detailing graph styles (graph_type), data graphing (formula), and the data source (data=), resulting in the following:
library(lattice)
graph_type(formula, data=)
To do this easily in lattice or ggplot2 you first need to convert your data to long format. I don't have a data set handy in the right format, so I took the famous iris data set and converted it to a wide-format data set called iris_wide (see code at the bottom). I'm using tidyverse here: all of this can also be done in base R.
(To understand what's going on here you should definitely examine the iris_wide and iris_long objects.)
convert from wide to long format
library(tidyverse)
iris_long <- iris_wide %>%
pivot_longer(cols=-id,names_to="species",values_to="value")
lattice version
lattice::dotplot(id~value, data=iris_long, group=species,pch=16,
auto.key=TRUE)
ggplot version
ggplot(iris_long, aes(value,id,colour=species))+geom_point()
convert iris data from long to wide
To match your example, I'm selecting only two categories (species) and one variable (sepal length)
iris_wide <- (iris
%>% filter(Species %in% c("setosa","virginica"))
%>% select(Sepal.Length, Species)
%>% group_by(Species)
%>% mutate(id=seq(n()))
%>% pivot_wider(names_from=Species, values_from=Sepal.Length)
%>% head(10)
%>% mutate(id=LETTERS[seq(n())])
)

How to plot two variable (same unit %) from two columns in ggplot2? [duplicate]

This question already has answers here:
Plotting two variables as lines using ggplot2 on the same graph
(5 answers)
Closed 4 years ago.
I have a csv table in which there are three columns I would like to plot out as line graph using ggplot2 in R.
The variable on x axis will reference the data in column "DATE_Out", the two variables on y axis will reference column "Percent_In" and "Percent_Out" respectively. Note that "Percent_In" and "Percent_Out" are completely two columns not one column's data with different types to group.
Table Data Example
Could anyone give me some hints with the R code?
library(ggplot2)
library(reshape2)
tbl <- read.csv('table.csv')
tbl$DATE_Out <- as.Date(tbl$DATE_Out, format = '%m/%d/%Y')
tbl <- melt(tbl, id.vars = 'DATE_Out')
plt <- ggplot(data = tbl, aes(x = DATE_Out, y = value))
plt <- plt + geom_path(aes(colour=tbl$variable))
plt + theme_minimal() + theme(legend.title=element_blank())
The tidyr package offers the gather function which is designed to do just this sort of thing.
library(dplyr)
library(tidyr)
View(iris)
iris %>%
gather('Measurment','Value',Sepal.Length,Sepal.Width) %>%
View
I prefer tidyr to reshape because, to me, the functionality is clearer and the functions are more versatile. For example, rather than having to specify all the variables as i.d. variables in melt, I can just specify the variables I wish to gather together. In most of my datasets that is a smaller, cleaner way to code. (See the help page for dplyr::select for more details on ways to select which columns are used)

Removing unused x-axis factors from each plot while creating multiple plots using the lapply function [duplicate]

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.
The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I get this:
I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).
Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!
Update based on the na.omit() suggestion
Consider a revised data set:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.
One easy options is to use na.omit() on your data frame df to remove those rows with NA
ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()
Given your update, the following
ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()
works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables
ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()
Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.
My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.
A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:
library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
df %>%
drop_na(var1) %>%
ggplot(aes(name, var1)) +
geom_col()
Created on 2021-12-03 by the reprex package (v2.0.1)

Remove unused factor levels from a ggplot bar plot

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.
The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I get this:
I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).
Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!
Update based on the na.omit() suggestion
Consider a revised data set:
library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()
I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.
One easy options is to use na.omit() on your data frame df to remove those rows with NA
ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()
Given your update, the following
ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()
works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables
ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()
Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.
My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.
A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:
library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
df %>%
drop_na(var1) %>%
ggplot(aes(name, var1)) +
geom_col()
Created on 2021-12-03 by the reprex package (v2.0.1)

Resources