R: combine mpg trans columns into new dataframe containing two columns - r

I am working my way through the R for Data Science Manual, currently finishing chapter 3. I am trying to find a way to produce a plot combining the different types of automatic and manual transmission into two plots, instead of what I have currently:
# Install necessary packages
install.packages("tidyverse")
library(tidyverse)
# Create the plot
fuelbytrans <- ggplot(data = mpg) +
geom_jitter(
mapping = aes(x = displ, y = hwy, colour = fl),
size = 0.75) +
# Change labels for title and x and y axes
labs(
title = "Drivstofforbruk iht. datasettet «mpg» fordelt på girkasse og motorvolum",
x = "Motorvolum",
y = "Am. mil per gallon")
# Run it
fuelbytrans
# Set colours and labels for fuel legend and position it on the bottom
# e (etanol), d (diesel), r (regular [bensin, lavoktan]), p (premium [bensin, høyoktan]),
# c (CNG)
cols <- c( #kilde: http://colorbrewer2.org/#type=diverging&scheme=PRGn&n=5
"c" = "yellow",
"d" = "red",
"e" = "black",
"p" = "blue",
"r" = "darkgreen"
)
labels_fuel <- fuelbytrans +
scale_colour_manual(
name = "Drivstoff",
values = cols,
breaks = c("c", "d", "e", "p", "r"),
labels = c("CNG",
"diesel",
"etanol",
"bensin,\nhøyoktan",
"bensin,\nlavoktan")) +
theme(legend.position = "bottom",
legend.background = element_rect(
fill = "gray90",
size = 2,
linetype = "dotted"
))
# Run it
labels_fuel
# Wrap by transmission type
labels_fuel + facet_wrap(~ trans, nrow = 1)
As you can see, what I get is 8 columns for automatic transmission, and two for manual; what I would like is just two columns, one for automatic and one for manual, concatenating the plots. I have presently no idea how to do this, and would appreciate all help.
If any information is missing, should have been written differently, or could otherwise be improved, please advise.
I am running RStudio 0.99.902. I am quite new to R.

You have more than 2 types of transmission in your data:
table(mpg$trans)
# auto(av) auto(l3) auto(l4) auto(l5) auto(l6)
# 5 2 83 39 6
# auto(s4) auto(s5) auto(s6) manual(m5) manual(m6)
# 3 3 16 58 19
You need to group them into 2 groups first, here is one option:
mpg = mpg %>%
mutate(trans2 = if_else(grepl("auto", trans), "auto", "manual"))
table(mpg$trans2)
# auto manual
# 157 77
Then, use the new trans2 variable for facetting (you need to rerun the plot).
Two more comments:
If you want to know more about an R function, call ?function_name in R. This will bring up the help page for that function. It usually includes examples that you can run from R to see what it does in action. (Plus here we are using grepl, so it would also be useful to Google the term "regular expressions", if you are not familiar with them).
Since you are reading r4ds, you need to get familiar with the "pipe operator" used in dplyr, tidyr and other tidyverse packages sooner rather than later. It can chain multiple function calls together in an easily readable way. Google it or take a look here. The call could also be written without the pipe like this:
mpg = mutate(mpg, trans2 = if_else(grepl("auto", trans), "auto", "manual"))
In this particular case, the pipe operator is actually not that useful. I am just so used to it I went for it automatically.

Related

Grouped barchart in r with 4 variables

I'm a beginner in r and I've been trying to find how I can plot this graphic.
I have 4 variables (% of gravel, % of sand, % of silt in five places). I'm trying to plot the percentages of these 3 types of sediment (y) in each station (x). So it's five groups in x axis and 3 bars per group.
Station % gravel % sand % silt
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000
I tried plotting a grouped barchart with
grao <- read_excel("~/Desktop/Masters/Data/grao.xlsx")
colors <- c('#999999','#E69F00','#56B4E9','#94A813','#718200')
barplot(table(grao$Station, grao$`% gravel`, grao$`% sand`, grao$`% silt`), beside = TRUE, col = colors)
But this error message keeps happening:
'height' must be a vector or matrix
I also tried
ggplot(grao, aes(Station, color=as.factor(`% gravel`), shape=as.factor(`% sand`))) +
geom_bar() + scale_color_manual(values=c('#999999','#E69F00','#56B4E9','#94A813','#718200')+ theme(legend.position="top")
But it's creating a crazy graphic.
Could someone help me, please? I've been stuck for weeks now in this one.
Cheers
I think this may be what you are looking for:
#install.packages("tidyverse")
library(tidyverse)
df <- data.frame(
station = c("PRA1", "PRA3", "PRA4", "PRB1", "PRB2"),
gravel = c(28.4, 19.5, 19.7, 7.01, 18.6),
sand = c(70.06, 78.07, 78.63, 91, 79),
silt = c(1.5, 2.4, 1.6, 1.7, 1.66)
)
df2 <- df %>%
pivot_longer(cols = c("gravel", "sand", "silt"), names_to = "Sediment_Type", values_to = "Percentage")
ggplot(df2) +
geom_bar(aes(x = station, y = Percentage, fill = Sediment_Type ), stat = "identity", position = "dodge") +
theme_minimal() #theme_minimal() is from the ggthemes package
provides:
You need to "pivot" your data set "longer". Part of the tidy way is ensuring all columns represent a single variable. You will notice in your initial dataframe that each column name is a variable ("Sediment_type") and each column fill is just the percentage for each ("Percentage"). The function pivot_longer() takes a dataset and allows one to gather up all the columns then turn them into just two - the identity and value.
Once you've done this, ggplot will allow you to specify your x axis, and then a grouping variable by "fill". You can switch these two up. If you end up with lots of data and grouping variables, faceting is also an option worth looking in to!
Hope this helps,
Brennan
barplot wants a "matrix", ideally with both dimension names. You could transform your data like this (remove first column while using it for row names):
dat <- `rownames<-`(as.matrix(grao[,-1]), grao[,1])
You will see, that barplot already does the tabulation for you. However, you also could use xtabs (table might not be the right function for your approach).
# dat <- xtabs(cbind(X..gravel, X..sand, X..silt) ~ Station, grao) ## alternatively
I would advise you to use proper variable names, since special characters are not the best idea.
colnames(dat) <- c("gravel", "sand", "silt")
dat
# gravel sand silt
# PRA1 28.430000 70.06000 1.507000
# PRA3 19.515000 78.07667 2.406000
# PRA4 19.771000 78.63333 1.598333
# PRB1 7.010667 91.38333 1.607333
# PRB2 18.613333 79.62000 1.762000
Then barplot knows what's going on.
.col <- c('#E69F00','#56B4E9','#94A813') ## pre-define colors
barplot(t(dat), beside=T, col=.col, ylim=c(0, 100), ## barplot
main="Here could be your title", xlab="sample", ylab="perc.")
legend("topleft", colnames(dat), pch=15, col=.col, cex=.9, horiz=T, bty="n") ## legend
box() ## put it in a box
Data:
grao <- read.table(text=" Station '% gravel' '% sand' '% silt'
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000 ", header=TRUE)

ggplot2 alternatives to fill in barplots, occurence of factor in multiple rows

I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)

Coloring Rarefaction curve lines by metadata (vegan package) (phyloseq package)

First time question asker here. I wasn't able to find an answer to this question in other posts (love stackexchange, btw).
Anyway...
I'm creating a rarefaction curve via the vegan package and I'm getting a very messy plot that has a very thick black bar at the bottom of the plot which is obscuring some low diversity sample lines.
Ideally, I would like to generate a plot with all of my lines (169; I could reduce this to 144) but make a composite graph, coloring by Sample Year and making different types of lines for each Pond (i.e: 2 sample years: 2016, 2017 and 3 ponds: 1,2,5). I've used phyloseq to create an object with all my data, then separated my OTU abundance table from my metadata into distinct objects (jt = OTU table and sampledata = metadata). My current code:
jt <- as.data.frame(t(j)) # transform it to make it compatible with the proceeding commands
rarecurve(jt
, step = 100
, sample = 6000
, main = "Alpha Rarefaction Curve"
, cex = 0.2
, color = sampledata$PondYear)
# A very small subset of the sample metadata
Pond Year
F16.5.d.1.1.R2 5 2016
F17.1.D.6.1.R1 1 2017
F16.1.D15.1.R3 1 2016
F17.2.D00.1.R2 2 2017
enter image description here
Here is an example of how to plot a rarefaction curve with ggplot. I used data available in the phyloseq package available from bioconductor.
to install phyloseq:
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
library(phyloseq)
other libraries needed
library(tidyverse)
library(vegan)
data:
mothlist <- system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff <- "0.10"
esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
extract OTU table, transpose and convert to data frame
otu <- otu_table(esophman)
otu <- as.data.frame(t(otu))
sample_names <- rownames(otu)
out <- rarecurve(otu, step = 5, sample = 6000, label = T)
Now you have a list each element corresponds to one sample:
Clean the list up a bit:
rare <- lapply(out, function(x){
b <- as.data.frame(x)
b <- data.frame(OTU = b[,1], raw.read = rownames(b))
b$raw.read <- as.numeric(gsub("N", "", b$raw.read))
return(b)
})
label list
names(rare) <- sample_names
convert to data frame:
rare <- map_dfr(rare, function(x){
z <- data.frame(x)
return(z)
}, .id = "sample")
Lets see how it looks:
head(rare)
sample OTU raw.read
1 B 1.000000 1
2 B 5.977595 6
3 B 10.919090 11
4 B 15.826125 16
5 B 20.700279 21
6 B 25.543070 26
plot with ggplot2
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, color = sample))+
scale_x_continuous(labels = scales::scientific_format())
vegan plot:
rarecurve(otu, step = 5, sample = 6000, label = T) #low step size because of low abundance
One can make an additional column of groupings and color according to that.
Here is an example how to add another grouping. Lets assume you have a table of the form:
groupings <- data.frame(sample = c("B", "C", "D"),
location = c("one", "one", "two"), stringsAsFactors = F)
groupings
sample location
1 B one
2 C one
3 D two
where samples are grouped according to another feature. You could use lapply or map_dfr to go over groupings$sample and label rare$location.
rare <- map_dfr(groupings$sample, function(x){ #loop over samples
z <- rare[rare$sample == x,] #subset rare according to sample
loc <- groupings$location[groupings$sample == x] #subset groupings according to sample, if more than one grouping repeat for all
z <- data.frame(z, loc) #make a new data frame with the subsets
return(z)
})
head(rare)
sample OTU raw.read loc
1 B 1.000000 1 one
2 B 5.977595 6 one
3 B 10.919090 11 one
4 B 15.826125 16 one
5 B 20.700279 21 one
6 B 25.543070 26 one
Lets make a decent plot out of this
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, group = sample, color = loc))+
geom_text(data = rare %>% #here we need coordinates of the labels
group_by(sample) %>% #first group by samples
summarise(max_OTU = max(OTU), #find max OTU
max_raw = max(raw.read)), #find max raw read
aes(x = max_raw, y = max_OTU, label = sample), check_overlap = T, hjust = 0)+
scale_x_continuous(labels = scales::scientific_format())+
theme_bw()
I know this is an older question but I originally came here for the same reason and along the way found out that in a recent (2021) update vegan has made this a LOT easier.
This is an absolutely bare-bones example.
Ultimately we're going to be plotting the final result in ggplot so you'll have full customization options, and this is a tidyverse solution with dplyr.
library(vegan)
library(dplyr)
library(ggplot2)
I'm going to use the dune data within vegan and generate a column of random metadata for the site.
data(dune)
metadata <- data.frame("Site" = as.factor(1:20),
"Vegetation" = rep(c("Cactus", "None")))
Now we will run rarecurve, but provide the argument tidy = TRUE which will export a dataframe rather than a plot.
One thing to note here is that I have also used the step argument. The default step is 1, and this means by default you will get one row per individual per sample in your dataset, which can make the resulting dataframe huge. Step = 1 for dune gave me over 600 rows. Reducing the step too much will make your curves blocky, so it will be a balance between step and resolution for a nice plot.
Then I piped a left join right into the rarecurve call
dune_rare <- rarecurve(dune,
step = 2,
tidy = TRUE) %>%
left_join(metadata)
Now it will be plottable in ggplot, with a color/colour call to whatever metadata you attached.
From here you can customize other aspects of the plot as well.
ggplot(dune_rare) +
geom_line(aes(x = Sample, y = Species, group = Site, colour = Vegetation)) +
theme_bw()
dune-output
(Sorry it says I'm not allowed to embed the image yet :( )

ggplot: how to choose the "proper" colors relating on a column

Suppose I have a simple dataframe to plot, in which I have to color the points related to the measure contained in a column. So, if I have:
dataframe
# X1 X2 pop
# 1 -0.11092652 -1.955598e-09 448053
# 2 -0.09999865 -2.310067e-10 418231
# 3 -0.05944755 -3.475013e-09 448473
# 4 0.51378848 1.631781e-09 119548
# 5 0.09438223 -9.606475e-10 323288
# 6 0.19349045 6.074025e-10 203153
# 7 0.06685609 3.210156e-10 208339
# 8 -0.10915456 -1.407190e-09 429178
# 9 -0.10348100 -1.401948e-09 1218038
# 10 -0.08607617 -7.356602e-10 383018
# 11 1.00343465 -2.423237e-08 209550
# 12 -0.05839148 1.503955e-09 287042
# 13 -0.09960163 2.167945e-10 973129
# 14 -0.05793417 2.510107e-09 187249
# 15 0.02191610 2.479708e-09 915225
# 16 0.48877872 1.338346e-08 462999
# 17 -0.10289556 1.472368e-09 1108776
# 18 -0.10316414 2.933469e-10 402422
# 19 -0.09545279 -2.926035e-10 274035
# 20 -0.06111044 3.464014e-09 230749
and I use ggplot in the following way:
ggplot(dataframe) +
ggtitle("Somehow useful spatialization")+ # Electricity / Gas
geom_point(aes(dataframe$X1, dataframe$X2), color = dataframe$pop, size=2 ) +
theme_classic(base_size = 16) +
guides(colour = guide_legend(override.aes = list(size=4)))+
xlab("X")+ylab("Y")
I obtain something like:
that is a possible representaion.
Neverthless, suppose that I want the points colored such to represent the column pop, i.e., having colors from (for example) light orange, passing for dark red and then black. How can I "scale" the column pop to obtain such graphics?
EDIT:
> dput(dataframe)
structure(list(X1 = c(-0.110926520419347, -0.0999986452719714,
-0.0594475526112884, 0.513788479303472, 0.0943822277852107, 0.193490454204271,
0.0668560854540437, -0.109154563987586, -0.103480996064617, -0.0860761723229372,
1.00343465471568, -0.0583914756527933, -0.0996016272609995, -0.0579341671474729,
0.0219161022704227, 0.488778719096658, -0.102895564162661, -0.103164140322136,
-0.0954527927249849, -0.0611104428640883), X2 = c(-1.9555978205951e-09,
-2.31006712207053e-10, -3.47501251356368e-09, 1.63178106438806e-09,
-9.60647459243156e-10, 6.07402512804044e-10, 3.21015629676789e-10,
-1.40718981687972e-09, -1.40194842954735e-09, -7.35660154466167e-10,
-2.423237202138e-08, 1.50395541775022e-09, 2.16794489937917e-10,
2.51010717100061e-09, 2.47970820013341e-09, 1.33834570208731e-08,
1.47236816671351e-09, 2.93346922578509e-10, -2.92603459149485e-10,
3.46401369936372e-09), pop = c(448053L, 418231L, 448473L, 119548L,
323288L, 203153L, 208339L, 429178L, 1218038L, 383018L, 209550L,
287042L, 973129L, 187249L, 915225L, 462999L, 1108776L, 402422L,
274035L, 230749L)), .Names = c("X1", "X2", "pop"), row.names = c(NA,
20L), class = "data.frame")
With ggplot you can add your aesthetics (aes) in your inital ggplot call. Since you're already telling ggplot where the data is (in dataframe), you can refer to the variables directly by their name (without dataframe$). Now for the color to be a scale it needs to be called as a aesthetic, inside the aes() call, and not as a static value. Once it is added as an aesthetic, we can customize how it reacts by adding a scale. Taking this all into account gives us the following code:
ggplot(dataframe, aes(x = X1, y = X2, color = pop)) +
ggtitle("Somehow useful spatialization")+ # Electricity / Gas
geom_point(size=2) +
theme_classic(base_size = 16) +
guides(colour = guide_legend(override.aes = list(size=4))) +
xlab("X")+ylab("Y") +
scale_color_gradient2(low = "green", mid = "red", high = "black", midpoint = mean(dataframe$pop))
This code gives the following graph. The colors could be further adjusted by playing around with the scale_color_gradient2 part. (Why green as low gives a better orange than actually choosing orange as the low color is beyond me, I just ended up there by coincidence)

ggplot function select multiple subset

I'm not an expert with the ggplot2 package. I have a subset selection problem.
Here is my code that produce this kind of graph...
g <- ggplot(merged_data,aes_string(x=Order,fill=var.y)) +
scale_y_continuous(expand = c(0.05,0)) +
xlab(paste("Order","Total number of sequences",sep=" - ")) +
ggtitle(main.str) +
geom_bar(position="fill",
subset = .(Order != ""),
width=0.6,hjust =0)+
geom_text(stat="bin",
subset = .(Order != ""),
color="black", hjust=1, vjust = 0.5, size=2,
aes_string(fill=NULL,x = Order,y = "0", label="..count.."))+
coord_flip()
For geom_bar and geom_text I select subset of data that remove empty names
subset = .(eval(parse(text=var.x)) != "")
this is a simple example with only 2 bars.
Here is a the data ...
Collector<- c("BK","YE_LD","BK","JB","JB",
"BK","BK","BK","JB","YE_LD")
Order<-c("A","B","B","B","A",
"B","B","A","B","B")
data <- data.frame(Order,Collector)
Now I want to add a cutoff to my subset... only show the variable that that have a minimum of counts.
So if I put the cutoff = 4 ... I will get only the bar at the bottom that have 7 counts, the bar at the top with 3 counts should not appear.
I have no idea how I can do this ...
Thanks for your help.
You could create a subset of the data and use this new object in ggplot. The following command will remove all Order conditions with less than four data points:
subset(data, Order %in% names(which(table(Order) >= 4)))
Order Collector
2 B YE_LD
3 B BK
4 B JB
6 B BK
7 B BK
9 B JB
10 B YE_LD

Resources