Using the airline-safety dataset available here, I'm trying to create a heat map in R. I want to order the heat map so that the airlines with the highest number of fatal accidents are listed at the top.
I'm able to order the heat map by "value" -
but this orders the heatmap by value, regardless of what the group is i.e. incidents, fatal accidents or fatalities.
# load packages -----------
library(tidyverse)
library(ggplot2)
library(reshape2)
library(dplyr)
library(plyr)
library(scales)
library(forcats)
# read in the data
airlines <- read.csv("/Volumes/GoogleDrive/My Drive/Uni/DVN/AT2/Blog 2/airline_incidents.csv", header = TRUE)
# select relevant columns
airlines_00_14 <- airlines[,c(1,6,7,8)]
# create a long dataset
airlines_00_14.m <- melt(airlines_00_14)
# rescale values for heat map
airlines_00_14.m <- ddply(airlines_00_14.m, .(variable), transform, rescale = rescale(value))
# create heat map
(q <- airlines_00_14.m %>%
ggplot( aes(x = variable, y = reorder(airline, value))) +
geom_tile(aes(fill = rescale), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue"))
One way to do this is to create the order before you melt, like this:
# order by fatalities and generate air_order value
airlines_00_14 = airlines_00_14[order(airlines_00_14$fatal_accidents_00_14),]
airlines_00_14$air_order = seq_len(nrow(airlines_00_14))
Then, when you use reshape2::melt, set `id.vars = c("airline","air_order")
# create a long dataset
airlines_00_14.m <- reshape2::melt(airlines_00_14,id.vars = c("airline", "air_order"))
Then, in your plot, use y=reorder(airline, air_order) instead of the current y=reorder(airline, value)
Output:
Related
I'm working on a project and a small part of it consists of drawing a world map with 43 countries on my list. My dataset is as follows:
How do I put this on the world map with different colors for development status as follows?
Data is here :
https://wetransfer.com/downloads/0960ed96fba15e9591a2e9c14ac852fa20220301181615/dc25f41a87fc2ba165a72ab6712dd8d020220301181640/832b5e
A quick example using fake data:
library(dplyr)
library(ggplot2)
# simulate data for subset of countries
mydata <- map_data("world") %>%
distinct(region) %>%
mutate(fakedata = runif(n())) %>%
slice_sample(n = 200)
# add simulated values and remove Antarctica
worldmap <- map_data("world") %>%
filter(region != "Antarctica") %>%
left_join(mydata)
ggplot(worldmap) +
geom_polygon(aes(long, lat, group = group, fill = fakedata)) +
coord_quickmap() +
scale_fill_viridis_c(option = "plasma", na.value = NA) +
theme_void()
Also look into the {sf} package and geom_sf(), which among other things makes it easier to use different / less distorted / less biased map projections.
Similar to #zephryl answer, but using tmap. The first step is joining your data with the World data by country name. The next step is drawing the map.
library(dplyr)
library(tmap)
# Get World data
data("World")
# Dummy data frame similar to your data
df <- data.frame(location = World$name,
devStat = rnorm(length(World$name), 5, 2.5))
# Join by country name
# Just need to make sure that country names are written exactly the same
# in the two datasets
df2 <- World |>
left_join(df, by = c("name" = "location"))
# Create map
# Shape to add to the map
tm_shape(df2) +
# Draw the previous shape as polygons
# Set the attribute to which the polygons will be coloured
tm_polygons("devStat",
# select palette
palette = "-plasma",
# Set palette categories as order
style = "order",
# Horizontal legend
legend.is.portrait = FALSE) +
# Remove frame from layout
tm_layout(frame = FALSE,
# Put legend outsize the frame
legend.outside = T,
legend.outside.position = "top")
I am attempting to create heat maps with a large data set that has several factors. I'd like to get a birds eye view first, by plotting the heat map of all values and all factors. THEN, I'd like to subset the heat map plot by a variety of factors - but have ggplot2::geom_tile re-calculate the heat map so it plots the relative abundance based on whatever factors I've subsampled.
library(reshape2)
library(ggplot2)
library(dplyr)
#Test data
df <- data.frame(
Measurement = c(1:30),
CA = rep(rnorm(30, mean=20, sd=5)),
TX = rep(rnorm(30, mean=18, sd=5)),
NY = rep(rnorm(30, mean=34, sd=2))
)
df.melt <- melt(df,id = c("Measurement"))
Basic heat map plot code. My actual data includes several factors/columns from which I want to pull data for various comparisons.
#Basic plot
ggplot(data = df.melt,
aes(x = variable, y = Measurement, colors = value, fill = value)) +
geom_tile(color = "black") +
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
I want the output colors to correspond to relative abundance by measurement. So I can look at Relative changes across CA, TX, and NY. This would be my "Base plot".
df.melt.reabun <- df.melt %>% group_by(Measurement) %>%
mutate(RelAbun = value/sum(value))
df.melt.reabun <- as.data.frame(df.melt.reabun)
#New plot with relative abundance
ggplot(data = df.melt.reabun,
aes(x = variable, y = Measurement, colors = RelAbun, fill = RelAbun)) +
geom_tile(color = "black") +
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
What I also want to do is be able to re-plot however I want and the relative abundance to automatically calculate within ggplot tile.
#Assign plot object
heat <- ggplot(data = df.melt.reabun,
aes(x = variable, y = Measurement, colors = RelAbun, fill = RelAbun)) +
geom_tile(color = "black")+
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
#Select variable to subset data
alt <- c("CA", "TX")
#Subset ggplot object
heat %+% subset(df.melt.reabun, variable %in% alt)
But this output is incorrect, because it is only showing relative abundance from the calculation that included CA, TX, and NY.
I want the relative abundance to re-calculate every time I subset the df to plot at this step: heat %+% subset()
I have a feeling I can smoothly combine group_by and geom_tile to do this automatically.. but I can't quite figure it out. Any help would be appreciated. I have MANY MANY combinations of heat maps I want to look at and I do NOT want to re-calculate the relative abundance "manually" each time.
It's generally advisable to do your data wranglings before passing the data frame to ggplot. In this case, something like the following could work:
subsetFun <- function(df, var.filter){
return(df %>%
filter(variable %in% var.filter) %>%
group_by(Measurement) %>%
mutate(RelAbun = value / sum(value)) %>%
ungroup())
}
heat %+% subsetFun(df.melt.reabun, alt)
I have the following code :
library(ggplot2)
ggplot(data = diamonds, aes(x = cut)) +
geom_bar()
with this result.
I would like to sort the graph on the count descending.
There are multiple ways of how to do it (it is probably possible just by using options within ggplot). But a way using dplyr library to first summarize the data and then use ggplot to plot the bar chart might look like this:
# load the ggplot library
library(ggplot2)
# load the dplyr library
library(dplyr)
# load the diamonds dataset
data(diamonds)
# using dplyr:
# take a dimonds dataset
newData <- diamonds %>%
# group it by cut column
group_by(cut) %>%
# count number of observations of each type
summarise(count = n())
# change levels of the cut variable
# you tell R to order the cut variable according to number of observations (i.e. count variable)
newData$cut <- factor(newData$cut, levels = newData$cut[order(newData$count, decreasing = TRUE)])
# plot the ggplot
ggplot(data = newData, aes(x = cut, y = count)) +
geom_bar(stat = "identity")
i need your help.
I was trying to do a stacked bar plot in R and i m not succeding for the moment. I have read several post but, no succed neither.
Like i am newbie, this is the chart I want (I made it in excel)
And this is how i have the data
Thank you in advance
I would use the package ggplot2 to create this plot as it is easier to position text labels than compared to the basic graphics package:
# First we create a dataframe using the data taken from your excel sheet:
myData <- data.frame(
Q_students = c(1000,1100),
Students_with_activity = c(950, 10000),
Average_debt_per_student = c(800, 850),
Week = c(1,2))
# The data in the dataframe above is in 'wide' format, to use ggplot
# we need to use the tidyr package to convert it to 'long' format.
library(tidyr)
myData <- gather(myData,
Condition,
Value,
Q_students:Average_debt_per_student)
# To add the text labels we calculate the midpoint of each bar and
# add this as a column to our dataframe using the package dplyr:
library(dplyr)
myData <- group_by(myData,Week) %>%
mutate(pos = cumsum(Value) - (0.5 * Value))
#We pass the dataframe to ggplot2 and then add the text labels using the positions which
#we calculated above to place the labels correctly halfway down each
#column using geom_text.
library(ggplot2)
# plot bars and add text
p <- ggplot(myData, aes(x = Week, y = Value)) +
geom_bar(aes(fill = Condition),stat="identity") +
geom_text(aes(label = Value, y = pos), size = 3)
#Add title
p <- p + ggtitle("My Plot")
#Plot p
p
so <- data.frame ( week1= c(1000,950,800), week2=c(1100,10000,850),row.names = c("Q students","students with Activity","average debt per student")
barplot(as.matrix(so))
Below is the dataset.
https://docs.google.com/spreadsheet/ccc?key=0AjmK45BP3s1ydEUxRWhTQW5RczVDZjhyell5dUV4YlE#gid=0
Code:
counts = table(finaldata$satjob, finaldata$degree)
barplot(counts, xlab="Highest Degree after finishing 9-12th Grade",col = c("Dark Blue","Blueviolet","deepPink4","goldenrod"), legend =(rownames(counts)))
The below barplot is the result of the above code.
https://docs.google.com/file/d/0BzmK45BP3s1yVkx5OFlGQk5WVE0/edit
Now, i want to create the plot for relative frequency table of "counts".
For creating a relative frequency table, I need the divide each cell of the column by the column total to get the relative frequency for that cell and so for others as well. How to go about doing it.
I have tried this formula counts/sum(counts) , but this is not working. counts[1:4]/sum(counts[1:4]), this gives me the relative frequency of the first column.
Help me obtain the same for other columns as well in the same table.
I'm a big fan of plyr & ggplot2, so you may have to download a few packages for the below to work.
install.packages('ggplot2') # only have to run once
install.packages('plyr') # only have to run once
install.packages('scales') # only have to run once
library(plyr)
library(ggplot2)
library(scales)
# dat <- YOUR DATA
dat_count <- ddply(ft, .(degree, satjob), 'count')
dat_rel_freq <- ddply(dat, .(degree), transform, rel_freq = freq/sum(freq))
ggplot(dat_rel_freq, aes(x = degree, y = rel_freq, fill = satjob)) +
geom_bar(stat = 'identity') +
scale_y_continuous(labels = percent) +
labs(title = 'Highest Degree After finishing 9-12th Grade\n',
x = '',
y = '',
fill = 'Job Satisfaction')