Sankey Diagrams in R? - r
I am trying to visualize my data flow with a Sankey Diagram in R.
I found this blog post linking to an R script that produces a Sankey Diagram; unfortunately, it's quite raw and somewhat limited (see below for sample code and data).
Does anyone know of other scripts—or maybe even a package—that is more developed? My end goal is to visualize both data flow and percentages by relative size of diagram components, like in these examples of Sankey Diagrams.
I posted a somewhat similar question on the r-help list, but after two weeks without any responses I'm trying my luck here on stackoverflow.
Thanks,
Eric
PS. I'm aware of the Parallel Sets Plot, but that is not what I'm looking for.
# thanks to, https://tonybreyal.wordpress.com/2011/11/24/source_https-sourcing-an-r-script-from-github/
sourc.https <- function(url, ...) {
# install and load the RCurl package
if (match('RCurl', nomatch=0, installed.packages()[,1])==0) {
install.packages(c("RCurl"), dependencies = TRUE)
require(RCurl)
} else require(RCurl)
# parse and evaluate each .R script
sapply(c(url, ...), function(u) {
eval(parse(text = getURL(u, followlocation = TRUE,
cainfo = system.file("CurlSSL", "cacert.pem",
package = "RCurl"))), envir = .GlobalEnv)
} )
}
# from https://gist.github.com/1423501
sourc.https("https://raw.github.com/gist/1423501/55b3c6f11e4918cb6264492528b1ad01c429e581/Sankey.R")
# My example (there is another example inside Sankey.R):
inputs = c(6, 144)
losses = c(6,47,14,7, 7, 35, 34)
unit = "n ="
labels = c("Transfers",
"Referrals\n",
"Unable to Engage",
"Consultation only",
"Did not complete the intake",
"Did not engage in Treatment",
"Discontinued Mid-Treatment",
"Completed Treatment",
"Active in \nTreatment")
SankeyR(inputs,losses,unit,labels)
# Clean up my mess
rm("inputs", "labels", "losses", "SankeyR", "sourc.https", "unit")
Sankey Diagram produced with the above code,
This plot can be created through the networkD3 package. It allows you to create interactive sankey diagrams. Here you can find an example. I also added a screenshot so you have an idea what it looks like.
# Load package
library(networkD3)
# Load energy projection data
# Load energy projection data
URL <- paste0(
"https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
# Plot
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "TWh", fontSize = 12, nodeWidth = 30)
I have created a package (riverplot) that has a slightly different, but overlapping functionality compared to the Sankey function, and can produce plots like this one:
If you want to do it with R, your best bid seems to be #Roman suggestion - hack the SankeyR function. For example - below is my very quick fix - simply orient labels verticaly, slighlty offset them and decrease the font for input referals to make it look a bit better. This modification only changes line 171 and 223 in the SankeyR function:
#line171 - change oversized font size of input label
fontsize = max(0.5,frInputs[j]*1.5)#1.5 instead of 2.5
#line223 - srt changes from 35 to 90 to orient labels vertically,
#and offset adjusts them to get better alignment with arrows
text(txtX, txtY, fullLabel, cex=fontsize, pos=4, srt=90, offset=0.1)
I am no ace in trigonometry, but this is really what you need for changing the direction of arrows. That would be ideal in my view - if you could adjust looses arrows so they are oriented horizontally rather then vertically. Otherwise, why my solution fixes the problem with labels orientation, it doesn't make the diagram much more readable...
In addition to rCharts, Sankey diagrams can now be also generated in R with googleVis (version >= 0.5.0). For example, this post describes the generation of the following diagram using googleVis:
R's alluvial package will also do this (from ?alluvial).
# install.packages(c("alluvial"), dependencies = TRUE)
require(alluvial)
# Titanic data
tit <- as.data.frame(Titanic)
# 4d
alluvial( tit[,1:4], freq=tit$Freq, border=NA,
hide = tit$Freq < quantile(tit$Freq, .50),
col=ifelse( tit$Class == "3rd" & tit$Sex == "Male", "red", "gray") )
plotly has the same power as networkD3 package (example link).
For completeness, there is also the ggalluvial package which is a ggplot2 extension for alluvial/Sankey diagrams.
Here is an example taken from the package's documentation
# devtools::install_github("corybrunson/ggalluvial", ref = "optimization")
library(ggalluvial)
titanic_wide <- data.frame(Titanic)
ggplot(data = titanic_wide,
aes(axis1 = Class, axis2 = Sex, axis3 = Age,
y = Freq)) +
scale_x_discrete(limits = c("Class", "Sex", "Age"), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = Survived)) +
geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) +
theme_minimal() +
ggtitle("passengers on the maiden voyage of the Titanic",
"stratified by demographics and survival") +
theme(legend.position = 'bottom')
ggplot(titanic_wide,
aes(y = Freq,
axis1 = Survived, axis2 = Sex, axis3 = Class)) +
geom_alluvium(aes(fill = Class),
width = 0, knot.pos = 0, reverse = FALSE) +
guides(fill = FALSE) +
geom_stratum(width = 1/8, reverse = FALSE) +
geom_text(stat = "stratum", label.strata = TRUE, reverse = FALSE) +
scale_x_continuous(expand = c(0, 0),
breaks = 1:3, labels = c("Survived", "Sex", "Class")) +
scale_y_discrete(expand = c(0, 0)) +
coord_flip() +
ggtitle("Titanic survival by class and sex")
Created on 2018-11-13 by the reprex package (v0.2.1.9000)
Judging by these definitions this function, like the Parallel Sets Plot, lacks the capacity to split and combine flows (i.e. through more than one transition).
Since Sankey diagrams are directed weighted graphs, a package like qgraph might be useful.
The SankeyR function provides clearer labels if you sort the losses in descending order as the text is placed closer to the arrow heads without overlapping.
have a look at //sankeybuilder.com as it offers a ready to go solution where you can upload your data and playback variations over time. The transition works well (similar to the youtube demo in your question). If you load the SankeyTrend demo it includes many time slots (Years of data). Once loaded (builds sankeys automatically), click the play button in the upper right hand corner of the page for playback of the time slots, you can even pause and resume time. Demo url is here: SankeyTrend Hope this helps your quest for the perfect Sankey diagram.
Just open sourced a package that uses an alluvial diagram to visualize workflow stages. Since history is kept when the alluvial form is used, there aren't any crossovers in the edges.
https://github.com/claytontstanley/shiny.alluvial
Related
Adding a customized legend to a R raster spplot map
I would like to ask you for a few advices on a R cartography with Raster / spplot I am currently working on. I am a novice so I apologize in advance should the methods I used to be not at all optimal! => So: I have a raster object and almost got what I wanted, but I have troubles with the legend and the result looks kind of childish. I'd like to get something a bit more "professional". I'd like to 1) improve the overall aesthetics and 2) add legends on my plot such as this concentric bubble size legend proposed in this other post: create a concentric circle legend. Here is what I have right now: death rate and exposure in France What I think might improve the map: Use a concentric circles bubble legend for hospital volume and put it on the top right corner Add transparency to my points. Here I have 13 bubbles, but the real map has about 600 with many overlapping (especially in Paris area). Add a legend to my colour gradient If you have any tips / comments do not hesitate! I'm a beginner but eager to learn :) I've enclosed a simplified full code (13 hospitals instead of 600, data completely edited, variable names changed... So no need to interprete!). I've edited it so that you can just copy / paste easily. #################################################################### #################################################################### # 1) DATA PREPARATION # Packages library(raster) library(rgeos) library(latticeExtra) library(sf) # Mortality dataset french_regions=c("IDF", "NE", "NO", "SE", "SO") death_rates_reg=c(0.032,0.014,0.019,0.018,0.021) region_mortality=data.frame(french_regions,death_rates_reg) # Hospital dataset hospital_id=1:13 expo=c(0.11,0.20,0.17,0.25,0.18,0.05,0.07,0.25,0.40,0.70,0.45,0.14,0.80) volume=sample(1:200, 13, replace=TRUE) lat=c(44.8236,48.8197,45.7599,45.2785,48.9183,50.61,43.6356,47.9877,48.8303,48.8302,48.8991,43.2915,48.7232) long=c(-0.57979,7.78697,4.79666,6.3421,2.52365,3.03763,3.8914,-4.095,2.34038,2.31117,2.33083,5.56335,2.45025) french_hospitals=data.frame(hospital_id,expo,volume,lat,long) # French regions map object - merge of departments according to phone codes formes <- getData(name="GADM", country="FRA", level=2) formes$NAME_3=0 # NAME_3 = new mega-regions IDF, NE, NO, SE, SO formes$NAME_3[formes$NAME_1=="Auvergne-Rhône-Alpes"]="SE" formes$NAME_3[formes$NAME_1=="Bourgogne-Franche-Comté"]="NE" formes$NAME_3[formes$NAME_1=="Bretagne"]="NO" formes$NAME_3[formes$NAME_1=="Centre-Val de Loire"]="NO" formes$NAME_3[formes$NAME_1=="Corse"]="SE" formes$NAME_3[formes$NAME_1=="Grand Est"]="NE" formes$NAME_3[formes$NAME_1=="Hauts-de-France"]="NE" formes$NAME_3[formes$NAME_1=="Île-de-France"]="IDF" formes$NAME_3[formes$NAME_1=="Normandie"]="NO" formes$NAME_3[formes$NAME_1=="Nouvelle-Aquitaine"]="SO" formes$NAME_3[formes$NAME_1=="Occitanie"]="SO" formes$NAME_3[formes$NAME_1=="Pays de la Loire"]="NO" formes$NAME_3[formes$NAME_1=="Provence-Alpes-Côte d'Azur"]="SE" formes$NAME_3[formes$NAME_2=="Aude"]="SE" formes$NAME_3[formes$NAME_2=="Gard"]="SE" formes$NAME_3[formes$NAME_2=="Hérault"]="SE" formes$NAME_3[formes$NAME_2=="Lozère"]="SE" formes$NAME_3[formes$NAME_2=="Pyrénées-Orientales"]="SE" groups = aggregate(formes, by = "NAME_3") # Colour palettes couleurs_death=colorRampPalette(c('gray100','gray50')) couleurs_expo=colorRampPalette(c('green','gold','red','darkred')) # Hospitals bubble sizes and colours my_colours=couleurs_expo(401) french_hospitals$bubble_color="Initialisation" french_hospitals$indice=round(french_hospitals$expo*400,digits=0)+1 french_hospitals$bubble_size=french_hospitals$volume*(1.5/50) for(i in 1:length(french_hospitals$bubble_color)){ french_hospitals$bubble_color[i]=my_colours[french_hospitals$indice[i]] } #################################################################### #################################################################### # 2) MAP # Assignation of death rates to regions idx <- match(groups$NAME_3, region_mortality$french_regions) concordance <- region_mortality[idx, "death_rates_reg"] groups$outcome_char <- concordance # First map: region colours = death rates graphA=spplot(groups, "outcome_char", col.regions=couleurs_death(500), par.settings = list(fontsize = list(text = 12)), main=list(label=" ",cex=1),colorkey = list(space = "bottom", height = 0.85)) # Second map: hospital bubbles = exposure GraphB=graphA + layer(panel.points(french_hospitals[,c(5,4)],col=french_hospitals$bubble_color,pch=20, cex=french_hospitals$bubble_size)) # Addition of the legend Bubble_location=matrix(data=c(-4.0,-2.0,0.0,-4.0,-2.0,0.0,42.3,42.3,42.3,41.55,41.55,41.55),nrow=6,ncol=2) GraphC1=GraphB + layer(panel.points(Bubble_location, col=c(my_colours[5],my_colours[125],my_colours[245],"black","black","black"), pch=19,cex=c(2.5,2.5,2.5,5.0,2.0,1.0))) Bubble_location2=matrix(data=c(-3.4,-1.27,0.55, -3.65, -3.3 , -3.4,-1.52,0.48,42.31,42.31,42.31,42.55,41.9, 41.56,41.56,41.56),nrow=8,ncol=2) GraphC2=GraphC1+layer(panel.text(Bubble_location2, label=c("0%","30%","60%", "Exposure:", "Hospital volume:", "125","50","25"), col="black", cex=1.0)) # Final map GraphC2 Thank you in advance for your help! (I know this is a lot, do not feel forced to dive in the code)
It isn't pretty, but I think this can get you started baring a more complete answer from someone else. I'd suggest using ggplot instead of spplot. The only thing you need to do is convert your sp object to sf to integrate with ggplot. The bubble plot needs a lot of guess and check, so I'll leave that up to you... Map layout design is still better in GIS software, in my opinion. library(sf) library(ggplot2) # Convert sp to sf groups_sf <- st_as_sf(groups) # Make reference dataframe for concentric bubble legend bubble_legend <- data.frame(x = c(8.5, 8.5, 8.5), y = c(50, 50, 50), size = c(3, 6, 9)) ggplot() + geom_sf(data = groups_sf) + geom_point(data = french_hospitals, aes(x = long, y = lat, color = indice, size = bubble_size), alpha = 0.7) + geom_point(data = bubble_legend, aes(x = x, y = y + size/50), size = bubble_legend$size, shape = 21, color = "black", fill = NA) + geom_text(data = bubble_legend, aes(x = x + 0.5, y = y + size/50, label = size), size = 3) + scale_color_gradient(low = "green", high = "red") + guides(size="none") Let me know what you think. I can help troubleshoot more if there are any issues.
Thank you for your answer Skaqqs, very appreciated. This is in my opinion a good step forward!! I tried it quickly on the real data and it already looks way better, especially with the transparency. I can't really show more since that's sensitive data on a trendy topic and we want to keep it confidential as much as possible until article submission. I'll move on from this good starting base and update you. Thank you :)
Apply color gradient to ggraph's `geom_conn_bundle`
Context I am using ggraph to arrange nodes (leaves of a tree) in a circular dendrogram and then add connections between some of the nodes (using hierarchical bundling using geom_conn_bundle): library(ggraph) library(igraph) # Example data edges <- data.frame(from="root", to=paste("leaf", seq(1,100), sep="")) vertices <- data.frame(name = unique(c(as.character(edges$from), as.character(edges$to))) ) tree <- graph_from_data_frame( edges, vertices=vertices ) # Drawing nodes pr <- ggraph(tree, layout = "dendrogram", circular = TRUE) + geom_edge_diagonal(alpha = 0.2) # Example connection pr <- pr + geom_conn_bundle( data = get_con(from = 23, to = 42), alpha=0.8, width=3, colour="skyblue", tension = 0.9 ) print(pr) This nicely displays a nearly transparent dendrogram and some (in this example one) connections in skyblue. Problem / Desired output What I'd like though, is the direction of the connection being indicated by a color gradient (i.e. starting with green, slowly changing into red) instead of showing the connection in just one color (skyblue). How can I achive such a color gradient using R and ggraph's geom_conn_bundle? The following excerpt from Holten (2006) can serve of an example of how I'd like the connections to look:
Several of the ggraph geoms for drawing edges, including geom_conn_bundle and geom_edge_diagonal, have a calculated index stat. It's a number from 0 to 1 of how far along the edge a point is. Note that the simplified versions of these geoms (geom_*0) don't calculate it. Some mentions of it are in this blog post by the ggraph author. In this case, map the index stat(index) to color inside your bundle's aes, then set a gradient scale with (scale_edge_color_gradient, not scale_color_gradient as I initially tried). In the example picture, I can't tell whether the width is also scaled, but the same would work, e.g. edge_width = stat(index). library(ggraph) library(igraph) ggraph(tree, layout = "dendrogram", circular = TRUE) + geom_edge_diagonal(alpha = 0.2) + geom_conn_bundle(aes(color = stat(index)), data = get_con(from = 23, to = 42), alpha=0.8, width=3, # colour="skyblue", tension = 0.9 ) + scale_edge_color_gradient(low = "green", high = "red") Created on 2019-03-09 by the reprex package (v0.2.1)
How To Rotate Labels when using Mosaic Plots within VCD
I am trying to rotate labels (not variable names) for a plot I have created, and I am struggling to find and adopt any workable solution. As you can see the labels are not readable I the current form. Plot is looking into reasons for planning permission objections, and variables are Income, politics, Sex, and Attitude to new homes in their region. Here is the code. Last iteration, includes labeling function, but does not have any effect on the plot. library(vcd) mosaic(~Sex+HomsBultBPV+HHIncQV++PartyID, data=BSA, shade=TRUE, labeling_args = list(set_varnames = c(Sex="Gender", HomsBultBPV="Attitude To Homes Built in Area", PartyID="Political Affiliation", HHIncQV="Income Quartile", labeling= labeling_border(rot_labels = c(25,25,25,0), just_labels = c("left", "center", "center", "center")))))
The solution is this parameter: "vcd::labeling_border(rot_labels = c(__, __))" Try out this code: library(vcd) mosaic(~ Sex + Age + Survived, data = Titanic, main = "Survival on the Titanic", shade = TRUE, legend = TRUE, labeling = vcd::labeling_border(rot_labels = c(45, 45)))
How to change plot background of a ctree object in R?
I'm making a tree using the party package for a poster, and the background of the poster is grey. I've been able to change the background of all of my other plots (box plots, scatter plots) to grey by using the command par(bg = "grey") but this doesn't work for ctree. For example, this makes a scatter plot on a grey background: airq <- subset(airquality, !is.na(Ozone)) par(bg="grey") plot(Temp ~ Wind, data = airq) But this does not make a tree on a grey background: library("party") air.ct <- ctree(Ozone ~ ., data = airq) par(bg = "grey") plot(air.ct, inner_panel=node_inner(air.ct, pval = TRUE, id = FALSE), terminal_panel = node_boxplot(air.ct, id = FALSE)) Please help, my poster is due on Thursday!
Both the party package and its successor partykit are based on the grid package for visualization. Therefore, the par() function for base graphics is ignored when creating grid graphics. For the latter, there is a gpar() function but it does not directly support setting a bg background. Therefore, in the current version of party or partykit setting the background color is not possible via simple arguments - only by supplying adapted panel functions. However, as this feature was already partially supported in some panel functions, I've adapted the partykit package on R-Forge to enable setting backgrounds. The most recent version of the package is required for this: library("partykit") packageDescription("partykit")$Version ## [1] "1.0-5" The tree can be grown as in your example: airq <- subset(airquality, !is.na(Ozone)) air.ct <- ctree(Ozone ~ ., data = airq) Then we first add an empty page with a gray background: grid.newpage() grid.rect(gp = gpar(col = "gray", fill = "gray")) Then the tree can be added: plot(air.ct, ip_args = list(id = FALSE, fill = "gray"), ep_args = list(fill = "gray"), tp_args = list(id = FALSE, bg = "gray", fill = "slategray"), newpage = FALSE ) To obtain this development version of partykit, please go to the R-Forge page of the package. There you can either check out the source package (see "SCM") and install it by hand - or you can wait until a new package has been built (see "R Packages"). The latter should hopefully be completed in a few hours.
Heat Table in R
I'm looking for a way to duplicate the kind of Heat Table shown below with R (and possibly ggplot2). Specific time axis are irrelevant; any rectangular table should do. I've tried to search for Heat map and Heat table in Google, but couldn't find any R package that did the trick. Thoughts?
require(ggplot2) df <- data.frame(vaxis = rep(c(letters[1:5], "top"), each = 4), haxis = rep(c(letters[6:8], "right"), times = 6), value = rpois(24, lambda = 10)) df$color <- factor(ifelse(df$vaxis == "top" | df$haxis == "right", 1, 0)) ggplot(df, aes(x = haxis, y = vaxis, size = value, color = color)) + geom_point() Just get your data in a similar format. You could write a function to make the "top" and "right" values normalized marginal sums. Of course lots of tweaks are possible in naming, legends, theme, etc.