How to plot Sankey Graph with R networkD3 values and percentage below each node - r

Good afternoon,
from the code below I am able to produce a Graph chart but it does not show the underlying values.
I tried to tweak the code in this thread but I got no joy. I never used Java.
What I need is a graph that has also the values and the percentage under each node like the following picture.
Thanks
library(dplyr)
library(networkD3)
library(tidyverse)
library(readxl)
library(RColorBrewer)
df = data.frame(Source = c("ABC","CDE","MNB","PCI","UCD"),
Destination = c("Me","You","Him","Her","Her"),
Value = c(200,350,456,450,100))
## Reshape dataframe to long
df2 = pivot_longer(df, c(Destination, Source))
## make unique list for destination and source
dest = unique(as.character(df$Destination))
sources = unique(as.character(df$Source))
## Assign nodes number to each element of the chart
nodes2 = data.frame(node = append(dest,sources), nodeid = c(0:8))
res = merge(df,nodes2, by.x="Source", by.y = "node")
res = merge(res,nodes2, by.x="Destination", by.y = "node")
## Make links
links2 = res[, c("nodeid.x","nodeid.y","Value")]
colnames(links2) <- c("source", "target", "value")
## Add a 'group' column to each connection:
links2$group = as.factor(c("type_a","type_b","type_c","type_d","type_e"))
## defining nodes
nodes2["groups"] = nodes2$node
nodes2$groups = as.factor(nodes2$groups)
# Give a color for each group:
my_color <- 'd3.scaleOrdinal() .domain(["type_a","type_b","type_c","type_d","type_e","Me","You","Him","Her","Her"]) .range(["rgb(165,0,38,0.4)", "rgb(215,48,39, 0.4)", "rgb(244,109,67,0.4)", "rgb(253,174,97,0.4)", "rgb(254,224,139,0.4)",
"rgb(255,255,191,0.4)", "rgb(217,239,139,0.4)", "rgb(166,217,106,0.4)",
"rgb(102,189,99,0.4)","rgb(26,152,80,0.4)"])'
# plot graph
networkD3::sankeyNetwork(Links = links2, Nodes = nodes2,
Source = 'source',
Target = 'target',
Value = 'value',
NodeID = 'node',
units = 'Amount',
colourScale=my_color,
LinkGroup="group",
NodeGroup="groups",
fontFamily = "arial",
fontSize = 8,
nodeWidth = 8)

Update below original content; it is a fully developed solution to your original request.
I'm still working on rendering the string with multiple lines (instead of on one line). However, it's proving to be quite difficult as SVG text. However, here is a method in which you can get all of the desired information onto your diagram, even if it isn't styled exactly as you wished.
First I created the data to add to the plot. This has to be added to the widget after it's created. (It will just get stripped if you try to add it beforehand.)
This creates the before and after percentages and the aggregated sums (where needed).
# for this tiny data frame some of this grouping is redundant---
# however, this method could be used on a much larger scale
df3 <- df %>%
group_by(Source) %>%
mutate(sPerc = paste0(round(sum(Value) / sum(df$Value) * 100, 2), "%")) %>%
group_by(Destination) %>%
mutate(dPerc = paste0(round(sum(Value) / sum(df$Value) * 100, 2), "%")) %>%
pivot_longer(c(Destination, Source)) %>%
mutate(Perc = ifelse(name == "Destination",
dPerc, sPerc)) %>% # determine which % to retain
select(Value, value, Perc) %>% # only fields to add to widget
group_by(value, Perc) %>%
summarise(Value = sum(Value)) # get the sum for 'Her'
I saved the Sankey diagram with the object name plt. This next part adds the new data to the widget plt.
plt$x$nodes <- right_join(plt$x$nodes, df3, by = c("name" = "value"))
This final element adds the value and the percentages to the source and destination node labels.
htmlwidgets::onRender(plt, '
function(el, x) {
d3.select(el).selectAll(".node text")
.text(d => d.name + " " + d.Perc + " " + d.Value)
}')
Update: Multi-line labels
I guess I just needed to sleep on it. This update will get you multi-line text.
You also asked for resources on how you would go about doing this yourself. There are a few things at play here: Javascript, SVG text, D3, and the package htmlwidgets. When you use onRender, it's important to know the script file that that connects the package R code to the package htmlwidgets. I would suggest starting with learning about htmlwidgets. For example, how to create your own.
Alright-- back to answering the original question. This appends the new values using all of the content I originally provided, except the call to onRender.
htmlwidgets::onRender(plt, '
function(el, x) {
d3.select(el).selectAll(".node text").each(function(d){
var arr, val, anc
arr = " " + d.Perc + " " + d.Value;
arr = arr.split(" ");
val = d3.select(this).attr("x");
anc = d3.select(this).attr("text-anchor");
for(i = 0; i < arr.length; i++) {
d3.select(this).append("tspan")
.text(arr[i])
.attr("dy", i ? "1.2em" : 0)
.attr("x", val)
.attr("text-anchor", anc)
.attr("class", "tspan" + i);
}
})
}')

Related

R - tmap_animation generates gif with green frames

I'm trying to create an animated GIF using tmap and display it in my Shiny app. When I use tm_shape() + tm_polygons() for a single date, the image produced is always OK. However, when I use tm_facets() and feed the result to tmap_animation, the resulting GIF has random dark green frames, as per the below.
Here is the code I am using the generate the animation:
data(World)
confirmed = read.csv('./data/time_series_covid_19_confirmed.csv', stringsAsFactors = T) %>%
select(-Province.State, -Country.Region) %>%
pivot_longer(!c('Lat', 'Long', 'iso3'), names_to = 'date', values_to = 'confirmed') %>%
mutate(date = as.Date(gsub('X', '', date), '%m.%d.%Y')) %>%
group_by(iso3, date) %>%
summarise(confirmed = sum(confirmed)) %>%
mutate(perc_change = 100 * ifelse(
lag(confirmed, default = 0) == 0 | confirmed < 1000 | lag(confirmed) > confirmed,
0,
(confirmed - lag(confirmed)) / lag(confirmed)
)
) %>%
inner_join(select(World, iso_a3, geometry), by=c('iso3' = 'iso_a3')) %>%
st_sf()
conf_anim = confirmed %>%
filter(date < '2020-02-28') %>%
tm_shape() + tm_polygons('perc_change', style='cont') +
# tm_fill('perc_change', palette='Blues', style='fixed',
# breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, Inf)) + tm_borders() +
tm_facets(along='date', free.coords = F)
tmap_animation(conf_anim, filename = './www/conf_anim.gif', delay=50)
Anyone know how I could fix this?
I had the same issue. I asked about it on the tmap GitHub issues tab and the team traced it to an issue with the gifski package tmap uses to generate gif animations. I found the same thing was happening for animations made with gifski without using tmap.
The workaround mtennekes suggested for now was to save the animation as an mp4 rather than gif, which uses the av package:
tmap_animation(conf_anim, filename = './www/conf_anim.mp4', delay=50)
Not a full fix, but hope it helps for your immediate application.
In case anyone else is struggling with this, I was able to create a workaround using tmap, gifski (which tmap uses to generate GIFs), dplyr, and the png library. This method is NOT fast and generating a GIF this way may take a long time. Another answer mentioned using an MP4 format instead of GIF - this method is far faster if that is an option for you.
Assuming you have an sf object named df_sf which you are trying to facet along facet_along_col_str and the first argument to tm_polygons is polygon_col_str, here is code that will help you generate a GIF:
library(tmap)
library(gifski)
library(png)
library(dplyr)
generate_gif = function(sf_df, facet_along_col_str, polygon_col_str, gif_file, delay) {
dims = get_dims(sf_df, facet_along_col_str, polygon_col_str)
save_gif(get_maps(sf_df), height=dims[1], width=dims[2], delay=delay, gif_file=gif_file)
}
get_dims = function(sf_df, facet_along, polygon_col) {
slice = st_drop_geometry(select(sf_df, !!as.symbol(map_slice_col)))[[1]][1]
t = tm_shape(filter(sf_df, !!as.symbol(map_slice_col) == slice)) +
tm_polygons(polygons_col)
filename = tempfile(fileext='.png')
tmap_save(t, filename=filename)
dims = dim(readPNG(filename))
file.remove(filename)
return(dims)
}
get_maps = function(sf_df, facet_along, polygon_col) {
slices = st_drop_geometry(select(sf_df, !!as.symbol(map_slice_col)))[[1]]
slices = sort(unique(slices))
for (i in 1:length(slices)) {
t = tm_shape(filter(sf_df, !!as.symbol(facet_along) == slices[i])) +
tm_polygons(polygon_col)
print(t)
}
}

if/else grepl "argument is of length zero"

I want to perform a set of operations (in R) on a number of data frames located within a list. In particular, for each of one I create a "library" column, which is then used to determine which kind of filtering operation to perform. This is the actual code:
sampleList <- list(RNA1 = "data/not_processed/dedup.Bp1R4T2_S2.txt",
RNA2 = "data/not_processed/dedup.Bp1R4T3_S4.txt",
RNA3 = "data/not_processed/dedup.Bp1R5T2_S1.txt",
RNA4 = "data/not_processed/dedup.Bp1R5T3_S2.txt",
RNA5 = "data/not_processed/dedup.Bp1R14T5_S1.txt",
RNA6 = "data/not_processed/dedup.Bp1R14T6_S1.txt",
RNA7 = "data/not_processed/dedup.Bp1R14T6_S2.txt",
RNA8 = "data/not_processed/dedup.Bp1R14T7_S2.txt",
RNA9 = "data/not_processed/dedup.Bp1R14T8_S3.txt",
RNA10 = "data/not_processed/dedup.Bp1R14T9_S3.txt",
RNA11 = "data/not_processed/dedup.Bp1R14T9_S4.txt",
DNA1 = "data/not_processed/dedup.dna10_1_S4.txt",
DNA2 = "data/not_processed/dedup.dna10_2_S5.txt",
DNA3 = "data/not_processed/dedup.dna10_3_S6.txt",
DNA4 = "data/not_processed/dedup.dna50_1_S1.txt",
DNA5 = "data/not_processed/dedup.dna50_2_S2.txt",
DNA6 = "data/not_processed/dedup.dna50_3_S3.txt",
DNA7 = "data/not_processed/dedup.dna50_pcrcocktail_S7.txt")
batch <- lapply(names(sampleList),function(mysample){
aux <- read.table(sampleList[[mysample]], col.names=c(column1, column2, ..., ID, library, column4, etc...))
aux %>% mutate(library = mysample, R = Fw_ref + Rv_ref, A = Fw_alt + Rv_alt) %>% distinct(ID, .keep_all=T)
if (grepl("DNA", aux$library)){
aux %>% filter(aux$R>1 & aux$A>1)
} else {
aux %>% filter((aux$R+aux$A)>7 & aux$Fw_ref>=1 & aux$Rv_ref>=1 & aux$Fw_alt>=1 & aux$Rv_alt>=1)
}
aux
})
batch_file <- do.call(rbind, batch)
write.table(batch_file, "data/batch_file.txt", col.names = T, sep = "\t")
The possible values of the library column are DNA1 to DNA7, and RNA1 to 11. I tried also with "char" %in%, but it gives the same problem:
Error in if (grepl("DNA", aux$library)) { : argument is of length zero
Seems like the if condition is not able to identify the value in library. However, when I tried to apply the if/else condition on the batch_file (not filtered, basically obtained with this code without the if/else part) it worked perfectly.
Many thanks in advance.

How to get the grouping right in R with Plotly

I have some problem to group my data in Plotly under R. To start with I was using local data from a csv file, reading them with:
geogrid_data <- read.delim('geogrid.csv', row.names = NULL, stringsAsFactors = TRUE)
and the plotting went well, using the following:
library(plotly)
library(RColorBrewer)
x <- list(
title = 'Date'
)
p <- plotly::plot_ly(geogrid_data,
type = 'scatter',
x = ~ts_now,
y = ~absolute_v_sum,
text = paste('Table: ', geogrid_data$table_name,
'<br>Absolute_v_Sum: ', geogrid_data$absolute_v_sum),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$table_name))
)
),
transforms = list(
list(
type = 'groupby',
groups = ~table_name
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Here the output:
Then I was going to alter the data source to an Oracle database table, reading the data as follows, using the ROracle package:
# retrieve data into resultSet object
rs <- dbSendQuery(con, "SELECT * FROM GEOGRID_STATS")
# fetch records from the resultSet into a data.frame
geogrid_data <- fetch(rs)
# free resources occupied by resultSet
dbClearResult(rs)
dbUnloadDriver(drv)
# remove duplicates from dataframe (based on TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM)
geogrid_data <- geogrid_data %>% distinct(TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM, .keep_all = TRUE)
# alter date columns in place
geogrid_data$TS_BEFORE <- as.Date(geogrid_data$TS_BEFORE, format='%d-%m-%Y')
geogrid_data$TS_NOW <- as.Date(geogrid_data$TS_NOW, format='%d-%m-%Y')
and adjusting the plotting to:
p <- plotly::plot_ly(
type = 'scatter',
x = geogrid_data$TS_NOW,
y = geogrid_data$ABSOLUTE_V_SUM,
text = paste('Table: ', geogrid_data$TABLE_NAME,
'<br>Absolute_v_Sum: ', geogrid_data$ABSOLUTE_V_SUM,
'<br>Date: ', geogrid_data$TS_NOW),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$TABLE_NAME))
)
),
transforms = list(
list(
type = 'groupby',
groups = geogrid_data$TABLE_NAME
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Unfortunately, this is leading to some problem with the grouping as it seems.:
As you can see from the label text when hovering over the data point, the point represents data from NY_SKOV_PLANTEB_MW_POLY while the legend is set to show data from NY_BYGN_MW_POLY. Looking at other data points in this graph I found a wild mix of points of all sorts in this graph, some of them representing data of NY_BYGN_MW_POLY, most of them not.
Also the plotting with respect to the time line does not work any more, e.g. data are plotted with start on Dec. 11 - Dec. 10 - Dec. 10 - Dec. 12 - Dec. 20 - Dec. 17 - Dec. 16 - Dec. 15.
Where do I go wrong in handling the data, and what do I have to do to get it right?
Of course, one should look at the data... thanks Marco, after your question I did look at my data.
There are some points where I simply assumed things.
The reason why all data plotted fine with data from the csv file is simple. All information manually compiled in the csv file came from information in emails that have been ordered by date. Hence, I compiled the data in the csv file ordered by date and Plotly does not have any problems grouping the data by table_name.
After looking at my data I tidied up, keeping only the data I need to show in the plot and used dplyr to sort the data by time.
geogrid_data <- dplyr::arrange(geogrid_data, TS_NOW)
It is only by time and not by time and table name because the sorting by table name is done anyway by Plotly and the groupby statement

How to change chart height in hchart() function in R (highcharter package) without using pipe operator?

I built a Shiny app where I create some plot from hist() and density() objects, both saved in a list into an .RDS file from another script file. So, in shiny I only read the .RDS and make the plot.
Everything is working now, except that I am not finding how to change the height of the highchart plot using the hchart() function. In my code, the way it was built, I cannot work with pipes "%>%", beacuse I am using hchart inside a purrr::map() function.
To explain better I created a small example, that follows.
# Example of how the objects are structured
list <-
list(df1 = list(Sepal.Length = hist(iris$Sepal.Length, plot = FALSE)),
df2 = list(Sepal.Length = density(iris$Sepal.Length)))
# Example of a plot built with hchart function
list[['df2']]['Sepal.Length'] %>%
purrr::map(hchart, showInLegend = FALSE)
# Example of what does not work
list[['df2']]['Sepal.Length'] %>%
purrr::map(hchart, showInLegend = FALSE, height = 200)
Actually, I also would like to change more options of the chart, like colors, for example. But I am not finding a way with this solution I found.
Thanks in advance.
Wlademir.
I can see 2 main ways to do what you need (not sure why you can't use the pipe):
Option 1
Create a function to process every data and add the options inside that function:
get_hc <- function(d) {
hchart(d, showInLegend = FALSE) %>%
hc_size(height= 200) %>%
hc_title(text = "Purrr rocks")
}
Then:
list_of_charts <- list[['df2']]['Sepal.Length'] %>%
purrr::map(get_hc)
Option 2
You can use successively purrr::map:
list_of_charts <- list[['df2']]['Sepal.Length'] %>%
purrr::map(hchart, showInLegend = FALSE)
# change heigth
list_of_charts <- purrr::map(list_of_charts, hc_size, height = 200)
# change title
list_of_charts <- purrr::map(list_of_charts, hc_title. text = "Purrr rocks")
Or you can use successively purrr::map/ %>% combo:
list_of_charts <- list[['df2']]['Sepal.Length'] %>%
purrr::map(hchart, showInLegend = FALSE) %>%
purrr::map(hc_size, height = 200) %>%
purrr::map(hc_title, text = "Purrr rocks")

how to use dataframe name inside for loop to save different ggplot2 plots in R

I have a data frame (all.table) that i have subsetted into 3 different data plots name (A1.table, B25.table, and C48.table)
all.table = read.table(file.path(input_file_name), header=T, sep = "\t")
A1.table = subset(all.table, ID == "A1")
B25.table = subset(all.table, ID == "B25")
C48.table = subset(all.table, ID == "C48")
For each graph type I want, I want to generate it based on all 4 tables
for (i in list(all.table, A1.table, B25.table, C48.table)){
ggplot(i, aes(x=Position, fill=Frequency)) + #other plot options
ggsave(file.path(full_output_path, "uniqueFileName.pfd")
#additional plots
#additional saves
}
my problem comes in the ggsave command with how to generate the 'uniqueFileName.pdf'. I would like to name it as some form of all.table.graph1.pdf, all.table.graph2.pdf and A1.table.graph1.pdf, A1.table.graph2.pdf etc
My question is how do I turn the name of the iterator i into a string, and add that string to a '.graph1.pdf' string?
Coming from a python background this seems like it should be rather simple. I am not very versed in R (as is likely obvious from this question) and anything resembling an answer I have found seems incredibly over complicated.
This is a workflow that uses the tidyverse suite of functions. iwalk is similar to lapply in base, but it requires a function that takes 2 arguments, and it automatically inputs the names of the list as the 2nd argument.
The short answer for what you want is paste0, which lets you combine strings.
library(tidyverse)
all.table %>%
filter(ID %in% c("A1", "B25", "C48")) %>% # only needed if there are more IDs than the 3 explictly listed
split(., .$ID) %>% # creates the list of data frames
c(list(all.table = all.table), .) %>% # adds "all.table" as a list element
iwalk(function(df, label) {
ggplot(df, aes(x = Position, fill = Frequency)) +
...
ggsave(file.path(full_output_path, paste0(label, ".graph1.pdf")))
})
Figured out a solution by looking for a python dictionary equivalent:
all.table = read.table(file.path(input_file_name), header=T, sep = "\t")
A1.table = subset(all.table, ID == "A1")
B25.table = subset(all.table, ID == "B25")
C48.table = subset(all.table, ID == "C48")
#Generate a named list of tables
list_of_tables = list(all = all.table, A1 = A1.table, B25 = B25.table, C48 = C48.table)
for (i in 1:length(list_of_tables)){
ggplot(list_of_tables[[i]], aes(x=Frequency, fill=Category)) + #more options
ggsave(file.path(full_output_path, paste0(names(list_of_tables[i]), ".graph1.pdf"))
}
I'm not sure if there is a downside to not using other libraries (ie tidyverse), but this seems like the simplest answer?

Resources