I've made a Sankey diagram in R Riverplot (v0.5), the output looks OK small in RStudio, but when exported or zoomed in it the colours have dark outlines or gridlines.
I think it may be because the outlines of the shapes are not matching the transparency I want to use for the fill?
I possibly need to find a way to get rid of outlines altogether (rather than make them semi-transparent), as I think they're also the reason why flows with a value of zero still show up as thin lines.
my code is here:
#loading packages
library(readr)
library("riverplot", lib.loc="C:/Program Files/R/R-3.3.2/library")
library(RColorBrewer)
#loaing data
Cambs_flows <- read_csv("~/RProjects/Cambs_flows4.csv")
#defining the edges
edges = rep(Cambs_flows, col.names = c("N1","N2","Value"))
edges <- data.frame(edges)
edges$ID <- 1:25
#defining the nodes
nodes <- data.frame(ID = c("Cambridge","S Cambs","Rest of E","Rest of UK","Abroad","to Cambridge","to S Cambs","to Rest of E","to Rest of UK","to Abroad"))
nodes$x = c(1,1,1,1,1,2,2,2,2,2)
nodes$y = c(1,2,3,4,5,1,2,3,4,5)
#picking colours
palette = paste0(brewer.pal(5, "Set1"), "90")
#plot styles
styles = lapply(nodes$y, function(n) {
list(col = palette[n], lty = 0, textcol = "black")
})
#matching nodes to names
names(styles) = nodes$ID
#defining the river
r <- makeRiver( nodes, edges,
node_labels = c("Cambridge","S Cambs","Rest of E","Rest of UK","Abroad","to Cambridge","to S Cambs","to Rest of E","to Rest of UK","to Abroad"),
node_styles = styles)
#Plotting
plot( r, plot_area = 0.9)
And my data is here
dput(Cambs_flows)
structure(list(N1 = c("Cambridge", "Cambridge", "Cambridge",
"Cambridge", "Cambridge", "S Cambs", "S Cambs", "S Cambs", "S Cambs",
"S Cambs", "Rest of E", "Rest of E", "Rest of E", "Rest of E",
"Rest of E", "Rest of UK", "Rest of UK", "Rest of UK", "Rest of UK",
"Rest of UK", "Abroad", "Abroad", "Abroad", "Abroad", "Abroad"
), N2 = c("to Cambridge", "to S Cambs", "to Rest of E", "to Rest of UK",
"to Abroad", "to Cambridge", "to S Cambs", "to Rest of E", "to Rest of UK",
"to Abroad", "to Cambridge", "to S Cambs", "to Rest of E", "to Rest of UK",
"to Abroad", "to Cambridge", "to S Cambs", "to Rest of E", "to Rest of UK",
"to Abroad", "to Cambridge", "to S Cambs", "to Rest of E", "to Rest of UK",
"to Abroad"), Value = c(0L, 1616L, 2779L, 13500L, 5670L, 2593L,
0L, 2975L, 4742L, 1641L, 2555L, 3433L, 0L, 0L, 0L, 6981L, 3802L,
0L, 0L, 0L, 5670L, 1641L, 0L, 0L, 0L)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -25L), .Names = c("N1", "N2",
"Value"), spec = structure(list(cols = structure(list(N1 = structure(list(), class = c("collector_character",
"collector")), N2 = structure(list(), class = c("collector_character",
"collector")), Value = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("N1", "N2", "Value")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
The culprit is a line in riverplot::curveseg. We can hack this function to fix it, or there is also a very simple workaround that does not require hacking the function. In fact, the simple solution is probably preferably in many cases, but first I explain how to hack the function, so we understand why the workaround also works. Scroll to the end of this answer if you only want the simple solution:
UPDATE: The change suggested below has now been implemented in riverplot version 0.6
To edit the function, you can use
trace(curveseg, edit=T)
Then find the line near the end of the function that reads
polygon(c(xx[i], xx[i + 1], xx[i + 1], xx[i]), c(yy[i],
yy[i + 1], yy[i + 1] + w, yy[i] + w), col = grad[i],
border = grad[i])
We can see here that the package authors chose not to pass the lty parameter to polygon (UPDATE: see this answer for an explanation of why the package author did it this way). Change this line by adding lty = 0 (or, if you prefer, border = NA) and it works as intended for OPs case. (But note that this may not work well if you wish to render a pdf - see here)
polygon(c(xx[i], xx[i + 1], xx[i + 1], xx[i]), c(yy[i],
yy[i + 1], yy[i + 1] + w, yy[i] + w), col = grad[i],
border = grad[i], lty=0)
As a side note, this also explains the somewhat odd reported behaviour in the comments that "if you run it twice, the second time the plot looks OK, although export it and the lines come back". When lty is not specified in a call to polygon, the default value it uses is lty = par("lty"). Initially, the default par("lty") is a solid line, but after running the riverplot function once, par("lty") gets set to 0 during a call to riverplot:::draw.nodes thus, suppressing the lines when riverplot is run a 2nd time. But if you then try to export the image, opening a new device resets par("lty") to its default value.
An alternative way to update the function with this edit is to use assignInNamespace to overwrite the package function with your own version. Like this:
curveseg.new = function (x0, x1, y0, y1, width = 1, nsteps = 50, col = "#ffcc0066",
grad = NULL, lty = 1, form = c("sin", "line"))
{
w <- width
if (!is.null(grad)) {
grad <- colorRampPaletteAlpha(grad)(nsteps)
}
else {
grad <- rep(col, nsteps)
}
form <- match.arg(form, c("sin", "line"))
if (form == "sin") {
xx <- seq(-pi/2, pi/2, length.out = nsteps)
yy <- y0 + (y1 - y0) * (sin(xx) + 1)/2
xx <- seq(x0, x1, length.out = nsteps)
}
if (form == "line") {
xx <- seq(x0, x1, length.out = nsteps)
yy <- seq(y0, y1, length.out = nsteps)
}
for (i in 1:(nsteps - 1)) {
polygon(c(xx[i], xx[i + 1], xx[i + 1], xx[i]),
c(yy[i], yy[i + 1], yy[i + 1] + w, yy[i] + w),
col = grad[i], border = grad[i], lty=0)
lines(c(xx[i], xx[i + 1]), c(yy[i], yy[i + 1]), lty = lty)
lines(c(xx[i], xx[i + 1]), c(yy[i] + w, yy[i + 1] + w), lty = lty)
}
}
assignInNamespace('curveseg', curveseg.new, 'riverplot', pos = -1, envir = as.environment(pos))
Now for the simple solution, which does not require changes to the function:
Just add the line par(lty=0) before you plot!!!
Here is the author of the package. I am now struggling for a satisfactory solution to be included in the next version of the package.
The problem is with how R renders PDFs as compared to bitmaps. In the original version of the package, indeed I passed on lty=0 to polygon() (you can still see it in the commented source code). However, polygon w/o borders looks good only on the png graphics. In the pdf output, thin white lines appear between the polygons. Take a look:
cc <- "#E41A1C90"
plot.new()
rect(0.2, 0.2, 0.4, 0.4, col=cc, border=NA)
rect(0.4, 0.2, 0.6, 0.4, col=cc, border=NA)
dev.copy2pdf(file="riverplot.pdf")
In X or on png, the output is correct. However, if rendered as PDF, you will see a thin white line between the recangles:
When you render a riverplot graphics as PDF like the one above, this looks really bad:
I therefore forced adding borders, however forgot about checking transparency. When no transparency is used, this looks OK -- the borders overlap with the polygons as well as which each other, but you cannot see it. The PDF is now acceptable. However, it messes up the figure if you have transparency.
EDIT:
I have now uploaded version 0.6 of riverplot to CRAN. Besides some new stuff (you can now add riverplot to any part of an existing drawing), by default it uses lty=0 again. However, there is now an option called "fix.pdf" which you can set to TRUE in order to draw the borders around the segments again.
Bottom line, and solutions for now:
Use riverplot 0.6`
If you want to render a PDF, don't use transparency and use fix.pdf=TRUE
If you want to use both transparency and PDF, help me solving the issue.
Related
Following a former question I opened few weeks ago:
Slope Chart - ggplot2
I face another issue, concerning the numeric values reported in the graph. Even specifying the decimal digits I need (exactly 3) with any of the two commands:
y=round(y, digit = 3) at the endof the code
or
options(digits=3) at the beginning of the whole code
The graphical output, doesn't give me the desired number of digits but only concerning 0. In the graph, I wanted to have 0.800 (not 0.8) and 0.940 (not 0.94). It looks like it removes 0. Below the graphical output from R, I circled in red the number I intended to change.
Below the whole code:
library(dplyr)
library(ggplot2)
#options(digits=3)
theme_set(theme_classic())
#### Data
df <- structure(list(group = c("Ups", "Ups", "Ups", "Ups", "Ups"),
yshift = c(0, 0, 0, 0, 0), x = structure(1:5, .Label = c("1 day",
"2 days", "3 days", "5 days", "7 days"), class = "factor"),
y = c(0.108, 0.8, 0.94, 1.511, 1.905), ypos = c(0.10754145,
0.8, 0.94, 1.5111111, 1.90544651164516)), row.names = c(1L,
3L, 5L, 7L, 9L), class = "data.frame")
# Define functions. Source: https://github.com/jkeirstead/r-slopegraph
plot_slopegraph <- function(df) {
ylabs <- subset(df, x==head(x,1))$group
yvals <- subset(df, x==head(x,1))$ypos
fontSize <- 3
gg <- ggplot(df,aes(x=x,y=ypos)) +
geom_line(aes(group=group),colour="grey80") +
geom_point(colour="white",size=8) +
geom_text(aes(label=y), size=fontSize, family="American Typewriter") +
scale_y_continuous(name="", breaks=yvals, labels=ylabs)
return(gg)
}
## Plot
plot_slopegraph(df) + labs(title="Monomer content after days of heating")
I am making any mistake or missing something? Is there any other way to force 0 digits?
Thank you in advance for every eventual reply or comment.
I like the scales package functions for things like this (though you could certainly use formatC or sprintf instead).
I've modified plot_slopegraph to use label=scales::label_number(accuracy = 0.001)(y)) in the geom_text():
plot_slopegraph <- function(df) {
ylabs <- subset(df, x==head(x,1))$group
yvals <- subset(df, x==head(x,1))$ypos
fontSize <- 3
gg <- ggplot(df,aes(x=x,y=ypos)) +
geom_line(aes(group=group),colour="grey80") +
geom_point(colour="white",size=8) +
geom_text(aes(label=scales::label_number(accuracy = 0.001)(y)), size=fontSize, family="American Typewriter") +
scale_y_continuous(name="", breaks=yvals, labels=ylabs)
return(gg)
}
plot_slopegraph(df)
Complete beginner at R here trying to perform nonmetric multidimensional scaling on a 95x95 matrix of similarities where 8 corresponds to very similar and 1 corresponds to very dissimilar. I also have an additional column (96th) signifying type and ranging from 0 to 1.
First I load the data:
dsimilarity <- read.table("d95x95matrix.txt",
header = T,
row.names = c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10", "Y11", "Y12", "Y13", "Y14", "Y15", "Y16", "Y17", "Y18", "Y19", "Y20",
"Y21", "Y22", "Y23", "Y24", "Y25", "Y26", "Y27", "Y28", "Y29", "Y30", "Y31", "Y32", "Y33", "Y34", "Y35", "Y36", "Y37", "Y38", "Y39", "Y40",
"Y41", "Y42", "Y43", "Y44", "Y45", "Y46", "Y47", "Y48", "Y49", "Y50", "Y51", "Y52", "Y53", "Y54", "Y55", "Y56", "Y57", "Y58", "Y59", "Y60",
"Y61", "Y62", "Y63", "Y64", "Y65", "Y66", "Y67", "Y68", "Y69", "Y70", "Y71", "Y72", "Y73", "Y74", "Y75", "Y76", "Y77", "Y78", "Y79", "Y80",
"Y81", "Y82", "Y83", "Y84", "Y85", "Y86", "Y87", "Y88", "Y89", "Y90", "Y91", "Y92", "Y93", "Y94", "Y95"))
I convert the matrix of similarities into a matrix of dissimilarities, and exclude the 96th column:
ddissimilarity <- dsimilarity; ddissimilarity[1:95, 1:95] = 8 - ddissimilarity[1:95, 1:95]
Then I perform the nonmetric MDS using the Smacof function:
ordinal.mds.results <- smacofSym(ddissimilarity[1:95, 1:95],
type = c("ordinal"),
ndim = 2,
ties = "primary",
verbose = T )
I create a new data frame (I'm following a guide and don't really know what's going on here):
mds.config <- as.data.frame(ordinal.mds.results$conf)
All well and good thus far (to my knowledge). However at this point I will try to create an xyplot of the data and get a good result using this code:
xyplot(D2 ~ D1, data = mds.config,
aspect = 1,
main = "Figure 1. MDS solution",
panel = function (x, y) {
panel.xyplot(x, y, col = "black")
panel.text(x, y-.03, labels = rownames(mds.config),
cex = .75)
},
xlab = "MDS Axis 1",
ylab = "MDS Axis 2",
xlim = c(-1.1, 1.1),
ylim = c(-1.1, 1.1))
Now I want to create a figure that incorporates the type in column 96th and assigns different colors to observations of the two different types. However, can't quite figure out how to do so. Does anyone have any ideas of where I'm going wrong here?
xyplot(D2 ~ D1, data = mds.config ~ ddissimilarity[96:96, 96:96],
aspect = 1,
main = "Figure 1. MDS solution",
panel = function (x, y) {
panel.xyplot(x, y, col = "black")
panel.text(x, y-.03, labels = rownames(mds.config),
cex = .75)
},
xlab = "MDS Axis 1",
ylab = "MDS Axis 2",
xlim = c(-1.1, 1.1),
ylim = c(-1.1, 1.1),
group = "Type")
So I use the following functions for plotting most of the data I have to plot. I created it thanks to different chunks of code that I have found online. So far I have never encountered any issue with it.
Here is the plotting function first.
library(ggplot2)
library(reshape2)
#' Plot a given mean with error bars
#' #param resultTable The table with all the result to plot
#' #param techniques The name of the techniques in the form of a list/vector
#' #param nbTechs The number of given techniques
#' #param ymin The minimum value for y
#' #param ymax The maximum value for y
#' #param xAxisLabel The label for the x (vertical) axis
#' #param yAxisLable The label for the y (horizontal) axis
#' #return
#'
barChartTime <- function(resultTable, techniques, nbTechs = -1, ymin, ymax, xAxisLabel = "I am the X axis", yAxisLabel = "I am the Y Label"){
#tr <- t(resultTable)
if(nbTechs <= 0){
stop('Please give a positive number of Techniques, nbTechs');
}
tr <- as.data.frame(resultTable)
nbTechs <- nbTechs - 1 ; # seq will generate nb+1
#now need to calculate one number for the width of the interval
tr$CI2 <- tr$upperBound_CI - tr$mean_time
tr$CI1 <- tr$mean_time - tr$lowerBound_CI
#add a technique column
tr$technique <- factor(seq.int(0, nbTechs, 1));
breaks <- c(as.character(tr$technique));
print(tr)
g <- ggplot(tr, aes(x=technique, y=mean_time)) +
geom_bar(stat="identity",fill = I("#CCCCCC")) +
geom_errorbar(aes(ymin=mean_time-CI1, ymax=mean_time+CI2),
width=0, # Width of the error bars
size = 1.1
) +
#labs(title="Overall time per technique") +
labs(x = xAxisLabel, y = yAxisLabel) +
scale_y_continuous(limits = c(ymin,ymax)) +
scale_x_discrete(name="",breaks,techniques)+
coord_flip() +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),axis.title=element_text(size = rel(1.2), colour = "black"),axis.text=element_text(size = rel(1.2), colour = "black"),panel.grid.major = element_line(colour = "#DDDDDD"),panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())+
geom_point(size=4, colour="black") # dots
print(g)
}
Now, here is (a simplified version of the data) data that I am using (and that reproduces the error):
EucliP,AngularP,EucliR,AngularR,EucliSp,AngularSp,EucliSl,AngularSl
31.6536,30.9863,64.394,92.7838,223.478,117.555,44.7374,25.4852
12.3592,40.7639,70.2508,176.55,10.3927,145.909,143.025,126.667
14.572,8.98445,113.599,150.551,47.1545,54.3019,10.7038,47.7004
41.7957,20.9542,55.1732,67.1647,52.364,41.3655,62.7036,75.65
135.868,83.7135,14.0262,69.7183,44.987,35.9599,19.5183,66.0365
33.5359,17.2129,6.95909,47.518,224.561,91.4999,67.1279,31.4079
25.7285,33.6705,17.4725,58.45,43.1709,113.847,28.9496,20.0574
48.4742,127.588,75.0804,89.1176,31.4494,27.9548,38.4563,126.248
31.9831,80.0161,19.9592,145.891,55.2789,142.738,94.5126,136.099
17.4044,52.3866,49.9976,150.891,104.936,77.2849,232.23,35.6963
153.359,151.897,41.8876,46.3893,79.5218,75.2011,68.9786,91.8972
And here is the code that I am using:
data = read.table("*Path_to_file*.csv", header=T, sep=",")
data$EucliPLog = (data$EucliP) #Before here I used to use a log transform that I tried to remove for some testing
data$EucliRLog = (data$EucliR) #Same thing
data$EucliSpLog = (data$EucliSp) #Same thing
data$EucliSlLog = (data$EucliSl) #Same thing
a1 = t.test(data$EucliPLog)$conf.int[1]
a2 = t.test(data$EucliPLog)$conf.int[2]
b1 = t.test(data$EucliRLog)$conf.int[1]
b2 = t.test(data$EucliRLog)$conf.int[2]
c1 = t.test(data$EucliSpLog)$conf.int[1]
c2 = t.test(data$EucliSpLog)$conf.int[2]
d1 = t.test(data$EucliSlLog)$conf.int[1]
d2 = t.test(data$EucliSlLog)$conf.int[2]
analysisData = c()
analysisData$ratio = c("Sl","Sp","R","P")
analysisData$pointEstimate = c(exp(mean(data$EucliSlLog)),exp(mean(data$EucliSpLog)),exp(mean(data$EucliRLog)),exp(mean(data$EucliPLog)))
analysisData$ci.max = c(exp(d2), exp(c2),exp(b2), exp(a2))
analysisData$ci.min = c(exp(d1), exp(c1),exp(b1), exp(a1))
datatoprint <- data.frame(factor(analysisData$ratio),analysisData$pointEstimate, analysisData$ci.max, analysisData$ci.min)
colnames(datatoprint) <- c("technique", "mean_time", "lowerBound_CI", "upperBound_CI ")
barChartTime(datatoprint,analysisData$ratio ,nbTechs = 4, ymin = 0, ymax = 90, "", "Title")
So If I do use the log() that I mention in the comments of the last piece of code, everything works fine and I get my plots displayed. However, I tried removing the log and I get the famous
Error in matrix(value, n, p) :
'data' must be of a vector type, was 'NULL'
I have tried looking for null values in my data but there are none and I do not know where to look at next. Would love to get some help with that.
Thanks in advance
Edit: Here is the result of dput on datatoprint:
structure(list(technique = structure(c(3L, 4L, 2L, 1L), .Label = c("P",
"R", "Sl", "Sp"), class = "factor"), mean_time = c(1.04016257618464e+32,
1.64430609815788e+36, 7.5457775364611e+20, 3.85267453902928e+21
), lowerBound_CI = c(6.64977706609883e+50, 5.00358136618364e+57,
2.03872433045407e+30, 4.93863589006376e+35), `upperBound_CI ` = c(16270292584857.9,
540361462434140, 279286207454.44, 30055062.6409769)), .Names = c("technique",
"mean_time", "lowerBound_CI", "upperBound_CI "), row.names = c(NA,
-4L), class = "data.frame")
And the dput on analysisData:
structure(list(ratio = c("Sl", "Sp", "R", "P"), pointEstimate = c(1.04016257618464e+32,
1.64430609815788e+36, 7.5457775364611e+20, 3.85267453902928e+21
), ci.max = c(6.64977706609883e+50, 5.00358136618364e+57, 2.03872433045407e+30,
4.93863589006376e+35), ci.min = c(16270292584857.9, 540361462434140,
279286207454.44, 30055062.6409769)), .Names = c("ratio", "pointEstimate",
"ci.max", "ci.min"))
Without the log I don't have anything on display because the value are above 10^40++ whereas with the log it's below the upper limit (90).
I don' get the error you get though.
I'm trying to do a multivariate k-means cluster plot in r. I have 3 variables, and 10 columns of data, plus the context (like species for Iris) so 11 variables. And my x is PeruReady, obviously
Following a tutorial online I got this far:
PeruReady.km <- kmeans(PeruReady[, -1], 3, iter.max=1000)
tbl <- table(PeruReady[, 1], PeruReady.km$cluster)
PeruReady.dist <- dist(PeruReady[, -1])
PeruReady.mds <- cmdscale(PeruReady.dist)
c.chars <- c("*", "o", "+")[as.integer(PeruReady$Context)]
a.cols <- rainbow(3)[PeruReady$cluster]
plot(PeruReady.mds, col=a.cols, pch=c.chars, xlab="X", ylab="Y")
But my plot is coming up completely empty, what am I doing wrong?
With a small data set (demand.sm), your code worked just fine. Have you normalized all your numeric columns?
dput(demand.sm)
structure(list(Demand = c("rify la", "p quasi", "rify LD", "ventive",
"ekeeper", " de min", " risk g", " approv", "uest te", "", "al trai",
"cation", "ely inv", "rge tim", "get of ", "vey pro", "ent ONA",
"ble sel", "cipline", "tus rep", "ced-ran"), normalized = structure(c(-1.15780226157481,
-0.319393727330983, -1.15780226157481, -1.15780226157481, -0.319393727330983,
-0.319393727330983, -0.319393727330983, -0.319393727330983, 0.519014806912847,
0.519014806912847, 0.519014806912847, -0.738597994452898, -0.738597994452898,
2.19583187540051, 2.19583187540051, -1.15780226157481, -0.319393727330983,
-0.319393727330983, 0.519014806912847, 1.35742334115668, 0.519014806912847
), .Dim = c(21L, 1L), "`scaled:center`" = 3.76190476190476, "`scaled:scale`" = 2.38547190100328)), .Names = c("Demand",
"normalized"), row.names = c(NA, -21L), class = "data.frame")
clusters <- kmeans(demand.sm[ , "normalized"], 5)
demand.dist <- dist(demand.sm[ , "normalized"])
demand.mds <- cmdscale(demand.dist) # multidimensional scaling of data matrix, aka principal coordinates analysis
c.chars <- c("*", "o", "+")[as.integer(clusters$Context)]
a.cols <- rainbow(3)[clusters$cluster]
plot(demand.mds, col=a.cols, pch=c.chars, xlab="X", ylab="Y")
I posted a recent post about controlling x-y plots as two Normal curves and have since realised I was making things too complicated. I have since managed to plot it as ellipse's but this slightly over estimates the error; which ideally could be plotted as rhombus.
The code I have to date is:
plot(c(-5,10), c(-5,5), xlab = expression(Age), ylab = expression(value), type="n")
draw.ellipse(Age, value, a=Age_error, b=value_error, col="grey70")
Which plots:
Is there someway to replace the ellipse with a rhombus whose height is controlled by 2x value_error and width by 2x age_error?
My data frame is below
structure(list(Age = c(1L, 2L, 4L), value = c(3, -2, 0.01), Age_error = c(2,
1.4, 3), value_error = c(0.5, 1, 2.1)), .Names = c("Age", "value",
"Age_error", "value_error"), class = "data.frame", row.names = c(NA,
-3L))
Many thanks
You can use the my.symbols and ms.polygon functions in the TeachingDemos package to draw the rhombuses:
library(TeachingDemos)
plot(c(-5,10), c(-5,5), xlab = expression(Age), ylab = expression(value),
type="n")
my.symbols( Age, value, ms.polygon, n=4, xsize=2*Age_error,
ysize=2*value_error, linesfun=polygon, col='grey' )
Leave out linesfun and col if you don't want the rhombuses filled.