VennDiagram with shares/relative numbers - r

Using R and VennDiagram 1.6.9, I want to draw a triple Venn diagram and display shares rather than absolute values. The internal consistency check however can't deal with rounding errors:
draw.triple.venn(area1=0.89, area2=round(0.481, 2), area3=0.5,
n12=0.46, n23=0.4, n13=0.47)
The error due to rounding is extremely small:
> round(0.48, 2)-0.46-0.4+0.38
[1] -5.551115e-17
Using the complete number, i.e. round(0.48, 3) it all works fine, but I don't want that (my real data has a lot more digits). Is there a way to overrun internal consistency checks? Or is there maybe a better way to display shares?

Firstly, note that the draw.triple.venn function has parameters print.mode and sigdigs, which might be helpful to you. If those are not enough, you may try hacking the output, by simply replacing the values of all labels with improved values to your taste. Here is an example:
grid.newpage()
draw.triple.venn(area1=0.89, area2=0.481, area3=0.5,
n12=0.46, n23=0.4, n13=0.47, n123=0.38)
grobjs = grid.ls() # List of all objects on the diagram
for (o in grobjs$name) {
# Pick out all text labels
if (grepl(".text.", o) == 1) {
# Re-format their value
old_value = as.numeric(grid.get(o)$label)
new_value = sprintf("%0.2f", old_value) #
if (new_value != "NA") {
grid.edit(o, label=new_value, redraw=FALSE)
}
}
}
grid.refresh()

Related

R ggplot2 stat_density_2d has a "..level.." feature, what other features are available? [duplicate]

I use ..density.. from time to time, and it's great. There's lots of examples of it in the ggplot2 book, as well as ..count... Looking through the stat_density documentation, I learned about ..scaled... Seeing someone use ..n.. here on StackOverflow, I found out about that. Now I just wonder what else I'm missing.
Search engines seem to ignore the .s in search strings like "..n.. ggplot2", even if I escape them. Is there a general term for these variables? Are there more? Where can I find documentation on them?
Here are all of the ..*.. options mentioned in the ggplot2 help files (or at least those help files that can be brought up by typing ?"<func>", where "<func>" refers to one of the functions exported by ggplot2).
library(ggplot2)
## Read all of the ggplot2 help files and convert them to character vectors
ex <- unlist(lapply(ls("package:ggplot2"), function(g) {
p = utils:::index.search(g, find.package(), TRUE)
capture.output(tools::Rd2txt(utils:::.getHelpFile(p)))
}))
## Extract all mentions of "..*.." from the character vectors
pat <- "\\.\\.\\w*\\.\\."
m <- gregexpr(pat, ex)
unique(unlist(regmatches(ex,m)))
# [1] "..density.." "..count.." "..level.." "..scaled.." "..quantile.."
# [6] "..n.."
Or, to find out which help files document which ..*.., run this:
library(ggplot2)
ex <- sapply(ls("package:ggplot2"), function(g) {
p = utils:::index.search(g, find.package(), TRUE)
capture.output(tools::Rd2txt(utils:::.getHelpFile(p)))
}, simplify=FALSE, USE.NAMES=TRUE)
res <- lapply(ex, function(X) {
m <- gregexpr("\\.\\.\\w*\\.\\.", X)
unique(unlist(regmatches(X, m)))
})
res[sapply(res, length) > 0]
As of ggplot2 version 3.3.0 (2020-03-05), (from the changelog):
The evaluation time of aesthetics can now be controlled to a finer degree. after_stat() supersedes the use of stat() and ..var..-notation, and is joined by after_scale() to allow for mapping to scaled aesthetic values. Remapping of the same aesthetic is now supported with stage(), so you can map a data variable to a stat aesthetic, and remap the same aesthetic to something else after statistical transformation
So the ..var.. variables are moot, and you should try to research and use after_stat instead.

I want to use heatmap in my code but i am getting error

heatmap(Web_Data$Timeinpage)
str(Web_Data)
heat = c(t(as.matrix(Web_Data$Timeinpage[,-1])))
heatmap(heat)
A few items to note here:
1) by including the c() operator in the c(t(as.matrix(Web_Data$Timeinpage[,-1]))) You are creating a single vector and not a matrix. You can see this by running the following: is.matirx(c(t(as.matrix(Web_Data$Timeinpage[,-1])))). heatmap (I believe) is checking for a matrix because...
2) You need to provide a matrix with at least two rows and two columns for this function to work. Currently, you are only give on vector - time. You will need to provide some other feature of interest to have it work correctly, such as Continent.
3) If you intend to plot ONLY one field, you may consider doing as suggested here and use the image() function. (I included an example below).
4) I find the heatmap function somewhat dated in look. You may want to consider other popular functions, such as ggplot's geom_tile. (see here).
Below is an example code that should produce an output:
#fake data
Web_Data <- data.frame("Timeinpage" = c(123,321,432,555,332,1221,2,43,0, NA,10, 44),
OTHER = rep(c("good", "bad",6)) )
#a matrix with TWO columns from my data frame. Notice the c() is removed and I am not transposing. Also removing the , from [,-1]
heat <- matrix(c(Web_Data$Timeinpage[-1], Web_Data$OTHER[-1]), 2,11)
#output
heatmap(heat)
#one row
heat2 <- as.matrix(sort(Web_Data$Timeinpage[-1])) #sorting as well
#output
image(heat2)

R Legend Variable Substitution

I always desire to have my R code as flexible as possible; at present I have three (potentially more) curves to compare based on a parameter delta, but I don't want to hardcode the values of delta anywhere (or even how many values if I can avoid it).
I am trying to make a legend that involves both Greek and a variable substitution for the delta values, so each legend entry is of the form like 'delta = 0.01', where delta is Greek and 0.01 is determined by variable. Many different combinations of paste, substitute, bquote and expression have been tried, but always end up with some verbatim code leftover in the finished legend, OR fail to put 'delta' into symbolic form.
delta <- c(0.01,0.05,0.1)
plot(type="n", x=1:5, y=1:5) #the curves themselves are irrelevant
legend_text <- vector(length=length(delta)) #I don't think lists work either
for(i in 1:length(delta)){
legend_text[i] <- substitute(paste(delta,"=",D),list(D=delta[i]) )
}
legend(x="topleft", fill=rainbow(length(delta)), legend=legend_text)
Since legend=substitute(paste(delta,"=",D),list(D=delta[1]) works for a single entry, I've also tried doing a 'semi-hardcoded' version, fixing the length of delta:
legend(x="topleft", fill=rainbow(length(delta)),
legend=c(substitute(paste(delta,"=",A), list(A=delta[1])),
substitute(paste(delta,"=",B), list(B=delta[2])),
substitute(paste(delta,"=",C), list(C=delta[3])) )
)
but this has the same issues as before.
Is there a way I can do this, or do I need to change the code by hand with each update of delta?
Try using lapply() with as.expression() to generate your legend labels. Also use bquote to create your individual expressions
legend_text <- as.expression(lapply(delta, function(d) {
bquote(delta==.(d))
} ))
Note that with plotmath you need == to get an equals sign. Also no need for paste() since nothing is really a string here.

R Sweave: digits number in xtable of prop.table

I'm making an xtableFtable on R Sweave and can't find a way to suppress the digits with this code. What I am doing false? I've read that it can happen if your values aren't numeric but factor or character, but is prop.table making them non-numeric? I'm lost...
library(xtable)
a <- ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100)
b <- xtableFtable(a, method = "compact", digits = 0)
print.xtableFtable(b, rotate.colnames = TRUE)
I've already tried with digits=c(0,0,0,0...) too.
You could use options(digits) to control how many digits will print. Try something like options(digits = 4) as the first line of your code (change 4 to whatever value you want between 1 and 22). See ?options for more information.
Or round the values before printing
a = round(ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100), 2)
b = xtableFtable(a, method = "compact")
print.xtableFtable(b, rotate.colnames = TRUE)
The "digits" argument to xtableFtable seems to be unimplemented (as of my version, which is 1.8.3), since after playing around with it for half an hour nothing seems to make any difference.
There's a hint to this effect in the function documentation:
It is not recommended that users change the values of align, digits or align. First of all, alternative values have not been tested. Secondly, it is most likely that to determine appropriate values for these arguments, users will have to investigate the code for xtableFtable and/or print.xtableFtable.
It's probably just carried over from the xtable function (on which xtableFtable is surely based) as a TODO which the maintainer hasn't gotten around to yet.

Fuzzy merging in R - seeking help to improve my code

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the string distance matrices, if there are multiple fuzzy variables, and not the computation of the string distances itself is parallelized). As for the fuzzy_join function the idea is to match for exact variables if possible (to keep the matrices smaller) and then to proceed to fuzzy matching within this exactly matched groups. I actually think that the function is self-explanatory. I am posting it here because I would like to have some feedback to improve it and because I guess that I am not the only one who tries to do stuff like that in R (although I admit that Python, SQL and things like that would probably be more efficient in this context. But one has to stick to the things one feels most comfortable with and doing the data cleaning and preparation in the same language is nice with regard to reproducibility)
merge.fuzzy = function(a,b,.exact,.fuzzy,.weights,.method,.ncores) {
require(stringdist)
require(matrixStats)
require(parallel)
if (length(.fuzzy)!=length(.weights)) {
stop(paste0("fuzzy and weigths must have the same length"))
}
if (!any(class(a)=="data.table")) {
stop(paste0("'a' must be of class data.table"))
}
if (!any(class(b)=="data.table")) {
stop(paste0("'b' must be of class data.table"))
}
#convert everything to lower
a[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
b[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
a[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
b[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
#create ids
a[,"id.a":=as.numeric(.I),by=c(.exact,.fuzzy)]
b[,"id.b":=as.numeric(.I),by=c(.exact,.fuzzy)]
c <- unique(rbind(a[,.exact,with=FALSE],b[,.exact,with=FALSE]))
c[,"exa.id":=.GRP,by=.exact]
a <- merge(a,c,by=.exact,all=FALSE)
b <- merge(b,c,by=.exact,all=FALSE)
##############
stringdi <- function(a,b,.weights,.by,.method,.ncores) {
sdm <- list()
if (is.null(.weights)) {.weights <- rep(1,length(.by))}
if (nrow(a) < nrow(b)) {
for (i in 1:length(.by)) {
sdm[[i]] <- stringdistmatrix(a[[.by[i]]],b[[.by[i]]],method=.method,ncores=.ncores,useNames=TRUE)
}
} else {
for (i in 1:length(.by)) { #if a is shorter, switch sides; this enhances parallelization speed
sdm[[i]] <- stringdistmatrix(b[[.by[i]]],a[[.by[i]]],method=.method,ncores=.ncores,useNames=FALSE)
}
}
rsdm = dim(sdm[[1]])
csdm = ncol(sdm[[1]])
sdm = matrix(unlist(sdm),ncol=length(by))
sdm = rowSums(sdm*.weights,na.rm=T)/((0 + !is.na(sdm)) %*% .weights)
sdm = matrix(sdm,nrow=rsdm,ncol=csdm)
#use ids as row/ column names
rownames(sdm) <- a$id.a
colnames(sdm) <- b$id.b
mid <- max.col(-sdm,ties.method="first")
mid <- matrix(c(1:nrow(sdm),mid),ncol=2)
bestdis <- sdm[mid]
res <- data.table(as.numeric(rownames(sdm)),as.numeric(colnames(sdm)[mid[,2]]),bestdis)
setnames(res,c("id.a","id.b","dist"))
res
}
setkey(b,exa.id)
distances = a[,stringdi(.SD,b[J(.BY[[1]])],.weights=.weights,.by=.fuzzy,.method=.method,.ncores=.ncores),by=exa.id]
a = merge(a,distances,by=c("exa.id","id.a"))
res = merge(a,b,by=c("exa.id","id.b"))
res
}
The following points would be interesting:
I am not quite sure how to code multiple exact matching variables in the data.table style I used above (which I believe is the fasted option).
Is it possible to have nested parallelization? This means is it possible to use a parallel foreach loop on top of the computation of the string distance matrices.
I am also interested in ideas with regard to making the whole thing more efficient, i.e. to consume less memory.
Maybe you can suggest a bigger "real world" data set so that I can create a woking example. Unfortunately I cannot share even small samples of my data with you.
In the future it would also be nice to do something else than a classic left inner join. So also ideas with regard to this topic are very much appreciated.
All your comments are welcome!

Resources