How to use custom font family in UpsetR plots? - r

I am trying to create set plot using the UpSetR package; however, I'd like to control the family of fonts. The ideal approach would be using theme function from ggplot2 but this is not supported at the moment by UpSetR (there's an open issue from 2016 on GitHub here) and results in NULL.
Example to create test plot:
# R version ---------------------------------------------------------------
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
# system x86_64, mingw32
# status
# major 3
# minor 5.1
# year 2018
# month 07
# day 02
# svn rev 74947
# language R
# version.string R version 3.5.1 (2018-07-02)
# nickname Feather Spray
# Package versions --------------------------------------------------------
# Assumes packages are already installed
packageVersion(pkg = "extrafont") == "0.17"
packageVersion(pkg = "UpSetR") == "1.3.3"
packageVersion(pkg = "ggplot2") == "3.1.0"
# Load UpSetR -------------------------------------------------------------
library(UpSetR)
library(extrafont)
library(ggplot2)
# Example -----------------------------------------------------------------
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), header=T, sep=";" )
upset(data = movies,
order.by = "freq",
keep.order = TRUE,
mainbar.y.label = "Example plot",
point.size = 4,
line.size = 1,
sets.x.label = NULL)
Going forward, the ideal would be where UpSetR supports layers / + theme() function from ggplot2; however, the UpSetR is not able to use "+" "layer name" logic. For example, if + theme(text = element_text(family = "Times New Roman")) were added at the end of the call above, it would return NULL and produce no plot.
Can you please suggest any workaround (or customization of function in package) that would support custom fonts in the example plot above produced by UpSetR? Alternatively, is there a way to force default font family in all plots without specifying any family arguments manually?

One way of achieving this with another UpSet implementation would be:
# install if needed
if(!require(ggplot2movies)) install.packages('ggplot2movies')
if(!require(devtools)) install.packages('devtools')
devtools::install_github('krassowski/complex-upset')
movies = as.data.frame(ggplot2movies::movies)
genres = c('Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short')
library(ComplexUpset)
library(ggplot2)
upset(
movies, genres, min_size=45, width_ratio=0.1,
themes=upset_default_themes(text=element_text(family='Times New Roman'))
)
Disclamer: I am the author of this implementation.

Related

How old is an installed R package?

Is it possible to get the year that an installed R package is released using some R code? I can get the version, but then have to look it up on the internet, when this version was released.
Background: I am working for the Swiss Federal Statistical Office and a small group is trying to get a better R environment (we are working for example with the dplyr version 0.7.4 from 2017... and it is not possible to install a newer version...).
Cheers
Renger
You can use versions package to get a timestamp of package version. The package pulls the published versions of the package from the MRAN snapshot server.
versions::installed.versions("dplyr")
# [1] "1.0.7"
versions::available.versions("dplyr")
# $dplyr
# version date available
# 1 1.0.7 2021-06-18 TRUE
# 2 1.0.6 2021-05-05 TRUE
# 3 1.0.5 2021-03-05 TRUE
# ...
Package age
So if you want to answer the specific question about the package age you can do the following:
how_old <- function(pkg, lib = .libPaths()[1], return_age = FALSE) {
pkg_ver <- versions::installed.versions(pkgs = pkg, lib = lib)
av_vers <- versions::available.versions(pkgs = pkg)
pkg_dte <- subset.data.frame(
x = as.data.frame(unname(av_vers)),
subset = version == pkg_ver,
select = date,
drop = TRUE
)
pkg_dte <- as.Date(pkg_dte)
if (return_age) {
return(epocakir::dob2age(dob = pkg_dte))
} else {
return(pkg_dte)
}
}
how_old("dplyr", return_age = TRUE)
Results
[1] "1123200s (~1.86 weeks)"
Package creation
Or if you want to find out when package was installed locally.
when_created <- function(pkg, lib = .libPaths()[1]) {
# Package will always have DESCRIPTION file so that's a safe bet
desc_file <- system.file("DESCRIPTION", package = pkg, lib.loc = lib)
info <- fs::file_info(desc_file)
info$birth_time
}
when_created("dplyr")
Results
# [1] "2021-06-25 08:47:21 BST"
As #Jonathan recommended, if the package has a citation, then you can call the year in the citation.
citation("dplyr")$year
An alternative is to get the date from a list of available versions of a package.
devtools::install_github("https://github.com/cran/versions")
library(versions)
x <- versions::available.versions(c("dplyr", "ggplot2"))
version_year <-
function(x,
package.name = "",
version = "") {
pckg <- x[[package.name]]
row <- which(pckg$version == version)
return(pckg$date[row])
}
version_year(x, "ggplot2", version = "2.0.0")
#[1] "2015-12-18"
As a last resort, you can find out when a package was created from its DESCRIPTION:
packageDescription(pkg)$Packaged
In fact, citation falls back to this very field if no other date was given (either as Date/Publication or via an explicit CITATION file).

How to link a couple of tip nodes in an inverted circular phylogenetic tree using ggtree in R

I want to create a figure of an annotated phylogenetic tree in circular layout with ggtree in R. Some tip nodes must be linked by a curve line. I can achieve this with the geom_taxalink() function in the rectangular layout, but it doesn't work in the circular layout. This seems to be because the geom_taxalink() uses geom_curve(), which doesn't support non-linear coordinates. I get the following message:
"Warning message:
geom_curve is not implemented for non-linear coordinates"
Below: reproducible code, the output I get, the output I want, and session info.
I'd appreciate any help to get the result I need.
Thanks!
Samuel
Example code:
library(treeio)
library(ggtree)
library(ggplot2)
raxml_file <- system.file("extdata/RAxML",
"RAxML_bipartitionsBranchLabels.H3",
package="treeio")
raxml <- read.raxml(raxml_file)
raxml <- as_tibble(raxml)
raxml$label <- gsub("_.*$", "", raxml$label)
raxml <- as.treedata(raxml)
my_tree <- ggtree(raxml, layout = "circular", branch.length = "none") +
geom_tiplab2(size = 3, hjust = 1) +
geom_taxalink("EU857082",
"YGSIV1534",
color = "red") +
scale_x_reverse(limits = c(100, 0))
ggsave("my_tree.png", my_tree,
width = 10, height = 10, units = "in",
dpi = 300)
Here is a link to a sample of the result I get:
Here is link to an example of the desired result:
Session info:
info <- sessionInfo()
toLatex(info, locale = FALSE)
# \begin{itemize}\raggedright
# \item R version 4.0.2 (2020-06-22), \verb|x86_64-pc-linux-gnu|
# \item Running under: \verb|Ubuntu 18.04.4 LTS|
# \item Matrix products: default
# \item BLAS: \verb|/usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1|
# \item LAPACK: \verb|/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1|
# \item Base packages: base, datasets, graphics, grDevices, methods,
# stats, utils
# \item Other packages: ggplot2~3.3.2, ggtree~2.2.1, treeio~1.12.0
# \item Loaded via a namespace (and not attached): ape~5.4,
# aplot~0.0.4, assertthat~0.2.1, BiocManager~1.30.10, cli~2.0.2,
# colorspace~1.4-1, compiler~4.0.2, crayon~1.3.4, dplyr~1.0.0,
# ellipsis~0.3.1, fansi~0.4.1, farver~2.0.3, generics~0.0.2,
# glue~1.4.1, grid~4.0.2, gtable~0.3.0, jsonlite~1.7.0, labeling~0.3,
# lattice~0.20-41, lazyeval~0.2.2, lifecycle~0.2.0, magrittr~1.5,
# munsell~0.5.0, nlme~3.1-148, parallel~4.0.2, patchwork~1.0.1,
# pillar~1.4.6, pkgconfig~2.0.3, purrr~0.3.4, R6~2.4.1, Rcpp~1.0.5,
# rlang~0.4.7, rstudioapi~0.11, rvcheck~0.1.8, scales~1.1.1,
# tibble~3.0.3, tidyr~1.1.0, tidyselect~1.1.0, tidytree~0.3.3,
# tools~4.0.2, vctrs~0.3.1, withr~2.2.0
# \end{itemize}
The solution is to upgrade to the version 2.3.2 (last version as for July 15, 2020), which is hosted on github by the author of the package:
devtools::install_github("YuLab-SMU/ggtree")

How do I refer to a specific DataFrame in a spark pipeline?

Suppose I have two spark DataFrames with the same features in Spark and I want to build a pipeline to cross validate them both. How can I refer to each table within a pipeline? I use sparklyr in R to do this, but I guess it should be the same with pyspark.
First I can use the following code to build a linear regression and cross evaluate it using ml_cross_validator()
suppressMessages(library(sparklyr))
suppressMessages(library(tidyverse))
sc <- spark_connect(master = "local")
copy_to(sc, mtcars, "mtcars")
mtcars <- tbl(sc, "mtcars")
pipeline <- ml_pipeline(sc) %>%
ft_r_formula(mpg ~ .) %>%
ml_linear_regression()
grid <- list(linear_regression = list(reg_param = 0))
cv <- ml_cross_validator(
sc,
estimator = pipeline, # use our pipeline to estimate the model
estimator_param_maps = grid, # use the params in grid
evaluator = ml_regression_evaluator(sc, metric_name = "rmse"), # how to evaluate the CV
num_folds = 2, # number of CV folds
seed = 2018
)
cv_model <- ml_fit(cv, mtcars)
cv_model$avg_metrics_df
#> rmse reg_param_1
#> 1 3.997882 0
Created on 2019-09-13 by the reprex package (v0.3.0)
But if I add another table with the same features:
mtcars_sample <- sdf_sample(mtcars, fraction = 0.8) %>%
sdf_register("mtcars_sample")
How can I refer to it within the pipeline?
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.0 (2019-04-26)
os macOS Mojave 10.14.6
system x86_64, darwin15.6.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Stockholm
date 2019-09-13

Why does tmap render 80 times faster than ggplot2? [Plotting shapefiles in R with ggplot2::geom_sf(), using XQuartz/X11 graphics device on macOS]

Update/Edit/Reprex: Rendering the same spatial data with the same graphics device takes 1 second with tmap versus 80 seconds with ggplot2, even though the tmap plot's R object is 80x larger in size. Difference in internals and/or implementation btw. packages & graphics device?
library(ggplot2); library(sf);
library(tmap); library(tidyverse)
library(here) # project directory
data(World) # sf object from tmap; Provides Africa polygon
# 'd' data pulled from acleddata.com/data, restricted to Aug 18 2017 - Aug 18 2018, Region: N/S/E/W/Middle Africa only
d <- read.csv(here('2017-08-18-2018-08-18-Eastern_Africa-Middle_Africa-Northern_Africa-Southern_Africa-Western_Africa.csv'))
dsf <- st_as_sf(d, coords = c('longitude', 'latitude'), crs = 4326)
Data used:
'World' shapefile data from the tmap package itself, and
[acleddata.com/data][1], ACLED conflict events restricted to Africa between August 18 2017 and August 18 2018 (7.8 MB .csv; these filters:)
[![enter image description here][2]][2]
Plot rendering:
# ggplot2; build plot, assign to object
dev.cur() # RStudioGD on macOS: quartz
system.time(p <- ggplot(World %>% filter(continent == 'Africa')) +
geom_sf() +
geom_sf(data = dsf, aes(fill = event_type,
color = event_type)) +
ggthemes::theme_tufte() +
theme(legend.key.size = unit(.1, 'cm'),
legend.title = element_blank()))
# user system elapsed
# 0.016 0.001 0.017
object.size(p)
# 175312 bytes
# render
system.time(print(p))
# user system elapsed
# 84.755 0.432 85.418 # Note over 80 seconds
[![ggplot2 png][3]][3]
# tmap; build plot, assign to object
tmap_mode('plot')
system.time(tm <- tm_shape(World, filter =
(World$continent == 'Africa')) +
tm_polygons(group = 'Countries') +
tm_shape(dsf) +
tm_dots(col = 'event_type', group = 'event_type'))
# user system elapsed
# 0.000 0.000 0.001
object.size(tm)
# 14331968 bytes # This is 80x ggplot2 plot's object size
# 14331968/175312 = 81.75121
# render
dev.cur() # RStudioGD on macOS: quartz
system.time(print(tm))
# user system elapsed
# 1.438 0.038 1.484 # Note 1 second
[![tmap png][4]][4]
[Previous inquiry into geom_sf() & graphics devices, without the tmap comparison:]
TL;DR:
I am trying to speed up my plotting speed by switching graphics devices to X11, since my default Quartz graphics device is slow. After downloading XQuartz (to connect to the X11 graphics device) and calling grDevices::X11(), I don't understand the errors I'm getting.
X11(type = "cairo")
# Error in .External2(C_X11, d$display, d$width, d$height, d$pointsize, :
# unable to start device X11
# In addition: Warning message:
# In X11() : unable to open connection to X11 display 'cairo'
#> Warning in X11(type = "cairo"): unable to open connection to X11 display ''
And when I call R from a XQuartz.app terminal on macOS instead, the error message is slightly different:
X11(type = "cairo")
#> Error in .External2(C_X11, d$display, d$width, d$height, d$pointsize, : unable to start device X11cairo
End TL;DR
Broader Context:
Plotting large shapefiles with `ggplot2::geom_sf()`, the quartz graphics device used in macOS plots considerably slower than other devices, and while this larger performance issue is being resolved, I want to change my device from Quartz to X11.
I downloaded XQuartz, following advice from the [RStudio forums][5], but my code doesn't successfully call X11, even when I launch R from XQuartz.
Proof, using the same data as the RStudio forum poster:
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
library(ggplot2)
tmpzip <- tempfile(fileext = ".zip")
download.file("https://github.com/bcgov/bcmaps.rdata/blob/master/data-raw/ecoregions/ecoregions.zip?raw=true", destfile = tmpzip)
gdb_path <- unzip(tmpzip, exdir = tempdir())
ecoregions <- sf::read_sf(dirname(gdb_path[1]))
## Create ggplot2 object
ecoregions_gg <- ggplot() + geom_sf(data = ecoregions)
# Running quartz device - default macOS
system.time(print(ecoregions_gg))
#> user system elapsed
#> 128.980 0.774 130.375
### ^ Note two full minutes!
[![Shapefile][6]][6]
This default device runs for an unusually long 129 seconds given the size.
X11 should run faster according to the RStudio forum. A test on a (granted, faster) Windows 7 machine (32 GB RAM, 3.60 GHz) using its default graphics device (not Quartz), yielded:
#> user system elapsed
#> 2.16 2.24 4.46
### ^Only two seconds
While people are troubleshooting the general geom_sf / Quartz performance problems ([Github Issue 1][7], [Github Issue 2][8]), how can I use my XQuartz install to run X11 and speed up my shapefile plotting?
[1]: http://acleddata.com/data
[2]: https://i.stack.imgur.com/iEA9j.png
[3]: https://i.stack.imgur.com/BCYzl.png
[4]: https://i.stack.imgur.com/5lgZB.png
[5]: https://community.rstudio.com/t/ggplot2-geom-sf-performance/3251
[6]: https://i.stack.imgur.com/ILhQP.png
[7]: https://github.com/tidyverse/ggplot2/issues/2655
[8]: https://github.com/tidyverse/ggplot2/issues/2718

Passing arguments to xlconnect functions with ellipses

I have a bunch of excel files in one folder, and would like to write a single function as follows:
# takes a file path and sheetname for an excel workbook, passes on additional params
getxl_sheet <- function(wb_path, sheetname, ...) {
testbook <- XLConnect::loadWorkbook(wb_path)
XLConnect::readWorksheet(testbook, sheet = sheetname, ...)
}
However, when I run the following,
set.seed(31415)
x <- rnorm(15); y <- rnorm(15)
randvals <- data.frame(x=x, y=y)
XLConnect::writeWorksheetToFile("~/temp_rands.xlsx", randvals, "Sheet1")
my_vals <- getxl_sheet("~/temp_rands.xlsx", "Sheet1", endRow=5)
my_vals returns the entire 15 by 2 dataframe, as opposed to just stopping at the fifth row (likewise if I use 'endCol=1' for example, it gives both columns). On the other hand, passing additional arguments in base R hasn't been a problem:
my_plot <- function(...) {
plot(...)
}
#my_plot(x=x, y=y, pch=16, col="blue")
works as expected. What's the problem with the function defined above to read in xlsx files? Thanks.
devtools::session_info()
Session info---------------------------------------------------------------------
setting value
version R version 3.1.1 (2014-07-10)
system x86_64, darwin13.1.0
ui RStudio (0.98.1062)
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages-------------------------------------------------------------------------
package * version date source
devtools 1.6.0.9000 2014-11-26 Github (hadley/devtools#bd9c252)
rJava 0.9.6 2013-12-24 CRAN (R 3.1.0)
rstudioapi 0.1 2014-03-27 CRAN (R 3.1.0)
XLConnect * 0.2.9 2014-08-14 CRAN (R 3.1.1)
XLConnectJars * 0.2.9 2014-08-14 CRAN (R 3.1.1)
The dots mechanism needs to have a function that expects dots, and unlike plot.default, readWorksheet is not designed to handle an ellipsis: You need to build some decoding into the arguments:
getxl_sheetRCshort <- function(wb_path, sheetname, ...) {
arglist <- list(...)
testbook <- loadWorkbook(wb_path);
readWorksheet(testbook, sheet = sheetname,
endRow=arglist[['endRow']], endCol=arglist[['endCol']])
}
> my_vals <- getxl_sheet("~/temp_rands.xlsx", "Sheet1", endRow=5)
> my_vals
x y
1 1.6470129 -1.27323204
2 -1.1119872 -1.77141948
3 -1.5485456 1.40846809
4 -0.7483785 -0.09450125
You could make this even more general by doing matching on the entire formals() list from the readWorksheet function and there are worked examples in SO that illustrate this. Fortunately the parser is somehow able to ignore the fact that no value is passed to 'endCol'.

Resources