After submitting an R package to CRAN, I received one of the following suggestions:
"Is there some reference about the method you can add in the Description field in the form Authors (year) ?"
After doing some searching, I haven't really found any instances of people putting DOIs in the Description file, except perhaps in the CITATION file, but that is not what is asked for here it seems. May I ask how I might go about resolving this issue? Thanks in advance!
Your searching may have been superficial. Limiting it to the subset of what I may have installed here so that I can grep:
edd#rob:~$ grep -l "<doi:.*>" /usr/local/lib/R/site-library/*/DESCRIPTION
/usr/local/lib/R/site-library/acepack/DESCRIPTION
/usr/local/lib/R/site-library/arules/DESCRIPTION
/usr/local/lib/R/site-library/datasauRus/DESCRIPTION
/usr/local/lib/R/site-library/ddalpha/DESCRIPTION
/usr/local/lib/R/site-library/DEoptimR/DESCRIPTION
/usr/local/lib/R/site-library/distr6/DESCRIPTION
/usr/local/lib/R/site-library/dqrng/DESCRIPTION
/usr/local/lib/R/site-library/earth/DESCRIPTION
/usr/local/lib/R/site-library/fastglm/DESCRIPTION
/usr/local/lib/R/site-library/fields/DESCRIPTION
/usr/local/lib/R/site-library/HardyWeinberg/DESCRIPTION
/usr/local/lib/R/site-library/jomo/DESCRIPTION
/usr/local/lib/R/site-library/lava/DESCRIPTION
/usr/local/lib/R/site-library/loo/DESCRIPTION
/usr/local/lib/R/site-library/lpirfs/DESCRIPTION
/usr/local/lib/R/site-library/mcmc/DESCRIPTION
/usr/local/lib/R/site-library/mice/DESCRIPTION
/usr/local/lib/R/site-library/party/DESCRIPTION
/usr/local/lib/R/site-library/plm/DESCRIPTION
/usr/local/lib/R/site-library/praznik/DESCRIPTION
/usr/local/lib/R/site-library/Rcpp/DESCRIPTION
/usr/local/lib/R/site-library/RcppSMC/DESCRIPTION
/usr/local/lib/R/site-library/RcppZiggurat/DESCRIPTION
/usr/local/lib/R/site-library/RProtoBuf/DESCRIPTION
/usr/local/lib/R/site-library/spam/DESCRIPTION
/usr/local/lib/R/site-library/SQUAREM/DESCRIPTION
/usr/local/lib/R/site-library/stabs/DESCRIPTION
/usr/local/lib/R/site-library/tweedie/DESCRIPTION
/usr/local/lib/R/site-library/xgboost/DESCRIPTION
edd#rob:~$
And, just to plain, here are the first ten lines of the actual result set:
edd#rob:~$ grep -h "<doi:.*>" /usr/local/lib/R/site-library/*/DESCRIPTION | head -10
80:580-598. <doi:10.1080/01621459.1985.10478157>].
<doi:10.1080/01621459.1988.10478610>]. A good introduction to these two methods is in chapter 16 of
See Christian Borgelt (2012) <doi:10.1002/widm.1074>.
<doi:10.1145/3025453.3025912>.
Description: Contains procedures for depth-based supervised learning, which are entirely non-parametric, in particular the DDalpha-procedure (Lange, Mosler and Mozharovskyi, 2014 <doi:10.1007/s00362-012-0488-4>). The training data sample is transformed by a statistical depth function to a compact low-dimensional space, where the final classification is done. It also offers an extension to functional data and routines for calculating certain notions of statistical depth functions. 50 multivariate and 5 functional classification problems are included. (Pokotylo, Mozharovskyi and Dyckerhoff, 2019 <doi:10.18637/jss.v091.i05>).
Brest et al. (2006) <doi:10.1109/TEVC.2006.872133>.
Description: An R6 object oriented distributions package. Unified interface for 42 probability distributions and 11 kernels including functionality for multiple scientific types. Additionally functionality for composite distributions and numerical imputation. Design patterns including wrappers and decorators are described in Gamma et al. (1994, ISBN:0-201-63361-2). For quick reference of probability distributions including d/p/q/r functions and results we refer to McLaughlin, M. P. (2001). Additionally Devroye (1986, ISBN:0-387-96305-7) for sampling the Dirichlet distribution, Gentle (2009) <doi:10.1007/978-0-387-98144-4> for sampling the Multivariate Normal distribution and Michael et al. (1976) <doi:10.2307/2683801> for sampling the Wald distribution.
proposed by Marsaglia and Tsang (2000, <doi:10.18637/jss.v005.i08>).
Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>) as
Splines" <doi:10.1214/aos/1176347963>.
edd#rob:~$
Related
I've read elsewhere that the VGAM package can be used to model underdispersed count data via the genpoisson families. However, when I look up the help file for genpoisson0, genpoisson1, and genpoisson2 they all say the following:
"In theory the λ parameter is allowed to be negative to handle underdispersion, however this is no longer supported, hence 0 < λ < 1."
"In theory the \varphi parameter might be allowed to be less than unity to handle underdispersion but this is not supported."
"In theory the α parameter might be allowed to be negative to handle underdispersion but this is not supported."
Where can I go to handle underdispersion?
You can use quasilikelihood methods, e.g. family=quasipoisson in glm() (in base R)
the glmmTMB package supports COM-Poisson (family = compois) and generalized Poisson (family = genpois) conditional distributions.
It's not clear to me whether the reasons discussed (briefly) here for why underdispersed generalized Poisson distributions are no longer supported in VGAM also apply to the implementation in glmmTMB ...
There is some discussion of the glmmTMB parameterizations/implementations of COM-Poisson and generalized Poisson in Brooks et al (2019).
Brooks, Mollie E., Kasper Kristensen, Maria Rosa Darrigo, Paulo Rubim, María Uriarte, Emilio Bruna, and Benjamin M. Bolker. “Statistical Modeling of Patterns in Annual Reproductive Rates.” Ecology 100, no. 7 (2019): e02706. https://doi.org/10.1002/ecy.2706.
I just submitted an R package to CRAN. I got this comment back:
If there are references describing the methods in your package, please add these in the description field of your DESCRIPTION file in the form
authors (year) <doi:...>
authors (year) <arXiv:...>
authors (year, ISBN:...)
or if those are not available: <https:...>
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for auto-linking.
(If you want to add a title as well please put it in quotes: "Title")
But I thought that the description field is limited to one paragraph, which means you can't include additional text besides the single paragraph in that field. So I was unsure what the exact formatting is for including references in the description field. My guess is below, but this format returns a note stating that the description is malformed.
Description: Text describing the package, blah blah blah.
More text goes here, etc etc etc.
Foo, B., and J. Baz. (1999) <doi:23232/xxxxx.00>
Smith, C. (2021) <https://something.etc/foo>
Note returned when running R CMD check:
checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
This question is related but does not have a satisfactory answer so I am asking again.
I started with Julia Silge's blog post here:
cran <- tools::CRAN_package_db()
desc_with_doi <- grep("doi:", cran$Description, value = TRUE)
Here are some examples:
Given a protein multiple sequence alignment, it is daunting task to assess the effects of substitutions along sequence length. 'aaSEA' package is intended to help researchers to rapidly analyse property changes caused by single, multiple and correlated amino acid substitutions in proteins. Methods for identification of co-evolving positions from multiple sequence alignment are as described in : Pelé et al., (2017) <doi:10.4172/2379-1764.1000250>.
or
Estimate parameters of accumulated damage (load duration) models based on failure time data under a Bayesian framework, using Approximate Bayesian Computation (ABC). Assess long-term reliability under stochastic load profiles. Yang, Zidek, and Wong (2019) <doi:10.1080/00401706.2018.1512900>.
Using a similar filter for "https" shows (unsurprisingly) a lot more generic website links than scholarly references, but e.g.:
Designed for studies where animals tagged with acoustic tags are expected\n to move through receiver arrays. This package combines the advantages of automatic sorting and checking \n of animal movements with the possibility for user intervention on tags that deviate from expected \n behaviour. The three analysis functions (explore(), migration() and residency()) \n allow the users to analyse their data in a systematic way, making it easy to compare results from \n different studies.\n CJS calculations are based on Perry et al. (2012) <https://www.researchgate.net/publication/256443823_Using_mark-recapture_models_to_estimate_survival_from_telemetry_data>.
ArXiv (there are only 24 packages with such links out of 17962 total at present):
Provides functions for model fitting and selection of generalised hypergeometric ensembles of random graphs (gHypEG).\n To learn how to use it, check the vignettes for a quick tutorial.\n Please reference its use as Casiraghi, G., Nanumyan, V. (2019) doi:10.5281/zenodo.2555300\n together with those relevant references from the one listed below.\n The package is based on the research developed at the Chair of Systems Design, ETH Zurich.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2016) <arXiv:1607.02441>.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2017) <doi:10.1007/978-3-319-67256-4_11>.\n Casiraghi, G., (2017) <arxiv:1702.02048>\n Casiraghi, G., Nanumyan, V. (2018) <arXiv:1810.06495>.\n Brandenberger, L., Casiraghi, G., Nanumyan, V., Schweitzer, F. (2019) <doi:10.1145/3341161.3342926>\n Casiraghi, G. (2019) <doi:10.1007/s41109-019-0241-1>.
Elith et al. [1] describe a method of measuring the dissimilarity between the values used in model fitting and the values used in making predictions. In the context of species distribution modelling (ecological niche modelling) that prediction is a 'projection'. The method is called 'multivariate environmental similarity surface (MESS) analysis. There is a function in the dismo package to estimate it (as well as a function built into the MAXENT java program).
q1: Does anyone know what units are reported by the dismo::mess function?
The dismo::mess function reports not only a MESS for each predictor (received and reported as a raster), but also reports a layer named 'rmess'. In the help file it is described as " an additional layer with the MESS values".
q2: How are the MESS values calculated?
q3: What is the rmess layer a measure of?
Thanks for your help!
[1] Elith, J., Kearney, M. & Phillips, S. 2010 The art of modelling range-shifting species. Methods in Ecology and Evolution 1, 330-342. (doi:10.1111/j.2041-210X.2010.00036.x).
You can see what dismo does by typing
dismo::mess
It calls .messi3, which you can see with
dismo:::.messi3
(I found the answer in the appendices...)
I would like to know, which implementation of random forest in package randomForest in R is used to grow decision trees? Is it CART, ID3, C4.5 ,...... or sth else?
According to ?randomForest() the description states:
randomForest implements Breiman’s random forest algorithm (based on
Breiman and Cutler’s original Fortran code) for classification and
regression. It can also be used in unsupervised mode for assessing
proximities among data points, with Breiman L (2001). "Random
Forests"." Based on: Machine Learning. 45 (1): 5–32.
doi:10.1023/A:1010933404324.
According to Wikipedia (https://en.wikipedia.org/wiki/Random_forest):
The introduction of random forests proper was first made in a paper
by Leo Breiman This paper describes a method of building a forest of
uncorrelated trees using a CART like procedure. Reference to Breiman L (2001).
"Random Forests". Machine Learning. 45 (1): 5–32.
doi:10.1023/A:1010933404324. "
Therefore I would say it is CART.
In R, the ()randomForest package is using CART. There is also another package in R called ()ranger which can run decision trees at a faster pace
I am pretty new to machine learning, and I've stumbled upon an issue and can't seem to find a solution no matter how hard I google.
I have performed a multiclass classification procedure using a randomForest algorithm and found a model that offers adequate prediction of my test sample. I then used varImpPlot() to determine which predictors are most important to the determining the class assignments.
My problem: I would like to know why those predictors are most important. Specifically, I would like to be able to report that cases that fall into Class X hold Characteristics A (e.g., are male), B (e.g., are older), and C (e.g., have high IQ), while cases that fall into Class Y hold Characteristics D (female), E (younger), and F (low IQ), and so on for the rest of my classes.
I know that standard binary logistic regression allows you to say that cases with high values on Characteristic A are more likely to fall into class X, for example. So, I was hoping for something conceptually similar, but from a random forest classification model on multiple classes.
Is this a thing that can be done using random forest models? If yes, is there a function in randomForest or in caret (or even elsewhere) that can help me get past the varImpPlot() and varImp() table?
Thanks!
There is a package named ExplainPrediction that promises an explanation for random forest models. Here's the top of DESCRIPTION file. The URL page has a link to an extensive citation list:
Package: ExplainPrediction
Title: Explanation of Predictions for Classification and Regression Models
Version: 1.3.0
Date: 2017-12-27
Author: Marko Robnik-Sikonja
Maintainer: Marko Robnik-Sikonja <marko.robnik#fri.uni-lj.si>
Description: Generates explanations for classification and regression models and visualizes them.
Explanations are generated for individual predictions as well as for models as a whole. Two explanation methods
are included, EXPLAIN and IME. The EXPLAIN method is fast but might miss explanations expressed redundantly
in the model. The IME method is slower as it samples from all feature subsets.
For the EXPLAIN method see Robnik-Sikonja and Kononenko (2008) <doi:10.1109/TKDE.2007.190734>,
and the IME method is described in Strumbelj and Kononenko (2010, JMLR, vol. 11:1-18).
All models in package 'CORElearn' are natively supported, for other prediction models a wrapper function is provided
and illustrated for models from packages 'randomForest', 'nnet', and 'e1071'.
License: GPL-3
URL: http://lkm.fri.uni-lj.si/rmarko/software/
Imports: CORElearn (>= 1.52.0),semiArtificial (>= 2.2.5)
Suggests: nnet,e1071,randomForest
Also:
Package: DALEX
Title: Descriptive mAchine Learning EXplanations
Version: 0.1.1
Authors#R: person("Przemyslaw", "Biecek", email = "przemyslaw.biecek#gmail.com", role = c("aut", "cre"))
Description: Machine Learning (ML) models are widely used and have various applications in classification
or regression. Models created with boosting, bagging, stacking or similar techniques are often
used due to their high performance, but such black-box models usually lack of interpretability.
'DALEX' package contains various explainers that help to understand the link between input variables and model output.
The single_variable() explainer extracts conditional response of a model as a function of a single selected variable.
It is a wrapper over packages 'pdp' and 'ALEPlot'.
The single_prediction() explainer attributes arts of model prediction to articular variables used in the model.
It is a wrapper over 'breakDown' package.
The variable_dropout() explainer assess variable importance based on consecutive permutations.
All these explainers can be plotted with generic plot() function and compared across different models.
Depends: R (>= 3.0)
License: GPL
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1.9000
Imports: pdp, ggplot2, ALEPlot, breakDown
Suggests: gbm, randomForest, xgboost
URL: https://pbiecek.github.io/DALEX/
BugReports: https://github.com/pbiecek/DALEX/issues
NeedsCompilation: no
Packaged: 2018-02-28 01:44:36 UTC; pbiecek
Author: Przemyslaw Biecek [aut, cre]
Maintainer: Przemyslaw Biecek <przemyslaw.biecek#gmail.com>
Repository: CRAN
Date/Publication: 2018-02-28 16:36:14 UTC
Built: R 3.4.3; ; 2018-04-03 03:04:04 UTC; unix