Format for adding references in R package DESCRIPTION? - r

I just submitted an R package to CRAN. I got this comment back:
If there are references describing the methods in your package, please add these in the description field of your DESCRIPTION file in the form
authors (year) <doi:...>
authors (year) <arXiv:...>
authors (year, ISBN:...)
or if those are not available: <https:...>
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for auto-linking.
(If you want to add a title as well please put it in quotes: "Title")
But I thought that the description field is limited to one paragraph, which means you can't include additional text besides the single paragraph in that field. So I was unsure what the exact formatting is for including references in the description field. My guess is below, but this format returns a note stating that the description is malformed.
Description: Text describing the package, blah blah blah.
More text goes here, etc etc etc.
Foo, B., and J. Baz. (1999) <doi:23232/xxxxx.00>
Smith, C. (2021) <https://something.etc/foo>
Note returned when running R CMD check:
checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
This question is related but does not have a satisfactory answer so I am asking again.

I started with Julia Silge's blog post here:
cran <- tools::CRAN_package_db()
desc_with_doi <- grep("doi:", cran$Description, value = TRUE)
Here are some examples:
Given a protein multiple sequence alignment, it is daunting task to assess the effects of substitutions along sequence length. 'aaSEA' package is intended to help researchers to rapidly analyse property changes caused by single, multiple and correlated amino acid substitutions in proteins. Methods for identification of co-evolving positions from multiple sequence alignment are as described in : Pelé et al., (2017) <doi:10.4172/2379-1764.1000250>.
or
Estimate parameters of accumulated damage (load duration) models based on failure time data under a Bayesian framework, using Approximate Bayesian Computation (ABC). Assess long-term reliability under stochastic load profiles. Yang, Zidek, and Wong (2019) <doi:10.1080/00401706.2018.1512900>.
Using a similar filter for "https" shows (unsurprisingly) a lot more generic website links than scholarly references, but e.g.:
Designed for studies where animals tagged with acoustic tags are expected\n to move through receiver arrays. This package combines the advantages of automatic sorting and checking \n of animal movements with the possibility for user intervention on tags that deviate from expected \n behaviour. The three analysis functions (explore(), migration() and residency()) \n allow the users to analyse their data in a systematic way, making it easy to compare results from \n different studies.\n CJS calculations are based on Perry et al. (2012) <https://www.researchgate.net/publication/256443823_Using_mark-recapture_models_to_estimate_survival_from_telemetry_data>.
ArXiv (there are only 24 packages with such links out of 17962 total at present):
Provides functions for model fitting and selection of generalised hypergeometric ensembles of random graphs (gHypEG).\n To learn how to use it, check the vignettes for a quick tutorial.\n Please reference its use as Casiraghi, G., Nanumyan, V. (2019) doi:10.5281/zenodo.2555300\n together with those relevant references from the one listed below.\n The package is based on the research developed at the Chair of Systems Design, ETH Zurich.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2016) <arXiv:1607.02441>.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2017) <doi:10.1007/978-3-319-67256-4_11>.\n Casiraghi, G., (2017) <arxiv:1702.02048>\n Casiraghi, G., Nanumyan, V. (2018) <arXiv:1810.06495>.\n Brandenberger, L., Casiraghi, G., Nanumyan, V., Schweitzer, F. (2019) <doi:10.1145/3341161.3342926>\n Casiraghi, G. (2019) <doi:10.1007/s41109-019-0241-1>.

Related

GloVe word embeddings containing sentiment?

I've been researching sentiment analysis with word embeddings. I read papers that state that word embeddings ignore sentiment information of the words in the text. One paper states that among the top 10 words that are semantically similar, around 30 percent of words have opposite polarity e.g. happy - sad.
So, I computed word embeddings on my dataset (Amazon reviews) with the GloVe algorithm in R. Then, I looked at the most similar words with cosine similarity and I found that actually every word is sentimentally similar. (E.g. beautiful - lovely - gorgeous - pretty - nice - love). Therefore, I was wondering how this is possible since I expected the opposite from reading several papers. What could be the reason for my findings?
Two of the many papers I read:
Yu, L. C., Wang, J., Lai, K. R. & Zhang, X. (2017). Refining Word Embeddings Using
Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 26(3), 671-681.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T. & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1: Long Papers, 1555-1565.
Assumption:
When you say you computed GLoVe embeddings you mean you used pretrained GLoVe.
Static Word Embeddings does not carry sentiment information of the input text at runtime
Above statement means that word embedding algorithms(most of them in my knowledge, like GLoVe, Word2Vec) are not designed or formulated to capture sentiment of the word. But, in general word embedding algorithms map the words that are similar in meaning (based on statistical nearness and co-occurrences). Example, "Woman" and "Girl" will lie near to each other in the n-dimensional space of the embeddings. But that does not mean that any sentiment related information is captured here.
Hence,
Words : (beautiful - lovely - gorgeous - pretty - nice - love), being sentimentally similar to a given word is not a co-incident. We have to look these words in terms of their meaning, all these words are similar in meaning, but we cannot say that, they necessarily carry the same sentiments. These words lie near to each other in GLoVe's vector space, because the model was trained well on the corpus that carried sufficient information in terms of words that can be grouped similar.
Also, please study the similarity score, that will make it clearer.
The top 10 words that are semantically similar, around 30 percent of words have opposite polarity
Here, asemanticity, is lesser related to context, whereas sentiment is more related to context. One word cannot define sentiment.
Example:
Jack: "Your dress is beautiful, Gloria"!
Gloria: "Beautiful my foot!"
In both the sentences, beautiful carries completely different sentiment, where as for both of them will have same embedding for the word beautiful. Now, replace beautiful with (lovely - gorgeous - pretty - nice), semantic thing holds true as described in one of the papers. Also, sentiment is not captured by Word Embeddings, hence, other paper also stands true.
The point where confusion may have occurred is considering two or more word with similar meanings to be sentimentally similar. Sentiment information can be gathered at sentence level or doc level and not at word level.

DOI in CRAN submission of R package?

After submitting an R package to CRAN, I received one of the following suggestions:
"Is there some reference about the method you can add in the Description field in the form Authors (year) ?"
After doing some searching, I haven't really found any instances of people putting DOIs in the Description file, except perhaps in the CITATION file, but that is not what is asked for here it seems. May I ask how I might go about resolving this issue? Thanks in advance!
Your searching may have been superficial. Limiting it to the subset of what I may have installed here so that I can grep:
edd#rob:~$ grep -l "<doi:.*>" /usr/local/lib/R/site-library/*/DESCRIPTION
/usr/local/lib/R/site-library/acepack/DESCRIPTION
/usr/local/lib/R/site-library/arules/DESCRIPTION
/usr/local/lib/R/site-library/datasauRus/DESCRIPTION
/usr/local/lib/R/site-library/ddalpha/DESCRIPTION
/usr/local/lib/R/site-library/DEoptimR/DESCRIPTION
/usr/local/lib/R/site-library/distr6/DESCRIPTION
/usr/local/lib/R/site-library/dqrng/DESCRIPTION
/usr/local/lib/R/site-library/earth/DESCRIPTION
/usr/local/lib/R/site-library/fastglm/DESCRIPTION
/usr/local/lib/R/site-library/fields/DESCRIPTION
/usr/local/lib/R/site-library/HardyWeinberg/DESCRIPTION
/usr/local/lib/R/site-library/jomo/DESCRIPTION
/usr/local/lib/R/site-library/lava/DESCRIPTION
/usr/local/lib/R/site-library/loo/DESCRIPTION
/usr/local/lib/R/site-library/lpirfs/DESCRIPTION
/usr/local/lib/R/site-library/mcmc/DESCRIPTION
/usr/local/lib/R/site-library/mice/DESCRIPTION
/usr/local/lib/R/site-library/party/DESCRIPTION
/usr/local/lib/R/site-library/plm/DESCRIPTION
/usr/local/lib/R/site-library/praznik/DESCRIPTION
/usr/local/lib/R/site-library/Rcpp/DESCRIPTION
/usr/local/lib/R/site-library/RcppSMC/DESCRIPTION
/usr/local/lib/R/site-library/RcppZiggurat/DESCRIPTION
/usr/local/lib/R/site-library/RProtoBuf/DESCRIPTION
/usr/local/lib/R/site-library/spam/DESCRIPTION
/usr/local/lib/R/site-library/SQUAREM/DESCRIPTION
/usr/local/lib/R/site-library/stabs/DESCRIPTION
/usr/local/lib/R/site-library/tweedie/DESCRIPTION
/usr/local/lib/R/site-library/xgboost/DESCRIPTION
edd#rob:~$
And, just to plain, here are the first ten lines of the actual result set:
edd#rob:~$ grep -h "<doi:.*>" /usr/local/lib/R/site-library/*/DESCRIPTION | head -10
80:580-598. <doi:10.1080/01621459.1985.10478157>].
<doi:10.1080/01621459.1988.10478610>]. A good introduction to these two methods is in chapter 16 of
See Christian Borgelt (2012) <doi:10.1002/widm.1074>.
<doi:10.1145/3025453.3025912>.
Description: Contains procedures for depth-based supervised learning, which are entirely non-parametric, in particular the DDalpha-procedure (Lange, Mosler and Mozharovskyi, 2014 <doi:10.1007/s00362-012-0488-4>). The training data sample is transformed by a statistical depth function to a compact low-dimensional space, where the final classification is done. It also offers an extension to functional data and routines for calculating certain notions of statistical depth functions. 50 multivariate and 5 functional classification problems are included. (Pokotylo, Mozharovskyi and Dyckerhoff, 2019 <doi:10.18637/jss.v091.i05>).
Brest et al. (2006) <doi:10.1109/TEVC.2006.872133>.
Description: An R6 object oriented distributions package. Unified interface for 42 probability distributions and 11 kernels including functionality for multiple scientific types. Additionally functionality for composite distributions and numerical imputation. Design patterns including wrappers and decorators are described in Gamma et al. (1994, ISBN:0-201-63361-2). For quick reference of probability distributions including d/p/q/r functions and results we refer to McLaughlin, M. P. (2001). Additionally Devroye (1986, ISBN:0-387-96305-7) for sampling the Dirichlet distribution, Gentle (2009) <doi:10.1007/978-0-387-98144-4> for sampling the Multivariate Normal distribution and Michael et al. (1976) <doi:10.2307/2683801> for sampling the Wald distribution.
proposed by Marsaglia and Tsang (2000, <doi:10.18637/jss.v005.i08>).
Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>) as
Splines" <doi:10.1214/aos/1176347963>.
edd#rob:~$

Dynamic topic models/topic over time in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a database of newspaper articles about the water policy from 1998 to 2008. I would like to see how the newspaper release changes during this period. My question is, should I use Dynamic Topic Modeling or Topic Over Time model to handle this task? Would they be significantly better than the traditional LDA model (in which I fit the topic model base on the entire set of text corpus, and plot the trend of topic based on how each of the document is tagged)? If yes, is there a package I could use for the DTA/ToT model in R?
So it depends on what your research question is.
A dynamic topic model allows the words that are most strongly associated with a given topic to vary over time. The paper that introduces the model gives a great example of this using journal entries [1]. If you are interested in whether the characteristics of individual topics vary over time, then this is the correct approach.
I have not dealt with the ToT model before, but it appears similar to a structural topic model whose time covariates are continuous. This means that topics are fixed, but their relative prevalence and correlations can vary. If you group your articles into say - months - then a structural or ToT model can show you whether certain topics become more or less prevalent over time.
So in sum, do you want the variation to be within topics or between topics? Do you want to study how the articles vary in the topics they speak on, or do you want to study how these articles construct certain topics?
In terms of R, you'll run into some problems. The stm package can deal with a STM with discrete time periods, but there is no pre-packaged implementation of a ToT model that I am aware of. For a DTM, I know there is a C++ implementation that was released with the introductory paper, and I have a python version which I can find for you.
Note: I would never recommend someone to use a simple LDA for text documents. I would always take a correlated topic model as a base, and build from there.
Edit: to explain more on stm package.
This package is an implementation of the structural topic model [2]. The STM is an extension to the correlated topic model [3] but permits the inclusion of covariates at the document level. You can then explore the relationship between topic prevalence and these covariates. If you include a covariate for date, then you can explore how individual topics become more or less important over time, relative to others. The package itself is excellent, fast and intuitive, and includes functions to choose the most appropriate number of topics etc.
[1] Blei, David M., and John D. Lafferty. "Dynamic topic models." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[2] Roberts, Margaret E., et al. "Structural Topic Models for Open‐Ended Survey Responses." American Journal of Political Science 58.4 (2014): 1064-1082.
[3] Lafferty, John D., and David M. Blei. "Correlated topic models." Advances in neural information processing systems. 2006.

checking DESCRIPTION meta-information ... NOTE

I am developing a package in R and when I run devtools::check() I am getting the following note.
checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
I am not using my name of the package or the word package in the description. I am also using complete sentence for the description yet I am getting this NOTE repeatedly. So I am wondering what does a complete sentence mean in this case.
Try adding periods to the ends of the sentences, that is, turn your existing Description field into Functions to analyze methylation data can be found here. Highlight of this workflow is the comprehensive quality control report.
I've always found it a bit curious that the Description field shouldn't contain the word "package" or the name of the package ...and also requires complete sentences (which require both a subject and a verb, leading to very grammatically awkward constructs if you want your introductory sentence to have a subject without using the no-no words.) I'm pretty sure that grammatically speaking, very few packages have truly "complete" sentences in their Description fields.
I'm pretty sure that it just checks for capital letters and periods, and I'd avoid any special characters, just to be on the safe side.
The Description field of my package on CRAN is Reads river network shape files and computes network distances. Also included are a variety of computation and graphical tools designed for fisheries telemetry research, such as minimum home range, kernel density estimation, and clustering analysis using empirical k-functions with a bootstrap envelope. Tools are also provided for editing the river networks, meaning there is no reliance on external software. Definitely not the greatest, but apparently it worked and CRAN liked it!
Adding a full stop symbol/period/dot (.) at the end of the description details will remove this warning.

Can someone explain me the difference between ID3 and CART algorithm?

I have to create decision trees with the R software and the rpart Package.
In my paper I should first define the ID3 algorithm and then implement various decision trees.
I found out that the rpart package does not work with the ID3 algorithm. It uses the CART algorithm. I would like to understand the difference and maybe explain the difference in my paper, but I did not found any literature who compares both sides.
Can you help me? Do you know a paper where both are compared or can you explain the difference to me?
I don't have access to the original texts 1,2 but using some secondary sources, key differences between these recursive ("greedy") partitioning ("tree") algorithms seem to be:
Type of learning:
ID3, as an "Iterative Dichotomiser," is for binary classification only
CART, or "Classification And Regression Trees," is a family of algorithms (including, but not limited to, binary classification tree learning). With rpart(), you can specify method='class' or method='anova', but rpart can infer this from the type of dependent variable (i.e., factor or numeric).
Loss functions used for split selection.
ID3, as other comments have mentioned, selects its splits based on Information Gain, which is the reduction in entropy between the parent node and (weighted sum of) children nodes.
CART, when used for classification, selects its splits to achieve the subsets that minimize Gini impurity
Anecdotally, as a practitioner, I hardly ever hear the term ID3 used, whereas CART is often used as a catch-all term for decision trees. CART has a very popular implementation in R's rpart package. ?rpart notes that "In most details it follows Breiman et. al (1984) quite closely."
However, you can pass rpart(..., parms=list(split='information')) to override the default behavior and split on information gain instead.
1 Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf
Read 1 C4.5 and beyond of the paper It will clarify all your doubts, helped me with mine.
Don't get discouraged by the title, its about differences in different tree algorithms.
Anyways a good paper to read through
ID3 algorithm can be used for categorical feature and categorical label.
Whereas, CART is used for continuous features and continuous label.

Resources