tokens_compound() in quanteda changes the order of features - r

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session.
Here is the replication procedure:
Find collocations, compound tokens, and save them.
library(quanteda)
set.seed(12345)
data(data_corpus_inaugural)
toks <- data_corpus_inaugural %>%
tokens(remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE) %>%
tokens_tolower()
col <- toks %>%
textstat_collocations()
toks.col <- toks %>%
tokens_compound(pattern = col[col$z > 3])
write(attr(toks.col, "types"), "col1.txt")
End and restart R session and run the above code again with "col1.txt" replaced by "col2.txt".
Compare the two sets of tokens and find they are different.
col1 <- read.table("col1.txt")
col2 <- read.table("col2.txt")
identical(col1$V1, col2$V1) # This should return FALSE.
col1$V1[head(which(col1$V1 != col2$V1))]
col2$V1[head(which(col1$V1 != col2$V1))]
This does not matter for many cases but the result of LDA (by {topicmodels}) changes in different sessions. I guess so because the result of LDA is constant if I reset the order of features in tokens by as.list() and thereafter as.tokens() (dfm_sort() does not work for this).
I wonder whether this happens only for me (Ubuntu 18.04.5, R 4.0.4, and quanteda 2.1.2) and would be happy to hear another (easier) solution.
Updated on Feb 20
For example, the output of LDA is not reproduced.
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
lis[[i]] <- tokens_compound(toks, pattern = col[col$z > 3]) %>%
dfm() %>%
convert(to = "topicmodels") %>%
LDA(k = 5,
method = "Gibbs",
control = list(seed = 12345,
iter = 100))
}
head(lis[[1]]#gamma)
head(lis[[2]]#gamma)

An interesting investigation but this is neither an error nor anything to be concerned with. Within a quanteda tokens object, the types are not determinate in order, after a processing step such as textstat_compound(). This is because this function is parallelised in C++ and how these threads operate is not fixed by set.seed() from R. But this will not affect the important part, which is the set of types, or anything about the tokens themselves. If you want the order of the types that you extract to be the same, then you should sort them upon extraction.
library("quanteda")
## Package version: 2.1.2
toks <- data_corpus_inaugural %>%
tokens(
remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE
) %>%
tokens_tolower()
col <- quanteda.textstats::textstat_collocations(toks)
It turns out that you do not need to save the output or restart R - this happens within a single session.
# types are differently indexed, but are the same set
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(types = types(toks.col)))
}
dframe <- data.frame(lis)
sum(dframe$types != dframe$types.1)
## [1] 19898
head(dframe[dframe$types != dframe$types.1, ])
## types types.1
## 8897 at_this_second my_fellow_citizens
## 8898 to_take_the_oath_of_the_presidential_office no_people
## 8899 there_is on_earth
## 8900 occasion_for cause_to_be_thankful
## 8901 an_extended this_is_said
## 8902 there_was spirit_of
However the (unordered) set of types is identical:
# but
setequal(dframe$types, dframe$types.1)
## [1] TRUE
More important is that when we compare the values of each token, which is ordered, these are identical:
# tokens are the same
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(toks = as.character(toks.col)))
}
dframe <- data.frame(lis)
all.equal(dframe$toks, dframe$toks.1)
## [1] TRUE
Created on 2021-02-18 by the reprex package (v1.0.0)
An additional comment, whose importance is underscored by this analysis: We strongly discourage direct access to object attributes. Use types(x) as above, not attr(x, "types"). The former will always work. The latter relies on our implementation of the object, which may change as we improve the package.

Related

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

Why does quanteda's textmodel_wordfish run infinitely when I apply to quanteda.corpora's corpus of UK party manifestos?

I'm attempting to apply wordfish to quanteda.corpora's data_corpus_ukmanifestos, but it never seems to stop running. On the other hand, when I use the example code from quanteda's wordfish tutorial, wordfish is complete within seconds. Is this just a problem for me? Does this happen to others as well? How can I circumvent this problem?
This is the code I have right now. Like I said, wordfish works in seconds when run on the Irish budget speeches, but never stops running when applied to party manifestos.
## install/load packages
## install.packages(c("quanteda", "devtools"))
## devtools::install_github("quanteda/quanteda.corpora")
library(quanteda)
library(quanteda.corpora)
require(quanteda)
require(quanteda.corpora)
dfmat_irish <- dfm(data_corpus_irishbudget2010, remove_punct = TRUE)
tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6,5))
summary(tmod_wf)
dfmat_uk <- dfm(data_corpus_ukmanifestos, remove_punct = TRUE)
wf_uk <- textmodel_wordfish(dfmat_uk, dir = c(83, 74))
How do I get wordfish to work with this corpus?
Try trimming low-frequency words. The longer the time span of a time-series corpus, the more sparse your matrix. There are 101 manifestos in the UK corpus, going back to 1945. A lot of the terms are going to be very rare.
library("quanteda")
## Package version: 1.4.4
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
data(data_corpus_ukmanifestos, package = "quanteda.corpora")
system.time(
wf_uk2 <- dfm(data_corpus_ukmanifestos, remove_numbers = TRUE, remove_punct = TRUE) %>%
dfm_trim(min_termfreq = 10, min_docfreq = 20) %>%
textmodel_wordfish(dir = c(83, 74))
)
## user system elapsed
## 2.274 0.124 2.356
You could also use dfm_wordstem() to reduce the feature set further, but best to do this before the trim operation.

Generating Random Strings

I want to generate random strings in the following way: ABCDE1234E, i.e each string contains 5 Characters, 4 Numerics, then 1 Char.
I figured out a way to create this using the following code.
library(random)
string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE))
number_4 <- as.vector(randomNumbers(n=5000, min=1111, max=9999, col=5, base=10, check=TRUE))
string_1 <- as.vector(randomStrings(n=5000, len=1, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=FALSE, check=TRUE))
PAN.Number <- paste(string_5,number_4,string_1,sep = "")
But these functions are taking a long time and the random library needs a network connection.
> system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
+ loweralpha=FALSE, unique=TRUE, check=TRUE)))
user system elapsed
0.07 0.00 3.18
Is there any method that I could try to reduce the execution time?
I also tried using sample() but I couldn't figure it out.
Using "stringi" as suggested by #akrun will be faster, but the following is also very fast and does not require any additional packages:
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
Example output:
myFun(10)
## [1] "BZHOF3737P" "EPOWI0674X" "YYWEB2825M" "HQIXJ5187K" "IYIMB2578R"
## [6] "YSGBG6609I" "OBLBL6409Q" "PUMAL5632D" "ABRAT4481L" "FNVEN7870Q"
We can use stri_rand_strings from stringi
library(stringi)
sprintf("%s%s%s", stri_rand_strings(5, 5, '[A-Z]'),
stri_rand_strings(5, 4, '[0-9]'), stri_rand_strings(5, 1, '[A-Z]'))
Or more compactly
do.call(paste0, Map(stri_rand_strings, n=5, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
Benchmarks
system.time({
do.call(paste0, Map(stri_rand_strings, n=5000, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
})
# user system elapsed
# 0 0 0
Was able to reproduce the timings even for one part of the expected output using OP's method
system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE)))
# user system elapsed
# 0.86 0.24 5.52
You can directly perform what you want:
Sample random 5 capital letters
Sample 4 digits
Sample 1 random capital letter
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
This will be more easily controlled, and won't take as long.
Your performance problem comes from using the random package in the first place: it's understandable that you could find the random::randomStrings() function in an internet search and think it's a good way to generate random strings for use in a program, but the random package is not intended for general-purpose programming. It works by querying the RANDOM.ORG server, which is intrinsically slower than R's built-in pseudo-random number generators.
From one of the vignettes from the random package:
There are a number of situations in which it is desirable to use non-deterministically determined
random numbers. Examples include
- to seed distributed computing on different nodes with truly indepedent seeds;
- to obtain portable initializations for RNGs that do not depend on particular operating system
or hardware features;
- to validate simulation results using non-deterministic random numbers;
- to provide indeterministic seeds used for lottery drawings or games ...
Note that most of these examples are about seeding or initializing (these are synonyms) R's built-in pseudo-random number generators, rather than replacing them ...
In case anyone came here looking for a way to generate random file names, here's what I used. I like it for its elegance
library(dplyr)
runif(1, 1000000000000, 9999999999999) %>% round %>% as.character %>% paste0("/tmp/", ., ".png")
Note: you can easily change how many random strings it generates by changing the 1 in runif() to the number you want
You can use the ASCII table to get a fine control of your final string.
randString <- function(characters=0, numbers=0, symbols=0, lowerCase=0, upperCase=0) {
ASCII <- NULL
if(symbols>0) ASCII <- c(ASCII, sample(c(33:47, 58:34, 91:96, 123:126), symbols))
if(numbers>0) ASCII <- c(ASCII, sample(48:57, numbers))
if(upperCase>0) ASCII <- c(ASCII, sample(65:90, upperCase))
if(lowerCase>0) ASCII <- c(ASCII, sample(97:122, lowerCase))
if(characters>0) ASCII <- c(ASCII, sample(c(65:90, 97:122), characters))
return( rawToChar(as.raw(sample(ASCII, length(ASCII)))) )
}
Example:
randString(characters=5, numbers=4)
# [1] "9fKW75o1N"
We can now do this with "rowwise" and "mutate" from dplyr, with library(stringi) for the stri_rand_strings function:
df %>%
rowwise() %>%
mutate(unique_id = paste0(stri_rand_strings(1, 5, "[A-Z]"), stri_rand_strings(1, 4, "[0-9]"), stri_rand_strings(1, 1, "[A-Z]")))
This avoids the need to create a function.

Filter xts objects by common dates

I am stuck with the following code.
For reference the code it is taken from the following website (http://gekkoquant.com/2013/01/21/statistical-arbitrage-trading-a-cointegrated-pair/), I am also compiling the code through R Studio.
library("quantmod")
startDate = as.Date("2013-01-01")
symbolLst<-c("WPL.AX","BHP.AX")
symbolData <- new.env()
getSymbols(symbolLst, env = symbolData, src = "yahoo", from = startDate)
stockPair <- list(
a =coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep="")))))
,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep="")))))
,hedgeRatio = 0.70 ,name=title)
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
I am getting the following error.
Error in stockPair$a - stockPair$hedgeRatio * stockPair$b :
non-conformable arrays
The reason these particular series don't match is because "WPL.AX" has an extra value (date:19-05-2014 - the matrix lengths are different) compared to "BHP". How can I solve this issue when loading data?
I have also tested other stock pairs such as "ANZ","WBC" with the source = "google" which produces two of the same length arrays.
> length(stockPair$a)
[1] 360
> length(stockPair$b)
[1] 359
Add code such as this prior to the stockPair computation, to trim each xts set to the intersection of dates:
common_dates <- as.Date(Reduce(intersect, eapply(symbolData, index)))
symbolData <- eapply(symbolData, `[`, i=common_dates)
Your code works fine if you don't convert your xts object to matrix via coredata. Then Ops.xts will ensure that only the rows with the same index will be subtracted. And fortune(106) applies.
fortunes::fortune(106)
# If the answer is parse() you should usually rethink the question.
# -- Thomas Lumley
# R-help (February 2005)
stockPair <- list(
a = Cl(symbolData[[symbolLst[1]]])
,b = Cl(symbolData[[symbolLst[2]]])
,hedgeRatio = 0.70
,name = "title")
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
Here's an alternative approach:
# merge stocks into a single xts object
stockPair <- do.call(merge, eapply(symbolData, Cl))
# ensure stockPair columns are in the same order as symbolLst, since
# eapply may loop over the environment in an order you don't expect
stockPair <- stockPair[,pmatch(symbolLst, colnames(stockPair))]
colnames(stockPair) <- c("a","b")
# add hedgeRatio and name as xts attributes
xtsAttributes(stockPair) <- list(hedgeRatio=0.7, name="title")
spread <- stockPair$a - attr(stockPair,'hedgeRatio')*stockPair$b

List and description of all packages in CRAN from within R

I can get a list of all the available packages with the function:
ap <- available.packages()
But how can I also get a description of these packages from within R, so I can have a data.frame with two columns: package and description?
Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
The function brings (currently) 66 columns back of which the of interest here are a part.
I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>
Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:
it is informative to beginning scrapers like myself
it took me a while to do and so why not :)
I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
Resources:
talkstats.com thread on web scraping (great beginner
examples)
w3schools.com site on html stuff (very
helpful)
I wanted to try to do this using a HTML scraper (rvest) as an exercise, since the available.packages() in OP doesn't contain the package Descriptions.
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
However, although working quite well for small array/dataframe list (subset), I ran into a display problem with the full list, where the data would be shown either column-by-column or unaligned. I would have been great to have this paged and properly formatted in a new window somehow. I tried using page, but I couldn't get it to work very well.
EDIT:
The recommended method is not the above, but rather using Dirk's suggestion (from the comments below):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
However, this still suffers from the display problem mentioned...

Resources