Generating Random Strings - r

I want to generate random strings in the following way: ABCDE1234E, i.e each string contains 5 Characters, 4 Numerics, then 1 Char.
I figured out a way to create this using the following code.
library(random)
string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE))
number_4 <- as.vector(randomNumbers(n=5000, min=1111, max=9999, col=5, base=10, check=TRUE))
string_1 <- as.vector(randomStrings(n=5000, len=1, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=FALSE, check=TRUE))
PAN.Number <- paste(string_5,number_4,string_1,sep = "")
But these functions are taking a long time and the random library needs a network connection.
> system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
+ loweralpha=FALSE, unique=TRUE, check=TRUE)))
user system elapsed
0.07 0.00 3.18
Is there any method that I could try to reduce the execution time?
I also tried using sample() but I couldn't figure it out.

Using "stringi" as suggested by #akrun will be faster, but the following is also very fast and does not require any additional packages:
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
Example output:
myFun(10)
## [1] "BZHOF3737P" "EPOWI0674X" "YYWEB2825M" "HQIXJ5187K" "IYIMB2578R"
## [6] "YSGBG6609I" "OBLBL6409Q" "PUMAL5632D" "ABRAT4481L" "FNVEN7870Q"

We can use stri_rand_strings from stringi
library(stringi)
sprintf("%s%s%s", stri_rand_strings(5, 5, '[A-Z]'),
stri_rand_strings(5, 4, '[0-9]'), stri_rand_strings(5, 1, '[A-Z]'))
Or more compactly
do.call(paste0, Map(stri_rand_strings, n=5, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
Benchmarks
system.time({
do.call(paste0, Map(stri_rand_strings, n=5000, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
})
# user system elapsed
# 0 0 0
Was able to reproduce the timings even for one part of the expected output using OP's method
system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE)))
# user system elapsed
# 0.86 0.24 5.52

You can directly perform what you want:
Sample random 5 capital letters
Sample 4 digits
Sample 1 random capital letter
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
This will be more easily controlled, and won't take as long.

Your performance problem comes from using the random package in the first place: it's understandable that you could find the random::randomStrings() function in an internet search and think it's a good way to generate random strings for use in a program, but the random package is not intended for general-purpose programming. It works by querying the RANDOM.ORG server, which is intrinsically slower than R's built-in pseudo-random number generators.
From one of the vignettes from the random package:
There are a number of situations in which it is desirable to use non-deterministically determined
random numbers. Examples include
- to seed distributed computing on different nodes with truly indepedent seeds;
- to obtain portable initializations for RNGs that do not depend on particular operating system
or hardware features;
- to validate simulation results using non-deterministic random numbers;
- to provide indeterministic seeds used for lottery drawings or games ...
Note that most of these examples are about seeding or initializing (these are synonyms) R's built-in pseudo-random number generators, rather than replacing them ...

In case anyone came here looking for a way to generate random file names, here's what I used. I like it for its elegance
library(dplyr)
runif(1, 1000000000000, 9999999999999) %>% round %>% as.character %>% paste0("/tmp/", ., ".png")
Note: you can easily change how many random strings it generates by changing the 1 in runif() to the number you want

You can use the ASCII table to get a fine control of your final string.
randString <- function(characters=0, numbers=0, symbols=0, lowerCase=0, upperCase=0) {
ASCII <- NULL
if(symbols>0) ASCII <- c(ASCII, sample(c(33:47, 58:34, 91:96, 123:126), symbols))
if(numbers>0) ASCII <- c(ASCII, sample(48:57, numbers))
if(upperCase>0) ASCII <- c(ASCII, sample(65:90, upperCase))
if(lowerCase>0) ASCII <- c(ASCII, sample(97:122, lowerCase))
if(characters>0) ASCII <- c(ASCII, sample(c(65:90, 97:122), characters))
return( rawToChar(as.raw(sample(ASCII, length(ASCII)))) )
}
Example:
randString(characters=5, numbers=4)
# [1] "9fKW75o1N"

We can now do this with "rowwise" and "mutate" from dplyr, with library(stringi) for the stri_rand_strings function:
df %>%
rowwise() %>%
mutate(unique_id = paste0(stri_rand_strings(1, 5, "[A-Z]"), stri_rand_strings(1, 4, "[0-9]"), stri_rand_strings(1, 1, "[A-Z]")))
This avoids the need to create a function.

Related

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session.
Here is the replication procedure:
Find collocations, compound tokens, and save them.
library(quanteda)
set.seed(12345)
data(data_corpus_inaugural)
toks <- data_corpus_inaugural %>%
tokens(remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE) %>%
tokens_tolower()
col <- toks %>%
textstat_collocations()
toks.col <- toks %>%
tokens_compound(pattern = col[col$z > 3])
write(attr(toks.col, "types"), "col1.txt")
End and restart R session and run the above code again with "col1.txt" replaced by "col2.txt".
Compare the two sets of tokens and find they are different.
col1 <- read.table("col1.txt")
col2 <- read.table("col2.txt")
identical(col1$V1, col2$V1) # This should return FALSE.
col1$V1[head(which(col1$V1 != col2$V1))]
col2$V1[head(which(col1$V1 != col2$V1))]
This does not matter for many cases but the result of LDA (by {topicmodels}) changes in different sessions. I guess so because the result of LDA is constant if I reset the order of features in tokens by as.list() and thereafter as.tokens() (dfm_sort() does not work for this).
I wonder whether this happens only for me (Ubuntu 18.04.5, R 4.0.4, and quanteda 2.1.2) and would be happy to hear another (easier) solution.
Updated on Feb 20
For example, the output of LDA is not reproduced.
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
lis[[i]] <- tokens_compound(toks, pattern = col[col$z > 3]) %>%
dfm() %>%
convert(to = "topicmodels") %>%
LDA(k = 5,
method = "Gibbs",
control = list(seed = 12345,
iter = 100))
}
head(lis[[1]]#gamma)
head(lis[[2]]#gamma)
An interesting investigation but this is neither an error nor anything to be concerned with. Within a quanteda tokens object, the types are not determinate in order, after a processing step such as textstat_compound(). This is because this function is parallelised in C++ and how these threads operate is not fixed by set.seed() from R. But this will not affect the important part, which is the set of types, or anything about the tokens themselves. If you want the order of the types that you extract to be the same, then you should sort them upon extraction.
library("quanteda")
## Package version: 2.1.2
toks <- data_corpus_inaugural %>%
tokens(
remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE
) %>%
tokens_tolower()
col <- quanteda.textstats::textstat_collocations(toks)
It turns out that you do not need to save the output or restart R - this happens within a single session.
# types are differently indexed, but are the same set
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(types = types(toks.col)))
}
dframe <- data.frame(lis)
sum(dframe$types != dframe$types.1)
## [1] 19898
head(dframe[dframe$types != dframe$types.1, ])
## types types.1
## 8897 at_this_second my_fellow_citizens
## 8898 to_take_the_oath_of_the_presidential_office no_people
## 8899 there_is on_earth
## 8900 occasion_for cause_to_be_thankful
## 8901 an_extended this_is_said
## 8902 there_was spirit_of
However the (unordered) set of types is identical:
# but
setequal(dframe$types, dframe$types.1)
## [1] TRUE
More important is that when we compare the values of each token, which is ordered, these are identical:
# tokens are the same
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(toks = as.character(toks.col)))
}
dframe <- data.frame(lis)
all.equal(dframe$toks, dframe$toks.1)
## [1] TRUE
Created on 2021-02-18 by the reprex package (v1.0.0)
An additional comment, whose importance is underscored by this analysis: We strongly discourage direct access to object attributes. Use types(x) as above, not attr(x, "types"). The former will always work. The latter relies on our implementation of the object, which may change as we improve the package.

How to coerce stslist.freq to dataframe

I am doing some describtive sequence analysis using the "TraMineR" library. I want to report my findings via R-Markdown in html format. For formating tables I use "kable" and "kableExtra".
To get the frequency and propotions of the most common sequences I use seqtab(). The result is an stslist.freq object. When I try to coerce it to a dataframe, the dataframe is not containing any frequencies and proportions.
I tried to print the results of seqtab() and store this result again. This gives me the dataframe I desire. However there are two "problems" with that: (1) I don't understand what is happening here and it seems like a "dirty" trick, (2) as a result I also get the output of the print command in my final html document if I don't split the code in multiple chunks and disable the ouput in the specific chunk.
Here is some code to replicate the problem:
library("TraMineR")
#Data creation
data.long <- data.frame(
id=rep(1:50, each=4),
time = c(0,1,2,3),
status = sample(letters[1:2], 200, replace = TRUE),
weight=rep(runif(50, 0, 1), each=4)
)
#reshape
data.wide <- reshape(data.long, v.names = "status", idvar="id", direction="wide", timevar="time")
#sequence
sequence <- seqdef(data.wide,
var=c("status.0", "status.1", "status.2", "status.3"),
weights=data.wide$weight)
#frequencies of sequences
##doesn't work:
seqtab.df1 <- as.data.frame(seqtab(sequence))
##works:
seqtab.df2 <- print(seqtab(sequence))
I expect the dataframe to be the same as the one saved in seqtab.df2, however either without using the print command or with "silently" (no output printed) using the print command.
Thank you very much for your help and let me know if I forgot something to make answering the question possible!
If you look at the class() of the object returned by seqtab, it has the type
class(seqtab(sequence))
# [1] "stslist.freq" "stslist" "data.frame"
so if we look at exactly, what's happening in the print statement for such an object we can get a clue what's going on
TraMineR:::print.stslist.freq
# function (x, digits = 2, width = 1, ...)
# {
# table <- attr(x, "freq")
# print(table, digits = digits, width = width, ...)
# }
# <bytecode: 0x0000000003e831f8>
# <environment: namespace:TraMineR>
We see that what it's really giving you is the "freq" attribute. You can extract this directly and skip the print()
attr(seqtab(sequence), "freq")
# Freq Percent
# a/3-b/1 4.283261 20.130845
# b/1-a/1-b/2 2.773341 13.034390
# a/2-b/1-a/1 2.141982 10.067073
# a/1-b/1-a/1-b/1 1.880359 8.837476
# a/1-b/2-a/1 1.723489 8.100203
# b/1-a/2-b/1 1.418302 6.665861
# b/2-a/1-b/1 1.365099 6.415813
# a/1-b/3 1.241644 5.835586
# a/1-b/1-a/2 1.164434 5.472710
# a/2-b/2 1.092656 5.135360

Truncate decimal to specified places

This seems like it should be a fairly easy problem to solve but I am having some trouble locating an answer.
I have a vector which contains long decimals and I want to truncate it to a specific number of decimals. I do not wish to round it, but rather just remove the values beyond my desired number of decimals.
For example I would like 0.123456789 to return 0.1234 if I desired 4 decimal digits. This is not an issue of printing a specific number of digits but rather returning the original value truncated to a given number.
Thanks.
trunc(x*10^4)/10^4
yields 0.1234 like expected.
More generally,
trunc <- function(x, ..., prec = 0) base::trunc(x * 10^prec, ...) / 10^prec;
print(trunc(0.123456789, prec = 4) # 0.1234
print(trunc(14035, prec = -2), # 14000
I used the technics above for a long time. One day I had some issues when I was copying the results to a text file and I solved my problem in this way:
trunc_number_n_decimals <- function(numberToTrunc, nDecimals){
numberToTrunc <- numberToTrunc + (10^-(nDecimals+5))
splitNumber <- strsplit(x=format(numberToTrunc, digits=20, format=f), split="\\.")[[1]]
decimalPartTrunc <- substr(x=splitNumber[2], start=1, stop=nDecimals)
truncatedNumber <- as.numeric(paste0(splitNumber[1], ".", decimalPartTrunc))
return(truncatedNumber)
}
print(trunc_number_n_decimals(9.1762034354551236, 6), digits=14)
[1] 9.176203
print(trunc_number_n_decimals(9.1762034354551236, 7), digits=14)
[1] 9.1762034
print(trunc_number_n_decimals(9.1762034354551236, 8), digits=14)
[1] 9.17620343
print(trunc_number_n_decimals(9.1762034354551236, 9), digits=14)
[1] 9.176203435
This solution is very handy in cases when its necessary to write to a file the number with many decimals, such as 16.
Just remember to convert the number to string before writing to the file, using format()
numberToWrite <- format(trunc_number_n_decimals(9.1762034354551236, 9), digits=20)
Not the most elegant way, but it'll work.
string_it<-sprintf("%06.9f", old_numbers)
pos_list<-gregexpr(pattern="\\.", string_it)
pos<-unlist(lapply(pos_list, '[[', 1)) # This returns a vector with the first
#elements
#you're probably going to have to play around with the pos- numbers here
new_number<-as.numeric(substring(string_it, pos-1,pos+4))

using hash to determine whether 2 dataframes are identical (PART 01)

I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison.
The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)
And I have written a little script as below:
library(digest)
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)
ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]
And something really interesting happens when I do the checking:
> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01
TRUE
506
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02
TRUE
506
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE
line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa.
line03 shows that the hash column for both dataframe are the same
line04 shows that the two dataframe are different
What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)
Or do I have any misunderstanding about digest? Thanks.
Read in data as before.
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
Like Marek said, start by checking for differences with all.equal.
all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches"
[2] "Component 8: 24 string mismatches"
So we just need to look at columns 6 and 8.
different.old <- ddd.old[, c(6, 8)]
different.new <- ddd.new[, c(6, 8)]
Hash these columns.
hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)
And find the rows where they don't match.
different_rows <- which(hash.old != hash.new) #which is optional
Finally, combine the datasets.
cbind(different.old[different_rows, ], different.new[different_rows, ])

Resources