A faster way to generate a vector of UUIDs in R - r

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time
# Time difference of 15.072 secs
Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6
They should be RFC 4122 compliant but the other requirements are flexible.

Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)
In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.
It is possible to get faster performance at scale, just not with uuid. See below.
uuid on Windows
Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:
library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100
... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)
However, there is a problem:
R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20
Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)
uuid on Linux
For comparison, the same results on a non-windows platform:
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100
R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000
(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)
UUID generation with a SQL server
If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.
(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)
Some performance comparisons on windows (all times in seconds):
# n uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31
The use of SQL has some overhead that does not take long to overcome when done at scale.
PostgreSQL needs the uuid-ossp extension, installable with
CREATE EXTENSION "uuid-ossp"
Once installed/available, you can generate n UUIDs with:
n <- 3
pgcon <- DBI::dbConnect(...)
DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
# uuid
# 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
# 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
# 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):
sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
DBI::dbGetQuery(sqlitecon, sprintf("
WITH RECURSIVE cnt(x) as (
select 1 union all select x+1 from cnt limit %d
)
select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
from cnt", n))
# uuid
# 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
# 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
# 3 EAC159D6-7986-F42C-C5F5-35764544C105
This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)
SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)
Other considerations
In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.
However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.
Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.
(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)

Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.
For 10k UUIDs,
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs
Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs

Related

sparklyr connecting to kafka streams/topics

I'm having difficulty connecting to and retrieving data from a kafka instance. Using python's kafka-python module, I can connect (using the same connection parameters), see the topic, and retrieve data, so the network is viable, there is no authentication problem, the topic exists, and data exists in the topic.
On R-4.0.5 using sparklyr-1.7.2, connecting to kafka-2.8
library(sparklyr)
spark_installed_versions()
# spark hadoop dir
# 1 2.4.7 2.7 /home/r2/spark/spark-2.4.7-bin-hadoop2.7
# 2 3.1.1 3.2 /home/r2/spark/spark-3.1.1-bin-hadoop3.2
sc <- spark_connect(master = "local", version = "2.4",
config = list(
sparklyr.shell.packages = "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0"
))
system.time({
Z <- stream_read_kafka(
sc,
options = list(
kafka.bootstrap.servers="11.22.33.44:5555",
subscribe = "mytopic"))
})
# user system elapsed
# 0.080 0.000 10.349
system.time(collect(Z))
# user system elapsed
# 1.336 0.136 8.537
Z
# # Source: spark<?> [inf x 7]
# # … with 7 variables: key <lgl>, value <lgl>, topic <chr>, partition <int>, offset <dbl>, timestamp <dbl>, timestampType <int>
My first concern is that I'm not seeing data from the topic, I appear to be getting a frame suggesting (meta)data about topics in general, and there is nothing found. With this topic, there are 800 strings (json), modest-to-small sizes. My second concern is that it takes almost 20 seconds to realize this problem (though I suspect that's a symptom of the larger connection problem).
For confirmation, this works:
cons = import("kafka")$KafkaConsumer(bootstrap_servers="11.22.33.44:5555", auto_offset_reset="earliest", max_partition_fetch_bytes=10240000L)
cons$subscribe("mytopic")
msg <- cons$poll(timeout_ms=30000L, max_records=99999L)
length(msg)
# [1] 1
length(msg[[1]])
# [1] 801
as.character( msg[[1]][[1]]$value )
# [1] "{\"TrackId\":\"c839dcb5-...\",...}"
(and those commands complete almost instantly, nothing like the 8-10sec lag above).
The kafka instance to which I'm connecting is using ksqlDB, though I don't think that's a requirement in order to need to use the "org.apache.spark:spark-sql-kafka-.." java package.
(Ultimately I'll be using stateless/stateful procedures on streaming data, including joins and window ops, so I'd like to not have to re-implement that from scratch on the simple kafka connection.)

Unexpected results in benchmark of read.csv / fread [duplicate]

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?
As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

Timing R code with Sys.time()

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?
As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

How to compute large object's hash value in R?

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).
I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.
One fast way to do that is to calculate the database's hash with the previously stored hash.
The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(
Does anyone know about a way around this problem?
There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)
Similar problem has been described in our issue tracker here:
https://github.com/eddelbuettel/digest/issues/33
The current version of digest can read a file to compute the hash.
Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.
The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.
library(parallel)
library(digest)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist
producer <- mcparallel({
mystream <- file(fname, "w")
writeLines(x[1:5], mystream)
writeLines(x[6:10], mystream)
close(mystream) # sends signal to the consumer (digester)
})
digester <- mcparallel({
digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})
# runs both processes in parallel
mccollect(list(producer, digester))
unlink(fname) # named pipe removed
UPDATE: Henrik Bengtsson provided a modified example based on futures:
library("future")
plan(multiprocess)
x <- as.character(1:10) # input
fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists
producer %<-% {
mystream <- file(fname, open="wb")
writeBin(x[1:5], endian="little", con=mystream)
writeBin(x[6:10], endian="little", con=mystream)
close(mystream) # sends signal to the consumer (digester)
}
# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).
library(microbenchmark)
#iris
nrow(iris)
microbenchmark(
whole = digest::digest(iris),
cols = digest::digest(lapply(iris, digest::digest))
)
#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)
microbenchmark(
whole = digest::digest(long_iris),
cols = digest::digest(lapply(long_iris, digest::digest))
)
Results:
#normal
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 12.6 13.6 14.4 14.0 14.6 24.9 100 b
cols 12.5 12.8 13.3 13.1 13.5 23.0 100 a
#long
Unit: milliseconds
expr min lq mean median uq max neval cld
whole 296 306 317 311 316 470 100 b
cols 261 276 290 282 291 429 100 a

Resources