I would like to include -co options to compress output raster using gdalwarp from gdalUtilities in R.
I have tried some options (commented in the code), but I have not been successful in generating the compressed raster.
gdalUtilities::gdalwarp(srcfile = paste0(source_path,"/mask_30.tif"),
dstfile = paste0(writing_path,"/mask_30_gdalwarp.tif"),
cutline = paste0(source_path,"/amazon.shp"),
crop_to_cutline = TRUE,
multi = TRUE,
wo = "NUM_THREADS = 32",
co = "COMPRESS = DEFLATE")
# co = c("COMPRESS = DEFLATE","ZLEVEL = 9"))
# co COMPRESS = DEFLATE,
# co ZLEVEL = 9),
# co = "COMPRESS = DEFLATE",
# co = ZLEVEL = 9")
Additionally, I would like to use multithread warping implementation. I am including-multi and -wo "NUM_THREADS = 16" (my computer has 32 cores) options, but I also have not been able to decrease the runtime vs. the default -multi option, which uses two cores by default.
Any suggestions for compression and parallelization?
Many thanks in advance.
1 - COMPRESSION
Please find the solution for the problem of file compression. To be honest, I have already been confronted with the same problem as you and, at the time, I was racking my brains... to finally find the solution which is quite simple (once we know it!): you must not put any spaces (i.e. "COMPRESS=DEFLATE" and not "COMPRESS = DEFLATE")
So, please find below a small reprex.
Reprex
library(gdalUtilities)
library(stars) # Loaded just to have a '.tif' image for the reprex
# Import a '.tif' image from the 'stars' library
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
# Write the image to disk (in your working directory)
write_stars(tif, "image.tif")
# Size of the image on disk (in bytes)
file.size("image.tif")
#> [1] 2950880
# Compress the image
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp.tif",
co = "COMPRESS=DEFLATE")
# Size of the compressed image on disk (in bytes)
file.size("image_gdalwarp.tif")
#> [1] 937920 # The image has been successfully compressed.
As #MarkAdler said, there is not much difference between the default compression level (i.e. 6) and level 9. That said, please find below how you should write the code to be able to apply the desired compression level (i.e. still without spaces and in a list):
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"))
file.size("image_gdalwarp_Z9.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)
2 - PARALLELIZATION
For the problem of parallelization on the cores of the processor, you should not use multi = TRUE. Only the argument wo = "NUM_THREADS=4" (always without spaces ;-)) is enough.
Just a clarification, I guess you are confusing the RAM and the number of cores. Usually computers are equipped with a 4 or 8 cores processor. The 32 that you indicate in your code refers to the 32 gigas of RAM that your computer probably has.
Reprex
library(gdalUtilities)
library(stars)
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
write_stars(tif, "image.tif")
file.size("image.tif")
#> [1] 2950880
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9_parallel.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"),
wo = "NUM_THREADS=4") # Replace '4' by '8' if your processor has 8 cores
file.size("image_gdalwarp_Z9_parallel.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)
Related
I am trying to download occurrence data of multiple taxonomic groups from multiple regions using the R package rgbif. I prefer to have one download by combining all the regions together (because I have thousands of regions, it would be insane to have separate downloads). However, I did not find a way to do so. I can do occ_download for one region per query.
Here are my exemplary code:
library(rgbif)
gbif_taxon_keys = c(212, 359)
# below are bbox of 3 regions;
# I have polygons as WKT, but they are clockwise
# (how to convert to counter clockwise??)
wkts = c("POLYGON((11.3431 47.2451,11.4638 47.2451,11.4638 47.2919,11.3431 47.2919,11.3431 47.2451))",
"POLYGON((12.9644 47.7608,13.0922 47.7608,13.0922 47.8453,12.9644 47.8453,12.9644 47.7608))",
"POLYGON((14.2284 48.2217,14.3669 48.2217,14.3669 48.3443,14.2284 48.3443,14.2284 48.2217))")
# this works
queries = occ_download_prep(
pred_in("taxonKey", gbif_taxon_keys),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within(wkts[1]),
user = gbif_user, pwd = gbif_pwd,
email = gbif_email)
out_test = occ_download_queue(.list = list(queries))
# now try to combine regions in one download
# this does not work
queries = occ_download_prep(
pred_in("taxonKey", gbif_taxon_keys),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within(wkts),
user = gbif_user, pwd = gbif_pwd,
email = gbif_email)
out_test = occ_download_queue(.list = list(queries))
Error: 'value' must be length 1
# this does not work neither (it runs though)
queries = occ_download_prep(
pred_in("taxonKey", gbif_taxon_keys),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred("geometry", paste0(wkts, collapse = ";")),
user = gbif_user, pwd = gbif_pwd,
email = gbif_email)
out_test = occ_download_queue(.list = list(queries))
<<gbif download metadata>>
Status: KILLED
From my download center on GBIF, it says "The download request was unsuccessful. ".
Can anyone help with this? Thanks!
I think I figured out how to do this. I just combined all polygons into a multipolygon and it seems works.
In another word, I just put the above 3 polygon into this:
wkts2 = "MULTIPOLYGON (((11.3431 47.2451, 11.4638 47.2451, 11.4638 47.2919, 11.3431 47.2919, 11.3431 47.2451)), ((12.9644 47.7608, 13.0922 47.7608, 13.0922 47.8453, 12.9644 47.8453, 12.9644 47.7608)), ((14.2284 48.2217, 14.3669 48.2217, 14.3669 48.3443, 14.2284 48.3443, 14.2284 48.2217)))"
then, I run:
queries = occ_download_prep(
pred_in("taxonKey", gbif_taxon_keys),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within(wkts2),
user = gbif_user, pwd = gbif_pwd,
email = gbif_email)
out_test = occ_download_queue(.list = list(queries))
It works for this example. #sckott may have better approaches.
Just concatenating the polygons into a multipolygon leads to overlaps, which aren't valid, and will lead to failing downloads.
Instead, use a GIS library to combine the polygons. This is the first I found for R:
library(sf)
x = st_as_sfc("POLYGON((5.032 52.237, 5.426 52.237, 5.426 52.525, 5.032 52.525, 5.032 52.237))")
y = st_as_sfc("POLYGON((5.234 52.033, 5.546 52.033, 5.546 52.311, 5.234 52.311, 5.234 52.033))")
u = st_union(x, y)
st_as_text(u)
[1] "POLYGON ((5.032 52.525, 5.426 52.525, 5.426 52.311, 5.546 52.311, 5.546 52.033, 5.234 52.033, 5.234 52.237, 5.032 52.237, 5.032 52.525))"
A quick check on Wicket shows we now have an 8-sided polygon, which should work as the within predicate.
I think, using this, you can probably put all your polygons into a single download. The limit is 10,000 points in total for a single download.
I am trying to find which objects are taking a lot of memory in my R session, but the problem is that the object might have been invisibly created with an unknown name in an unknown environment.
If the object is stored in .GlobalEnv or a known environment, I can easily use a strategy like ls(enviro)+get()+object.size() (see lsos on this post for example) to list all objects and their size, allowing me to identify the heavy objects.
However, the object in question might not be stored in .GlobalEnv, but might be in some obscure environment implicitly created by an external package. How can in that case identify which object is using a lot of RAM?
The best case study is ggplot2 creating .last_plot in a dedicated environment. Looking under the hood one can find that it is stored in environment(ggplot2:::.store$get), so one can find it and eventually remove it. But if I didn't know that location or name a priori, would there be a way to find that there is a heavy object called .last_plot somewhere in memory?
pryr::mem_used()
#> 34.7 MB
## example: implicit creation of heavy and hidden object by ggplot
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 34.9 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
## Hidden object is not in .GlobalEnv
ls(.GlobalEnv, all.names = TRUE)
#> [1] "path"
## Here I know where to find it: environment(ggplot2:::.store$get)
ls(all.names = TRUE, envir = environment(ggplot2:::.store$get))
#> [1] ".last_plot"
pryr::object_size(get(".last_plot", environment(ggplot2:::.store$get))$data)
#> 80 MB
## But how could I have found this otherwise?
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't think there's any existing way to do this. If you combine #AllanCameron's answer with my comment, where you'd also run ls(y) for y environments calculated as
ns <- loadedNamespaces()
for (x in ns) {
y <- loadNamespace(x)
# look at the size of everything in y
}
you still won't find all the environments. I think you could do it if you also examined every object that might contain a reference to an environment (e.g. every function, formula, list, and various exotic objects) but it would be tricky not to miss something or count things more than once.
Edited to add: Actually, pryr::object_size is pretty smart at reporting on the environments attached to objects, so we'd get close by searching namespaces. For example, to find the top 20 objects:
pryr::mem_used()
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 35 MB
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 35.2 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
envs <- c(globalenv = globalenv(),
sapply(loadedNamespaces(), function(ns) loadNamespace(ns)))
sizes <- lapply(envs, function(e) {
objs <- ls(e, all = TRUE)
sapply(objs, function(obj) pryr::object_size(get(obj, envir = e)))
})
head(sort(unlist(sizes), decreasing = TRUE), 20)
#> base..__S3MethodsTable__. utils..__S3MethodsTable__.
#> 96216872 83443704
#> grid..__S3MethodsTable__. ggplot2..__S3MethodsTable__.
#> 80945520 80636768
#> ggplot2..store methods..classTable
#> 80418936 10101152
#> graphics..__S3MethodsTable__. tools..check_packages
#> 9325608 5185880
#> compiler.inlineHandlers methods..genericTable
#> 3444600 2808440
#> Rcpp..__T__show:methods colorspace..__T__show:methods
#> 2474672 2447880
#> Rcpp..RcppClass Rcpp..__C__C++OverloadedMethods
#> 2127584 1990504
#> Rcpp..__C__RcppClass Rcpp..__C__C++Field
#> 1982576 1980176
#> Rcpp..__C__C++Constructor Rcpp..__T__$:base
#> 1979992 1939616
#> tools..install_packages Rcpp..__C__Module
#> 1904032 1899872
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't know why those methods tables come out so large (I suspect it's because ggplot2 adds methods to those tables, so its environment gets captured); but somehow they are finding your object, because they aren't so big if I don't create it.
A hint about the issue is in the 5th object, listed as ggplot2..store (i.e. the object named .store in the ggplot2 namespace). Doesn't tell you to look in the environments of the functions in .store, but at least it gets you started.
Second edit:
Here are some tweaks to make the output a bit more readable.
# Unlist first, so we can clean up the names
sizes <- unlist(sizes)
# Replace the first dot with :::
names(sizes) <- sub(".", ":::", names(sizes), fixed = TRUE)
# Remove internal R objects
keep <- !grepl(".__", names(sizes), fixed = TRUE)
sizes <- sizes[keep]
With these changes, the output from sort(sizes[keep], decreasing = TRUE) starts out as
ggplot2:::.store
80418936
base:::.userHooksEnv
47855920
base:::.Options
45016888
utils:::Rprof
44958416
If you do
unlist(lapply(search(), function(y) sapply(ls(y), function(x) object.size(get(x)))))
You will get a complete list of all the objects in all the environments on your search path, including their sizes. You can then sort these and find the offending objects.
I've written a script to automate the processing of my raster data, but it fails after half completing. The loop looks in the directory unzipped for subdirectories which contain .tif files and stacks those with some basemaps in /base_layers. Its been failing with:
Error in matrix(unlist(ini), ncol = 2, byrow = TRUE) : 'data' must be of a vector type, was 'NULL'
and
In writeBin(v, x#file#con, size = x#file#dsize) :
problem writing to connection
and
Error in .local(.Object, ...) :
GDAL Error 3: Free disk space available is 875151360 bytes, whereas 2473962400 are at least necessary. You can disable this check by defining the CHECK_DISK_FREE_SPACE configuration option to FALSE.
FIXED I think, see answers
Ran in a kubernetes pod with Ubuntu, 24 gigs of ram, and 50 gigs of storage, and 4 vCPUs and also in a VM instance running Ubuntu 18.04 with 2 vCPUs, 8 gigs of ram and 50 gigs of storage. The pods were failing part way through because of memory issues, and the VM didn't complete either. Code typically fails when it reaches the mask raster stage but only after running through a few iterations of the loop (so it all works at the start).
If anyone can point out if memory fragmentation is likely to occur in this script, or ways to clear the OS memory I would be forever grateful!
Script:
library(raster)
library(rgdal)
input = "/mnt/nfs/data/unzipped"
output = "/mnt/nfs/data/training" #path to where the data should go
#get paths to basemaps
DEM = raster("/mnt/nfs/data/base_layers/Clipped_filled_dem.tif")
flow_accum = raster("/mnt/nfs/data/base_layers/accumulation.tif")
slope = raster("/mnt/nfs/data/base_layers/Clipped_slope.tif")
aspect = raster("/mnt/nfs/data/base_layers/Clipped_filled_aspect.tif")
ruggedness = raster("/mnt/nfs/data/base_layers/Clipped_TRI.tif")
#get directories that have data that needs to be processed
directory = list.dirs(path = input, recursive = FALSE)
for(direct in directory) {
subdirect = list.dirs(path = direct,recursive = FALSE)
for(sub in subdirect){
files_for_raster <- list.files(path = sub, pattern = "*.tif$", full.names = TRUE)
rasterstack = stack(files_for_raster)
# name of datapull
name = gsub(paste(direct,"/",sep=""),"",sub)
print(c("working in",name))
# crop DEM to the extent of the satellite image
DEMcrop = crop(DEM,rasterstack) #extent can be a raster
flow_accumcrop = crop(flow_accum,rasterstack)
slopecrop = crop(slope,rasterstack)
aspectcrop = crop(aspect,rasterstack)
ruggednesscrop = crop(ruggedness,rasterstack)
print(c("cropped"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# resample rasters, this will take a bit
DEMcrop = resample(DEMcrop,rasterstack)
flow_accumcrop = resample(flow_accumcrop,rasterstack)
slopecrop = resample(slopecrop,rasterstack)
aspectcrop = resample(aspectcrop,rasterstack)
ruggednesscrop = resample(ruggednesscrop,rasterstack)
print(c("resampled"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# mask layers
DEMcrop = mask(DEMcrop,raster::subset(rasterstack,1))
flow_accumcrop = mask(flow_accumcrop,raster::subset(rasterstack,1))
slopecrop = mask(slopecrop,raster::subset(rasterstack,1))
aspectcrop = mask(aspectcrop,raster::subset(rasterstack,1))
ruggednesscrop = mask(ruggednesscrop,raster::subset(rasterstack,1))
print(c("masked"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# add baselayers to the raster stack
finalstack = addLayer(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop)
print(names(finalstack))
print(nlayers(finalstack))
bands<-c("band1","band2","band3","band4","band5","band6","band7")
type<-c("quality","sr_ndvi","DEM","flow_accum","slope","aspect","TRI")
band_info<-data.frame(bands,type)
print("finalstack")
# create new output directory and save raster there
output_subdirect = gsub(paste(input,"/",sep=""),"",sub)
dir.create(file.path(output,output_subdirect), recursive = TRUE)
Sys.chmod(file.path(output,output_subdirect), mode = "777", use_umask = FALSE)
print("created directory")
write = file.path(output,output_subdirect)
writeRaster(finalstack, format="GTiff", filename=file.path(write,name,fsep="/"), options=c("INTERLEAVE=BAND","COMPRESS=NONE"), overwrite=TRUE)
write.csv(band_info, file = paste(file.path(write,name,fsep="/"),".csv",sep=""))
print("done processing")
rm(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop)
gc()
print(gc())
system("sudo sysctl -w vm.drop_caches=3")
}
}
# useful functions
#mystack = stack("path/to/multilayer.tif") # multilayer.tif is an existing raster stack
#band1 = subset(mystack,subset=1) # subsets bands from raster stack
#removeTmpFiles(h=0) # removes temp files, can be used after writing raster stackes to delete all temp raster files
#rasterbrick<-brick(rasterstack) #can make a raster brick from a raster stack
I've included the print(object.size()) to try to see if my objects are growing in size throughout the code's execution.
The file was writing large rasters to disk in the /tmp folder, over 20Gb each time it executed. Filled up the disk pretty quick. Still concerned why it was writing rasters to existing files rather than clearing them. Could clear the tmp directory during the script.
This was a pretty good explanation of memory issues with the raster package. https://discuss.ropensci.org/t/how-to-avoid-space-hogging-raster-tempfiles/864
Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)
I am comparing the performance of R and Apache Spark on a local machine and R seems to be doing much better. Is that because I am not using a cluster or am I doing something wrong?
Create data (create_data.R):
options = commandArgs(trailingOnly = TRUE)
rows = as.numeric(options[1])
perday = 365 / (rows-1) * 6
dates = seq(as.Date('2010-01-01'), as.Date('2015-12-31'), by=perday)
rows = length(dates)
ids = sample(paste0("ID", seq(1:10000)), rows, replace=TRUE)
sales = rpois(rows,50)
categories = sample(paste("Category", sprintf("%02d",seq(1:10))), rows, replace=TRUE)
data = data.frame(dates, ids, sales, categories)
write.csv(data, "/home/phil/performance/data.csv", row.names=FALSE)
Test R (cut.R):
suppressMessages(suppressWarnings(require(dplyr, quietly=TRUE)))
data = read.csv("data.csv")
first_purchase = head(data[order(data$dates, data$ids),],1)
print(first_purchase)
Test Spark (cut.py):
from pyspark import SparkContext
sc = SparkContext("local")
rdd = sc.textFile("data.csv", 2)
# Get rid of header
header = rdd.take(1)[0]
rdd = rdd.filter(lambda line: line != header)
rdd = rdd.map(lambda line: line.split(","))
first_purchase = rdd.takeOrdered(1, lambda x: [x[0],x[1]])[0]
print(first_purchase)
Run complete test (run_tests.sh):
echo "Creating data"
Rscript create_data.R 5000000
wc -l data.csv
echo "Testing R"
time Rscript cut.R
echo "Testing Spark"
time spark-submit cut.py
Output of the tests:
$ . run_test.sh
Creating data
5000001 data.csv
Testing R
dates ids sales categories
1264 2010-01-01 ID10 60 Category 01
real 0m12.689s
user 0m12.498s
sys 0m0.187s
Testing Spark
[u'2010-01-01', u'"ID10"', u'60', u'"Category 01"']
real 0m17.029s
user 0m7.388s
sys 0m0.392s
I am running this on a Ubuntu in a VirtualBox with Windows 7 as host system, if that makes a difference.
Spark is a distributed computing framework and it's model is to break down the work in pieces (tasks), where those tasks are scheduled, serialized and shipped based on the DAG derived from the dependencies in the functional transformations defined on the RDD.
All that machinery comes with an overhead cost, even in local mode. When compared to R, it is not unexpected that R, having been designed for single node execution will work faster.
Try the same comparison on a cluster... oh... wait... R only runs in a single node (but not for long anymore).