Some references:
This is a follow-up on this Why is processing a sorted array faster than processing an unsorted array?
The only post in r tag that I found somewhat related to branch prediction was this Why sampling matrix row is very slow?
Explanation of the problem:
I was investigating whether processing a sorted array is faster than processing an unsorted one (same as the problem tested in Java and C – first link) to see if branch prediction is affecting R in the same manner.
See the benchmark examples below:
set.seed(128)
#or making a vector with 1e7
myvec <- rnorm(1e8, 128, 128)
myvecsorted <- sort(myvec)
mysumU = 0
mysumS = 0
SvU <- microbenchmark::microbenchmark(
Unsorted = for (i in 1:length(myvec)) {
if (myvec[i] > 128) {
mysumU = mysumU + myvec[i]
}
} ,
Sorted = for (i in 1:length(myvecsorted)) {
if (myvecsorted[i] > 128) {
mysumS = mysumS + myvecsorted[i]
}
} ,
times = 10)
ggplot2::autoplot(SvU)
Question:
First, I want to know that why "Sorted" vector is not the fastest all the time and not by the same magnitude as expressed in Java?
Second, why the sorted execution time has a higher variation compared to one of the unsorted?
N.B. My CPU is an i7-6820HQ # 2.70GHz Skylake, quad-core with hyperthreading.
Update:
To investigate the variation part, I did the microbenchmark with the vector of 100 million elements (n=1e8) and repeated the benchmark 100 times (times=100). Here's the associated plot with that benchmark.
Here's my sessioninfo:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] compiler stats graphics grDevices utils datasets methods base
other attached packages:
[1] rstudioapi_0.10 reprex_0.3.0 cli_1.1.0 pkgconfig_2.0.3 evaluate_0.14 rlang_0.4.0
[7] Rcpp_1.0.2 microbenchmark_1.4-7 ggplot2_3.2.1
Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.
R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)
The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.
Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.
Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.
The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.
Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.
Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
X86 prefetching optimizations: "computed goto" threaded code has more links.
https://comparch.net/2013/06/30/why-tage-is-the-best/
https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.
The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.
So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.
Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.
If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.
Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.
I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected.
Is there a way for me to utilize parallel processing to run this more efficiently? Otherwise, is there other ways to optimize this?
#Converting binary variables to 1 and 2
lca_dat1=lca_dat1+1
#Formula for LCA
f<-cbind(Abdominal_hernia,Abdominal_pain,
Acute_and_unspecified_renal_failure,Acute_cerebrovascular_disease,
Acute_myocardial_infarction,Administrative_social_admission,
Allergic_reactions,Anal_and_rectal_conditions,
Anxiety_disorders,Appendicitis_and_other_appendiceal_conditions,
Asthma,Bacterial_infection_unspecified_site,
Biliary_tract_disease,Calculus_of_urinary_tract,
Cancer_of_breast,Cardiac_dysrhythmias,
Cataract,Chronic_obstructive_pulmonary_disease_and_bronchiectasis,
Chronic_renal_failure,Chronic_ulcer_of_skin,
Coagulation_and_hemorrhagic_disorders,Coma_stupor_and_brain_damage,
Complication_of_device_implant_or_graft,Complications_of_surgical_procedures_or_medical_care,
Conditions_associated_with_dizziness_or_vertigo,Congestive_heart_failure_nonhypertensive,
Coronary_atherosclerosis_and_other_heart_disease,Crushing_injury_or_internal_injury,
Deficiency_and_other_anemia,Delirium_dementia_and_amnestic_and_other_cognitive_disorders,
Disorders_of_lipid_metabolism,Disorders_of_teeth_and_jaw,
Diverticulosis_and_diverticulitis,E_Codes_Adverse_effects_of_medical_care,
E_Codes_Adverse_effects_of_medical_drugs,E_Codes_Fall,
Epilepsy_convulsions,Esophageal_disorders,
Essential_hypertension,Fever_of_unknown_origin,
Fluid_and_electrolyte_disorders,Fracture_of_lower_limb,
Fracture_of_upper_limb,Gastritis_and_duodenitis,
Gastroduodenal_ulcer_except_hemorrhage,Gastrointestinal_hemorrhage,
Genitourinary_symptoms_and_illdefined_conditions,Gout_and_other_crystal_arthropathies,
Headache_including_migraine,Heart_valve_disorders,
Hemorrhoids,Hepatitis,Hyperplasia_of_prostate,
Immunizations_and_screening_for_infectious_disease,
Inflammation_infection_of_eye_except_that_caused_by_tuberculosis_or_sexually_transmitteddisease,Inflammatory_diseases_of_female_pelvic_organs,
Intestinal_infection,Intracranial_injury,
Joint_disorders_and_dislocations_traumarelated,Late_effects_of_cerebrovascular_disease,
Medical_examination_evaluation,Menstrual_disorders,
Mood_disorders,Nausea_and_vomiting,
Neoplasms_of_unspecified_nature_or_uncertain_behavior,Nephritis_nephrosis_renal_sclerosis,
Noninfectious_gastroenteritis,Nonspecific_chest_pain,
Nutritional_deficiencies,Open_wounds_of_extremities,
Open_wounds_of_head_neck_and_trunk,Osteoarthritis,
Other_aftercare,Other_and_unspecified_benign_neoplasm,
Other_circulatory_disease,
Other_connective_tissue_disease,
Other_diseases_of_bladder_and_urethra,Other_diseases_of_kidney_and_ureters,
Other_disorders_of_stomach_and_duodenum,Other_ear_and_sense_organ_disorders,
Other_endocrine_disorders,Other_eye_disorders,
Other_female_genital_disorders,Other_fractures,
Other_gastrointestinal_disorders,Other_infections_including_parasitic,
Other_injuries_and_conditions_due_to_external_causes,Other_liver_diseases,
Other_lower_respiratory_disease,Other_nervous_system_disorders,
Other_nontraumatic_joint_disorders,Other_nutritional_endocrine_and_metabolic_disorders,
Other_screening_for_suspected_conditions_not_mental_disorders_or_infectious_disease,
Other_skin_disorders,Other_upper_respiratory_disease,
Other_upper_respiratory_infections,Paralysis,
Pleurisy_pneumothorax_pulmonary_collapse,Pneumonia_except_that_caused_by_tuberculosis_or_sexually_transmitted_disease,
Poisoning_by_other_medications_and_drugs,Respiratory_failure_insufficiency_arrest_adult,
Retinal_detachments_defects_vascular_occlusion_and_retinopathy,Screening_and_history_of_mental_health_and_substance_abuse_codes,
Secondary_malignancies,Septicemia_except_in_labor,
Skin_and_subcutaneous_tissue_infections,Spondylosis_intervertebral_disc_disorders_other_back_problems,
Sprains_and_strains,Superficial_injury_contusion,
Syncope,Thyroid_disorders,Urinary_tract_infections)~1
#LCA for 1 class
lca1<-poLCA(f,lca_dat1,nclass=1,maxiter=3000,tol=1e-7,graph=F,nrep=5)
#LCA for 2 classes
lca2<-poLCA(f,lca_dat1,nclass=2,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca2=lca2$posterior
posterior_lca2$max_pos=apply(posterior_lca2,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca2$max_pos>0.7)
#LCA for 3 classes
lca3<-poLCA(f,lca_dat1,nclass=3,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca3=lca3$posterior
posterior_lca3$max_pos=apply(posterior_lca3,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca3$max_pos>0.7)
...
You can create a list with the different configurations you want to use. Then use either one of the *apply functions from the parallel package or %dopar% from foreach. Which parallel backend you can/should use depends on your OS.
Here an example with foreach:
library(foreach)
library(doParallel)
registerDoSEQ() # proper backend depends on the OS
foreach(nclass = 1:10) %dopar% {
# do something with nclass
sqrt(nclass)
}
Here are my not too brief or compact thoughts on this. They are less than exact. I have not ever used anywhere near so many manifest factors with poLCA and I think you may be breaking some interesting ground doing so computationally. I use poLCA to predict electoral outcomes per voter (red, blue, purple). I can be wrong on that and not suffer a malpractice suit. I really don't know about the risk of LCA use in health analysis. I think of LCA as more of a social sciences tool. I could be wrong about that as well. Anyway:
(1) I believe you want to look for the most "parsimonious" factors to produce a latent class and limit them to a reduced subset that proves the most useful for all your data. That will help with CPU optimization. I have found personally that using manifests that are exceptionally "monotonic" is not (by default) necessarily a good thing, although certainly experimenting with factors more or less "monotonic" talks to you about your model.
I have found it is more "machine learning" friendly/responsible to use the most widespread manifests and "sample split" your data into groups; recombining the posteriors post LCA run. This assumes that that the most widespread factors affect different subgroups quantitatively but with variance for sample groups (e.g. red, blue, purple). I don't know that anyone else does this, but I gave up trying to build the "one LCA model that rules them all" from voterdb information. That didn't work.
(2)The poLCA library (like most latent class analysis) depends upon matrix multiplication. I have found poLCA more CPU bound than memory bound but with 114 manifests you may experience bottlenecks at every nook and cranny of your motherboard. Whatever you can do to increase matrix multiplication efficiency helps. I believe I have found that Microsoft Open R use of Intel's MKL MKLs more efficient than the default CRAN numeric library. Sorry, I haven't completely tested that nor do I understand why some numeric libraries might be more efficient than others for matrix multiplication. I only know that Microsoft Open R brags about this some and it appears to me they have a point with MKL MKL.
(3) Reworking your LCS code into Matt Dowles data.table library shows me efficiencies across the board on all my work. I create 'dat' as an data.table and run iterations for the best optimized data.table function for poLCA and posteriors. Combining data.table efficiency with some of Hadley Wickham's improved *ply functions (plyr library) that puts LCA runs into lists works well for me:
rbindlist(plyr::llply(1:10,check_lc_pc)) # check_lc_pc is the poLCA function.
(4) This is a simple tip (maybe even condescending), but you don't need to list all standard error data once you are satisfied with your model so verbose = FALSE. Also, by making regular test runs, I can determine the poLCA run optimizated best for my model the best ('probs.start') and leverage testing thereof:
lc <- poLCA(f,dat,nrep=1,probs.start=probs.start.new,verbose=FALSE)
poLCA produces a lot of output to the screen by default. Create a poLCA function with verbose=FALSE and a byte-compiled R 3.5 optimizes output.
(5) I use Windows 10 and because of fast SSD, fast DDR, and Microsoft "memory compression" I think I notice that the the Windows 10 OS adapts to lca runs with lots of "memory compression". I assume that it is holding the same matrices in compressed memory because I am calling them repeatedly over time. I really like the Kaby Lake processors that "self over-clock". I see my processor 7700HQ taking advantage of that during LCA runs. (It would seem like LCA runs would benefit from over clocking. I don't like to overclock my processor on my own. That's too much risk for me.) I think it is useful to monitor memory use of your LCA runs from another R console with system calls to Powershell and cmd memory management functions. The one below list the hidden "Memory Compression" process(!!):
ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
$t2 = $t1 | Select {
$_.Id;
[math]::Round($_.WorkingSet64/1MB);
[math]::Round($_.PrivateMemorySize64/1MB);
[math]::Round($_.VirtualMemorySize64/1MB) };
$t2 | ft * "); }
ps_all <- function() {ps();ps_e();ps_f();}
I have this memory management function for your session used for the lca runs, but of course, that runs before or after:
memory <- function() {
as.matrix(list(
paste0(shell('systeminfo | findstr "Memory"')), # Windows
paste0("R Memory size (malloc) available: ",memory.size(TRUE)," MB"),
paste0("R Memory size (malloc) in use: ",memory.size()," MB"),
paste0("R Memory limit (total alloc): ",memory.limit()," MB")
There is work on the optimization functions for latent class analysis. I post a link here but I don't think that helps us today as users of poLCA or LCA: http://www.mat.univie.ac.at/~neum/ms/fuchs-coap11.pdf. But maybe the discussion is good background. There is nothing simple about poLCA. This document by the developers: http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf is worth reading at least twice!
If anyone else has any thoughts of poLCA or LCA compression, I would appreciate further discussion as well. Once I started predicting voter outcomes of entire state as opposed to my county, I had to think about optimization and the limits of poLCA and LCA/LCR.
Nowadays, there is a parallized cpp-based impementation of poLCA, named poLCAParallel in https://github.com/QMUL/poLCAParallel . For me, it was much much faster than the base package.
Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.
There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.
I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
for n=1 it takes 12.2 seconds
for n=2 it takes 6.3 seconds
for n=3 it takes 4.4 seconds
for n=4 it takes 3.4 seconds
for n=5 it takes 3.1 seconds
for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.