Issues factoring large prime that is 99 digits long - math

The number is 112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869
My code is below
def getfactors(number):
factors = []
for potentialFactor in range(1 , int(math.sqrt(number)) + 1):
if number % potentialFactor == 0:
factors.append(potentialFactor)
return factors
and the input is
getfactors(112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869)
The program has been running for at least 3 hours and I still have no results from it yet. The code works with other numbers too. Is there any algorithm or method that I could use to speed this up?

Your method will take a lot of time to factor in the given number since the RSA primes are close to each other. Even sieving with Sieve of Eratosthenes won't help, since you have a 326-bit number. Can you sieving to 163-bit, there is no way. This is slightly larger than the first RSA challenge RSA-100 that has 300-bit.
Use existing libraries like the
CADO-NFS ; http://cado-nfs.gforge.inria.fr/
NFS factoring: http://gilchrist.ca/jeff/factoring/nfs_beginners_guide.html
factoring as a service https://seclab.upenn.edu/projects/faas/
The experiments
I have tried with Pollard's p-1 algorithm, still running for one and a half-day and did not produce a result, yet. This is what expected due to the B bound must be around 2^55 with success probability 1/27. I've stopped the experiment after the CADO-NFS succeeds. This is self-implemented Pollard's p-1, one can find an optimized in GMP-ECM
Tried the CADO-NFS. The stable version may not be easily compiled for new machines, so prefer the active one from the GitLab.
After ~6 hours with 4 cores, CADO-NFS produced the result. As expected this is an RSA CTF/Challange. Since I don't want to spoil the fun; here the hash commitments with SHA-512, it is executed with OpenSSL;
echo -n "prime x" | openssl sha512
27c64b5b944146aa1e40b35bd09307d04afa8d5fa2a93df9c5e13dc19ab032980ad6d564ab23bfe9484f64c4c43a993c09360f62f6d70a5759dfeabf59f18386
faebc6b3645d45f76c1944c6bd0c51f4e0d276ca750b6b5bc82c162e1e9364e01aab42a85245658d0053af526ba718ec006774b7084235d166e93015fac7733d
Details of the experiment
CPU : Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz
RAM : 32GB - doesn't require much ram, at least during polynomial selection and Sieveing.
Dedicated cores : 4
Test machine Ubuntu 20.04.1 LTS
CUDA - NO
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
cmake version 3.16.3
external libraries: Nothing out of Ubuntu's canonicals
CODA-NFS version : GitLab develepment version cloned at 23-01-2021
The bit sizes;
n has 326 bits ( RSA-100 challenge has 330 and broken by Lenstra in 1991)
p has 165 bits
q has 162 bits
The cado-nfs-2.3.0 did not compile and giving errors about HWLOC- HWLOC_TOPOLOGY_FLAG_IO_DEVICES. Asked a friend to test the compile and it worked for them. It was an older Linux version. So I decided to use the GitLab version.

What do you know about this number? If it is an RSA public key then it only has two large prime factors. If it is a random number then it will very probably have small prime factors. The type of number will determine how you want to approach factorising it.
Two ancillary functions will also be useful. First a Sieve of Eratosthenes to build a list of primes up to, say 50,000 or some convenient limit. Second a large number prime test, such as Miller-Rabin, to check if the residue is prime or not.
Use the sieve of Eratosthenes to give you all the small primes up to a convenient limit. Test for each prime in turn up to the square root of the target number. When you find a prime that divides the test number, divide the test number to make it smaller. A prime may divide in more than once. Reset the prime limit to the square root of the smaller number once all the divisions are finished.
if (numToTest MOD trialFactor = 0)
repeat
listOfFactors.add(trialFactor)
numToTest <- numToTest/trialFactor
until (numToTest MOD trialFactor != 0)
primeLimit <- sqrt(numTotest)
endif
Once the number you are testing has been reduced to 1, you have completely factored it.
If you run out of primes before completely factoring the target, it is worth running the Miller-Rabin test 64 times to see if the remainder is itself prime; potentially that can save you a lot of work trying to find non-existent factors of a large prime. If the remainder is composite then you can either try again with a larger sieve or use one of the heavy duty factoring methods: Quadratic Sieve, Elliptic Curve etc.

I have authored a library for the R programming language called RcppBigIntAlgos that can factor these types of numbers in a reasonable amount of time (not as fast as using the excellent CADO-NFS library as used in #kelalaka's answer).
As others have pointed out, your number is going to be very difficult to factor by any program you roll yourself unless you invest a considerable amount of time into it. For me personally, I have invested thousands of hours. Your mileage may vary.
Here is my test run in a vanilla R console (no special IDE):
numDig99 <- "112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869"
## install.packages("RcppBigIntAlgos") if necessary
library(RcppBigIntAlgos)
prime_fac <- quadraticSieve(numDig99, showStats=TRUE, nThreads=8)
Summary Statistics for Factoring:
112887987371630998240814603336195913423482111436696007401429072377238341647882152698281999652360869
| MPQS Time | Complete | Polynomials | Smooths | Partials |
|--------------------|----------|-------------|------------|------------|
| 11h 13m 25s 121ms | 100% | 11591331 | 8768 | 15707 |
| Mat Algebra Time | Mat Dimension |
|--------------------|--------------------|
| 1m 39s 519ms | 24393 x 24475 |
| Total Time |
|--------------------|
| 11h 15m 12s 573ms |
And just as #kelalaka, we obtain the same hashed values (Again, ran right in the R console):
system(sprintf("printf %s | openssl sha512", prime_fac[1]))
faebc6b3645d45f76c1944c6bd0c51f4e0d276ca750b6b5bc82c162e1e9364e01aab42a85245658d0053af526ba718ec006774b7084235d166e93015fac7733d
system(sprintf("printf %s | openssl sha512", prime_fac[2]))
27c64b5b944146aa1e40b35bd09307d04afa8d5fa2a93df9c5e13dc19ab032980ad6d564ab23bfe9484f64c4c43a993c09360f62f6d70a5759dfeabf59f18386
The algorithm implemented in RcppBigIntAlgos::quadraticSieve is the Multiple Polynomial Quadratic Sieve. There is a more efficient version of the quadratic sieve known as the Self Initializing Quadratic Sieve, however, there isn't as much literature in the wild available.
Here are my machine specs:
MacBook Pro (15-inch, 2017)
Processor: 2.8 GHz Quad-Core Intel Core i7
Memory; 16 GB 2133 MHz LPDDR3
And here is my R info:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RcppBigIntAlgos_1.0.1 gmp_0.6-0
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3 Rcpp_1.0.5

Related

How does Branch Prediction affect performance in R?

Some references:
This is a follow-up on this Why is processing a sorted array faster than processing an unsorted array?
The only post in r tag that I found somewhat related to branch prediction was this Why sampling matrix row is very slow?
Explanation of the problem:
I was investigating whether processing a sorted array is faster than processing an unsorted one (same as the problem tested in Java and C – first link) to see if branch prediction is affecting R in the same manner.
See the benchmark examples below:
set.seed(128)
#or making a vector with 1e7
myvec <- rnorm(1e8, 128, 128)
myvecsorted <- sort(myvec)
mysumU = 0
mysumS = 0
SvU <- microbenchmark::microbenchmark(
Unsorted = for (i in 1:length(myvec)) {
if (myvec[i] > 128) {
mysumU = mysumU + myvec[i]
}
} ,
Sorted = for (i in 1:length(myvecsorted)) {
if (myvecsorted[i] > 128) {
mysumS = mysumS + myvecsorted[i]
}
} ,
times = 10)
ggplot2::autoplot(SvU)
Question:
First, I want to know that why "Sorted" vector is not the fastest all the time and not by the same magnitude as expressed in Java?
Second, why the sorted execution time has a higher variation compared to one of the unsorted?
N.B. My CPU is an i7-6820HQ # 2.70GHz Skylake, quad-core with hyperthreading.
Update:
To investigate the variation part, I did the microbenchmark with the vector of 100 million elements (n=1e8) and repeated the benchmark 100 times (times=100). Here's the associated plot with that benchmark.
Here's my sessioninfo:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] compiler stats graphics grDevices utils datasets methods base
other attached packages:
[1] rstudioapi_0.10 reprex_0.3.0 cli_1.1.0 pkgconfig_2.0.3 evaluate_0.14 rlang_0.4.0
[7] Rcpp_1.0.2 microbenchmark_1.4-7 ggplot2_3.2.1
Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.
R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)
The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.
Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.
Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.
The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.
Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.
Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
X86 prefetching optimizations: "computed goto" threaded code has more links.
https://comparch.net/2013/06/30/why-tage-is-the-best/
https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.
The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.
So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.
Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.
If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.
Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

Parallel processing or optimization of latent class analysis

I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected.
Is there a way for me to utilize parallel processing to run this more efficiently? Otherwise, is there other ways to optimize this?
#Converting binary variables to 1 and 2
lca_dat1=lca_dat1+1
#Formula for LCA
f<-cbind(Abdominal_hernia,Abdominal_pain,
Acute_and_unspecified_renal_failure,Acute_cerebrovascular_disease,
Acute_myocardial_infarction,Administrative_social_admission,
Allergic_reactions,Anal_and_rectal_conditions,
Anxiety_disorders,Appendicitis_and_other_appendiceal_conditions,
Asthma,Bacterial_infection_unspecified_site,
Biliary_tract_disease,Calculus_of_urinary_tract,
Cancer_of_breast,Cardiac_dysrhythmias,
Cataract,Chronic_obstructive_pulmonary_disease_and_bronchiectasis,
Chronic_renal_failure,Chronic_ulcer_of_skin,
Coagulation_and_hemorrhagic_disorders,Coma_stupor_and_brain_damage,
Complication_of_device_implant_or_graft,Complications_of_surgical_procedures_or_medical_care,
Conditions_associated_with_dizziness_or_vertigo,Congestive_heart_failure_nonhypertensive,
Coronary_atherosclerosis_and_other_heart_disease,Crushing_injury_or_internal_injury,
Deficiency_and_other_anemia,Delirium_dementia_and_amnestic_and_other_cognitive_disorders,
Disorders_of_lipid_metabolism,Disorders_of_teeth_and_jaw,
Diverticulosis_and_diverticulitis,E_Codes_Adverse_effects_of_medical_care,
E_Codes_Adverse_effects_of_medical_drugs,E_Codes_Fall,
Epilepsy_convulsions,Esophageal_disorders,
Essential_hypertension,Fever_of_unknown_origin,
Fluid_and_electrolyte_disorders,Fracture_of_lower_limb,
Fracture_of_upper_limb,Gastritis_and_duodenitis,
Gastroduodenal_ulcer_except_hemorrhage,Gastrointestinal_hemorrhage,
Genitourinary_symptoms_and_illdefined_conditions,Gout_and_other_crystal_arthropathies,
Headache_including_migraine,Heart_valve_disorders,
Hemorrhoids,Hepatitis,Hyperplasia_of_prostate,
Immunizations_and_screening_for_infectious_disease,
Inflammation_infection_of_eye_except_that_caused_by_tuberculosis_or_sexually_transmitteddisease,Inflammatory_diseases_of_female_pelvic_organs,
Intestinal_infection,Intracranial_injury,
Joint_disorders_and_dislocations_traumarelated,Late_effects_of_cerebrovascular_disease,
Medical_examination_evaluation,Menstrual_disorders,
Mood_disorders,Nausea_and_vomiting,
Neoplasms_of_unspecified_nature_or_uncertain_behavior,Nephritis_nephrosis_renal_sclerosis,
Noninfectious_gastroenteritis,Nonspecific_chest_pain,
Nutritional_deficiencies,Open_wounds_of_extremities,
Open_wounds_of_head_neck_and_trunk,Osteoarthritis,
Other_aftercare,Other_and_unspecified_benign_neoplasm,
Other_circulatory_disease,
Other_connective_tissue_disease,
Other_diseases_of_bladder_and_urethra,Other_diseases_of_kidney_and_ureters,
Other_disorders_of_stomach_and_duodenum,Other_ear_and_sense_organ_disorders,
Other_endocrine_disorders,Other_eye_disorders,
Other_female_genital_disorders,Other_fractures,
Other_gastrointestinal_disorders,Other_infections_including_parasitic,
Other_injuries_and_conditions_due_to_external_causes,Other_liver_diseases,
Other_lower_respiratory_disease,Other_nervous_system_disorders,
Other_nontraumatic_joint_disorders,Other_nutritional_endocrine_and_metabolic_disorders,
Other_screening_for_suspected_conditions_not_mental_disorders_or_infectious_disease,
Other_skin_disorders,Other_upper_respiratory_disease,
Other_upper_respiratory_infections,Paralysis,
Pleurisy_pneumothorax_pulmonary_collapse,Pneumonia_except_that_caused_by_tuberculosis_or_sexually_transmitted_disease,
Poisoning_by_other_medications_and_drugs,Respiratory_failure_insufficiency_arrest_adult,
Retinal_detachments_defects_vascular_occlusion_and_retinopathy,Screening_and_history_of_mental_health_and_substance_abuse_codes,
Secondary_malignancies,Septicemia_except_in_labor,
Skin_and_subcutaneous_tissue_infections,Spondylosis_intervertebral_disc_disorders_other_back_problems,
Sprains_and_strains,Superficial_injury_contusion,
Syncope,Thyroid_disorders,Urinary_tract_infections)~1
#LCA for 1 class
lca1<-poLCA(f,lca_dat1,nclass=1,maxiter=3000,tol=1e-7,graph=F,nrep=5)
#LCA for 2 classes
lca2<-poLCA(f,lca_dat1,nclass=2,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca2=lca2$posterior
posterior_lca2$max_pos=apply(posterior_lca2,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca2$max_pos>0.7)
#LCA for 3 classes
lca3<-poLCA(f,lca_dat1,nclass=3,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca3=lca3$posterior
posterior_lca3$max_pos=apply(posterior_lca3,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca3$max_pos>0.7)
...
You can create a list with the different configurations you want to use. Then use either one of the *apply functions from the parallel package or %dopar% from foreach. Which parallel backend you can/should use depends on your OS.
Here an example with foreach:
library(foreach)
library(doParallel)
registerDoSEQ() # proper backend depends on the OS
foreach(nclass = 1:10) %dopar% {
# do something with nclass
sqrt(nclass)
}
Here are my not too brief or compact thoughts on this. They are less than exact. I have not ever used anywhere near so many manifest factors with poLCA and I think you may be breaking some interesting ground doing so computationally. I use poLCA to predict electoral outcomes per voter (red, blue, purple). I can be wrong on that and not suffer a malpractice suit. I really don't know about the risk of LCA use in health analysis. I think of LCA as more of a social sciences tool. I could be wrong about that as well. Anyway:
(1) I believe you want to look for the most "parsimonious" factors to produce a latent class and limit them to a reduced subset that proves the most useful for all your data. That will help with CPU optimization. I have found personally that using manifests that are exceptionally "monotonic" is not (by default) necessarily a good thing, although certainly experimenting with factors more or less "monotonic" talks to you about your model.
I have found it is more "machine learning" friendly/responsible to use the most widespread manifests and "sample split" your data into groups; recombining the posteriors post LCA run. This assumes that that the most widespread factors affect different subgroups quantitatively but with variance for sample groups (e.g. red, blue, purple). I don't know that anyone else does this, but I gave up trying to build the "one LCA model that rules them all" from voterdb information. That didn't work.
(2)The poLCA library (like most latent class analysis) depends upon matrix multiplication. I have found poLCA more CPU bound than memory bound but with 114 manifests you may experience bottlenecks at every nook and cranny of your motherboard. Whatever you can do to increase matrix multiplication efficiency helps. I believe I have found that Microsoft Open R use of Intel's MKL MKLs more efficient than the default CRAN numeric library. Sorry, I haven't completely tested that nor do I understand why some numeric libraries might be more efficient than others for matrix multiplication. I only know that Microsoft Open R brags about this some and it appears to me they have a point with MKL MKL.
(3) Reworking your LCS code into Matt Dowles data.table library shows me efficiencies across the board on all my work. I create 'dat' as an data.table and run iterations for the best optimized data.table function for poLCA and posteriors. Combining data.table efficiency with some of Hadley Wickham's improved *ply functions (plyr library) that puts LCA runs into lists works well for me:
rbindlist(plyr::llply(1:10,check_lc_pc)) # check_lc_pc is the poLCA function.
(4) This is a simple tip (maybe even condescending), but you don't need to list all standard error data once you are satisfied with your model so verbose = FALSE. Also, by making regular test runs, I can determine the poLCA run optimizated best for my model the best ('probs.start') and leverage testing thereof:
lc <- poLCA(f,dat,nrep=1,probs.start=probs.start.new,verbose=FALSE)
poLCA produces a lot of output to the screen by default. Create a poLCA function with verbose=FALSE and a byte-compiled R 3.5 optimizes output.
(5) I use Windows 10 and because of fast SSD, fast DDR, and Microsoft "memory compression" I think I notice that the the Windows 10 OS adapts to lca runs with lots of "memory compression". I assume that it is holding the same matrices in compressed memory because I am calling them repeatedly over time. I really like the Kaby Lake processors that "self over-clock". I see my processor 7700HQ taking advantage of that during LCA runs. (It would seem like LCA runs would benefit from over clocking. I don't like to overclock my processor on my own. That's too much risk for me.) I think it is useful to monitor memory use of your LCA runs from another R console with system calls to Powershell and cmd memory management functions. The one below list the hidden "Memory Compression" process(!!):
ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
$t2 = $t1 | Select {
$_.Id;
[math]::Round($_.WorkingSet64/1MB);
[math]::Round($_.PrivateMemorySize64/1MB);
[math]::Round($_.VirtualMemorySize64/1MB) };
$t2 | ft * "); }
ps_all <- function() {ps();ps_e();ps_f();}
I have this memory management function for your session used for the lca runs, but of course, that runs before or after:
memory <- function() {
as.matrix(list(
paste0(shell('systeminfo | findstr "Memory"')), # Windows
paste0("R Memory size (malloc) available: ",memory.size(TRUE)," MB"),
paste0("R Memory size (malloc) in use: ",memory.size()," MB"),
paste0("R Memory limit (total alloc): ",memory.limit()," MB")
There is work on the optimization functions for latent class analysis. I post a link here but I don't think that helps us today as users of poLCA or LCA: http://www.mat.univie.ac.at/~neum/ms/fuchs-coap11.pdf. But maybe the discussion is good background. There is nothing simple about poLCA. This document by the developers: http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf is worth reading at least twice!
If anyone else has any thoughts of poLCA or LCA compression, I would appreciate further discussion as well. Once I started predicting voter outcomes of entire state as opposed to my county, I had to think about optimization and the limits of poLCA and LCA/LCR.
Nowadays, there is a parallized cpp-based impementation of poLCA, named poLCAParallel in https://github.com/QMUL/poLCAParallel . For me, it was much much faster than the base package.

R running out of memory during time series distance computation

Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.
There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.

OpenCL Accuracy and Performance Issues when using MacPro (Firepro D500)

I have run into a strange issue while running the same OpenCL kernel on multiple machines. Please see below:
OS OpenCL version GPU Output Accuracy
LINUX 2.0 AMD-R9 290X Good
Mac 1.2 Nvidia GT-750M Good
Mac 1.2 AMD Firepro D500 Incorrect
LINUX 1.1 Nvidia Tesla K20 Good
I posted on Apple forums, and the only reply I have received is that I should disable fast path math. I am not enabling it anywhere.
In terms of performance, the code runs two times slower on the Firepro when compared to the other discrete GPUs (Tesla and R9) in the list.
Can someone please tell what could be going on? I am happy to share the code if needed.
Here is the OpenCL kernel (some of the variable/function names are not proper): http://pastebin.com/Kt4TinXt
Here is how it is called from the host:
sentence_length = 1024
num_sentences = 6
count = 0
for(sentence in textfile)
{
sentences += sentence
count++
if(count == num_sentences - 1)
enqueuekernel(sentences)
}
A sentence is basically a group of 1024 words. The level of parallelism is at the word level. I chose to use 128 work-items per word, because that allowed me to keep neu1 and neu1e in the shared memory. I tried other combinations like 'layer1_size' work items per word, or 1 wavefront per word, but that did not give good performance at all. Even now, the performance is not that great, but it gives me around 2.8X (compared to 6 core Xeon) on the R9 and Tesla.
Please let me know if more detail is needed!

GPU programming via JOCL uses only 6 out of 80 shader cores?

I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
for n=1 it takes 12.2 seconds
for n=2 it takes 6.3 seconds
for n=3 it takes 4.4 seconds
for n=4 it takes 3.4 seconds
for n=5 it takes 3.1 seconds
for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.

Resources