Parallel processing or optimization of latent class analysis - r

I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected.
Is there a way for me to utilize parallel processing to run this more efficiently? Otherwise, is there other ways to optimize this?
#Converting binary variables to 1 and 2
lca_dat1=lca_dat1+1
#Formula for LCA
f<-cbind(Abdominal_hernia,Abdominal_pain,
Acute_and_unspecified_renal_failure,Acute_cerebrovascular_disease,
Acute_myocardial_infarction,Administrative_social_admission,
Allergic_reactions,Anal_and_rectal_conditions,
Anxiety_disorders,Appendicitis_and_other_appendiceal_conditions,
Asthma,Bacterial_infection_unspecified_site,
Biliary_tract_disease,Calculus_of_urinary_tract,
Cancer_of_breast,Cardiac_dysrhythmias,
Cataract,Chronic_obstructive_pulmonary_disease_and_bronchiectasis,
Chronic_renal_failure,Chronic_ulcer_of_skin,
Coagulation_and_hemorrhagic_disorders,Coma_stupor_and_brain_damage,
Complication_of_device_implant_or_graft,Complications_of_surgical_procedures_or_medical_care,
Conditions_associated_with_dizziness_or_vertigo,Congestive_heart_failure_nonhypertensive,
Coronary_atherosclerosis_and_other_heart_disease,Crushing_injury_or_internal_injury,
Deficiency_and_other_anemia,Delirium_dementia_and_amnestic_and_other_cognitive_disorders,
Disorders_of_lipid_metabolism,Disorders_of_teeth_and_jaw,
Diverticulosis_and_diverticulitis,E_Codes_Adverse_effects_of_medical_care,
E_Codes_Adverse_effects_of_medical_drugs,E_Codes_Fall,
Epilepsy_convulsions,Esophageal_disorders,
Essential_hypertension,Fever_of_unknown_origin,
Fluid_and_electrolyte_disorders,Fracture_of_lower_limb,
Fracture_of_upper_limb,Gastritis_and_duodenitis,
Gastroduodenal_ulcer_except_hemorrhage,Gastrointestinal_hemorrhage,
Genitourinary_symptoms_and_illdefined_conditions,Gout_and_other_crystal_arthropathies,
Headache_including_migraine,Heart_valve_disorders,
Hemorrhoids,Hepatitis,Hyperplasia_of_prostate,
Immunizations_and_screening_for_infectious_disease,
Inflammation_infection_of_eye_except_that_caused_by_tuberculosis_or_sexually_transmitteddisease,Inflammatory_diseases_of_female_pelvic_organs,
Intestinal_infection,Intracranial_injury,
Joint_disorders_and_dislocations_traumarelated,Late_effects_of_cerebrovascular_disease,
Medical_examination_evaluation,Menstrual_disorders,
Mood_disorders,Nausea_and_vomiting,
Neoplasms_of_unspecified_nature_or_uncertain_behavior,Nephritis_nephrosis_renal_sclerosis,
Noninfectious_gastroenteritis,Nonspecific_chest_pain,
Nutritional_deficiencies,Open_wounds_of_extremities,
Open_wounds_of_head_neck_and_trunk,Osteoarthritis,
Other_aftercare,Other_and_unspecified_benign_neoplasm,
Other_circulatory_disease,
Other_connective_tissue_disease,
Other_diseases_of_bladder_and_urethra,Other_diseases_of_kidney_and_ureters,
Other_disorders_of_stomach_and_duodenum,Other_ear_and_sense_organ_disorders,
Other_endocrine_disorders,Other_eye_disorders,
Other_female_genital_disorders,Other_fractures,
Other_gastrointestinal_disorders,Other_infections_including_parasitic,
Other_injuries_and_conditions_due_to_external_causes,Other_liver_diseases,
Other_lower_respiratory_disease,Other_nervous_system_disorders,
Other_nontraumatic_joint_disorders,Other_nutritional_endocrine_and_metabolic_disorders,
Other_screening_for_suspected_conditions_not_mental_disorders_or_infectious_disease,
Other_skin_disorders,Other_upper_respiratory_disease,
Other_upper_respiratory_infections,Paralysis,
Pleurisy_pneumothorax_pulmonary_collapse,Pneumonia_except_that_caused_by_tuberculosis_or_sexually_transmitted_disease,
Poisoning_by_other_medications_and_drugs,Respiratory_failure_insufficiency_arrest_adult,
Retinal_detachments_defects_vascular_occlusion_and_retinopathy,Screening_and_history_of_mental_health_and_substance_abuse_codes,
Secondary_malignancies,Septicemia_except_in_labor,
Skin_and_subcutaneous_tissue_infections,Spondylosis_intervertebral_disc_disorders_other_back_problems,
Sprains_and_strains,Superficial_injury_contusion,
Syncope,Thyroid_disorders,Urinary_tract_infections)~1
#LCA for 1 class
lca1<-poLCA(f,lca_dat1,nclass=1,maxiter=3000,tol=1e-7,graph=F,nrep=5)
#LCA for 2 classes
lca2<-poLCA(f,lca_dat1,nclass=2,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca2=lca2$posterior
posterior_lca2$max_pos=apply(posterior_lca2,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca2$max_pos>0.7)
#LCA for 3 classes
lca3<-poLCA(f,lca_dat1,nclass=3,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca3=lca3$posterior
posterior_lca3$max_pos=apply(posterior_lca3,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca3$max_pos>0.7)
...

You can create a list with the different configurations you want to use. Then use either one of the *apply functions from the parallel package or %dopar% from foreach. Which parallel backend you can/should use depends on your OS.
Here an example with foreach:
library(foreach)
library(doParallel)
registerDoSEQ() # proper backend depends on the OS
foreach(nclass = 1:10) %dopar% {
# do something with nclass
sqrt(nclass)
}

Here are my not too brief or compact thoughts on this. They are less than exact. I have not ever used anywhere near so many manifest factors with poLCA and I think you may be breaking some interesting ground doing so computationally. I use poLCA to predict electoral outcomes per voter (red, blue, purple). I can be wrong on that and not suffer a malpractice suit. I really don't know about the risk of LCA use in health analysis. I think of LCA as more of a social sciences tool. I could be wrong about that as well. Anyway:
(1) I believe you want to look for the most "parsimonious" factors to produce a latent class and limit them to a reduced subset that proves the most useful for all your data. That will help with CPU optimization. I have found personally that using manifests that are exceptionally "monotonic" is not (by default) necessarily a good thing, although certainly experimenting with factors more or less "monotonic" talks to you about your model.
I have found it is more "machine learning" friendly/responsible to use the most widespread manifests and "sample split" your data into groups; recombining the posteriors post LCA run. This assumes that that the most widespread factors affect different subgroups quantitatively but with variance for sample groups (e.g. red, blue, purple). I don't know that anyone else does this, but I gave up trying to build the "one LCA model that rules them all" from voterdb information. That didn't work.
(2)The poLCA library (like most latent class analysis) depends upon matrix multiplication. I have found poLCA more CPU bound than memory bound but with 114 manifests you may experience bottlenecks at every nook and cranny of your motherboard. Whatever you can do to increase matrix multiplication efficiency helps. I believe I have found that Microsoft Open R use of Intel's MKL MKLs more efficient than the default CRAN numeric library. Sorry, I haven't completely tested that nor do I understand why some numeric libraries might be more efficient than others for matrix multiplication. I only know that Microsoft Open R brags about this some and it appears to me they have a point with MKL MKL.
(3) Reworking your LCS code into Matt Dowles data.table library shows me efficiencies across the board on all my work. I create 'dat' as an data.table and run iterations for the best optimized data.table function for poLCA and posteriors. Combining data.table efficiency with some of Hadley Wickham's improved *ply functions (plyr library) that puts LCA runs into lists works well for me:
rbindlist(plyr::llply(1:10,check_lc_pc)) # check_lc_pc is the poLCA function.
(4) This is a simple tip (maybe even condescending), but you don't need to list all standard error data once you are satisfied with your model so verbose = FALSE. Also, by making regular test runs, I can determine the poLCA run optimizated best for my model the best ('probs.start') and leverage testing thereof:
lc <- poLCA(f,dat,nrep=1,probs.start=probs.start.new,verbose=FALSE)
poLCA produces a lot of output to the screen by default. Create a poLCA function with verbose=FALSE and a byte-compiled R 3.5 optimizes output.
(5) I use Windows 10 and because of fast SSD, fast DDR, and Microsoft "memory compression" I think I notice that the the Windows 10 OS adapts to lca runs with lots of "memory compression". I assume that it is holding the same matrices in compressed memory because I am calling them repeatedly over time. I really like the Kaby Lake processors that "self over-clock". I see my processor 7700HQ taking advantage of that during LCA runs. (It would seem like LCA runs would benefit from over clocking. I don't like to overclock my processor on my own. That's too much risk for me.) I think it is useful to monitor memory use of your LCA runs from another R console with system calls to Powershell and cmd memory management functions. The one below list the hidden "Memory Compression" process(!!):
ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
$t2 = $t1 | Select {
$_.Id;
[math]::Round($_.WorkingSet64/1MB);
[math]::Round($_.PrivateMemorySize64/1MB);
[math]::Round($_.VirtualMemorySize64/1MB) };
$t2 | ft * "); }
ps_all <- function() {ps();ps_e();ps_f();}
I have this memory management function for your session used for the lca runs, but of course, that runs before or after:
memory <- function() {
as.matrix(list(
paste0(shell('systeminfo | findstr "Memory"')), # Windows
paste0("R Memory size (malloc) available: ",memory.size(TRUE)," MB"),
paste0("R Memory size (malloc) in use: ",memory.size()," MB"),
paste0("R Memory limit (total alloc): ",memory.limit()," MB")
There is work on the optimization functions for latent class analysis. I post a link here but I don't think that helps us today as users of poLCA or LCA: http://www.mat.univie.ac.at/~neum/ms/fuchs-coap11.pdf. But maybe the discussion is good background. There is nothing simple about poLCA. This document by the developers: http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf is worth reading at least twice!
If anyone else has any thoughts of poLCA or LCA compression, I would appreciate further discussion as well. Once I started predicting voter outcomes of entire state as opposed to my county, I had to think about optimization and the limits of poLCA and LCA/LCR.

Nowadays, there is a parallized cpp-based impementation of poLCA, named poLCAParallel in https://github.com/QMUL/poLCAParallel . For me, it was much much faster than the base package.

Related

How to improve igraph single_paths calculation efficiency

I'm using the function all_simple_paths from the igraph R package: (1) to generate the list of all simple paths in networks (object List_paths_Mp); and (2) to calculate the total number of simple paths (object n_paths).
I'm using the function in the form:
pathsMp <- unlist(lapply(V(graphMp), function(x)all_simple_paths(graphMp, from = x)), recursive =FALSE)
List_paths_Mp <- lapply(1:length(pathsMp), function(x)as_ids(pathsMp[[x]]))
n_paths<-length(List_paths_Mp)
Where:
Mp is a square matrix with either 1 or 0 values, and graphMp is the igraph graph objected obtained through the function graph_from_adjacency_matrix.
The function does what I need, but with the increase in the number of variables and interactions the processing time to identify and store the different single paths in the network grows too much and it takes very long to get the results.
In particular, using a network with 11 variables and 60 interactions, there is a total of 146338 possible simple paths. And this already takes a long time to compute. Using a bigger network, with 13 variables and 91 interactions, causes the program to take even longer times to process (after 2 hours the function still didn't run its course, and when called to stop it crashed R).
Is there a way to increase the efficiency of the task (i.e. to get results in a faster way)? Has anyone ever encountered a similar problem and found a solution? And, I know, I could use a CPU with higher processing power, but the point is to have the function to run efficiently (as much as possible) in a normal personal computer.
Edit: here I do the calculations from the graph object, but if someone else has any idea of doing the same from the adjacency matrix, I would welcome it too!

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following error: Quick-TRANSfer stage steps exceeded maximum (= 31834400), and although one can view the code at http://svn.r-project.org/R/trunk/src/library/stats/R/kmeans.R - I am unsure as to what is going wrong. I assume my problem has to do with the size of my dataset, but I would be grateful if someone could clarify once and for all what I can do to mitigate the issue.
I just had the same issue.
See the documentation of kmeans in R via ?kmeans:
The Hartigan-Wong algorithm
generally does a better job than either of those, but trying
several random starts (‘nstart’> 1) is often recommended. In rare
cases, when some of the points (rows of ‘x’) are extremely close,
the algorithm may not converge in the “Quick-Transfer” stage,
signalling a warning (and returning ‘ifault = 4’). Slight
rounding of the data may be advisable in that case.
In these cases, you may need to switch to the Lloyd or MacQueen algorithms.
The nasty thing about R here is that it continues with a warning that may go unnoticed. For my benchmark purposes, I consider this to be a failed run, and thus I use:
if (kms$ifault==4) { stop("Failed in Quick-Transfer"); }
Depending on your use case, you may want to do something like
if (kms$ifault==4) { kms = kmeans(X, kms$centers, algorithm="MacQueen"); }
instead, to continue with a different algorithm.
If you are benchmarking K-means, note that R uses iter.max=10 per default. It may take much more than 10 iterations to converge.
Had the same problem, seems to have something to do with available memory.
Running Garbage Collection before the function worked for me:
gc()
or reference:
Increasing (or decreasing) the memory available to R processes
#jlhoward's comment:
Try
kmeans(dataset, algorithm="Lloyd", ..)
I got the same error message, but in my case it helped to increase the number of iterations iter.max. That contradicts the theory of memory overload.

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

R: SVM performance using custom kernel (user defined kernel) is not working in kernlab

I'm trying to use user defined kernel. I know that kernlab offer user defined kernel(custom kernel functions) in R. I used data spam including package kernlab.
(number of variables=57 number of examples =4061)
I'm defined kernel's form,
kp=function(d,e){
as=v*d
bs=v*e
cs=as-bs
cs=as.matrix(cs)
exp(-(norm(cs,"F")^2)/2)
}
class(kp)="kernel"
It is the transformed kernel for gaussian kernel, where v is the continuously changed values that are inverse of standard deviation vector about each variables, for example:
v=(0.1666667,........0.1666667)
The training set defined 60% of spam data (preserving the proportions of the different classes).
if data's type is spam, than data's type = 1 for train svm
m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
But this step is not working. It's always waiting for a response.
So, I ask you this problem, why? Is it because the number of examples are too big? Is there any other R package that can train SVMs for user defined kernel?
First, your kernel looks like a classic RBF kernel, with v = 1/sigma, so why do you use it? You can use a built-in RBF kernel and simply set the sigma parameter. In particular - instead of using frobenius norm on matrices you could use classic euclidean on the vectorized matrices.
Second - this is working just fine.
> xtrain = as.matrix( c(1,2,3,4) )
> ytrain = as.factor( c(0,0,1,1) )
> v= 0.01
> m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
> m
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 10
Number of Support Vectors : 4
Objective Function Value : -39.952
Training error : 0
There are at least two reasons for you still waiting for results:
RBF kernels induce the most hard problem to optimize for SVM (especially for large C)
User defined kernels are far less efficient then builtin
As I am not sure, whether ksvm actually optimizes the user-defined kernel computation (in fact I'm pretty sure it does not), you could try to build the kernel matrix ( K[i,j] = K(x_i,x_j) where x_i is i'th training vector) and provide ksvm with it. You can achieve this by
K <- kernelMatrix(kp,xtrain)
m <- ksvm(K,ytrain,type="C-svc",kernel='matrix',C=10)
Precomputing kernel matrix can be quite long process, but then optimization itself will be much faster, so it is a good method if you want to test many different C values (which you for sure should do). Unfortunately this requires O(n^2) memory, so if you use more then 100 000 vectors, you will need really great amount of RAM.

Resources