I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected.
Is there a way for me to utilize parallel processing to run this more efficiently? Otherwise, is there other ways to optimize this?
#Converting binary variables to 1 and 2
lca_dat1=lca_dat1+1
#Formula for LCA
f<-cbind(Abdominal_hernia,Abdominal_pain,
Acute_and_unspecified_renal_failure,Acute_cerebrovascular_disease,
Acute_myocardial_infarction,Administrative_social_admission,
Allergic_reactions,Anal_and_rectal_conditions,
Anxiety_disorders,Appendicitis_and_other_appendiceal_conditions,
Asthma,Bacterial_infection_unspecified_site,
Biliary_tract_disease,Calculus_of_urinary_tract,
Cancer_of_breast,Cardiac_dysrhythmias,
Cataract,Chronic_obstructive_pulmonary_disease_and_bronchiectasis,
Chronic_renal_failure,Chronic_ulcer_of_skin,
Coagulation_and_hemorrhagic_disorders,Coma_stupor_and_brain_damage,
Complication_of_device_implant_or_graft,Complications_of_surgical_procedures_or_medical_care,
Conditions_associated_with_dizziness_or_vertigo,Congestive_heart_failure_nonhypertensive,
Coronary_atherosclerosis_and_other_heart_disease,Crushing_injury_or_internal_injury,
Deficiency_and_other_anemia,Delirium_dementia_and_amnestic_and_other_cognitive_disorders,
Disorders_of_lipid_metabolism,Disorders_of_teeth_and_jaw,
Diverticulosis_and_diverticulitis,E_Codes_Adverse_effects_of_medical_care,
E_Codes_Adverse_effects_of_medical_drugs,E_Codes_Fall,
Epilepsy_convulsions,Esophageal_disorders,
Essential_hypertension,Fever_of_unknown_origin,
Fluid_and_electrolyte_disorders,Fracture_of_lower_limb,
Fracture_of_upper_limb,Gastritis_and_duodenitis,
Gastroduodenal_ulcer_except_hemorrhage,Gastrointestinal_hemorrhage,
Genitourinary_symptoms_and_illdefined_conditions,Gout_and_other_crystal_arthropathies,
Headache_including_migraine,Heart_valve_disorders,
Hemorrhoids,Hepatitis,Hyperplasia_of_prostate,
Immunizations_and_screening_for_infectious_disease,
Inflammation_infection_of_eye_except_that_caused_by_tuberculosis_or_sexually_transmitteddisease,Inflammatory_diseases_of_female_pelvic_organs,
Intestinal_infection,Intracranial_injury,
Joint_disorders_and_dislocations_traumarelated,Late_effects_of_cerebrovascular_disease,
Medical_examination_evaluation,Menstrual_disorders,
Mood_disorders,Nausea_and_vomiting,
Neoplasms_of_unspecified_nature_or_uncertain_behavior,Nephritis_nephrosis_renal_sclerosis,
Noninfectious_gastroenteritis,Nonspecific_chest_pain,
Nutritional_deficiencies,Open_wounds_of_extremities,
Open_wounds_of_head_neck_and_trunk,Osteoarthritis,
Other_aftercare,Other_and_unspecified_benign_neoplasm,
Other_circulatory_disease,
Other_connective_tissue_disease,
Other_diseases_of_bladder_and_urethra,Other_diseases_of_kidney_and_ureters,
Other_disorders_of_stomach_and_duodenum,Other_ear_and_sense_organ_disorders,
Other_endocrine_disorders,Other_eye_disorders,
Other_female_genital_disorders,Other_fractures,
Other_gastrointestinal_disorders,Other_infections_including_parasitic,
Other_injuries_and_conditions_due_to_external_causes,Other_liver_diseases,
Other_lower_respiratory_disease,Other_nervous_system_disorders,
Other_nontraumatic_joint_disorders,Other_nutritional_endocrine_and_metabolic_disorders,
Other_screening_for_suspected_conditions_not_mental_disorders_or_infectious_disease,
Other_skin_disorders,Other_upper_respiratory_disease,
Other_upper_respiratory_infections,Paralysis,
Pleurisy_pneumothorax_pulmonary_collapse,Pneumonia_except_that_caused_by_tuberculosis_or_sexually_transmitted_disease,
Poisoning_by_other_medications_and_drugs,Respiratory_failure_insufficiency_arrest_adult,
Retinal_detachments_defects_vascular_occlusion_and_retinopathy,Screening_and_history_of_mental_health_and_substance_abuse_codes,
Secondary_malignancies,Septicemia_except_in_labor,
Skin_and_subcutaneous_tissue_infections,Spondylosis_intervertebral_disc_disorders_other_back_problems,
Sprains_and_strains,Superficial_injury_contusion,
Syncope,Thyroid_disorders,Urinary_tract_infections)~1
#LCA for 1 class
lca1<-poLCA(f,lca_dat1,nclass=1,maxiter=3000,tol=1e-7,graph=F,nrep=5)
#LCA for 2 classes
lca2<-poLCA(f,lca_dat1,nclass=2,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca2=lca2$posterior
posterior_lca2$max_pos=apply(posterior_lca2,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca2$max_pos>0.7)
#LCA for 3 classes
lca3<-poLCA(f,lca_dat1,nclass=3,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca3=lca3$posterior
posterior_lca3$max_pos=apply(posterior_lca3,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca3$max_pos>0.7)
...
You can create a list with the different configurations you want to use. Then use either one of the *apply functions from the parallel package or %dopar% from foreach. Which parallel backend you can/should use depends on your OS.
Here an example with foreach:
library(foreach)
library(doParallel)
registerDoSEQ() # proper backend depends on the OS
foreach(nclass = 1:10) %dopar% {
# do something with nclass
sqrt(nclass)
}
Here are my not too brief or compact thoughts on this. They are less than exact. I have not ever used anywhere near so many manifest factors with poLCA and I think you may be breaking some interesting ground doing so computationally. I use poLCA to predict electoral outcomes per voter (red, blue, purple). I can be wrong on that and not suffer a malpractice suit. I really don't know about the risk of LCA use in health analysis. I think of LCA as more of a social sciences tool. I could be wrong about that as well. Anyway:
(1) I believe you want to look for the most "parsimonious" factors to produce a latent class and limit them to a reduced subset that proves the most useful for all your data. That will help with CPU optimization. I have found personally that using manifests that are exceptionally "monotonic" is not (by default) necessarily a good thing, although certainly experimenting with factors more or less "monotonic" talks to you about your model.
I have found it is more "machine learning" friendly/responsible to use the most widespread manifests and "sample split" your data into groups; recombining the posteriors post LCA run. This assumes that that the most widespread factors affect different subgroups quantitatively but with variance for sample groups (e.g. red, blue, purple). I don't know that anyone else does this, but I gave up trying to build the "one LCA model that rules them all" from voterdb information. That didn't work.
(2)The poLCA library (like most latent class analysis) depends upon matrix multiplication. I have found poLCA more CPU bound than memory bound but with 114 manifests you may experience bottlenecks at every nook and cranny of your motherboard. Whatever you can do to increase matrix multiplication efficiency helps. I believe I have found that Microsoft Open R use of Intel's MKL MKLs more efficient than the default CRAN numeric library. Sorry, I haven't completely tested that nor do I understand why some numeric libraries might be more efficient than others for matrix multiplication. I only know that Microsoft Open R brags about this some and it appears to me they have a point with MKL MKL.
(3) Reworking your LCS code into Matt Dowles data.table library shows me efficiencies across the board on all my work. I create 'dat' as an data.table and run iterations for the best optimized data.table function for poLCA and posteriors. Combining data.table efficiency with some of Hadley Wickham's improved *ply functions (plyr library) that puts LCA runs into lists works well for me:
rbindlist(plyr::llply(1:10,check_lc_pc)) # check_lc_pc is the poLCA function.
(4) This is a simple tip (maybe even condescending), but you don't need to list all standard error data once you are satisfied with your model so verbose = FALSE. Also, by making regular test runs, I can determine the poLCA run optimizated best for my model the best ('probs.start') and leverage testing thereof:
lc <- poLCA(f,dat,nrep=1,probs.start=probs.start.new,verbose=FALSE)
poLCA produces a lot of output to the screen by default. Create a poLCA function with verbose=FALSE and a byte-compiled R 3.5 optimizes output.
(5) I use Windows 10 and because of fast SSD, fast DDR, and Microsoft "memory compression" I think I notice that the the Windows 10 OS adapts to lca runs with lots of "memory compression". I assume that it is holding the same matrices in compressed memory because I am calling them repeatedly over time. I really like the Kaby Lake processors that "self over-clock". I see my processor 7700HQ taking advantage of that during LCA runs. (It would seem like LCA runs would benefit from over clocking. I don't like to overclock my processor on my own. That's too much risk for me.) I think it is useful to monitor memory use of your LCA runs from another R console with system calls to Powershell and cmd memory management functions. The one below list the hidden "Memory Compression" process(!!):
ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
$t2 = $t1 | Select {
$_.Id;
[math]::Round($_.WorkingSet64/1MB);
[math]::Round($_.PrivateMemorySize64/1MB);
[math]::Round($_.VirtualMemorySize64/1MB) };
$t2 | ft * "); }
ps_all <- function() {ps();ps_e();ps_f();}
I have this memory management function for your session used for the lca runs, but of course, that runs before or after:
memory <- function() {
as.matrix(list(
paste0(shell('systeminfo | findstr "Memory"')), # Windows
paste0("R Memory size (malloc) available: ",memory.size(TRUE)," MB"),
paste0("R Memory size (malloc) in use: ",memory.size()," MB"),
paste0("R Memory limit (total alloc): ",memory.limit()," MB")
There is work on the optimization functions for latent class analysis. I post a link here but I don't think that helps us today as users of poLCA or LCA: http://www.mat.univie.ac.at/~neum/ms/fuchs-coap11.pdf. But maybe the discussion is good background. There is nothing simple about poLCA. This document by the developers: http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf is worth reading at least twice!
If anyone else has any thoughts of poLCA or LCA compression, I would appreciate further discussion as well. Once I started predicting voter outcomes of entire state as opposed to my county, I had to think about optimization and the limits of poLCA and LCA/LCR.
Nowadays, there is a parallized cpp-based impementation of poLCA, named poLCAParallel in https://github.com/QMUL/poLCAParallel . For me, it was much much faster than the base package.
I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
sum+=i;
output[id]=sum;
}
GPU on my laptop:
NVS 5400M (www.nvidia.com/object/nvs_techspecs.html)
GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7
I dont have an HD5850 but how can I know maximum workgroup size of it for opencl ? What is the preferred floating point vector width for HD5850? I suspected it was 5 but did not work on a friends computer who has 5850. Tried width 4 but did not work fast enough now I suspect work group size is not optimal. Doing NBody for 25k 50k and 100k particles consists of float8 variables for x,y,z, vx,vy,vz.
Thanks.
If you need the OpenCL specifics at development time but don't have access to the hardware, try http://clbenchmark.com. For example, the HD 5850 page is here: http://clbenchmark.com/device-environment.jsp?config=11975982. It shows CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT=4.
Use clGetDeviceInfo to poll for CL_DEVICE_MAX_WORK_GROUP_SIZE. I think the 5850 will have this at 256, but that may not be optimal for your kernel.
Use the same technique to poll for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, which I think is 4 on your card.
clGetDeviceInfo