Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I am looking for some advice on how to choose a workstation. My budget is around $5000.
I simulate structural economic models using Julia. My codes typically use big arrays (through which I iterate using for loops), involve large Monte Carlo simulations, and minimisation algorithms. I parallelise as much as I can.
As I understand it, it would be beneficial for me to have a machine with as many cores as possible and quite a lot of RAM. However, I am not sure how to balance these two. What is the trade-off? Also, does the quality of the cores matter?
Is there anything else that I should take into account apart from RAM and cores/CPU?
Any help is appreciated.
Thanks!
It depends on wheter your algorithms is parallelizable and if yes, whether it is constrained by memory bandwidth or compute power. Most algorithms are bandwidth-restricted. Large arrays sounds also GPU-) parrallelizable, provided the array components are independent of each other.
Largest performance here offer CPUs with as much memory channels as possible. Normal desktop CPUs usually have 2-channel memory, AMD Threadripper have 4-channel and Threadripper Pro have 8-channel. So a ~24-core Threadripper Pro with 8x8GB /8x16GB memory may be suitable in your budget.
If you parallelize a lot, maybe consider using a GPU. Julia also supports GPU parallelization. When running very parallelizable code, a single GPU can be about as powerful as 2000 CPU cores. The speedup really is substantial. Memory bandwidth for GPUs is also orders of magnitude larger than for CPUs.
The main crux is that GPUs have very limited, non-expandable memory, and GPUs with a lot of memory tend to become disproportunately expensive. If 24GB is enough for your workloads, go for an RTX3090. If you parallelize most of your code on the GPU, the CPU does not matter nearly as much, and you can choose a normal desktop cpu, for example the 16-core AMD Ryzen 5950X with 4x16GB (2-channel), and stick entirely with consumer / gamer hardware which is much more powerful for much less money.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have used mclapply quite a bit and love it. It is a memory hog but very convenient. Alas, now I have a different problem that is not simply embarrassingly parallel.
Can R (esp Unix R) employ multiple CPU cores on a single computer, sharing the same memory space, without resorting to copying full OS processes, so that
there is minimal process overhead; and
modification of global data by one CPU are immediately available to other CPUs?
If yes, can R lock some memory just like files (flock)?
I suspect that the answer is no and learning this definitively would be very useful. If the answer is yes, please point me the right way.
regards,
/iaw
You can use the Rdsm package to use distributed shared memory parallelism, i.e. multiple R processes using the same memory space.
Besides that, you can employ multi-threaded BLAS/LAPACK (e.g. OpenBLAS or Intel MKL) and you can use C/C++ (and probably Fortran) code together with OpenMP. See assembling a matrix from diagonal slices with mclapply or %dopar%, like Matrix::bandSparse for an example.
Have you take a look at Microsoft's R Open (available for Linux), with the custom Math Kernel Library (MKL).
I've seen very good performance improvements without rewriting code.
https://mran.microsoft.com/documents/rro/multithread
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on optimization of the ADAS algorithm which are in c++.
I want to optimize that algorithm using OpenCL tech.
I have gone through some basic doc of OpenCL.
I came to know the kernel code is written in C which is doing the optimization.
But I want to know how internally kernel is splitting the work into different workitems ?
How is the single statement is doing for loop task.
Please share your knowledge with me on OpenCL.
Tr,
Ashwin
First of all C code is not doing the optimization. Parallelism is. Optimization with OpenCL only works on algorithms that can heavily utilize parallelism. If you are using OpenCL like regular C you are probably slowing your algorithm down. This is because it takes lot of time to move data between host and device.
Secondly kernel is not splitting the work into different workitems. Instead programmer is splitting it by launching multiple kernels to run the same kernel code in parallel. You can set how many kernels you want launch by setting the global_work_size of the clEnqueueNDRangeKernel.
If you have a for loop where iterations are not dependent on each other, it could be a good part to optimize with OpenCL. It is also good if there is quite a lot calculations in that loop but not much data going into it and out from it. In that case you make the inner part of the loop into OpenCL kernel and launch it with a global_work_size that is equivalent to the for loop's total loop count.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.
I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:
system.time() to measure the current CPU usage of my functions
Rprof(tf <- "rprof.log", memory.profiling=TRUE) to profile memory
usage
Rprofmem("Rprofmem.out", threshold = 10485760) to log objects that
exceed 10MB
require(parallel) to give me multicore and parallel functionality
for use in my functions
source('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R') to
benchmark CPU usage differences in single core and parallel modes
sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")})) to list object sizes
print(object.size(x=lapply(ls(), get)), units="Mb") to give me total memory used at the completion of my script
The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.
Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I currently have 2 ExtraSmall webroles(MVC4) running on Azure cloud services(windows server 2012).I logged into the RDP and checked is's resources usage by task manager, found out that the memory usage is very high, one is about 92% used and only 56Mb free memory left, another is 86% has 150Mb free memory left. The website is very slow, is it possible the low performance's caused by the low memory?Do you think it's better upgrade the VM size to Small or larger?
Thx a lot
Honestly, only you can determine best instance size. From Small (1 core, 1.75GB, 100Mbps NIC) to Extra Large (8 core, 14GB, 800Mbps NIC), machines scale in a straightforward way, and you should pick the smallest instance size that can properly and efficiently run your app, and then scale out/in as necessary. The A6/A7 machines are significantly larger (4 core, 28GB, 1000Mbps NIC, 8 core, 56GB, 2000Mbps NIC), and the Extra Small is very limited (shared core, 768MB, 5Mbps NIC). Extra Small instances may have issues running certain workloads.
So: You may be having issues related to the XS resource limitations for your particular app. You should do some empirical testing on Small through Extra Large to see where low-volume app experiences work fine, and then pick that size, using multiple instances to handle heavier load.
When picking the size, you'll probably reach a bottleneck with a specific resource (CPU, RAM, network), and you'll need to pick based on that. For example, if you really need 6GB RAM, you're now looking at a Large, even if you're barely utilizing CPU.
More details on instances sizes, here.
It's always easy to scale up to the Small first then go to Large. You are going to be doubling your memory with a small at 1.75 GB. Plus on the Extra Small you are using Shared CPU Cores on a Small you don't share the cores.
Going to a Large with 7 GB of memory would be overkill, I think.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
is MPI widely used today in HPC?
A substantial majority of the multi-node simulation jobs that run on clusters everywhere is MPI. The most popular alternatives include things like GASNet which support PGAS languages; the infrastructure for Charm++; and probably linda spaces get an honourable mention, just due to the number of core-hours being spent running Gaussian. In HPC, UPC, co-array fortran/HPF, PVM etc ends up dividing up the tiny fraction that is left.
Any time you read in the science news about a simulation of a supernova, or of formula-one racing teams using simulation to "virtual wind-tunnel" their cars before making design changes, there's an excellent chance that it is MPI under the hood.
It's arguably a shame that it is so widely used by technical computing people - that there aren't more popular general-purpose higher-level tools which get the same uptake - but that's where we are at the moment.
I worked for 2 years in the HPC area and can say that 99% of cluster applications was written using MPI.
MPI is widely used in high performance computing, but some machines try to boost performance by combining deploying shared memory compute nodes, which usually use OpenMP. In those cases the application would uses MPI and OpenMP to get optimal performance. Also some systems use GPUs to improve performance, I am not sure about how well MPI supports this particular execution model.
But the short answer would be yes. MPI is widely used in HPC.
It's widely used on clusters. Often it's the only way that a certain machine supports multi-node jobs. There are other abstractions like UPC or StarP, but those are usually implemented with MPI.
Yes, for example, Top500 super computers are benchmarked using LINPACK (MPI based).
Speaking about HPC, MPI is the main tool even nowdays. Although GPUs are strongly hitting HPC, MPI is still top 1.