Running r models on AWS - do multiple vCPUs function like a multicore system?

Running r models on AWS - do multiple vCPUs function like a multicore system? - r

I'm running models in the r package 'secr'. The simplest models take days to complete on a 4G macbook and I've already done everything possible within the model's setup to decrease run time. Parallel (multicore) processing is possible and straightforward in secr, but benefits are minimal and run time may actually increase. Am I likely to see improvement in run time if I switch to a high-powered virtual machine in the cloud (e.g. AWS's EC2 with 16 RAM and 4 vCPUs), or do the EC2's four vCPUs function like a multicore system (in which case I would only benefit from one vCPU despite having 4)?
I've asked this question in a couple of different forums and received conflicting answers.

You can think of the vCPUs just like a multicore system. They would appear as multiple cores to any software running on the system.

Good question. It depends. You may see a improvement in runtime if you switch to a EC2 instance type with better virtual hardware specifications. AWS runs a custom version of the Xen hypervisor, and your getting vCPUs as you pointed out. Performance will depend on the variability of the other guests's workloads. If the vCPUs are all assigned to instances, and each instance is running CPU heavy workloads, your going to see a downward trend in performance. It depends on the pattern of usage of all the instances running on the hypervisor. This article from Citrix explains some of the nuances of balancing vCPU time between instances on Xen and why performance will vary:
Citrix on Xen vCPU Performance
The instance type matters, not only the vCPUs and RAM. Avoid the T2 instances because they are 'burstable' and CPU performance will certainly vary. This article from AWS recommends to try M4 instance types for parallelization with R:
Running R on AWS
For specific types of EC2 instances you can control the C-state (sleep levels a core can enter when it is idle) and P-State (desired performance in frequency from a core). This would allow you to tune your instance performance for your workload. The following link explains in detail what instance types allow for C-State and P-State control and shows you how to use the utility "stress" to benchmark and tune different configurations.
EC2: Processor State Control
It would be best to design a test when you first provision the instance to see if it the type meets your performance requirements and then run the test again later to see if the performance benchmark holds.

Related

Hyperthreading makes my code run slower?

Some multithreaded code I just wrote appears to run slower under hyperthreaded CPUs - i.e. disabling hyperthreading makes it run FASTER. Is this normal?

This depends entirely on use case. A subjective term like normal has a lot of leeway! There are use cases where Hyper-Threading (HT) makes sense, and cases where it will have a performance impact.
One such case of performance decrease is for applications making heavy use of AVX instructions. The AVX instructions are carried out in the vector processing unit(VPU), of which there is one per core in Intel Xeon processors. Additional threads will block when trying to access the VPU if it is not available, leading to no performance improvement with the use of HT.
If you have say, 4 cores with HT, allowing you to run 8 threads, you will only actually be able to run 4 VPU instructions at a time - so your other 4 threads will be blocked as they complete. The additional overhead of the blocking and scheduling will usually net you a lower throughput than if you were running 4 threads on 4 cores, with HT disabled.
Likewise, running just 4 threads on the 8 cores, the OS scheduler can schedule the threads to run on any physical core - so there may still be a chance where one thread blocks waiting for another to complete. Some newer applications and job schedulers can now coordinate with the OS to "pin" threads on physical cores, allowing HT to be enabled, but not to oversubscribe the amount of threads that are running on a core. Over time this will probably get better, but does require awareness on the developer's part.
For more general purpose use cases, like a generic server handling many types of workloads, the advantage of HT running additional threads in a single core it usually a performance gain.

Max number of workers/slaves for parallel job snow

I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?

It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.

How to get cpu cores utilizations in xen?

I have installed xen as hypervisor and there are dom0 and some paravirtualized machins as domu VMs on it.
I know xentop is used for checking the performance of system and virtual machines and I can read the output of it for measuring the virtual machine cpu utilization, But, it just gives the total usage of cpus!
So, is there any tool or any way to get cpu usages per cores?

I think you may be able to get what you want from XenMon. http://www.virtuatopia.com/index.php/Xen_Monitoring_Tools_and_Techniques
Also, try using xentop with the VCPUs option: -v or press V when inside xentop.
If you are using XenServer, then there are lots of interesting host metrics available, much more than the base Xen setup. Check out http://support.citrix.com/servlet/KbServlet/download/38321-102-714737/XenServer-6.5.0_Administrators%20Guide.pdf, Chapter 9.
It's all particularly good if you use XenCenter to view those metrics, but then I would say that since I wrote a significant portion of it ;-)

Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?

I am running the termstrc yield curve analysis package in R across 10 years of daily bond price data for 5 different countries. This is highly compute intensive, it takes 3200 seconds per country on a standard lapply, and if I use foreach and %dopar% (with doSNOW) on my 2009 i7 mac, using all 4 cores (8 with hyperthreading) I get this down to 850 seconds. I need to re-run this analysis every time I add a country (to compute inter-country spreads), and I have 19 countries to go, with many more credit yield curves to come in the future. The time taken is starting to look like a major issue. By the way, the termstrc analysis function in question is accessed in R but is written in C.
Now, we're a small company of 12 people (read limited budget), all equipped with 8GB ram, i7 PCs, of which at least half are used for mundane word processing / email / browsing style tasks, that is, using 5% maximum of their performance. They are all networked using gigabit (but not 10-gigabit) ethernet.
Could I cluster some of these underused PCs using MPI and run my R analysis across them? Would the network be affected? Each iteration of the yield curve analysis function takes about 1.2 seconds so I'm assuming that if the granularity of parallel processing is to pass a whole function iteration to each cluster node, 1.2 seconds should be quite large compared with the gigabit ethernet lag?
Can this be done? How? And what would the impact be on my co-workers. Can they continue to read their emails while I'm taxing their machines?
I note that Open MPI seems not to support Windows anymore, while MPICH seems to. Which would you use, if any?
Perhaps run an Ubuntu virtual machine on each PC?

Yes you can. There are a number of ways. One of the easiest is to use redis as a backend (as easy as calling sudo apt-get install redis-server on an Ubuntu machine; rumor has that you could have a redis backend on a windows machine too).
By using the doRedis package, you can very easily en-queue jobs on a task queue in redis, and then use one, two, ... idle workers to query the queue. Best of all, you can easily mix operating systems so yes, your co-workers' windows machines qualify. Moreover, you can use one, two, three, ... clients as you see fit and need and scale up or down. The queue does not know or care, it simply supplies jobs.
Bost of all, the vignette in the doRedis has working examples of a mix of Linux and Windows clients to make a bootstrapping example go faster.

Perhaps not the answer you were looking for, but - this is one of those situations where an alternative is sooo much better that it's hard to ignore.
The cost of AWS clusters is ridiculously low (my emphasis) for exactly these types of computing problems. You pay only for what you use. I can guarantee you that you will save money (at the very least in opportunity costs) by not spending the time trying to convert 12 windows machines into a cluster. For your purposes, you could probably even do this for free. (IIRC, they still offer free computing time on clusters)
References:
Using AWS for parallel processing with R
http://blog.revolutionanalytics.com/2011/01/run-r-in-parallel-on-a-hadoop-cluster-with-aws-in-15-minutes.html
http://code.google.com/p/segue/
http://www.vcasmo.com/video/drewconway/8468
http://aws.amazon.com/ec2/instance-types/
http://aws.amazon.com/ec2/pricing/
Some of these instances are so powerful you probably wouldn't even need to figure out how to setup your work on a cluster (given your current description). As you can see from the references costs are ridiculously low, ranging from 1-4$ per hour of compute time.

What about OpenCL?
This would require rewriting the C code, but would allow potentially large speedups. The GPU has immense computing power.

Single-CPU programs running on Hyper-Threading-enabled quadcore CPU

I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.

This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.

Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!

If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.

No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex