I noticed an interesting problem. If I run the following code in R 2.12.0 (32-Bit) on a windows 3.00 gHz Core 2 Duo CPU with 2GB of RAM, it runs in less than one second. If I run it on a unix-box with sparc-sun-solaris2.10 (Also 32-Bit, though the unix box could run 64-bit) it takes 84 seconds. The processing speed of the unix box is 2.5 gHz. If I run top while the code is running, I noticed that my R process is only taking up to ~3.2% of available cpu states, even if more are available. Could this be part of the problem? I read the install manual, but nothing jumped out at me as the obvious solution to my problem. Is the unix operating system somehow limiting available resources while windows is not? Or, is there some preferable way to compile R from source that was not done? I apologize if I have not given enough information to answer the problem, this is not really my area of expertise.
t0 <- proc.time()[[3]]
x <- rnorm(10000)
for(i in 1:10000){
sd(x)
}
print(proc.time()[[3]]-t0)
Processors such as the T1 or T2 have a number of cores, and each core has a number of strands (hardware-level context switching). If you can run a multithreaded application, you'll get a large throughput. A typical intended use case would be a Java based web server, processing e.g. 20-40 connections at the same time.
The downside of this type of processors is that single threaded performance of these SPARC chips is quite low. It looks like Oracle is aware of the issue; the current development on T4 focuses on improving the single threaded speed.
The T1 processor exposes 32 logical CPUs to the operating system. If this was your case, and the displayed value was the percent of total computing power, 1/32 ~= 3.125%, which is close to what you saw.
To squeeze all the performance from a T1 processor, you need to make R use multiple CPUs, for example via the multicore package.
Related
I've been using R 3.1.2 on an early-2014 13" MacBook Air with 8GB and 1.7GHz Intel Core I7, running Mavericks OSX.
Recently, I've started to work with substantially larger data frames (2+ million rows and 500+ columns) and I am running into performance issues. In Activity Monitor, I'm seeing virtual memory sizes of 64GB, 32GB paging files, etc. and the "memory pressure" indicator is red.
Can I use the "throw more hardware" at this problem? Since the MacBook Air tops out at 8GB physical memory, I was thinking about buying a Mac Pro with 64GB memory. Before I spend the $5K+, I wanted to ask if there are any inherent limitations in R other than the ones that I've read about here: R Memory Limits or if anyone who has a Mac Pro has experienced any issues running R/RStudio on it. I've searched using Google and haven't come up with anything specific about running R on a Mac Pro.
Note that I realize I'll still be using 1 CPU core unless I rewrite my code. I'm just trying to solve the memory problem first.
Several thoughts:
1) Its a lot more cost effective to use a cloud service like https://www.dominodatalab.com (not affiliated). Amazon AWS would also work, the benefit of domino is that it takes the work out of managing the environment so you can focus on the data science.
2) You might want to redesign your processing pipeline so that not all your data needs to be loaded in memory at the same time (soon you will find you need 128 GB, then what). Read up on memory mapping, using databases, separating your pipeline into several steps that can be executed independent of each other, etc (googling brought up http://user2007.org/program/presentations/adler.pdf). Running out of memory is a common problem when working with real life datasets, throwing more hardware at the problem is not always your best option (though sometimes it really can't be avoided).
Have anyone tried to compile glibc with -march=corei7 to see if there's any performance improvement over the version that comes by default with any Linux x68_64 distribution? GCC is compiled with -march=i686. I think (not sure) that the mathematical library is also compiled the same way. Can anybody confirm this?
Most Linux distributions for x86 compile using only i686 instructions, but asking for scheduling them for later processors. I haven't really followed later developments.
A long while back different versions of system libraries according to processor lines were common, but the performance differences were soon deemed too small for the cost. And machines got more uniform in performance meanwhile.
One thing that has to be remembered always is that today's machines are memory bound. I.e., today a memory access takes a few hundred times longer than an instruction, and the gap is growing. Not to mention that this machine (an oldish laptop, was top-of-the-line some 2 years back) has 4 cores (8 threads), all battling to get data/instructions from memory. Making the code run a tiny bit faster, so the CPU can wait longer for RAM, isn't very productive.
The Peak GFLOPS of the the cores for the Desktop i7-4770k # 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS. But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS. Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed.
As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of OpenCL/OpenGL. I'm curious to know how one can program the Intel HD Graphics hardware for compute (e.g. SGEMM) without OpenCL?
Added comment:
Their is no Intel support for HD graphics and OpenCL on Linux. I found beignet which is open source attempt to add support to Linux at least for Ivy Bridge HD graphics. I have not tried it. Probably the people developing Beignet know how to program the HD graphics hardware without OpenCL then.
Keep in mind that there is a performance hit to copy the data to the video card and back, so this must be taken into account. AMD is close to releasing APU chips that have unified memory for the CPU and GPU on the same die, which will go a long way towards alleviating this problem.
The way the GPU used to be utilized before CUDA and OpenCL were to represent the memory to be operated on as a texture utilizing DirectX or OpenGL. Thank goodness we don't have to do that anymore!
AMD is really pushing the APU / OpenCL model, so more programs should take advantage of the GPU via OpenCL - if the performance trade off is there. Currently, GPU computing is a bit of a niche market relegated to high performance computing or number crunching that just isn't needed for web browsing and word processing.
It doesn't make sense any more for vendors to let you program using low-level ISA.
It's very hard and most programmers won't use it.
It keeps them from adjusting the ISA in future revisions.
So programmers use a language (like C99 in OpenCL) and the runtime does ISA-specific optimizations right on the user's machine.
An example of what this enables: AMD switched from VLIW vector machines to scalar machines and existing kernels still ran (most ran faster). You couldn't do this if you wrote ISA directly.
Programming a coprocessor like iris without opencl is rather like driving a car without the steering wheel.
OpenCL is designed to expose the requisite parallelism that iris needs to achieve its theoretical performance. You cant just spawn 100s of threads or processes on it and expect performance. Having blocks of threads doing the same thing, at the same time, on similar memory addresses, is the whole crux of the matter.
Maybe you can think of a better paradigm than opencl for achieving that goal; but until you do, I suggest you try learning some opencl. If you are into python; pyopencl is a great place to start.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Parallel processing in R limited
I've written some code in R multicore, and I'm running it on a 24-core machine. In fact there are only 12 cores, but they are hyperthreaded, so it looks like there are 24.
Here's what's strange: all the threads run on the same single core! So they each only use a tiny amount of cpu, instead of each running on a single core, and chewing up all available cores.
For simplicity, I'm just running 4 threads:
mclapply( 1:30, function(size) {
# time consuming stuff that is cpu bound (think "forecast.ets" et al)
}, mc.cores = 4, mc.preschedule = F )
Prior to running this, there is already an R process running on one core, using 100% of that core's capacity:
Next, I launch the "multicore process", and 4 extra threads fight for the same core!:
... so, they each get 12% of one core, or about 1% of the available processing power, when they should each be able to get 100% of one core. Also, the other R process now only get 50% of the core.
OS is Ubuntu 12.04 64-bit. Hardware is Intel. R is version 2.15.2 "trick or treat"
Thoughts? (I know I could just use snowfall, but I have a ton of variables, and I really don't want to have to sfExport all of them!)
Edit: oh, I guess there's some global lock somewhere? But still, why would there be a conflict between two completely separate R processes? I can run two R processes in parallel just fine, with each taking 100% of a core's CPU.
Edit2: Thanks to Dirk's pointer, I rebuilt openblas, and it's looking much healthier now!:
A possible issue is a possible side effect of the OpenBLAS package which sets CPU affinity such that processes stick to one core. See Parallel processing in R limited for a discussion and link to more discussion on the r-sig-hpc list which has a fix.
I have a basic question.
If I run an executable file (Release, Visual Studio 2010) on two computers with the same CPU speed run two different Windows operating systems, eg. Windws7 vs XP, shall I expect to see different CPU usages when I measure it using the task manager? Is the CPU speed the only factor to measuring the CPU usage?
Thanks.
Sar
Different OS's? Yes.
Operating Systems are the go-between between the programs you run and the bare-metal they run on. As OS'es change and evolve the naturally and and remove features that consume resources--these are things that run in the background; or they could be changes to the manner in which the OS speaks to the hardware.
Also, the measurement of CPU usage is done by the OS. There isn't a tachometer on chips saying "running at 87% of redline", but rather that "tach" is constructed largely by the OS.
After better understanding your situation: I would suggest taking a look at the Performance Monitor (perfmon.exe) which ships with both XP and Win7, and gets you much finer-grain detail about processor usage levels. Another (very good) option would be to consider running a profiler on your application on both OSes and compare the results. It would likely be the best option to specifically benchmark your application on both OSes.
Even on the same OS you should expect to see different usages, because there are so many factors that determine CPU usage.
The percentage of CPU usage listed in the task manager is not a very good indication of much of anything, except to say that a program either is, or is not using CPU. That particular statistic is derived from task switching statistics, and task switching is very sensitive to basically every single thing that's going on in a computer, from network access to memory speed to CPU temperature.