I'm running a Notebook on JupyterLab. I am loading in some large Monte Carlo chains as numpy arrays which have the shape (500000, 150). I have 10 chains which I load into a list in the following way:
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
If I load 5 chains then all works well. If I try to load 10 chains, after about 6 or 7 I get the error:
Kernel Restarting
The kernel for my_code.ipynb appears to have died. It will restart automatically.
I have tried loading the chains in different orders to make sure there is not a problem with any single chain. It always fails when loading number 6 or 7 no matter the order, so I think the chains themselves are fine.
I have also tried to load 5 chain in one list and then in the next cell try to load the other 5, but the fail still happens when I get to 6 or 7, even when I split like this.
So it seems like the problem is that I'm loading too much data into the Notebook or something like that. Does this seem right? Is there a work around?
It is indeed possible that you are running out of memory, though unlikely that it's actually your system that is running out of memory (unless it's a very small system). It is typical behavior that if jupyter exceeds its memory limits, the kernel will die and restart, see here, here and here.
Consider that if you are using the float64 datatype by default, the memory usage (in megabytes) per array is:
N_rows * N_cols * 64 / 8 / 1024 / 1024
For N_rows = 500000 and N_cols = 150, that's 572 megabytes per array. You can verify this directly using numpy's dtype and nbytes attributes (noting that the output is in bytes, not bits):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
print(chain.dtype)
print(chain.nbytes / 1024 / 1024)
If you are trying to load 10 of these arrays, that's about 6 gigabytes.
One workaround is increasing the memory limits for jupyter, per the posts referenced above. Another simple workaround is using a less memory-intensive floating point datatype. If you don't really need the digits of accuracy afforded by float64 (see here) you could simply use a smaller floating point representation, e.g. float32
chain = np.loadtxt('my_chain_{}.txt'.format(i), dtype=np.float32)
chains.append(chain)
Given that you can get to 6 or 7 already, halving the data usage of each chain should be enough to get you to 10.
You could be running out of memory.
Try to load the chains one by one and then concatenate them.
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
if i > 0:
chains = np.concatenate((chains[0],
chains[1]), axis=0)
chains.pop(1)
Related
I've been using OpenCL for a little while now for hobby purposes. I was wondering if someone could explain how i should view global and local work spaces. I've been playing around with it for a bit but i cannot seem to wrap my head around it.
I have this piece of code, the kernel has a global work size of 8 and the local work size of 4
__kernel void foo(__global int *bar)
{
bar[get_global_id(0)] = get_local_id(0);
}
This result in bar looks like this:
{0, 1, 2, 3, 0, 1, 2, 3, 4}
I know why it is happening because of the work sizes I've used. But i can't seem to wrap my head around how i should view this.
Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?
Or does this mean the main body of the kernel just has two counters? one from local and one global but what is the point of that?
I know i might be a bit vague and my question might seem dumb. But i don't know how i can use this more optimally and how i should view this?
Global size is the total number of work items.
Work groups subdivide this total workload, and local size defines the size of each group within the global size.
So for a global work size of 8 and a local size of 4, each in 1 dimension, you will have 2 groups. Your get_global_id(0) will be different for each thread: 0…7. get_local_id(0) will return 0…3 for the 4 different threads within each group. This is what you're seeing in indices 0 through 7 of your output.
This also means that if your global work size is 8, only the first 8 items of bar will be set by your kernel. So anything beyond that (the value 4 at index 8 in your output) is undefined.
Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?
You're overthinking it. There are 8 threads in total. They are subdivided into 2 groups of 4 threads. What is "local" about the threads in those groups is that they share access to the same local memory. Threads which are not in the same group can only "communicate" via global memory.
Using local memory can hugely improve efficiency for some workloads:
It's very fast.
Threads in a work group can use barriers to ensure they are in lock-step, i.e. they can wait for one another to guarantee another thread has written to a specific local memory location. (Threads in different groups cannot wait for each other.)
But:
Local memory is small (typically a few KiB) - and using all of it in one group usually has further efficiency penalties.
Local memory must be filled with data inside the kernel, and its contents is lost when the kernel completes. (Except for device-scheduled kernels in OpenCL 2)
There are tight limits on group size due to hardware limitations.
So if you are not using local memory, work groups and therefore local work size are essentially irrelevant to you.
I can create an array of a million elements like this:
Array(1:1_000_000)
Vector{Int64} with 1000000 elements
but if I try to create an array of a billion elements I get this:
Array(1:1_000_000_000)
Julia has exited.
Press Enter to start a new session.
Is Julia not able to handle a billion elements in an array or what am I doing wrong here?
You are creating an Array of Int64, each of which needs to be stored in memory:
julia> sizeof(3)
8
So at some point you're bound to run out of memory - this is not due to some inherent limit on the number of elements in an array, but rather the size of the overall array, which in turn depends on the size of each element. Consider:
julia> sizeof(Int8(3))
1
julia> [Int8(1) for _ in 1:1_000_000_000]
1000000000-element Array{Int8,1}:
1
1
1
⋮
1
1
1
so filling the array with a smaller data type (8-bit rather than 64-bit Integer) allows me to create an array with more elements.
While there is no limit how big an Array can be in Julia there is obviously the available RAM memory limit (mentioned in the other answer). Basically, you can assume that all your available system memory can be allocated for a Julia process. sizeof is a good way to calculate how much RAM you need.
However, if you actually do big array computing in Julia the above limit can be circumvented in many ways:
Use massive memory machines from a major cloud computing provider. I use Julia on AWS Linux and it walks like a charm - you can have a machine up to 4TB RAM on a virtual machine and 24TB RAM on a bare metal machine. While it is not a Julia solution, sometimes it is the easiest and cheapest way to go.
Sometimes your data is sparse - you do not actually use all of those memory cells. In such cases consider SparseArrays. In other cases your sparse data is formatted in some specific way (e.g. non-zero values only on diagonal) in that case use BanndedMatrices.jl. It is worth noting that there is even a Julia package for infinite algebra. Basically whatever you find at the Julia Matrices project is worth looking at.
You can use memory mapping - that means that most of your array is on disk and only some part is hold in RAM. In this way you are limited by your disk space rather than the RAM.
You can use DistributedArrays.jl and have a single huge Array hosted on several machines.
Hope it will be useful for you or other people trying to do big data algebra in Julia.
I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.
I have 3 OpenCL devices on my MacBookPro, so I am trying a little bit complicated calculation with a small exsample.
I create a context contain 3 devices, two are GPU and one is CPU. Then create 3 command queues, one from(or for) each of them.
Then create a big global buffer, big but not bigger than the smallest one available in any one of the device. Then create 3 sub buffers from the input buffer, the sizes of them are all calculated carefully. Another not so big output buffer is also created and 3 small sub buffers created on it.
After setup the kernel, set arguments and so on, everything looks good. The first two device accept the kernel and start to run, but the third one refused it and return CL_INVALID_WORK_GROUP_SIZE.
I don't want to put any source code here as their are nothing special and I am sure there is no bug in it.
I did some log as the following:
command queue 0
device: Iris Pro max work group size 512
local work size(32 * 16) = 512
global work size(160 * 48) = 7680
number of work groups = 15
command queue 1
device: GeForce GT 750M max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
command queue 2
device: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
I checked the first two output are correct as expected, so the kernel and host code must be correct.
There is only one possibility I can think of, is there any limit when using CPU and GPU at the same time and share one buffer object?
Thanks in advance.
Ok I figure out the problem. The CPU support max work item size (1024, 1, 1), so local work size cannot use (32x32).
But still have problem when use local work size bigger than (1, 1). Keep trying.
From Intel's OpenCL guide:
https://software.intel.com/en-us/node/540486
Query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE return always 1, even with a very simple kernel without barrier. In that case, work group size can be 128 (it's a 1D work group), but cannot be 256.
Conclusion is better not use it in some case :(
i am currently evaluating the AES encryption and decryption speed on my laptop and my workstation.
when executing
cryptsetup benchmark -c aes --key-size 128
I get normal results of almost 200MB/s without the AESNI extension.
when i load the extension with
modprobe aesni-intel
and perform the same benchmark I get completely unrealistic results
for example 68021MB/s on decrypt
any suggestions what might be the problem causing these unrealistic results?
BTW: OS on laptop is Ubuntu, Workstation is Gentoo
Uninstalled predefined ubuntu package
installed from source
with
make check
the make scripts performs a single test and these results are fine
but when i install it via
make install
i again get these weird results
Unrealistic benchmark results are usually caused by wrong (as in totally invalid) approach to benchmarking.
Judging from their benchmark source, the benchmark core is (in horrendous pseudo-code)
totalTime = 0
totalSize = 0
while ( totalTime < 1000 ) {
(sampleTime, sampleSize) = processSingleSample
totalTime += sampleTime
totalSize += sampleSize
}
speed = totalSize / totalTime
Imagine a situation when the execution time of processSingleSample is close to zero - each iteration steadily increases the totalSize but on some iteration the total time will not increase at all. In the end the totalTime is 1000 and the totalSize is arbitrary large, hence the resulting "speed" is arbitrary large.
This benchmarking approach can still be useful when each individual iteration takes a significant amount of time but in this particular case (especially after you enable aesni which decreases the time for each individual iteration even more) it just not the right one.