i am currently evaluating the AES encryption and decryption speed on my laptop and my workstation.
when executing
cryptsetup benchmark -c aes --key-size 128
I get normal results of almost 200MB/s without the AESNI extension.
when i load the extension with
modprobe aesni-intel
and perform the same benchmark I get completely unrealistic results
for example 68021MB/s on decrypt
any suggestions what might be the problem causing these unrealistic results?
BTW: OS on laptop is Ubuntu, Workstation is Gentoo
Uninstalled predefined ubuntu package
installed from source
with
make check
the make scripts performs a single test and these results are fine
but when i install it via
make install
i again get these weird results
Unrealistic benchmark results are usually caused by wrong (as in totally invalid) approach to benchmarking.
Judging from their benchmark source, the benchmark core is (in horrendous pseudo-code)
totalTime = 0
totalSize = 0
while ( totalTime < 1000 ) {
(sampleTime, sampleSize) = processSingleSample
totalTime += sampleTime
totalSize += sampleSize
}
speed = totalSize / totalTime
Imagine a situation when the execution time of processSingleSample is close to zero - each iteration steadily increases the totalSize but on some iteration the total time will not increase at all. In the end the totalTime is 1000 and the totalSize is arbitrary large, hence the resulting "speed" is arbitrary large.
This benchmarking approach can still be useful when each individual iteration takes a significant amount of time but in this particular case (especially after you enable aesni which decreases the time for each individual iteration even more) it just not the right one.
Related
I'm running a Notebook on JupyterLab. I am loading in some large Monte Carlo chains as numpy arrays which have the shape (500000, 150). I have 10 chains which I load into a list in the following way:
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
If I load 5 chains then all works well. If I try to load 10 chains, after about 6 or 7 I get the error:
Kernel Restarting
The kernel for my_code.ipynb appears to have died. It will restart automatically.
I have tried loading the chains in different orders to make sure there is not a problem with any single chain. It always fails when loading number 6 or 7 no matter the order, so I think the chains themselves are fine.
I have also tried to load 5 chain in one list and then in the next cell try to load the other 5, but the fail still happens when I get to 6 or 7, even when I split like this.
So it seems like the problem is that I'm loading too much data into the Notebook or something like that. Does this seem right? Is there a work around?
It is indeed possible that you are running out of memory, though unlikely that it's actually your system that is running out of memory (unless it's a very small system). It is typical behavior that if jupyter exceeds its memory limits, the kernel will die and restart, see here, here and here.
Consider that if you are using the float64 datatype by default, the memory usage (in megabytes) per array is:
N_rows * N_cols * 64 / 8 / 1024 / 1024
For N_rows = 500000 and N_cols = 150, that's 572 megabytes per array. You can verify this directly using numpy's dtype and nbytes attributes (noting that the output is in bytes, not bits):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
print(chain.dtype)
print(chain.nbytes / 1024 / 1024)
If you are trying to load 10 of these arrays, that's about 6 gigabytes.
One workaround is increasing the memory limits for jupyter, per the posts referenced above. Another simple workaround is using a less memory-intensive floating point datatype. If you don't really need the digits of accuracy afforded by float64 (see here) you could simply use a smaller floating point representation, e.g. float32
chain = np.loadtxt('my_chain_{}.txt'.format(i), dtype=np.float32)
chains.append(chain)
Given that you can get to 6 or 7 already, halving the data usage of each chain should be enough to get you to 10.
You could be running out of memory.
Try to load the chains one by one and then concatenate them.
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
if i > 0:
chains = np.concatenate((chains[0],
chains[1]), axis=0)
chains.pop(1)
this is my first post on this forum so if I don't do things correctly, please tell me I'm correcting it!
So here's my problem: at the end of one of my R programs, I do a simple loop using "repeat". This loop will perform a series of calculations for each month of the year 137 times, making a total of 1,646 repetitions. This takes time, about 50 minutes.
numero_profil_automatisation = read.csv2("./import_csv/numero_profil_automatisation.csv", header=TRUE, sep=";")
n=1
repeat {
annee="2020"
annee_profil="2020"
mois_de_debut="1"
mois_de_fin="12"
operateur=numero_profil_automatisation[n,1]
profil_choisi=numero_profil_automatisation[n,3]
couplage=numero_profil_automatisation[n,4]
subvention=numero_profil_automatisation[n,5]
type=numero_profil_automatisation[n,6]
seuil=0.05
graine=4356
resultat=calcul_depense(annee,annee_profil,mois_de_debut,mois_de_fin,operateur,
profil_choisi,couplage,subvention,type,graine,seuil)
nom=paste("./resultat_csv/",operateur,an,type,couplage,subvention,
profil,".csv",sep="_")
write.csv2(resultat,nom)
n<-n+1
if (n == 138) break
}
I would like to optimize the code, so I talked to a friend of mine, a computational developer (who doesn't "know" R) who advised me among other things to parallelize the calculations. My new work computer has 4 cores (R detects the 8 logical cores) so I can save a lot of time.
Being an economist-statistician and not a developer, I'm completely uncertain on this subject. I looked at some forums and articles and I found a code that worked on my previous computer which had only 2 cores (R detected the 4 logical cores), it divided the computing time by almost 2, from 2h to 1h. On my new computer, this piece of code doesn't change at all the computing time, with or without this piece of code it runs arround 50 minutes (better processor, more ram memory).
Below are the two lines of code I added just above the code I put above. With of course the packages at the beginning of the code that I can give you
no_cores <- availableCores() - 1
plan(multicore, workers = no_cores)
Do you have any idea why it seemed to work on my previous computer and not on the new one when nothing has changed except the computer? Or a corrective action to take?
I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores
I'm running CPLEX from IBM ILOG CPLEX Optimization Studio 12.6.
Here I'm facing a weird issue. Solving the same optimization problem (pure LP) multiple times in a row, yields different results.
The aim is to solve once, then iteratively modify the coefficient matrix, and re-solve the problem. However, we experienced that the changes between iterations did not correspond to the modifications.
This lead us to try re-solving the problem without doing modifications in between, which returned different results.
The catch is that we still do one major modification before we start iterating, and our hypothesis is that this change (cplex.setCoef(...) on about 10,000 rows) is done asynchronously, so that it is only partially done during the first re-solution iterations.
However, we cannot seem to find any documentation stating that this method is asynchronous, nor any way to ensure synchronous execution, so that all the changes are done before CPLEX restarts.
Does anyone know if this is the case? Is there any way to delay restart until cplex.setCoef(...) is done? The problem is quite huge, but the representative lines are:
functionUsingSetCoefOn10000rows();
for(var j = 0; j < 100; j++){
cplex.solve();
writeln("Iteration " + j + ": " + cplex.getObjValue());
for(var k = 0; k < 100000; k++){
doBusyWork(); //Just to kill time
}
}
which outputs
Iteration 0: 1529486959.814946
Iteration 1: 1544325969.750444
Iteration 2: 1549669732.757587
Iteration 3: 1551818419.584333
...
Iteration 33: 1564007987.849925
...
Iteration 98: 1564007987.849925
Iteration 99: 1564007987.849925
Last minute update
Reducing the number of calls to cplex.setCoef to about 2500 removes the issue, and all iterations return the same objective value. Sadly, we do need to change all the 10,000 coefficients.
Edit: The OPL scripting and engine log: http://goo.gl/ywJhkm and here: http://goo.gl/v2Qhm9
Sorry that this is not really an answer, but it is too big to go as a comment...
I don't think that the setCoef() calls would be asynchronous and not complete - that would be very surprising. Such behaviour would be too unpredictable and too many other people would have problems with this behaviour. However, CPLEX itself will use multiple threads to solve a problem and that means that it can generate different solutions each time it runs. The example objective values that you show do seem to change significantly, so a few questions/observations:
1: The numbers seem to be monotonically increasing - are they all increasing like this until they reach the maximum value? It looks like some kind of convergence behaviour. On re-running, CPLEX will start from a previous solution if it can. Check that there isn't some other CPLEX parameter stopping the search early such as an iteration or time limit or wider solution optimality tolerance.
2: Have you looked at the CPLEX logs from each run to see what CPLEX is doing in each run?
3: If you have doubts about the model being solved, try dumping out the model as an LP file and check the values in each iteration. They should all be the same in your case. You can also try solving the LP file in the CPLEX standalone optimiser to see what value that gives.
4: Have you tried setting the parameters to make CPLEX use a different LP algorithm (e.g. primal simplex, barrier etc)?
I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.