JMH - why do I need Blackhole.consumeCPU() - jmh

I'm trying to understand why it is wise to use Blackhole.consumeCPU() ?
Something I found about Blackhole.consumeCPU() on Google -->
Sometimes when we run run a benchmark across multiple threads we also
want to burn some cpu cycles to simulate CPU business when running our
code. This can't be a Thread.sleep as we really want to burn cpu. The
Blackhole.consumeCPU(long) gives us the capability to do this.
My example code:
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
#State(Scope.Thread)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConcatAvgBenchmark {
StringBuilder stringBuilder1;
StringBuilder stringBuilder2;
StringBuffer stringBuffer1;
StringBuffer stringBuffer2;
String string1;
String string2;
/*
* re-initializing the value after every iteration
*/
#Setup(Level.Iteration)
public void init() {
stringBuilder1 = new StringBuilder("foo");
stringBuilder2 = new StringBuilder("bar");
stringBuffer1 = new StringBuffer("foo");
stringBuffer2 = new StringBuffer("bar");
string1 = new String("foo");
string2 = new String("bar");
}
#Benchmark
#Warmup(iterations = 10)
#Measurement(iterations = 100)
#BenchmarkMode(Mode.AverageTime)
public StringBuilder stringBuilder() {
// operation is very thin and so consuming some CPU
Blackhole.consumeCPU(100);
return stringBuilder1.append(stringBuilder2);
// to avoid dead code optimization returning the value
}
#Benchmark
#Warmup(iterations = 10)
#Measurement(iterations = 100)
#BenchmarkMode(Mode.AverageTime)
public StringBuffer stringBuffer() {
Blackhole.consumeCPU(100);
// to avoid dead code optimization returning the value
return stringBuffer1.append(stringBuffer2);
}
#Benchmark
#Warmup(iterations = 10)
#Measurement(iterations = 100)
#BenchmarkMode(Mode.AverageTime)
public String stringPlus() {
Blackhole.consumeCPU(100);
return string1 + string2;
}
#Benchmark
#Warmup(iterations = 10)
#Measurement(iterations = 100)
#BenchmarkMode(Mode.AverageTime)
public String stringConcat() {
Blackhole.consumeCPU(100);
// to avoid dead code optimization returning the value
return string1.concat(string2);
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include(StringConcatAvgBenchmark.class.getSimpleName())
.threads(1).forks(1).shouldFailOnError(true).shouldDoGC(true)
.jvmArgs("-server").build();
new Runner(options).run();
}
}
Why are the results of this Benchmark better with the blackhole.consumeCPU(100) ?
EDIT:
Output with blackhole.consumeCPU(100):
Benchmark Mode Cnt Score Error Units
StringBenchmark.stringBuffer avgt 10 398,843 ± 38,666 ns/op
StringBenchmark.stringBuilder avgt 10 387,543 ± 40,087 ns/op
StringBenchmark.stringConcat avgt 10 410,256 ± 33,194 ns/op
StringBenchmark.stringPlus avgt 10 386,472 ± 21,704 ns/op
Output without blackhole.consumeCPU(100):
Benchmark Mode Cnt Score Error Units
StringBenchmark.stringBuffer avgt 10 51,225 ± 19,254 ns/op
StringBenchmark.stringBuilder avgt 10 49,548 ± 4,126 ns/op
StringBenchmark.stringConcat avgt 10 50,373 ± 1,408 ns/op
StringBenchmark.stringPlus avgt 10 87,942 ± 1,701 ns/op
My question was why the author of this code is using here blackhole.consumeCPU(100)
I think I know now why, because the Benchmarks are too quick without some delay.
With blackhole.consumeCPU(100) you can measure each benchmark better und receive more significant results.
Is that right ?

Adding artificial delay would not normally improve the benchmark.
But, there are some cases where the operation you are measuring is contending over some resources, and you need a backoff that only consumes CPU, and hopefully does nothing else. See e.g. the case in :
http://shipilev.net/blog/2014/nanotrusting-nanotime/
The benchmark in original question is not such a case, therefore I'd speculate Blackhole.consumeCPU is used there without a good reason, or at least that reason is not called out specifically in the comments. Don't do that.

Related

Simultaneously read and write to buffer

I am in the process of learning Julia and I'd like to do some buffer manipulation.
What I want to achieve is the following:
I've got a buffer that I can write to and read from at the same time, meaning that the speed with which I add a value to the Fifo buffer approximately equals the speed with which I read from the buffer. Reading and writing will happen in separate threads so it can occur simultaneously.
Additionally, I want to be able to control the values that I write into the buffer based on user input. For now, this is just a simple console prompt asking for a number, which I then want to write into the stream continously. The prompt refreshes and asks for a new number to write into the stream, but the prompt is non-blocking, meaning that in the background, the old number is written to the buffer until I enter a new number, which is then written to the buffer continuously.
This is my preliminary code for simulatenous reading and writing of the stream:
using Causal
CreateBuffer(size...) = Buffer{Fifo}(Float32, size...)
function writetobuffer(buf::Buffer, n::Float32)
while !isfull(buf)
write!(buf, fill(n, 2, 1))
end
end
function readfrombuffer(buf::Buffer)
while true
while !isempty(buf)
#show read(buf)
end
end
end
n_channels = 2
sampling_rate = 8192
duration = 2
n_frames = sampling_rate * duration
sbuffer = CreateBuffer(n_channels, n_frames)
print("Please enter a number: ")
n = parse(Float32, readline())
s1 = Threads.#spawn writetobuffer(sbuffer, n)
s2 = Threads.#spawn readfrombuffer(sbuffer)
s1 = fetch(s1)
s2 = fetch(s2)
I am not sure how to integrate the user input in a way that it keeps writing and reading the latest number the user put in. I looked at the documentation for channels, but didn't manage to get it working in a way that was non-blocking for the stream writing. I don't know that the correct approach is (channels, events, julia's multithreading) to enable this functionality.
How would I go on about to include this?
I managed to get it working, but I think it could be improved:
using Causal
CreateBuffer(size...) = Buffer{Fifo}(Float32, size...)
function writeToBuffer(buf::Buffer, n::Float32)
write!(buf, fill(n, 2, 1))
end
function readFromBuffer()
global soundbuffer
println("Starting")
sleep(0.5)
while true
while !isempty(soundbuffer)
read(soundbuffer)
end
end
println("Exiting...")
end
function askForInput()::Float32
print("Please enter a number: ")
a = parse(Float32, readline())
return(a)
end
function inputAndWrite()
global soundbuffer
old_num::Float32 = 440
new_num::Float32 = 440
while true
#async new_num = askForInput()
while (new_num == old_num)
writeToBuffer(soundbuffer, new_num)
end
old_num = new_num
println("Next iteration with number " * string(new_num))
end
end
n_channels = 2
sampling_rate = 8192
duration = 2
n_frames = sampling_rate * duration
soundbuffer = CreateBuffer(n_channels, n_frames)
s1 = Threads.#spawn inputAndWrite()
s2 = Threads.#spawn readFromBuffer()
s1 = fetch(s1)
s2 = fetch(s2)

Time complexity of Letter Combinations of a Phone Number

Here is LeetCode question 17:
Given a string containing digits from 2-9 inclusive, return all possible letter combinations that the number could represent. Return the answer in any order.
(https://leetcode.com/problems/letter-combinations-of-a-phone-number/)
Below is my DFS recursive code:
class Solution {
public static final String[] map = {"", "", "abc", "def", "ghi", "jkl", "mno", "pqrs", "tuv", "wxyz"};
public List<String> letterCombinations(String digits){
List<String> result = new ArrayList<String>();
if(digits == null || digits.length() == 0){return result;}
int curr_index = 0;
StringBuilder prefix = new StringBuilder("");
update_result(digits, prefix, curr_index, result);
return result;
}
private void update_result(String digits, StringBuilder prefix, int curr_index, List<String> result){
if(curr_index >= digits.length()){
result.add(prefix.toString());
return;
}
else{
String letters = map[ digits.charAt(curr_index) - '0' ];
for(int i = 0; i < letters.length(); i++){
prefix.append( letters.charAt(i) );
update_result(digits, prefix, curr_index+1, result);
prefix.deleteCharAt(prefix.length() -1);
}
return;
}
}
}
In the LeetCode solutions, it says the time complexity is O(n*n^4), where n is the length of the input. I have trouble understanding except the n^4, where the remaining extra n comes from.
My analysis of my code is: T(n) = O(1) + 4T(n-1). (The for loop is repeated for 4 times which length decremented by 1. And in the loop constant time is required for update the prefix String.)
It solves to 1 + 4 + 4^1+ ... + 4^n = O(4^n)
Can anyone help with why the solution says the time complexity is O(n*4^n)?
I agree with you, the time complexity should be O(4^n).
I have several ideas why it can be O(n * 4^n):
StringBuilder's append method can take O(n) time when its capacity reaches threshold, which means copying all elements to a new array. But in your case maximum length of the resulting string is 4 (from the problem constraints), whereas the default value of the threshold is 16 (https://docs.oracle.com/javase/8/docs/api/java/lang/StringBuilder.html#StringBuilder-java.lang.String-). Since 4 < 16, it always takes O(1) time.
StringBuilder's deleteCharAt method takes O(n) time in worst case, because of array copy. But in your case, you are removing only last character, which takes O(1) time.
Used String instead of StringBuilder, where concatenation and deletion with one element takes O(n) time
well I don't know very well Java but in python the reason is O(n4^n)
is because the concatenation takes O(n) in youtube a there is youtuber who says is because of the height of the tree (meaning n is the height of the tree) but i think he is wrong because that does not make sense (atleast for me)
note O(3m4k*n) is also valid(m the number of digits with 3 possibilities and k the number of digits with 4 possibilities)
Its because the width of the tree is O(4^n) and the height of the tree is O(n). Watch this video for a better explanation than I can give: https://www.youtube.com/watch?v=0snEunUacZY

Why does Featuretools slows down when I increase the number of Dask workers?

I'm using an Amazon SageMaker Notebook that has 72 cores and 144 GB RAM, and I carried out 2 tests with a sample of the whole data to check if the Dask cluster was working.
The sample has 4500 rows and 735 columns from 5 different "assets" (I mean 147 columns for each asset). The code is filtering the columns and creating a feature matrix for each filtered Dataframe.
First, I initialized the cluster as follows, I received 72 workers, and got 17 minutes of running. (I assume I created 72 workers with one core each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True,n_workers=72,threads_per_worker=72)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
Second, I initialized the cluster as follows, I received 9 workers, and got 3,5 minutes of running. (I assume I created 9 workers with 8 cores each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
For me, it's mind-blowing because I thought that 72 workers could carry the work out faster! Once I'm not a specialist neither in Dask nor in FeatureTools I guess that I'm setting something wrong.
I would appreciate any kind of help and advice!
Thank you!
You are correctly setting dask_kwargs in DFS. I think the slow down happens as a result of additional overhead and less cores in each worker. The more workers there are, the more overhead exists from transmitting data. Additionally, 8 cores from 1 worker can be leveraged to make computations run faster than 1 core from 8 workers.

Strange slow down when using struct and loop in Julia

I made a speed-comparison between two functions with and without using struct as below, and its performance difference is huge: 0.07899 vs 0.0011 [sec]. The strange thing is that the contents of idxset in test1() and test2() are exactly the same(1...10000) but the processing time of the loop over those two are different. Note that the measurements were performed only for loops.
Could you explain how to improve my code with struct and why this happens?
struct Data
bool
end
function test1()
N = 10^5
data = Data(trues(N))
idxset = findall(data.bool)
s=0.0
#time for i in idxset
s += i^2
end
return s
end
function test2()
N = 10^5
bool = trues(N)
idxset = findall(bool)
s=0.0
#time for i in idxset
s += i^2
end
return s
end
test1()
test2()
struct Data
bool
end
doesn't have any type information on bool, so data.bool cannot infer the type, leading to uninferred types in your function and slow code. data.bool being uninferred probably makes idxset uninferred which makes each i uninferred and slows down the arithmetic. Check this with #code_warntype. Fix this with:
struct Data
bool::BitArray{1}
end

Efficient method for imposing (some cases of) periodic boundary conditions on floats?

Some cases of periodic boundary conditions (PBC) can be imposed very efficiently on integers by simply doing:
myWrappedWithinPeriodicBoundary = myUIntValue & mask
This works when the boundary is the half open range [0, upperBound), where the (exclusive) upperBound is 2^exp so that
mask = (1 << exp) - 1
For example:
let pbcUpperBoundExp = 2 // so the periodic boundary will be [0, 4)
let mask = (1 << pbcUpperBoundExp) - 1
for x in -7 ... 7 { print(x & mask, terminator: " ") }
(in Swift) will print:
1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Question: Is there any (roughly similar) efficient method for imposing (some cases of) PBCs on floating point-numbers (32 or 64-bit IEEE-754)?
There are several reasonable approaches:
fmod(x,1)
modf(x,&dummy) — has the advantage of knowing its divisor statically, but in my testing comes from libc.so.6 even with -ffast-math
x-floor(x) (suggested by Jens in a comment) — supports negative inputs directly
Manual bit-twiddling direct implementation
Manual bit-twiddling implementation of floor
The first two preserve the sign of their input; you can add 1 if it's negative.
The two bit manipulations are very similar: you identify which significand bits correspond to the integer portion, and mask them (for the direct implementation) or the rest (to implement floor) off. The direct implementation can be completed either with a floating-point division or with a shift to reassemble the double manually; the former is 28% faster even given hardware CLZ. The floor implementation can immediately reconstitute a double: floor never changes the exponent of its argument unless it returns 0. About 20 lines of C are required.
The following timing is with double and gcc -O3, with timing loops over representative inputs into which the operative code was inlined.
fmod: 41.8 ns
modf: 19.6 ns
floor: 10.6 ns
With -ffast-math:
fmod: 26.2 ns
modf: 30.0 ns
floor: 21.9 ns
Bit manipulation:
direct: 18.0 ns
floor: 20.6 ns
The manual implementations are competitive, but the floor technique is the best. Oddly, two of the three library functions perform better without -ffast-math: that is, as a PLT function call than as an inlined builtin function.
I'm adding this answer to my own question since it describes the, at the time of writing, best solution I have found. It's in Swift 4.1 (should be straight forward to translate into C) and it's been tested in various use cases:
extension BinaryFloatingPoint {
/// Returns the value after restricting it to the periodic boundary
/// condition [0, 1).
/// See https://forums.swift.org/t/why-no-fraction-in-floatingpoint/10337
#_transparent
func wrappedToUnitRange() -> Self {
let fract = self - self.rounded(.down)
// Have to clamp to just below 1 because very small negative values
// will otherwise return an out of range result of 1.0.
// Turns out this:
if fract >= 1.0 { return Self(1).nextDown } else { return fract }
// is faster than this:
//return min(fract, Self(1).nextDown)
}
#_transparent
func wrapped(to range: Range<Self>) -> Self {
let measure = range.upperBound - range.lowerBound
let recipMeasure = Self(1) / measure
let scaled = (self - range.lowerBound) * recipMeasure
return scaled.wrappedToUnitRange() * measure + range.lowerBound
}
#_transparent
func wrappedIteratively(to range: Range<Self>) -> Self {
var v = self
let measure = range.upperBound - range.lowerBound
while v >= range.upperBound { v = v - measure }
while v < range.lowerBound { v = v + measure }
return v
}
}
On my MacBook Pro with a 2 GHz Intel Core i7,
a hundred million (probably inlined) calls to wrapped(to range:) on random (finite) Double values takes 0.6 seconds, which is about 166 million calls per second (not multi threaded). The range being statically known or not, or having bounds or measure that is a power of two etc, can make some difference but not as much as one could perhaps have thought.
wrappedToUnitRange() takes about 0.2 seconds, meaning 500 million calls per second on my system.
Given the right scenario, wrappedIteratively(to range:) is as fast as wrappedToUnitRange().
The timings have been made by comparing a baseline test (without wrapping some value, but still using it to compute eg a simple xor checksum) to the same test where a value is wrapped. The difference in time between these are the times I have given for the wrapping calls.
I have used Swift development toolchain 2018-02-21, compiling with -O -whole-module-optimization -static-stdlib -gnone. And care has been taken to make the tests relevant, ie preventing dead code removal, using true random input of different distributions etc. Writing the wrapping functions generically, like this extension on BinaryFloatingPoint, turned out to be optimized into equivalent code as if I had written separate specialized versions for eg Float and Double.
It would be interesting to see someone more skilled than me investigating this further (C or Swift or any other language doesn't matter).
EDIT:
For anyone interested, here is some versions for simd float2:
extension float2 {
#_transparent
func wrappedInUnitRange() -> float2 {
return simd.fract(self)
}
#_transparent
func wrappedToMinusOneToOne() -> float2 {
let scaled = (self + float2(1, 1)) * float2(0.5, 0.5)
let scaledFract = scaled - floor(scaled)
let wrapped = simd_muladd(scaledFract, float2(2, 2), float2(-1, -1))
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let oneNextDown = Float(bitPattern:
0b0_01111110_11111111111111111111111)
let oneNextDownFloat2 = float2(oneNextDown, oneNextDown)
return simd.min(wrapped, oneNextDownFloat2)
}
#_transparent
func wrapped(toLowerBound lowerBound: float2,
upperBound: float2) -> float2
{
let measure = upperBound - lowerBound
let recipMeasure = simd_precise_recip(measure)
let scaled = (self - lowerBound) * recipMeasure
let scaledFract = scaled - floor(scaled)
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let wrapped = simd_muladd(scaledFract, measure, lowerBound)
let maxX = upperBound.x.nextDown // For some reason, this won't be
let maxY = upperBound.y.nextDown // optimized even when upperBound is
// statically known, and there is no similar simd function available.
let maxValue = float2(maxX, maxY)
return simd.min(wrapped, maxValue)
}
}
I asked some related simd-related questions here which might be of interest.
EDIT2:
As can be seen in the above Swift Forums thread:
// Note that tiny negative values like:
let x: Float = -1e-08
// May produce results outside the [0, 1) range:
let wrapped = x - floor(x)
print(wrapped < 1.0) // false
// which may result in out-of-bounds table accesses
// in common usage, so it's probably better to use:
let correctlyWrapped = simd_fract(x)
print(correctlyWrapped < 1.0) // true
I have since updated the code to account for this.

Resources