Dask losing workers over time - graph

Here is the mcve to demonstrate losing workers over time. This is a followup to
Distributing graphs to across cluster nodes
The example is not quite minimal but it does give an idea of our typical work patterns. The sleep is necessary to cause the problem. This occurs in the full application because of the need to generate a large graph from previous results.
When I run this on a cluster, I use dask-ssh to get 32 workers over 8 nodes:
dask-ssh --nprocs 4 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10
It should run in less than about 10 minutes with the full set of workers. I follow the execution on the diagnostics screen. Under events, I see the workers being added but then I sometimes but not always see removal of a number of workers, usually leaving only those on the node hosting the scheduler.
""" Test to illustrate losing workers under dask/distributed.
This mimics the overall structure and workload of our processing.
Tim Cornwell 9 Sept 2017
realtimcornwell#gmail.com
"""
import numpy
from dask import delayed
from distributed import Client
# Make some randomly located points on 2D plane
def init_sparse(n, margin=0.1):
numpy.random.seed(8753193)
return numpy.array([numpy.random.uniform(margin, 1.0 - margin, n),
numpy.random.uniform(margin, 1.0 - margin, n)]).reshape([n, 2])
# Put the points onto a grid and FFT, skip to save time
def grid_data(sparse_data, shape, skip=100):
grid = numpy.zeros(shape, dtype='complex')
loc = numpy.round(shape * sparse_data).astype('int')
for i in range(0, sparse_data.shape[0], skip):
grid[loc[i,:]] = 1.0
return numpy.fft.fft(grid).real
# Accumulate all psfs into one psf
def accumulate(psf_list):
lpsf = 0.0 * psf_list[0]
for p in psf_list:
lpsf += p
return lpsf
if __name__ == '__main__':
import sys
import time
start=time.time()
# Process nchunks each of length len_chunk 2d points, making a psf of size shape
len_chunk = int(1e6)
nchunks = 16
shape=[512, 512]
skip = 100
# We pass in the scheduler from the invoking script
if len(sys.argv) > 1:
scheduler = sys.argv[1]
client = Client(scheduler)
else:
client = Client()
print("On initialisation", client)
sparse_graph = [delayed(init_sparse)(len_chunk) for i in range(nchunks)]
sparse_graph = client.compute(sparse_graph, sync=True)
print("After first sparse_graph", client)
xfr_graph = [delayed(grid_data)(s, shape=shape, skip=skip) for s in sparse_graph]
xfr = client.compute(xfr_graph, sync=True)
print("After xfr", client)
tsleep = 120.0
print("Sleeping now for %.1f seconds" % tsleep)
time.sleep(tsleep)
print("After sleep", client)
sparse_graph = [delayed(init_sparse)(len_chunk) for i in range(nchunks)]
# sparse_graph = client.compute(sparse_graph, sync=True)
xfr_graph = [delayed(grid_data)(s, shape=shape, skip=skip) for s in sparse_graph]
psf_graph = delayed(accumulate)(xfr_graph)
psf = client.compute(psf_graph, sync=True)
print("*** Successfully reached end in %.1f seconds ***" % (time.time() - start))
print(numpy.max(psf))
print("After psf", client)
client.shutdown()
exit()
Grep'ing a typical run for Client shows:
On initialisation <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After first sparse_graph <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After xfr <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After sleep <Client: scheduler='tcp://sand-8-17:8786' processes=4 cores=4>
After psf <Client: scheduler='tcp://sand-8-17:8786' processes=4 cores=4>
Thanks,
Tim

It's not quite clear why this works but it did. We were using dask-ssh but needed more control over the creation of the workers. Eventually we settled on:
scheduler=$(head -1 hostfile.$JOBID)
hostIndex=0
for host in `cat hostfile.$JOBID`; do
echo "Working on $host ...."
if [ "$hostIndex" = "0" ]; then
echo "run dask-scheduler"
ssh $host dask-scheduler --port=8786 &
sleep 5
fi
echo "run dask-worker"
ssh $host dask-worker --host ${host} --nprocs NUMBER_PROCS_PER_NODE \
--nthreads NUMBER_THREADS \
--memory-limit 0.25 --local-directory /tmp $scheduler:8786 &
sleep 1
hostIndex="1"
done
echo "Scheduler and workers now running"

Related

Why can't I handle 300 get responses with async?

As part of my homework project, I'm working with imdb.com pages.
For one task I need to make 320 get-requests to turn them into beautifulsoup objects later on.
I'm trying to do that the async way and so far I got this:
def get_tasks(session, url_links):
tasks = []
num = 1 # debugging purposes
for url in url_links:
tasks.append(session.get(url, headers={'Accept-Language': 'en', 'X_FORWARDED_FOR': '2.21.184.0'}, ssl=False))
time.sleep(1) # avoid 503 status_code
print(f"Number of responses get_tasks: {num}") # debugging purposes
num += 1 # debugging purposes
return tasks
# Getting response.texts
results = []
async def get_response_texts(url_links):
async with aiohttp.ClientSession() as session:
tasks = get_tasks(session, url_links)
responses = await asyncio.gather(*tasks)
t1 = time.perf_counter()
num = 1
for response in responses:
results.append(await response.text())
print(f"{num} responses processed") # debugging purposes
num += 1
t2 = time.perf_counter()
print(f'Asynchronous execution: Finished in {t2 - t1} seconds\n')
if __name__ == '__main__':
links = [a list of urls to films as strings]
asyncio.run(get_response_texts(links))
print(len(results))
Here comes the problem: When I process 100 requests, things seem all right, but when I make 300, I get asyncio.exceptions.TimeoutError.
Why is it so and how can I avoid that and make 320 requests asynchronously?

Chainer: custom extension for early stopping based on time limit

I have a trainer that already has a stop trigger based on the total number of epochs:
trainer = training.Trainer(updater, (epochs, 'epoch'))
Now I would like to add a stopping condition based on the total elapsed time starting from some point in the code (which may be different than the elapsed_time stored inside trainer):
global_start = time.time()
# Some long preprocessing
expensive_processing()
# Trainer starts here and its internal elapsed time
# does not take into account the preprocessing
trainer.run()
What I tried is to define an extension as follows.
trainer.global_start = global_start
trainer.global_elapsed_time = 0.0
def stop_training():
return True
def check_time_limit(my_trainer):
my_trainer.global_elapsed_time = time.time() - my_trainer.global_start
# If reach the time limit, then set the stop_trigger as a callable
# that is always True
if my_trainer.global_elapsed_time > args.time_limit * 3600:
my_trainer.stop_trigger = stop_training
# Add the extension to trainer
trainer.extend(check_time_limit, trigger=(1000, 'iteration'))
Running the code, I obtained some The previous value of epoch_detail is not saved error. What did I do wrong?
Thank you so much in advance for your help!

How can I calculate average packet inter-arrival time on wire-shark with .pcapng file?

How can I calculate average packet inter-arrival time on wire-shark with .pcapng file?
I captured some packets on WIRE-SHARK, and I have to analysis it.
Of course, I've tried to do some thing about analysis tab on wire shark, but I cannot find effective way to find average packet inter-arrival time .
Are there any efficient way to find it?
I just wrote a pdml2flow plugin for this: pdml2flow-frame-inter-arrival-time.
Currently it is not published to pip, therefore you need to git clone && python setup.py install the plugin yourself. But once you did that you can:
Print inter arrival times form your capture dump.capture:
$ tshark -r dump.capture -Tpdml | pdml2flow +frame-inter-arrival-time
{"inter_arrival_times": [7.152557373046875e-07, 0.0, 0.1733696460723877], "frames": null}
{"inter_arrival_times": [3.7670135498046875e-05, 2.3126602172851562e-05], "frames": null}
{"inter_arrival_times": [0.16418147087097168, 0.0007672309875488281, 0.16009950637817383, 0.00016069412231445312, 0.0007240772247314453, 0.15914177894592285, 3.814697265625e-05, 5.245208740234375e-06], "frames": null}
{"inter_arrival_times": [0.1608715057373047, 0.15995335578918457, 2.384185791015625e-07, 2.384185791015625e-07, 2.384185791015625e-07, 0.15888381004333496], "frames": null}
{"inter_arrival_times": [0.16829872131347656, 0.0007762908935546875, 0.14913678169250488, 0.000125885009765625, 0.000736236572265625, 10.19379997253418], "frames": null}
Print inter arrival times with a different flow aggregation. For example by interface, if you captured from multiple interfaces:
$ tshark -r dump.capture -Tpdml | pdml2flow -f frame.interface_name +frame-inter-arrival-time
{"inter_arrival_times": [7.152557373046875e-07, 0.0, 0.00018739700317382812, 3.7670135498046875e-05, 2.3126602172851562e-05, 0.008971691131591797, 0.16414976119995117, 4.76837158203125e-07, 3.123283386230469e-05, 0.0007672309875488281, 0.16007304191589355, 2.6464462280273438e-05, 0.00016069412231445312, 0.0007240772247314453, 0.1590421199798584, 2.384185791015625e-07, 2.384185791015625e-07, 2.384185791015625e-07, 9.894371032714844e-05, 3.814697265625e-05, 5.245208740234375e-06, 0.0006232261657714844, 0.15811824798583984, 0.010167837142944336, 1.2636184692382812e-05, 0.0007762908935546875, 0.14911913871765137, 1.7642974853515625e-05, 0.000125885009765625, 0.000736236572265625, 0.16014313697814941, 0.035120248794555664, 0.2039034366607666, 1.907348632, ... ] }
Print arrival times without flow aggregation:
$ tshark -r dump.capture -Tpdml | pdml2flow +frame-inter-arrival-time --no_flow
0.0
7.152557373046875e-07
0.0
0.00018739700317382812
3.7670135498046875e-05
2.3126602172851562e-05
0.008971691131591797
0.16414976119995117
4.76837158203125e-07
3.123283386230469e-05
The plugin logic is implemented in:
plugin/plugin.py [a305598]: Calculates the frame inter arrival time
# vim: set fenc=utf8 ts=4 sw=4 et :
from pdml2flow.plugin import Plugin2
from argparse import ArgumentParser
from json import dumps
argparser = ArgumentParser('Calculate inter arrival times of frames in a flow or on an interface')
DEFAULT_NO_FLOW = False
argparser.add_argument(
'--no_flow',
action = 'store_true',
dest = 'no_flow',
default = DEFAULT_NO_FLOW,
help = 'Calculate inter arrival time to the previous frame on the interface, not in the flow [default: {}]'.format(
DEFAULT_NO_FLOW
)
)
PRINT_FRAMES = False
argparser.add_argument(
'--frames',
action = 'store_true',
dest = 'frames',
default = PRINT_FRAMES,
help = 'Print the frames alongside the inter arrival time [default: {}]'.format(
PRINT_FRAMES,
)
)
def _get_frame_time(x):
return x['frame']['time_epoch']['raw']
class Plugin(Plugin2):
#staticmethod
def help():
"""Return a help string."""
return argparser.format_help()
def __init__(self, *args):
"""Called once during startup."""
self._args = argparser.parse_args(args)
self._last_frame_time = None
def flow_end(self, flow):
"""Calculate and print the frame inter-arrival time."""
if not self._args.no_flow:
inter_arrival_times = []
prev_t = None
for t in _get_frame_time(flow.frames):
if prev_t:
inter_arrival_times.append(
t - prev_t
)
prev_t = t
print(
dumps({
'inter_arrival_times': inter_arrival_times,
'frames': flow.frames if self._args.frames else None
})
)
def frame_new(self, frame, flow):
"""Calculate and print the frame inter-arrival time."""
if self._args.no_flow:
frame_time_now = _get_frame_time(frame)[0]
if not self._last_frame_time:
self._last_frame_time = frame_time_now
print(
frame_time_now - self._last_frame_time
)
self._last_frame_time = frame_time_now
if __name__ == '__main__':
print(Plugin.help())
I hope this helps. Feedback / feature or change requests are always welcome. :)

how to shrink consecutive B ip classes in bigger one

After reading List of IP Space used by Facebook:
"Real" list is the last answer, but I wonder how Igy (with the answer marked as solution) managed to shrink the list a lot by adding consecutive classes into a bigger one (by decreasing accordingly from network mask for each new consecutive network), is there a tool, or only manually ?
This is a HUGE improvement for firewall, where number of rules counts (the shorter the better).
A simple solution is to use netaddr:
import netaddr
ips = netaddr.IPSet()
for addr in all_addrs:
ips.add(addr)
ips.compact()
for cidr in ips.iter_cidrs():
print(str(cidr))
Following Python 3 script can do what you want:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
from functools import reduce
import sys
def add_mark(regions, mark, k):
r = regions[:]
i = 0
j = len(r)
while i < j:
m = (i + j) // 2
if mark < r[m][0]:
j = m
elif r[m][0] < mark:
i = m + 1
else:
r[m][1] += k
if r[m][1] == 0:
del r[m]
return r
r.insert(i, [mark, k])
return r
def add_region(regions, start, end):
return add_mark(add_mark(regions, start, 1), end, -1)
def parse_network(n):
pos = n.find('/')
return ip_to_int(n[:pos]), 2**(32 - int(n[pos+1:]))
def ip_to_int(ip):
return reduce(lambda a, b: 256*a + b, map(int, ip.split('.')))
def print_summary(r):
if len(r) == 0:
return
start = None
level = 0
for item in r:
level += item[1]
if start is None:
start = item[0]
elif level == 0:
summarize_networks(start, item[0])
start = None
def summarize_networks(start, end):
while start < end:
mask = 32
amount = 1
while start % amount == 0 and start + amount - 1< end:
mask -= 1
amount *= 2
mask += 1
amount //= 2
print('{0}/{1}'.format(int_to_ip(start), mask))
start += amount
def int_to_ip(n):
n, o4 = divmod(n, 256)
n, o3 = divmod(n, 256)
o1, o2 = divmod(n, 256)
return '.'.join(map(str, [o1, o2, o3, o4]))
def main():
regions = []
while True:
line = sys.stdin.readline()
if len(line) == 0:
break
for item in line.strip().split():
start, amount = parse_network(item)
regions = add_region(regions, start, start + amount)
print_summary(regions)
if __name__ == "__main__":
main()
Example:
./unique_networks.py <<EOF
192.168.0.0/24
192.168.0.128/25
192.168.0.248/30
192.168.1.0/24
192.168.100.0/24
EOF
192.168.0.0/23
192.168.100.0/24
For a facebook's list of
204.15.20.0/22
69.63.176.0/20
66.220.144.0/20
66.220.144.0/21
69.63.184.0/21
69.63.176.0/21
74.119.76.0/22
69.171.255.0/24
173.252.64.0/18
69.171.224.0/19
69.171.224.0/20
103.4.96.0/22
69.63.176.0/24
173.252.64.0/19
173.252.70.0/24
31.13.64.0/18
31.13.24.0/21
66.220.152.0/21
66.220.159.0/24
69.171.239.0/24
69.171.240.0/20
31.13.64.0/19
31.13.64.0/24
31.13.65.0/24
31.13.67.0/24
31.13.68.0/24
31.13.69.0/24
31.13.70.0/24
31.13.71.0/24
31.13.72.0/24
31.13.73.0/24
31.13.74.0/24
31.13.75.0/24
31.13.76.0/24
31.13.77.0/24
31.13.96.0/19
31.13.66.0/24
173.252.96.0/19
69.63.178.0/24
31.13.78.0/24
31.13.79.0/24
31.13.80.0/24
31.13.82.0/24
31.13.83.0/24
31.13.84.0/24
31.13.85.0/24
31.13.86.0/24
31.13.87.0/24
31.13.88.0/24
31.13.89.0/24
31.13.90.0/24
31.13.91.0/24
31.13.92.0/24
31.13.93.0/24
31.13.94.0/24
31.13.95.0/24
69.171.253.0/24
69.63.186.0/24
31.13.81.0/24
179.60.192.0/22
179.60.192.0/24
179.60.193.0/24
179.60.194.0/24
179.60.195.0/24
185.60.216.0/22
45.64.40.0/22
185.60.216.0/24
185.60.217.0/24
185.60.218.0/24
185.60.219.0/24
129.134.0.0/16
157.240.0.0/16
204.15.20.0/22
69.63.176.0/20
69.63.176.0/21
69.63.184.0/21
66.220.144.0/20
69.63.176.0/20
networks this script summarizes them to:
31.13.24.0/21
31.13.64.0/18
45.64.40.0/22
66.220.144.0/20
69.63.176.0/20
69.171.224.0/19
74.119.76.0/22
103.4.96.0/22
129.134.0.0/16
157.240.0.0/16
173.252.64.0/18
179.60.192.0/22
185.60.216.0/22
204.15.20.0/22
With the help of sds' answer (netaddr is beautiful, it even sorts the output) I came up with the following to convert facebook IP ranges to ipset:
ipset create facebook4 hash:net comment
whois -h whois.radb.net -- '-i origin AS32934' | awk '/^route:/ {print $2}' | ./netaddr-compact.py | sed 's/^/ipset add facebook4 /' | sh -x
ipset create facebook6 hash:net family inet6 comment
whois -h whois.radb.net -- '-i origin AS32934' | awk '/^route6:/ {print $2}' | ./netaddr-compact.py | sed 's/^/ipset add facebook6 /' | sh -x
ipset create facebook list:set comment
ipset add facebook facebook4
ipset add facebook facebook6
The netaddr-compact.py file is simple:
#!/usr/bin/env python3
import netaddr # on ubuntu: apt install python3-netaddr
import fileinput
ips = netaddr.IPSet()
for addr in fileinput.input():
ips.add(addr)
ips.compact()
for cidr in ips.iter_cidrs():
print(str(cidr))

Programming Logic - Splitting up Tasks Between Threads

Lets say you want 5 threads to process data simultaneous. Also assume, you have 89 tasks to process.
Off the bat you know 89 / 5 = 17 with a remainder of 4. The best way to split up tasks would be to have 4 (the remainder) threads process 18 (17+1) tasks each and then have 1 (# threads - remainder) thread to process 17.
This will eliminate the remainder. Just to verify:
Thread 1: Tasks 1-18 (18 tasks)
Thread 2: Tasks 19-36 (18 tasks)
Thread 3: Tasks 37-54 (18 tasks)
Thread 4: Tasks 55-72 (18 tasks)
Thread 5: Tasks 73-89 (17 tasks)
Giving you a total of 89 tasks completed.
I need a way of getting the start and ending range of each thread mathematically/programmability; where the following should print the exact thing I have listed above:
$NumTasks = 89
$NumThreads = 5
$Remainder = $NumTasks % $NumThreads
$DefaultNumTasksAssigned = floor($NumTasks / $NumThreads)
For $i = 1 To $NumThreads
if $i <= $Remainder Then
$NumTasksAssigned = $DefaultNumTasksAssigned + 1
else
$NumTasksAssigned = $DefaultNumTasksAssigned
endif
$Start = ??????????
$End = ??????????
print Thread $i: Tasks $Start-$End ($NumTasksAssigned tasks)
Next
This should also work for any number of $NumTasks.
Note: Please stick to answering the math at hand and avoid suggesting or assuming the situation.
Why? Rather then predetermining the scheduling order, stick all of the tasks on a queue, and then have each thread pull them off one by one when they're ready. Then your tasks will basically run "as fast as possible".
If you pre-allocated, then one thread may be doing a particularly long bit of processing and blocking the running of all the tasks stuck behind it. Using the queue, as each task finishes and a thread frees up, it grabs the next task and keeps going.
Think of it like a bank with 1 line per teller vs one line and a lot of tellers. In the former, you might get stuck behind the person depositing coins and counting it out one by one, the latter you get to the next available teller, while Mr. PocketChange counts away.
I second Will Hartung 's remark. You may just feed them one task at a time (or a few tasks at a at a time, depending if there's much overhead, i.e. if individual tasks get typically completed very fast, relative to the cost of starting/recycling threads). Your subsequent comments effectively explain that your "threads" carry a heavy creation cost, and hence your desire to feed them once with as much work as possible, rather than wasting time creating new "thread" each fed a small amount of work.
Anyway... going to the math question...
If you'd like to assign tasks just once, the following formula, plugged in lieu of the the ????????? in your logic, should do the trick:
$Start = 1
+ (($i -1) * ($DefaultNumTasksAssigned + 1)
- (floor($i / ($Remainder + 1)) * ($i - $Remainder))
$End = $Start + $NumTasksAssigned -1
The formula is explained as follow:
1 is for the fact that your display / logic is one-based not zero-based
The second term is because we generally add ($DefaultNumTasksAssigned + 1) with each iteration.
The third term provides a correction for the last few iterations.
Its first part, (floor($i / ($Remainder + 1)) provides 0 until $i reaches the first thread
that doesn't receive one extra task, and 1 thereafter.
The second part express by how much we need to correct.
The formula for $End is easier, the only trick is the minus 1, it is because the Start and End values are inclusive (so for example between 1 an 19 there are 19 tasks not 18)
The following slightly modified piece of logic should also work, it avoids the "fancy" formula by keeping a running tab of the $Start variable, rather than recomputing it each time..
$NumTasks = 89
$NumThreads = 5
$Remainder = $NumTasks % $NumThreads
$DefaultNumTasksAssigned = floor($NumTasks / $NumThreads)
$Start = 1
For $i = 1 To $NumThreads
if $i <= $Remainder Then // fixed here! need <= because $i is one-based
$NumTasksAssigned = $DefaultNumTasksAssigned + 1
else
$NumTasksAssigned = $DefaultNumTasksAssigned
endif
$End = $Start + $NumTasksAssigned -1
print Thread $i: Tasks $Start-$End ($NumTasksAssigned tasks)
$Start = $Start + $NumTasksAssigned
Next
Here's a Python transcription of the above
>>> def ShowWorkAllocation(NumTasks, NumThreads):
... Remainder = NumTasks % NumThreads
... DefaultNumTasksAssigned = math.floor(NumTasks / NumThreads)
... Start = 1
... for i in range(1, NumThreads + 1):
... if i <= Remainder:
... NumTasksAssigned = DefaultNumTasksAssigned + 1
... else:
... NumTasksAssigned = DefaultNumTasksAssigned
... End = Start + NumTasksAssigned - 1
... print("Thread ", i, ": Tasks ", Start, "-", End, "(", NumTasksAssigned,")")
... Start = Start + NumTasksAssigned
...
>>>
>>> ShowWorkAllocation(89, 5)
Thread 1 : Tasks 1 - 18 ( 18 )
Thread 2 : Tasks 19 - 36 ( 18 )
Thread 3 : Tasks 37 - 54 ( 18 )
Thread 4 : Tasks 55 - 72 ( 18 )
Thread 5 : Tasks 73 - 89 ( 17 )
>>> ShowWorkAllocation(11, 5)
Thread 1 : Tasks 1 - 3 ( 3 )
Thread 2 : Tasks 4 - 5 ( 2 )
Thread 3 : Tasks 6 - 7 ( 2 )
Thread 4 : Tasks 8 - 9 ( 2 )
Thread 5 : Tasks 10 - 11 ( 2 )
>>>
>>> ShowWorkAllocation(89, 11)
Thread 1 : Tasks 1 - 9 ( 9 )
Thread 2 : Tasks 10 - 17 ( 8 )
Thread 3 : Tasks 18 - 25 ( 8 )
Thread 4 : Tasks 26 - 33 ( 8 )
Thread 5 : Tasks 34 - 41 ( 8 )
Thread 6 : Tasks 42 - 49 ( 8 )
Thread 7 : Tasks 50 - 57 ( 8 )
Thread 8 : Tasks 58 - 65 ( 8 )
Thread 9 : Tasks 66 - 73 ( 8 )
Thread 10 : Tasks 74 - 81 ( 8 )
Thread 11 : Tasks 82 - 89 ( 8 )
>>>
I think you've solved the wrong half of your problem.
It's going to be virtually impossible to precisely determine the time it will take to complete all your tasks, unless all of the following are true:
your tasks are 100% CPU-bound: that is, they use 100% CPU while running and don't need to do any I/O
none of your tasks have to synchronize with any of your other tasks in any way
you have exactly as many threads as you have CPUs
the computer that is running these tasks is not performing any other interesting tasks at the same time
In practice, most of the time, your tasks are I/O-bound rather than CPU-bound: that is, you are waiting for some external resource such as reading from a file, fetching from a database, or communicating with a remote computer. In that case, you only make things worse by adding more threads, because they're all contending for the same scarce resource.
Finally, unless you have some really weird hardware, it's unlikely you can actually have exactly five threads running simultaneously. (Usually processor configurations come in multiples of at least two.) Usually the sweet spot is at about 1 thread per CPU if your tasks are very CPU-bound, about 2 threads per CPU if the tasks spend half their time being CPU-bound and half their time doing IO, etc.
tl;dr: We need to know a lot more about what your tasks and hardware look like before we can advise you on this question.

Resources