Interrupt ZStream mapMPar processing - zio

I have the following code which, because of Excel max row limitations, is restricted to ~1million rows:
ZStream.unwrap(generateStreamData).mapMPar(32) {m =>
streamDataToCsvExcel
}
All fairly straightforward and it works perfectly. I keep track of the number of rows streamed, and then stop writing data. However I want to interrupt all the child fibers spawned in mapMPar, something like this:
ZStream.unwrap(generateStreamData).interruptWhen(effect.true).mapMPar(32) {m =>
streamDataToCsvExcel
}
Unfortunately the process is interrupted immediately here. I'm probably missing something obvious...
Editing the post as it needs some clarity.
My stream of data is generated by an expensive process in which data is pulled from a remote server, (this data is itself calculated by an expensive process) with n Fibers.
I then process the streams and then stream them out to the client.
Once the processed row count has reached ~1 million, I then need to stop pulling data from the remote server (i.e. interrupt all the Fibers) and end the process.

Here's what I can come up with after your clarification. The ZIO 1.x version is a bit uglier because of the lack of .dropRight
Basically we can use takeUntilM to count the size of elements we've gotten to stop once we get to the maximum size (and then use .dropRight or the additional filter to discard the last element that would take it over the limit)
This ensures that both
You only run streamDataToCsvExcel until the last possible message before hitting the size limit
Because streams are lazy expensiveQuery only gets run for as many messages as you can fit within the limit (or N+1 if the last value is discarded because it would go over the limit)
import zio._
import zio.stream._
object Main extends zio.App {
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
val expensiveQuery = ZIO.succeed(Chunk(1, 2))
val generateStreamData = ZIO.succeed(ZStream.repeatEffect(expensiveQuery))
def streamDataToCsvExcel = ZIO.unit
def count(ref: Ref[Int], size: Int): UIO[Boolean] =
ref.updateAndGet(_ + size).map(_ > 10)
for {
counter <- Ref.make(0)
_ <- ZStream
.unwrap(generateStreamData)
.takeUntilM(next => count(counter, next.size)) // Count size of messages and stop when it's reached
.filterM(_ => counter.get.map(_ <= 10)) // Filter last message from `takeUntilM`. Ideally should be .dropRight(1) with ZIO 2
.mapMPar(32)(_ => streamDataToCsvExcel)
.runDrain
} yield ExitCode.success
}
}
If relying on the laziness of streams doesn't work for your use case you can trigger an interrupt of some sort from the takeUntilM condition.
For example you could update the count function to
def count(ref: Ref[Int], size: Int): UIO[Boolean] =
ref.updateAndGet(_ + size).map(_ > 10)
.tapSome { case true => someFiber.interrupt }

Related

How map from a stream the sum of a Duration field in java?

I am trying to create a map that holds an activity and the total duration of that activity, knowing that the activity appears more times with different durations.
Normally, I would have solved it like this:
Map<String,Duration> result2 = new HashMap<String,Duration>();
for(MonitoredData m: lista)
{
if(result2.containsKey(m.getActivity())) result2.replace(m.getActivity(),result2.get(m.getActivity()).plus(m.getDuration()));
else result2.put(m.getActivity(), m.getDuration());
}
But I am trying to do this with a stream, but I can't figure out how to put the sum in there.
Function<Duration, Duration> totalDuration = x -> x.plus(x);
Map<String, Duration> result2 = lista.stream().collect(
Collectors.groupingBy(MonitoredData::getActivity,
Collectors.groupingBy(totalDuration.apply(), Collectors.counting()))
);
I tried in various ways to group them, to map them directly, or to sum them directly in the brackets, but i'm stuck.
Use the 3-argument version of toMap collector:
import static java.util.stream.Collectors.toMap;
Map<String,Duration> result = lista.stream()
.collect(toMap(MonitoredData::getActivity, MonitoredData::getDuration, Duration::plus));
Also, note that Map interface got some nice additions in Java 8. One of them is merge. With that, even your iterative for loop can be rewritten to be much cleaner:
for (MonitoredData m: lista) {
result.merge(m.getActivity(), m.getDuration(), Duration::plus);
}

How do I limit the number of concurrent processes spawned by Proc::Async in Perl 6?

I want to process a list of files in a subtask in my script and I'm using Proc::Async to spawn the subprocesses doing the work. The downside is that if I have a large list of files to process, it will spawn many subprocesses. I want to know how to limit the number of concurrent subprocesses that Proc::Async spawns?
You can explicitly limit the number of Proc::Async processes using this react block technique which Jonathan Worthington demonstrated in his concurrency/parallelism/asynchrony talk at the 2019 German Perl Workshop (see slide 39, for example). I'm using the Linux command echo N as my "external process" in the code below.
#!/bin/env perl6
my #items = <foo bar baz>;
for #items -> $item {
start { say "Planning on processing $item" }
}
# Run 2 processes at a time
my $degree = 2;
react {
# Start $degree processes at first
run-one-process for 1..$degree;
# Run one, run-one again when it ends, thus maintaining $degree active processes at a time
sub run-one-process {
my $item = #items.shift // return;
my $proc = Proc::Async.new('echo', "processing $item");
my #output;
# Capture output
whenever $proc.stdout.lines { push #output, $_; }
# Print all the output, then start the next process
whenever $proc.start {
#output.join("\n").say;
run-one-process
}
}
}
Old Answer:
Based on Jonathan Worthington's talk Parallelism, Concurrency, and Asynchrony in Perl 6 (video, slides), this sounds most like parallelism (i.e. choosing to do multiple things at once; see slide 18). Asynchrony is reacting to things in the future, the timing of which we cannot control; see slides 39 and 40. As #raiph pointed out in his comment you can have one, the other, or both.
If you care about the order of results, then use hyper, but if the order isn't important, then use race.
In this example, adapted from Jonathan Worthington's slides, you build a pipeline of steps in which data is processed in batches of 32 filenames using 4 workers:
sub MAIN($data-dir) {
my $filenames = dir($data-dir).race(batch => 32, degree => 4);
my $data = $filenames.map(&slurp);
my $parsed = $data.map(&parse-climate-data);
my $european = $parsed.grep(*.continent eq 'Europe');
my $max = $european.max(by => *.average-temp);
say "$max.place() is the hottest!";
}

Filtering tab completion in input task implementation

I'm currently implementing a SBT plugin for Gatling.
One of its features will be to open the last generated report in a new browser tab from SBT.
As each run can have a different "simulation ID" (basically a simple string), I'd like to offer tab completion on simulation ids.
An example :
Running the Gatling SBT plugin will produce several folders (named from simulationId + date of report generaation) in target/gatling, for example mysim-20140204234534, myothersim-20140203124534 and yetanothersim-20140204234534.
Let's call the task lastReport.
If someone start typing lastReport my, I'd like to filter out tab-completion to only suggest mysim and myothersim.
Getting the simulation ID is a breeze, but how can help the parser and filter out suggestions so that it only suggest an existing simulation ID ?
To sum up, I'd like to do what testOnly do, in a way : I only want to suggest things that make sense in my context.
Thanks in advance for your answers,
Pierre
Edit : As I got a bit stuck after my latest tries, here is the code of my inputTask, in it's current state :
package io.gatling.sbt
import sbt._
import sbt.complete.{ DefaultParsers, Parser }
import io.gatling.sbt.Utils._
object GatlingTasks {
val lastReport = inputKey[Unit]("Open last report in browser")
val allSimulationIds = taskKey[Set[String]]("List of simulation ids found in reports folder")
val allReports = taskKey[List[Report]]("List of all reports by simulation id and timestamp")
def findAllReports(reportsFolder: File): List[Report] = {
val allDirectories = (reportsFolder ** DirectoryFilter.&&(new PatternFilter(reportFolderRegex.pattern))).get
allDirectories.map(file => (file, reportFolderRegex.findFirstMatchIn(file.getPath).get)).map {
case (file, regexMatch) => Report(file, regexMatch.group(1), regexMatch.group(2))
}.toList
}
def findAllSimulationIds(allReports: Seq[Report]): Set[String] = allReports.map(_.simulationId).distinct.toSet
def openLastReport(allReports: List[Report], allSimulationIds: Set[String]): Unit = {
def simulationIdParser(allSimulationIds: Set[String]): Parser[Option[String]] =
DefaultParsers.ID.examples(allSimulationIds, check = true).?
def filterReportsIfSimulationIdSelected(allReports: List[Report], simulationId: Option[String]): List[Report] =
simulationId match {
case Some(id) => allReports.filter(_.simulationId == id)
case None => allReports
}
Def.inputTaskDyn {
val selectedSimulationId = simulationIdParser(allSimulationIds).parsed
val filteredReports = filterReportsIfSimulationIdSelected(allReports, selectedSimulationId)
val reportsSortedByDate = filteredReports.sorted.map(_.path)
Def.task(reportsSortedByDate.headOption.foreach(file => openInBrowser((file / "index.html").toURI)))
}
}
}
Of course, openReport is called using the results of allReports and allSimulationIds tasks.
I think I'm close to a functioning input task but I'm still missing something...
Def.inputTaskDyn returns a value of type InputTask[T] and doesn't perform any side effects. The result needs to be bound to an InputKey, like lastReport. The return type of openLastReport is Unit, which means that openLastReport will construct a value that will be discarded, effectively doing nothing useful. Instead, have:
def openLastReport(...): InputTask[...] = ...
lastReport := openLastReport(...).evaluated
(Or, the implementation of openLastReport can be inlined into the right hand side of :=)
You probably don't need inputTaskDyn, but just inputTask. You only need inputTaskDyn if you need to return a task. Otherwise, use inputTask and drop the Def.task.

Geb: Waiting/sleeping between tests

Is there a way to wait a set amount of time between tests? I need a solution to compensate for server lag. When creating a record, it takes a little bit of time before the record is searchable in my environment.
In the following code example, how would I wait 30 seconds between the first test and the second test and have no wait time between second test and third test?
class MySpec extends GebReportingSpec {
// First Test
def "should create a record named myRecord"() {
given:
to CreateRecordsPage
when:
name_field = "myRecord"
and:
saveButton.click()
then:
at IndexPage
}
// Second Test
def "should find record named myRecord"() {
given:
to SearchPage
when:
search_query = "myRecord"
and:
searchButton.click()
then:
// haven't figured this part out yet, but would look for "myRecord" on the results page
}
// Third Test
def "should delete the record named myRecord"() {
// do the delete
}
}
You probably don't want to wait a set amount of time - it will make your tests slow. You would ideally want to continue as soon as the record is added. You can use Geb's waitFor {} to poll for a condition to be fulfilled.
// Second Test
def "should find record named myRecord"() {
when:
to SearchPage
then:
waitFor(30) {
search_query = "myRecord"
searchButton.click()
//verify that the record was found
}
}
This will poll every half a second for 30 seconds for the condition to be fulfilled passing as soon as it is and failing if it's still not fulfilled after 30 seconds.
To see what options you have for setting waiting time and interval have look at section on waiting in The Book of Geb. You might also want to check out the section on implicit assertions in waitFor blocks.
If your second feature method depends on success of the first one then you should probably consider annotating this specification with #Stepwise.
You should always try to use waitFor and check conditions wherever possible. However if you find there isn't a specific element you can check for, or any other condition to check, you can use this to wait for a specified amount of time:
def sleepForNSeconds(int n) {
def originalMilliseconds = System.currentTimeMillis()
waitFor(n + 1, 0.5) {
(System.currentTimeMillis() - originalMilliseconds) > (n * 1000)
}
}
I had to use this while waiting for some chart library animations to complete before capturing a screenshot in a report.
Thread.sleep(30000)
also does the trick. Of course still agree to "use waitFor whenever possible".

plone.memoize cache depending on function's return value

i'm trying to cache the return value of a function only in case it's not None.
in the following example, it makes sense to cache the result of someFunction in case it managed to obtain data from some-url for an hour.
if the data could not be obtained, it does not make sense to cache the result for an hour (or more), but probably for 5 minutes (so the server for some-domain.com has some time to recover)
def _cachekey(method, self, lang):
return (lang, time.time() // (60 * 60))
#ram.cache(_cachekey)
def someFunction(self, lang='en'):
data = urllib2.urlopen('http://some-url.com/data.txt', timeout=10).read()
except socket.timeout:
data = None
except urllib2.URLError:
data = None
return expensive_compute(data)
calling method(self, lang) in _cachekey would not make a lot of sense.
as this code would be too long for a comment, i'll post it here in hope it'll help others:
#initialize cache
from zope.app.cache import ram
my_cache = ram.RAMCache()
my_cache.update(maxAge=3600, maxEntries=20)
_marker = object()
def _cachekey(lang):
return (lang, time.time() // (60 * 60))
def someFunction(self, lang='en'):
cached_result = my_cache.query(_cacheKey(lang), _marker)
if cached_result is _marker:
#not found, download, compute and add to cache
data = urllib2.urlopen('http://some-url.com/data.txt', timeout=10).read()
except socket.timeout:
data = None
except urllib2.URLError:
data = None
if data is not None:
#cache computed value for 1 hr
computed = expensive_compute(data)
my_cache.set(data, (lang, time.time() // (60 * 60) )
else:
# allow download server to recover 5 minutes instead of trying to download on every page load
computed = None
my_cache.set(None, (lang, time.time() // (60 * 5) )
return computed
return cached_result
In this case, you should not generalize "return as None", as decorator cached results can depend only on input values.
Instead, you should build the caching mechanism inside your function and not rely on a decorator.
Then this becomes a generic non-Plone specific Python problem how to cache values.
Here is an example how to build your manual caching using RAMCache:
https://developer.plone.org/performance/ramcache.html#using-custom-ram-cache

Resources