Spark - "too many open files" in shuffle - bigdata

Using Spark 1.1
I have 2 datasets. One is very large and the other was reduced (using some 1:100 filtering) to much smaller scale. I need to reduce the large dataset to the same scale, by joining only those items from the smaller list with their corresponding counterparts in the larger list (those lists contain elements that have a mutual join field).
I am doing that using the following code:
The "if(joinKeys != null)" part is the relevant part
Smaller list is "joinKeys", larger list is "keyedEvents"
private static JavaRDD<ObjectNode> createOutputType(JavaRDD jsonsList, final String type, String outputPath,JavaPairRDD<String,String> joinKeys) {
outputPath = outputPath + "/" + type;
JavaRDD events = jsonsList.filter(new TypeFilter(type));
// This is in case we need to narrow the list to match some other list of ids... Recommendation List, for example... :)
if(joinKeys != null) {
JavaPairRDD<String,ObjectNode> keyedEvents = events.mapToPair(new KeyAdder("requestId"));
JavaRDD < ObjectNode > joinedEvents = joinKeys.join(keyedEvents).values().map(new PairToSecond());
events = joinedEvents;
}
JavaPairRDD<String,Iterable<ObjectNode>> groupedEvents = events.mapToPair(new KeyAdder("sliceKey")).groupByKey();
// Add convert jsons to strings and add "\n" at the end of each
JavaPairRDD<String, String> groupedStrings = groupedEvents.mapToPair(new JsonsToStrings());
groupedStrings.saveAsHadoopFile(outputPath, String.class, String.class, KeyBasedMultipleTextOutputFormat.class);
return events;
}
Thing is when running this job, I always get the same error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2757 in stage 13.0 failed 4 times, most recent failure: Lost task 2757.3 in stage 13.0 (TID 47681, hadoop-w-175.c.taboola-qa-01.internal): java.io.FileNotFoundException: /hadoop/spark/tmp/spark-local-20141201184944-ba09/36/shuffle_6_2757_2762 (Too many open files)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.<init>(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
I already increased my ulimits, by doing the following on all cluster machines:
echo "* soft nofile 900000" >> /etc/security/limits.conf
echo "root soft nofile 900000" >> /etc/security/limits.conf
echo "* hard nofile 990000" >> /etc/security/limits.conf
echo "root hard nofile 990000" >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo "session required pam_limits.so" >> /etc/pam.d/common-session-noninteractive
But doesn't fix my problem...

The bdutil framework works in a way the user "hadoop" is the one running the job. The script that deploys the cluster, created a file /etc/security/limits.d/hadoop.conf that overrided the ulimit settings for "hadoop" user, which I wasn't aware of. By deleting this file, or alternatively setting the desired ulimits there, I was able to resolve the problem.

Related

path not being detected by Nextflow

i'm new to nf-core/nextflow and needless to say the documentation does not reflect what might be actually implemented. But i'm defining the basic pipeline below:
nextflow.enable.dsl=2
process RUNBLAST{
input:
val thr
path query
path db
path output
output:
path output
script:
"""
blastn -query ${query} -db ${db} -out ${output} -num_threads ${thr}
"""
}
workflow{
//println "I want to BLAST $params.query to $params.dbDir/$params.dbName using $params.threads CPUs and output it to $params.outdir"
RUNBLAST(params.threads,params.query,params.dbDir, params.output)
}
Then i'm executing the pipeline with
nextflow run main.nf --query test2.fa --dbDir blast/blastDB
Then i get the following error:
N E X T F L O W ~ version 22.10.6
Launching `main.nf` [dreamy_hugle] DSL2 - revision: c388cf8f31
Error executing process > 'RUNBLAST'
Error executing process > 'RUNBLAST'
Caused by:
Not a valid path value: 'test2.fa'
Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run
I know test2.fa exists in the current directory:
(nfcore) MN:nf-core-basicblast jraygozagaray$ ls
CHANGELOG.md conf other.nf
CITATIONS.md docs pyproject.toml
CODE_OF_CONDUCT.md lib subworkflows
LICENSE main.nf test.fa
README.md modules test2.fa
assets modules.json work
bin nextflow.config workflows
blast nextflow_schema.json
I also tried with "file" instead of path but that is deprecated and raises other kind of errors.
It'll be helpful to know how to fix this to get myself started with the pipeline building process.
Shouldn't nextflow copy the file to the execution path?
Thanks
You get the above error because params.query is not actually a path value. It's probably just a simple String or GString. The solution is to instead supply a file object, for example:
workflow {
query = file(params.query)
BLAST( query, ... )
}
Note that a value channel is implicitly created by a process when it is invoked with a simple value, like the above file object. If you need to be able to BLAST multiple query files, you'll instead need a queue channel, which can be created using the fromPath factory method, for example:
params.query = "${baseDir}/data/*.fa"
params.db = "${baseDir}/blastdb/nt"
params.outdir = './results'
db_name = file(params.db).name
db_path = file(params.db).parent
process BLAST {
publishDir(
path: "{params.outdir}/blast",
mode: 'copy',
)
input:
tuple val(query_id), path(query)
path db
output:
tuple val(query_id), path("${query_id}.out")
"""
blastn \\
-num_threads ${task.cpus} \\
-query "${query}" \\
-db "${db}/${db_name}" \\
-out "${query_id}.out"
"""
}
workflow{
Channel
.fromPath( params.query )
.map { file -> tuple(file.baseName, file) }
.set { query_ch }
BLAST( query_ch, db_path )
}
Note that the usual way to specify the number of threads/cpus is using cpus directive, which can be configured using a process selector in your nextflow.config. For example:
process {
withName: BLAST {
cpus = 4
}
}

Contstraint stream breaks when switching from OptaPlanner 7.46.0 to 8.0.0

I have a constraint that crashes in the latest OptaPlanner 8.0.0, but used to work fine on 7.46.0.
As expected, IntelliJ's code inspection (and the debugger) shows that after the first join, the stream is a TriConstraintStream. The runtime class makes more sense to me than the class OptaPlanner is trying to cast to.
When leaving out the last groupBy the error goes away, so that clause seems to cause the issue.
Did something change in the way join and groupby worked?
It seems that the underlying OptaPlanner code was refactored for 8.0.0, so I have trouble seeing what exactly changed in OptaPlanner.
Should I add something to ensure that a TriJoin is used instead of a BiJoin?
I could not find any relevant notes in the migration documentation.
protected Constraint preventProductionShortage(ConstraintFactory factory) {
return factory.from(Demand.class)
.groupBy(Demand::getSKU,
Demand::getWeekNumber
)//BiConstraintStream
.join(Demand.class,
equal((sku, weekNumber)-> sku, Demand::getSKU),
greaterThanOrEqual((sku, weekNumber)-> weekNumber, Demand::getWeekNumber)//TriConstraintStream
)
.groupBy((sku, weekNumber, totalDemand) -> sku,
(sku, weekNumber, totalDemand) -> weekNumber,
sum((sku, weekNumber, totalDemand) -> totalDemand.getOrderQuantity())
)//TriConstraintStream
.penalize("Penalty", HardMediumSoftScore.ONE_MEDIUM,
(sku_weekNumber, demandQty, productionQty) -> 1);
}
Stack trace:
java.lang.ClassCastException: class org.optaplanner.core.impl.score.stream.tri.CompositeTriJoiner cannot be cast to class org.optaplanner.core.impl.score.stream.bi.AbstractBiJoiner (org.optaplanner.core.impl.score.stream.tri.CompositeTriJoiner and org.optaplanner.core.impl.score.stream.bi.AbstractBiJoiner are in unnamed module of loader 'app')
at org.optaplanner.core.impl.score.stream.drools.common.rules.BiJoinMutator.<init>(BiJoinMutator.java:40)
at org.optaplanner.core.impl.score.stream.drools.common.rules.UniRuleAssembler.join(UniRuleAssembler.java:70)
at org.optaplanner.core.impl.score.stream.drools.common.rules.AbstractRuleAssembler.join(AbstractRuleAssembler.java:179)
at org.optaplanner.core.impl.score.stream.drools.common.ConstraintSubTree.getRuleAssembler(ConstraintSubTree.java:94)
at org.optaplanner.core.impl.score.stream.drools.common.ConstraintSubTree.getRuleAssembler(ConstraintSubTree.java:89)
at org.optaplanner.core.impl.score.stream.drools.common.ConstraintGraph.generateRule(ConstraintGraph.java:431)
at org.optaplanner.core.impl.score.stream.drools.common.ConstraintGraph.lambda$generateRule$57(ConstraintGraph.java:423)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.optaplanner.core.impl.score.stream.drools.common.ConstraintGraph.generateRule(ConstraintGraph.java:424)
at org.optaplanner.core.impl.score.stream.drools.DroolsConstraintFactory.buildSessionFactory(DroolsConstraintFactory.java:101)
at org.optaplanner.core.impl.score.director.stream.ConstraintStreamScoreDirectorFactory.<init>(ConstraintStreamScoreDirectorFactory.java:77)
at org.optaplanner.test.impl.score.stream.DefaultConstraintVerifier.verifyThat(DefaultConstraintVerifier.java:63)
at org.optaplanner.test.impl.score.stream.DefaultConstraintVerifier.verifyThat(DefaultConstraintVerifier.java:32)
at com.ohly.planner.constraints.ConstraintsTest.weekShortageSingleSKU(ConstraintsTest.java:61)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
Process finished with exit code -1
[edit] For completeness, the new function as suggested by Lukáš Petrovický
protected Constraint preventProductionShortage(ConstraintFactory factory) {
return factory.from(Demand.class)
.join(Demand.class,
equal(Demand::getSKU),
greaterThanOrEqual(demand -> demand.getWeekNumber()))
.groupBy((d, d2) -> d.getSKU(),
(d, d2) -> d.getWeekNumber(),
sum((d,d2) -> d2.getOrderQuantity())
)
...
[/edit]
This was an unfortunate bug not caught by the existing test coverage.
The fix is aimed at OptaPlanner 8.0.1, incl. test coverage improvement.
That said, I would argue that the constraint is not very efficient. Unless I'm missing some key implications, the following is semantically very similar, yet much faster:
protected Constraint preventProductionShortage(ConstraintFactory factory) {
return factory.from(Demand.class)
.join(Demand.class,
equal(Demand::getSKU),
greaterThanOrEqual(demand -> demand.getWeekNumber()))
.groupBy(..., ..., sum((demand, demand2) -> ...))
.penalize("Penalty", HardMediumSoftScore.ONE_MEDIUM);
}
Note how I eliminated the first use of groupBy(). There may be some difference though in how many tuples are penalized this way, which may or may not be what you want. Feel free to open another question on that.

Spark using map in cluster mode

I have a immutable map in my class. When I run my code in local mode, there is no problem and I can reach every key in the map. However, when I run my code in cluster mode, nodes throw error about not finding the key in the map.
What I've tried up to now are these;
-Broadcast the immutable map over cluster.
broadcast = sc.broadcast(my_immutable_map)
-Parallelize the map as pair RDD
my_map_rdd = sc.parallelize( my_immutable_map.toSeq)
When i examine the logs, I see key not found exception.
My error stacktrace is as follows:
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (TID 25, datanode1.big.com): java.util.NoSuchElementException: key not found: 905053199731
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at havelsan.CDRGenerator$.generate_random_target(CDRGenerator.scala:95)
at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:167)
at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:165)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Can you explain how spark distribute maps and how it is possible that some nodes can't find some keys in this map, please? Btw my spark version is 1.6.0
What am I missing?
UPDATE
This part is for initializing the map on driver.
...
var pd = sc.textFile( "hdfs://...")
my_immutable_map = pd.map( line => line.split(":") ).map{ line => (line(0), line(1).split(","))}.collectAsMap
...
broadcast = sc.broadcast(my_immutable_map)
my_map_rdd = sc.parallelize( my_immutable_map.toSeq)
And this is the part where i got the error.
def my_func(key:String):String={
...
my_value = broadcast.value(key)
...
}
my_func is called inside a map as;
my_another_rdd.map{ line =>
val key = line.split(",")(0)
my_func(key)
}
The solution that i found is to pass the broadcast value to the function as a parameter. Still, I couldn't find a solution for parallelize method.
https://stackoverflow.com/a/34912887/4668959

Using powershell loop with net.exe command error

I'm trying to use powershell to write a script that calls net.exe's delete on a collection of computers meeting the specific case of having 3 or fewer files open. I'm fairly new at this, obviously, as I'm getting odd errors.
Using the example at Microsoft's blog I made the function below out of net session.
Function Get-ActiveNetSessions
{
# converts the output of the net session cmd into a PSobject
$output = net session | Select-String -Pattern \\
$output | foreach {
$parts = $_ -split "\s+", 4
New-Object -Type PSObject -Property #{
Computer = $parts[0].ToString();
Username = $parts[1];
Opens = $parts[2];
IdleTime = $parts[3];
}
}
}
which does produce a workable object that I can apply logic to.
I can use
$computerList = Get-ActiveNetSessions | Where-Object {$_.Opens -clt 3} | Select-Object {$_.Computer} to pull all computers with less than three opens into a variable, too.
What fails is the loop below
ForEach($computer in $computerList)
{
net session $computer /delete
}
with the error
net : The syntax of this command is:
At line:5 char:5
net session $computer /delete
CategoryInfo :NotSpecified (The syntax of this command is::String) [], RemoteException
FullyQualifiedErrorId : NativeCommandError
NET SESSION
[\\computername] [/DELETE] [/LIST]
Trying to run it with a call of $computer = $computer.ToString() ahead of the execution so it sees a string causes the script to hang without dropping the sessions, forcing me to close and reopen the ISE.
What should I do to get this loop working? Any help is appreciated.
Net session expects a \\ before the server name, it looks like. Have you given that a try?

Time command equivalent in PowerShell

What is the flow of execution of the time command in detail?
I have a user created function in PowerShell, which will compute the time for execution of the command in the following way.
It will open the new PowerShell window.
It will execute the command.
It will close the PowerShell window.
It will get the the different execution times using the GetProcessTimes function function.
Is the "time command" in Unix also calculated in the same way?
The Measure-Command cmdlet is your friend.
PS> Measure-Command -Expression {dir}
You could also get execution time from the command history (last executed command in this example):
$h = Get-History -Count 1
$h.EndExecutionTime - $h.StartExecutionTime
I've been doing this:
Time {npm --version ; node --version}
With this function, which you can put in your $profile file:
function Time([scriptblock]$scriptblock, $name)
{
<#
.SYNOPSIS
Run the given scriptblock, and say how long it took at the end.
.DESCRIPTION
.PARAMETER scriptBlock
A single computer name or an array of computer names. You mayalso provide IP addresses.
.PARAMETER name
Use this for long scriptBlocks to avoid quoting the entire script block in the final output line
.EXAMPLE
time { ls -recurse}
.EXAMPLE
time { ls -recurse} "All the things"
#>
if (!$stopWatch)
{
$script:stopWatch = new-object System.Diagnostics.StopWatch
}
$stopWatch.Reset()
$stopWatch.Start()
. $scriptblock
$stopWatch.Stop()
if ($name -eq $null) {
$name = "$scriptblock"
}
"Execution time: $($stopWatch.ElapsedMilliseconds) ms for $name"
}
Measure-Command works, but it swallows the stdout of the command being run. (Also see Timing a command's execution in PowerShell)
If you need to measure the time taken by something, you can follow this blog entry.
Basically, it suggest to use the .NET StopWatch class:
$sw = [System.Diagnostics.StopWatch]::startNew()
# The code you measure
$sw.Stop()
Write-Host $sw.Elapsed

Resources