Spark using map in cluster mode - dictionary

I have a immutable map in my class. When I run my code in local mode, there is no problem and I can reach every key in the map. However, when I run my code in cluster mode, nodes throw error about not finding the key in the map.
What I've tried up to now are these;
-Broadcast the immutable map over cluster.
broadcast = sc.broadcast(my_immutable_map)
-Parallelize the map as pair RDD
my_map_rdd = sc.parallelize( my_immutable_map.toSeq)
When i examine the logs, I see key not found exception.
My error stacktrace is as follows:
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (TID 25, datanode1.big.com): java.util.NoSuchElementException: key not found: 905053199731
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at havelsan.CDRGenerator$.generate_random_target(CDRGenerator.scala:95)
at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:167)
at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:165)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Can you explain how spark distribute maps and how it is possible that some nodes can't find some keys in this map, please? Btw my spark version is 1.6.0
What am I missing?
UPDATE
This part is for initializing the map on driver.
...
var pd = sc.textFile( "hdfs://...")
my_immutable_map = pd.map( line => line.split(":") ).map{ line => (line(0), line(1).split(","))}.collectAsMap
...
broadcast = sc.broadcast(my_immutable_map)
my_map_rdd = sc.parallelize( my_immutable_map.toSeq)
And this is the part where i got the error.
def my_func(key:String):String={
...
my_value = broadcast.value(key)
...
}
my_func is called inside a map as;
my_another_rdd.map{ line =>
val key = line.split(",")(0)
my_func(key)
}

The solution that i found is to pass the broadcast value to the function as a parameter. Still, I couldn't find a solution for parallelize method.
https://stackoverflow.com/a/34912887/4668959

Related

How to write unittest cases for checking database connection with SQL server in Python?

I have just started using unittest in Python for writing test cases, I have a function that makes the connection with SQL server.
sql_connection.py
def getConnection():
connection = pyodbc.connect("Driver={ODBC Driver 13 for SQL Server};"
"Server="+appConfig['sql_server']['server']+";"
"Database="+appConfig['sql_server']['database']+";"
"UID="+appConfig['sql_server']['uid']+";"
"PWD="+appConfig['sql_server']['password']+";"
"Trusted_Connection=no;",
)
return connection
I have tried below test case for checking database connect or not.
test_connection.py
import pyodbc
getConnection1=getConnection()
class TestDatabseConnection(unittest.TestCase):
def test_getConnection(self):
try:
db_connection = getConnection1.connection
except pyodbc.Error as ex:
sqlstate = ex.args[1]
print(sqlstate)
self.fail(
"getConnection() raised pyodbc.OperationalError. " +
"Connection to database failed. Detailed error message: " + sqlstate)
self.assertIsNone(db_connection)
But still not able to get succeed.
======================================================================
ERROR: test_getConnection (__main__.TestDatabseConnection)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_connection.py", line 23, in test_getConnection
db_connection = getConnection1.connection
AttributeError: 'pyodbc.Connection' object has no attribute 'connection'
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (errors=1)
Please help me out in this.
A unit test for your getConnection could look like below. I would suggest using a patch and mock from unittest.mock. With a unit test you are only interested in testing the functionality of getConnection and should "Mock" all other function calls within the function. If you want to test the full potential of pyodbc.connect then I would suggest a functional test that actually connects to the databse which would no longer be a unit test. For more information on patch and Mock checkout the docs. These are very powerful and make unit testing fun and easy! unittest.mock
import unittest
from unittest.mock import patch, Mock
import pyodbc
def getConnection():
appConfig = {'sql_server': {'server':'', 'database':'', 'uid':'', 'password':''}}
connection = pyodbc.connect("Driver={ODBC Driver 13 for SQL Server};"
"Server="+appConfig['sql_server']['server']+";"
"Database="+appConfig['sql_server']['database']+";"
"UID="+appConfig['sql_server']['uid']+";"
"PWD="+appConfig['sql_server']['password']+";"
"Trusted_Connection=no;",
)
return connection
#patch('pyodbc_example.pyodbc')
class TestDatabseConnection(unittest.TestCase):
def test_getConnection(self, pyodbc_mock):
pyodbc_mock.connect.return_value = Mock()
connection = getConnection()
self.assertEqual(connection, pyodbc_mock.connect.return_value)

Resource 7bed8adc-9ed9-49dc-b15e-6660e2fc3285 transitioned to failure state ERROR when use openstacksdk to create_server

When I create the openstack server, I get bellow Exception:
Resource 7bed8adc-9ed9-49dc-b15e-6660e2fc3285 transitioned to failure state ERROR
My code is bellow:
server_args = {
"name":server_name,
"image_id":image_id,
"flavor_id":flavor_id,
"networks":[{"uuid":network.id}],
"admin_password": admin_password,
}
try:
server = user_conn.conn.compute.create_server(**server_args)
server = user_conn.conn.compute.wait_for_server(server)
except Exception as e: # there I except the Exception
raise e
When create_server, my server_args data is bellow:
{'flavor_id': 'd4424892-4165-494e-bedc-71dc97a73202', 'networks': [{'uuid': 'da4e3433-2b21-42bb-befa-6e1e26808a99'}], 'admin_password': '123456', 'name': '133456', 'image_id': '60f4005e-5daf-4aef-a018-4c6b2ff06b40'}
My openstacksdk version is 0.9.18.
In the end, I find the flavor data is too big for openstack compute node, so I changed it to a small flavor, so I create success.

Internal error in the mapping processor: java.lang.NullPointerException

I'm trying to my map local pojo to an autogenerated domain objects using mapstruct. Expect for a specific complex structure everything else seems to map and the mapper implementation class gets generation. Below is the error that I get.
My mapper class is:
#Mappings({
#Mapping(source = "sourcefile", target = "sourceFILE"),
#Mapping(source = "id", target = "ID"),
#Mapping(source = "reg", target = "regID"),
#Mapping(source = "itemDetailsType", target = "ItemDetailsType") //This is the structure that does not map
})
AutoGenDomainType map(LocalPojo localPojo);
#Mappings({
#Mapping(source = "line", target = "LINE"),
#Mapping(source = type", target = "TYPE")
})
ItemDetailsType map(ItemDetailsTypes itemDetailsType);
Error:
Internal error in the mapping processor: java.lang.NullPointerException at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.hasCompatibleCopyConstructor(MappingResolverImpl.java:547) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.isPropertyMappable(MappingResolverImpl.java:522) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.getTargetAssignment(MappingResolverImpl.java:202) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl$ResolvingAttempt.access$100(MappingResolverImpl.java:153) at org.mapstruct.ap.internal.processor.creation.MappingResolverImpl.getTargetAssignment(MappingResolverImpl.java:121) at
.....
.....
[ERROR]
[ERROR] Found 1 error and 16 warnings.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project uwo-services: Compilation failure
The target object ItemDetailsType does have other properties that need not be mapped. The error says compilation issue, but I dont find any. Also I have tried adding have tried the unmappedTargetPolicy = ReportingPolicy.IGNORE at my mapper class level just to avoid if this is caused by the unmapped properties, but still no solution.
This is a known bug in MapStruct. The bug is reported in #729, it has been fixed in 1.1.0.Final. You are using 1.0.0.Final. I would highly suggest switching to the either 1.1.0.Final or 1.2.0.Beta2.
Once you update you will see a better error message and you will know exactly what the problem in the mapping is.
By looking at this first it looks like that target in #Mapping(source = "itemDetailsType", target = "ItemDetailsType") is wrong. Are you sure that you need a capital letter there?

SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database

I am using Apache Spark 1.5.1 and trying to connect to a local SQLite database named clinton.db. Creating a data frame from a table of the database works fine but when I do some operations on the created object, I get the error below which says "SQL error or missing database (Connection is closed)". Funny thing is that I get the result of the operation nevertheless. Any idea what I can do to solve the problem, i.e., avoid the error?
Start command for spark-shell:
../spark/bin/spark-shell --master local[8] --jars ../libraries/sqlite-jdbc-3.8.11.1.jar --classpath ../libraries/sqlite-jdbc-3.8.11.1.jar
Reading from the database:
val emails = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlite:../data/clinton.sqlite", "dbtable" -> "Emails")).load()
Simple count (fails):
emails.count
Error:
15/09/30 09:06:39 WARN JDBCRDD: Exception closing statement
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (Connection is closed)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109)
at org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1$$anonfun$8.apply(JDBCRDD.scala:358)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1$$anonfun$8.apply(JDBCRDD.scala:358)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
res1: Long = 7945
I got the same error today, and the important line is just before the exception:
15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection
15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (Connection is closed)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109)
at org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
So Spark succeeded to close the JDBC connection, and then it fails to close the JDBC statement
Looking at the source, close() is called twice:
Line 358 (org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, Spark 1.5.1)
context.addTaskCompletionListener{ context => close() }
Line 469
override def hasNext: Boolean = {
if (!finished) {
if (!gotNext) {
nextValue = getNext()
if (finished) {
close()
}
gotNext = true
}
}
!finished
}
If you look at the close() method (line 443)
def close() {
if (closed) return
you can see that it checks the variable closed, but that value is never set to true.
If I see it correctly, this bug is still in the master. I have filed a bug report.
Source: JDBCRDD.scala (lines numbers differ slightly)

Spark - "too many open files" in shuffle

Using Spark 1.1
I have 2 datasets. One is very large and the other was reduced (using some 1:100 filtering) to much smaller scale. I need to reduce the large dataset to the same scale, by joining only those items from the smaller list with their corresponding counterparts in the larger list (those lists contain elements that have a mutual join field).
I am doing that using the following code:
The "if(joinKeys != null)" part is the relevant part
Smaller list is "joinKeys", larger list is "keyedEvents"
private static JavaRDD<ObjectNode> createOutputType(JavaRDD jsonsList, final String type, String outputPath,JavaPairRDD<String,String> joinKeys) {
outputPath = outputPath + "/" + type;
JavaRDD events = jsonsList.filter(new TypeFilter(type));
// This is in case we need to narrow the list to match some other list of ids... Recommendation List, for example... :)
if(joinKeys != null) {
JavaPairRDD<String,ObjectNode> keyedEvents = events.mapToPair(new KeyAdder("requestId"));
JavaRDD < ObjectNode > joinedEvents = joinKeys.join(keyedEvents).values().map(new PairToSecond());
events = joinedEvents;
}
JavaPairRDD<String,Iterable<ObjectNode>> groupedEvents = events.mapToPair(new KeyAdder("sliceKey")).groupByKey();
// Add convert jsons to strings and add "\n" at the end of each
JavaPairRDD<String, String> groupedStrings = groupedEvents.mapToPair(new JsonsToStrings());
groupedStrings.saveAsHadoopFile(outputPath, String.class, String.class, KeyBasedMultipleTextOutputFormat.class);
return events;
}
Thing is when running this job, I always get the same error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2757 in stage 13.0 failed 4 times, most recent failure: Lost task 2757.3 in stage 13.0 (TID 47681, hadoop-w-175.c.taboola-qa-01.internal): java.io.FileNotFoundException: /hadoop/spark/tmp/spark-local-20141201184944-ba09/36/shuffle_6_2757_2762 (Too many open files)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.<init>(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
I already increased my ulimits, by doing the following on all cluster machines:
echo "* soft nofile 900000" >> /etc/security/limits.conf
echo "root soft nofile 900000" >> /etc/security/limits.conf
echo "* hard nofile 990000" >> /etc/security/limits.conf
echo "root hard nofile 990000" >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo "session required pam_limits.so" >> /etc/pam.d/common-session-noninteractive
But doesn't fix my problem...
The bdutil framework works in a way the user "hadoop" is the one running the job. The script that deploys the cluster, created a file /etc/security/limits.d/hadoop.conf that overrided the ulimit settings for "hadoop" user, which I wasn't aware of. By deleting this file, or alternatively setting the desired ulimits there, I was able to resolve the problem.

Resources