If Spark computes all its RDD operations in-memory itself, then what difference does it make to persist RDD in memory?
We can persist RDD to apply more than one action or call action on RDD later on. After persist the RDD spark will Skip all the the stages which are need to me calculated for execution of Action. In spark all transformations are lazy evaluated that means when you call action all transformations will be executed in real so at first time if you call collect() its will execute all transformation and persist one of the RDD now if you again run another action like count it will not re-execute all transformation just skip all before persist and execute non persisted part for example
val list = sc.parallelize(List(1,23,5,4,3,2))
val rdd1 = list.map(_+1)
val rdd2 = rdd1.map(_+5).cache
rdd2.collect
rdd2.count
like in the above example when rdd2.collect will call it will executed all above transformations as you notice rdd2 is already cache so now when count will be call it will not execute above transformation and use persisted rdd to calculate results.
Related
I tried a sample code ( see below) , to create new items in dynamo db . based on the docs for dynamodb and boto3, the sample code adds the item in dynamodb in batch, but just from the code , it looks like put item is being called in each iteration of the for loop below. any thoughts, also , i understand for updating item, there is no batch operation , we have to call update item one at a time?
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
with table.batch_writer() as writer:
for item in somelist:
writer.put_item(Item=item)
Note that you called the put_item() method on the writer object. This writer object is a batch writer - it is a wrapper of the original table object. This wrapper doesn't perform every put_item() request individually! Instead,
As its name suggests, the batch writer collects batches of up to 25 writes in memory, and only on the 25th call, it sends all 25 writes as one DynamoDB BatchWriteItem request.
Then, at the end of the loop, the writer object is destroyed when the the with block ends, and this sends the final partial batch as one last BatchWriteItem request.
As you can see, Python made efficient writing using batches very transparent and easy.
The boto3 batch writer buffers internally and sends each batch automatically. It’s like magic.
I am pushing multiple values to XCOM based on values returned from a database. As the number of values returned may vary, I am using the index as the key.
How do I, in the next task, retrieve all the values from the previous task. Currently I am only returning the last XCOM from t1 but would like all of them.
Here is the source code for xcom_pull.
You'll see it had some filter logic and defaulting behaviour. I believe you are doing xcom_pull()[-1] or equivalent. But you can use task_ids argument to provide a list in order of the explicit task_ids that you want to pull xcom data from. Alternatively, you can use the keys that you push the data up with.
So in your case, where you want all the data emitted from the last task_instance and that alone, you just need to pass the task_id of the relevant task to the xcom_pull method.
I have a hbase basetable A on which I call Mapper and Reducer class from my code. The result that I obtain, table A-first, I have to call same Mapper and Reducer on it again. Is it possible to do it at all? I have to do the recursion 3 times, as the table A is extremely large and my MapReduce works to aggregate the table 3 ways on time stamps.
And is it possible to do it without creating table A-first first i.e. send the data obtained from Reducer back to Mapper (table is created but MapReduce is performed on data after reducer rather than accessing table A-first)?
I am calling PutItem with a ConditionExpression that looks like:
attribute_exists(id) AND object_version = :x
In other words, I only want to update an item if the following conditions are true:
The object needs to exists
My update must be on latest version of the object
Right now, if the check fails, I don't know which condition was false. Is there a way to get information on which conditions were false? Probably not but who knows...
Conditional expressions in DynamoDB allow for atomic write operations for DynamoDB objects that is strongly consistent for a single object, even in a distributed system thanks to paxos.
One standard approach is to simply read the object first and perform your above check in your client application code. If one of the conditions doesn't match you know which one was invalid directly without a failed write operation. The reason for having DynamoDB also perform this check is because another application or thread may have modified this object between the check and write. If the write failed then you would read the object again and perform the check again.
Another approach would be to skip the read before the write and just read the object after the failed write to do the check in your code to determine what condition actually failed.
The one or two additional reads against the table are required because you want to know which specific condition failed. This feature is not provided by DynamoDB so you'll have to do the check yourself.
Command builder builds Parameter Object. What does it mean by parameter object? Why are they created ?
Why does Derive Parameter method need extra round trip to Data store ?
I have only used it one time (when I needed to import some data and I was too lazy to build out my Sql statements) - to automatically generate insert, update, and delete statements to reconcile changes made to a DataSet with the associated database.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlcommandbuilder.aspx