MapReduce operation recursively - recursion

I have a hbase basetable A on which I call Mapper and Reducer class from my code. The result that I obtain, table A-first, I have to call same Mapper and Reducer on it again. Is it possible to do it at all? I have to do the recursion 3 times, as the table A is extremely large and my MapReduce works to aggregate the table 3 ways on time stamps.
And is it possible to do it without creating table A-first first i.e. send the data obtained from Reducer back to Mapper (table is created but MapReduce is performed on data after reducer rather than accessing table A-first)?

Related

how to create/update item in dynamo db?

I tried a sample code ( see below) , to create new items in dynamo db . based on the docs for dynamodb and boto3, the sample code adds the item in dynamodb in batch, but just from the code , it looks like put item is being called in each iteration of the for loop below. any thoughts, also , i understand for updating item, there is no batch operation , we have to call update item one at a time?
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
with table.batch_writer() as writer:
for item in somelist:
writer.put_item(Item=item)
Note that you called the put_item() method on the writer object. This writer object is a batch writer - it is a wrapper of the original table object. This wrapper doesn't perform every put_item() request individually! Instead,
As its name suggests, the batch writer collects batches of up to 25 writes in memory, and only on the 25th call, it sends all 25 writes as one DynamoDB BatchWriteItem request.
Then, at the end of the loop, the writer object is destroyed when the the with block ends, and this sends the final partial batch as one last BatchWriteItem request.
As you can see, Python made efficient writing using batches very transparent and easy.
The boto3 batch writer buffers internally and sends each batch automatically. It’s like magic.

Apply multiple functions to update table using kusto

I want to produce/update the output of a table using several functions. Becasue each functions will create separate columns. For me it would be relatively practical to write several functions for it.
To update a table using one function is documented in the documentation. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/updatepolicy
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/alter-table-update-policy-command
But this case is not informed. is it even possible to do it ? if so How ?
is is this right way to do this ? .set-or-append TABLE_NAME<| FUNCTION1 <| FUNCTION2 <| FUNCTION3
You can chain update policies as much as you need (as long as it does not create a circular reference), this means that Table B can have an update policy that runs a function over Table A and Table C can have an update policy that runs a function over Table B.
If you don't need the intermediate tables you can set their retention policy to 0 days, this means that no data will actually be ingested into these tables.

Is there a way to clone a table in Kusto?

Is there a way to clone a table in Kusto exactly so it has all the extents of the original table? Even if it's not possible to have extents retained , at least is there a performant way to copy a table to a new table. I tried the following:-
.set new_table <| existing_table;
It was running forever and got timeout error. Is there way to copy so the Kusto engine recognizes that this is just a dump copy so instead of using Kusto engine, it will just do a simple blob copy from back-end and simply point the new table to the copied blob thus bypassing the whole Kusto processing route?
1. Copying schema and data of one table to another is possible using the command you mentioned (another option to copy the data is to export its content into cloud storage, then ingest the result storage artifacts using Kusto's ingestion API or a tool that uses it, e.g. LightIngest or ADF)
Of course, if the source table has a lot of data, then you would want to split this command into multiple ones, each dealing with a subset of the source data (which you can 'partition', for example, by time).
Below is just one example (it obviously depends on how much data you have in the source table):
.set-or-append [async] new_table <| existing_table | where ingestion_time() > X and ingestion_time() < X + 1h
.set-or-append [async] new_table <| existing_table | where ingestion_time() >= X+1h and ingestion_time() < X + 2h
...
Note that the async is optional, and is to avoid the potential client-side-timeout (default after 10 minutes). the command itself continues to run on the backend for up to a non-configurable timout of 60 mins (though it's strongly advised to avoid such long-running commands, e.g. by performing the "partitioning" mentioned above).
2. To your other question: There's no option to copy data between tables without re-ingesting the data (an extent / data shard currently can't belong to more than 1 table).
3. If you need to "duplicate" data being ingestion into table T1 continuously into table T2, and both T1 and T2 are in the same database, you can achieve that using an update policy.

lambda function was always runnning when attached to dynamoDB as trigger point

I am trying to invoke one lambda function with dynamoDB if any new record will insert. i am attaching to one dyanamo Db table but the lambda function call involked mutltiple times even if i add one record.
Can some one give an insight on this.
Thank you

SPARK In Memory Computation

If Spark computes all its RDD operations in-memory itself, then what difference does it make to persist RDD in memory?
We can persist RDD to apply more than one action or call action on RDD later on. After persist the RDD spark will Skip all the the stages which are need to me calculated for execution of Action. In spark all transformations are lazy evaluated that means when you call action all transformations will be executed in real so at first time if you call collect() its will execute all transformation and persist one of the RDD now if you again run another action like count it will not re-execute all transformation just skip all before persist and execute non persisted part for example
val list = sc.parallelize(List(1,23,5,4,3,2))
val rdd1 = list.map(_+1)
val rdd2 = rdd1.map(_+5).cache
rdd2.collect
rdd2.count
like in the above example when rdd2.collect will call it will executed all above transformations as you notice rdd2 is already cache so now when count will be call it will not execute above transformation and use persisted rdd to calculate results.

Resources