Impala Data Locality - cloudera

I have a question about data locality in impala,
lets say I have cluster with 10 data nodes(on each data node there is impalad),
if I executing query in impala SELECT * FROM big_table where dt='2017' where blabla orderby blabla group by blabla (lets say its a big query).
and lets say the files under partition (dt=2017') is in dn 1,3,5
so if I'll execute the query will the coordinator use only daemons 1,3,5 for data locality or it'll use all the daemons, and the other daemons will read this data remotely?

Short answer to your question: it uses only daemons 1,3,5 for data locality.
This is generally a scheduling problem. Impala make such decisions in simple-scheduler.cc.
// We schedule greedily in this order:
// cached collocated replicas > collocated replicas > remote (cached or not) replicas.
Impala will not use other backends to scan a datanode if there is one backend colocated. And for fragments without a scannode, like partitioned aggregation node, impala puts them on the same location as their input fragments reside.
// there is no leftmost scan; we assign the same hosts as those of our
// leftmost input fragment (so that a partitioned aggregation fragment
// runs on the hosts that provide the input data)

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

What's the difference between distributed hashtable technology and the bitcoin blockchain?

This question could go into a bitcoin forum but I am trying to understand from a programming point of view.
There are technologies used for distributed storage, like distributed hashtables (say kademlia or similar). How is the bitcoin blockchain different from distributed hashtables? Or is maybe distributed hashtable technology underpinning the bitcoin blockchain? Or why is the bitcoin blockchain hailed as such a breakthrough compared to DHT?
Distributed Hash Table
A DHT is simply a key-value store distributed accross a number of
nodes in a network. The keys are distributed among nodes with a
deterministic algorithm. Each node is responsible for a portion of
the hash table.
A routing algorithm allows to perform requests in the hash table
without knowing every node of the network.
For exemple in the Chord
DHT —which is relatively simple DHT implementation— each
node is assigned an identifier and is responsible of keys which
are closer to its identifier.
Imagine there is 4 nodes that have identifiers: 2a6c, 7811, a20f, e9c3
The data with the identifier 2c92 will be stored on the node 2a6c.
Imagine now that you only know the node 7811 and you are looking
for the data with the identifier eabc.
You ask the node 7811 for the data eabc. 7811 doesn't have it so
it ask the node e9c3 wich send it to node 7811 which send it back
to you.
A clever algorithm allows to find data in O(log(N))
jumps. Without storing the entire routing table of the
network (the addresses of each nodes). Basically you ask the
closest node to the data identifier you know wich itself asks the
closest node it knows and so on reducing the size of the jump at
each step.
A DHT is very scalable because the data are uniformly distributed
among nodes and lookup time generally grows in O(log(N)).
Blockchain
A blockchain is also a distributed data structure but its purpose
is completely different.
Think of it as a history, or a ledger. The purpose is to store a
continuously-growing list of record without the possibility of
tampering and revision.
It is mainly used in the bitcoin currency system for keeping
track of transactions. Its property of being tamper-proof let everybody
know the exact balance of an account by knowing its history of
transaction.
In a blockchain, each node of the network stores the full data.
So it is absolutely not the same idea as the DHT in which data
are divided among nodes. Every new entry in the blockchain must
be validated by a process called mining whose details are out of the scope of this answer but this process insure consensus of the
data.
The two structures are both distributed data structure but serve
different purposes. DHT aims to provide an efficient (in term of
lookup time and storage footprint) structure to divide data on a
network and blockchain aims to provide a tamper-proof data
structure.
In computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
but block chain is
a digital ledger in which transactions made in bitcoin or another cryptocurrency are recorded chronologically and publicly.

In IPv4 network finding the period of identifiers generated by a host wrap around?

Every host in an IPv4 network has a 1-second resolution real-time clock with battery backup. Each host needs to generate up to 1000 unique identifiers per second. Assume that each host has a globally unique IPv4 address. Design a 50-bit globally unique ID for this purpose. After what period (in seconds) will the identifiers generated by a host wrap around?
There is a system which generates 1000 IDs per second.
You need to design a 50bit globally unique ID.
Now the IPv4 address of each host is already given to be unique. So using the IP within this ID would ensure that IDs generated on one host can't conflict on another (thus ensuring global uniqueness).
This leaves only a part of the full 50bit space available for the "unique" part of each generated ID; random or sequential doesn't matter.
The question can be reworded as follows:
How many bits are left for the "unique" part? And assuming that each hosts generates 1000 IDs per second, how many seconds do you need before you wrap around the available bit space?

Riak: Using n_val = 3 and only 3 nodes

I'm starting with Riak and so far everything is going great. I'm not concerned about performance at the moment because I'm mainly using it as a backup store. I've read all docs I could find (the littleriakbook.com was great to explain the concepts) but I still seem to not grasp some parts.
The situation is that I can only use 3 physical nodes/servers at the moment (instead of the recommended 5). I want that all data is replicated to all three nodes. Essentially if up to 2 nodes go down, I want to still be able to read and write to the remaining node. And if the nodes are coming up again they should synchronise again.
I've set it all up and riak-admin diag shows me that not all data is fulfilling the n_val requirement. How can I make sure that all three nodes are (eventually) identical copies? Is it possible to trigger a redristribution of the data that doesn't fulfil the requirements?
With only 3 nodes, it is not possible to fulfil the n_val requirement and ensure that the three copies stored of any object always will be on different nodes. The reason for this is in how Riak distributes replicas.
When storing or retrieving an object, Riak will calculate a hash value based on the bucket and key, and map this value to a specific partition on the ring. Once this partition has been determined the other N-1 replicas are always placed on the following N-1 partitions. If we assume we have a ring size of 64 and name these partitions 1-64, an object that hashes into partition 10 and belongs to a bucket with n_val set to 3 will also be stored in partitions 11 and 12.
With 3 nodes you will often see that the partitions are spread out alternating between the physical nodes. This means that for most partitions, the replicas will be on different physical nodes. For the last 2 partitions of the ring, 63 and 64 in our case, storage will however need to wrap around onto partitions 1 and 2. As 64 can not be evenly divided by 3, objects that do hash into these last partitions will therefore only be stored on 2 different physical nodes.
When a node fails or becomes unavailable in Riak, the remaining nodes will temporarily take responsibility for the partitions belonging to the lost node. These are known as fallback partitions and will initially be empty. As data is updated or inserted, these partitions will keep track of it and hand it back to the owning node once it becomes available. If Active Anti-Entropy is enabled, it will over time synchronise the fallback partition with the other partitions in the background.

Who can explain the 'Replication' in dynamo paper?

In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?
It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.
I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.
Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.

Resources