I understand that the bench mark for commodity hardware is around 10 nodes able to process 1 million tuples (each of size 10 MB?) per second. However, the term commodity hardware is vague and just to add a pinch of salt, is each node 8 core?
Also is this bench mark speed for a fully transactional base Storm cluster or a cluster that is configured for maximum efficiency?
Related
I have an MPI program which traverses through a graph to solve a problem.
if some rank finds another branch of the graph it will send the task to another random rank. all ranks will wait and receive another task after they complete one.
I have 4 processors, when I see the CPU usage when running the program I will usually see 2-3 processors at max while 1-2 processors idling because the tasks are not split equally among the ranks.
to solve this issue I have to know which rank is not already busy solving some task. so when some rank finds another branch in the graph it will see which rank is free to work on this branch and send it the task.
Q: How can I balance the workload between the ranks.
note: I don't know the length or the size of the graph so I can't split the tasks into ranges for each rank when starting. I have to visit each node on the fly and check if it solves the graph problem. if not I send the next node branches to other ranks.
studying raft, I can’t understand one thing, for example, I have a cluster of 6 nodes, and I have 3 partitions with a replication factor of 3, let’s say that a network error has occurred, and now 3 nodes do not see the remaining 3 nodes, until they remain available for clients, and for example, the record SET 5 came to the first formed cluster , and in this case it will pass? because replication factor =3 and majority will be 2? it turns out you can get split brain using raft protocol?
In case of 6 nodes, the majority is 4. So if you have two partitions of three nodes, neither of those partitions will be able to elect a leader or commit new values.
When a raft cluster is created, it is configured with specific number of nodes. And the majority of those nodes is required to either elect a leader or commit a log message.
In case of a raft cluster, every node has a replica of data. I guess we could say that the replication factor is equal to cluster side in a raft cluster. But I don't think I've ever seen replication factor term being used in consensus use case.
Few notes on cluster size.
Traditionally, cluster size is 2*N+1, where N is number of nodes a cluster can lose and still be operational - as the rest of nodes still have majority to elect a leader or commit log entries. Based on that, a cluster of 3 nodes may lose 1 node; a cluster of 5 may lose 2.
There is no much point (from consensus point of view) to have cluster of size 4 or 6. In case of 4 nodes total, the cluster may survive only one node going offline - it can't survive two as the other two are not the majority and they won't be able to elect a leader or agree on progress. Same logic applies for 6 nodes - that cluster can survive only two nodes going off. Having a cluster of 4 nodes is more expensive as we can have support the same single node outage with just 3 nodes - so cluster of 4 is a just more expensive with no availability benefit.
There is a case when cluster designers do pick cluster of size 4 or 6. This is when the system allows stale reads and those reads can be executed by any node in a cluster. To support larger scale of potentially stale reads, a cluster owner adds more nodes to handle the load.
I came across this intersting database the other day, and have read some doc on its offical site, I have some questions regarding Raft Group in TiKV (here),
Suppose that we have a cluster which has like 100 nodes, and the replication factor is 3, does that mean we will end up with a lot of tiny Raft "bubbles", each of them contains only 3 members, and they do leader election and log replication inside the "buble".
Or, we only have one single fat Raft "buble" which contains 100 nodes?
Please help to shed some light here, thank you!
a lot of tiny Raft "bubbles", each of them contains only 3 members,
The tiny Raft bubble in your context is a Raft group in TiKV, comprised of 3 replicas(by default). Data is auto-sharded in Regions in TiKV, each Region corresponding to a Raft group. To support large data volume, Multi-Raft is implemented. So you can perceive Multi-raft as tiny Raft "bubbles" that are distributed evenly across your nodes.
Check the image for Raft in TiKV here
we only have one single fat Raft "buble" which contains 100 nodes?
No, a Raft group does not contain nodes, they are contained in nodes.
For more details, see: What is Multi-raft in TiKV
In this case it means that you have 33 shards ("bubbles") of 3 nodes each.
A replication factors of 3 is quite common in distributed systems. In my experience, databases use replication factors of 3 (in 3 different locations) as a sweet spot between durability and latency; 6 (in 3 locations) when they lean heavily towards durability; and 9 (in 3 locations) when they never-ever want to lose data. The 9-node databases are extremely stable (paxos/raft-based) and I have only seen them used as configuration for the 3-node and 6-node databases which can use a more performant protocol (though, raft is pretty performant, too).
I hava multiple questions about the disk size which Corda needs over time and could not find any information online.
How much disk space does a Corda transaction need?
How musch disk space does Corda need over the course of 10 years with 4.5 million transactions per month on average (without attachment etc.)
The size of a transaction is not fixed. It will depend on the states, contracts, attachments and other components used.
We do not have any rough guides currently, but we will likely be doing some tests shortly in the run-up to the release of Corda's enterprise version. This will give an idea of the storage requirements of running a node.
As was said the answer is it depends on the transaction size. The average bitcoin transaction runs about 560 bytes, giving around 2000 transactions per 1 meg block. Ethereum runs an average of about 2K per transaction so it can store 500 per 1 meg block and from best numbers I can get hyperledger runs about 5k per transaction to around 205 per block. Assuming CORDA will be somewhere in this spectrum, and assuming you will use the less is more axiom (store as little as possible in the blockchain block, defer all else to sideDB or offchain storage) then lets chose something easy to calculate with, let's say CORDA has a 1k per transaction average. That is 1000 trans/block. With the 1k size multiply TPSseconds of processing in a dayactual processing days per year to get your number. In your case (4,500,000*1024*12*10)/(1024^3) should give you gig. (seems to be about 515 gigabytes at a 1k transaction size)
I tried the CordApp example of an ultra simple IOU transaction to measure this. A single IOU transaction contains the identity of two counterparties and one notary and a singe double value (requiring 8 bytes).
Looking at the database I see that the serialised transaction requires 11 kB.
I am asking for alternative ways for serialisation in: Corda: Large serialized transaction size: Are there alternatives to current serialization design?
Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on