I want use cassandra in my server and the server has five disks, can i partition data across disks? in other words,How to store data in this disks with partition key? i read cassandra can partition in multiple servers, how to do same for multiple disks?
TNX
Yes, Cassandra can make use of multiple disks.
In the cassandra.yaml file there is property called data_file_directories. Under that you can give a list of data directories, so you would list one for each disk using paths appropriate for your OS for specifying different physical disks.
That's all you need to do and Cassandra will evenly distribute inserted data across the specified data directories.
Related
you can use ceph-volume lvm create --filestore --data example_vg/data_lv --journal example_vg/journal_lv to create ceph volume,but I want to know how many volumes can ceph support,can it be infinite?
Ceph can serve an infinite number of volumes to clients, but that is not really your question.
ceph-volume Is used to prepared a disk to be consumed by Ceph for serving capacity to users. The prepared volume will be served by an OSD and join the RADOS cluster, adding its capacity to the cluster’s.
If your question is how many disks you can attach to a single cluster today, the sensible answer is “a few thousand”. You can push farther using a few tricks. Scale increases over time, but I would say 2,500-5,000 OSDs is a reasonable limit today.
Recently I have read the paper of Dynamo, the key/value storage system of Amazon. The Dynamo uses consistent hashing algorithm as the partition algorithm. To solve the challenge of load balance and heterogeneous, it applies the "virtual node" mechanism. Here is my question:
It is described that "The number of virtual nodes that a node is
responsible can decided based on its capacity", but what capacity it
is? Is it the calculation capacity, network bandwidth, or the disk
volume?
What is the technology to partition a node to "virtual nodes"? Is a virtual node just a process? Or maybe using docker or virtual machine?
Without going into specifics, for #1 the answer would be: all of the above. The capacity may be determined empirically for different node types after running some load testing and noting the results. A similar process to what you would use to determine the capacity of a web server.
And for your second question, the paper just says that you should think of nodes from a logical stand point. In order to satisfy #1, nodes in the ring are designated such that one or multiple nodes would hash to the same physical hardware. So a virtual node is just a logical mapping. It is just one more layer of abstraction on top of the physical layer. If you are familiar with file systems, think of a virtual node like an iNode vs. a disk cylinder (a comparison perhaps slightly dated)
I would like to use Chronicle Map to read a serialized map from a network file share and then process it locally. We would have 100+ machines reading prepared data from a map - lots of iteration but never writing. Can I just have one process create the map on a network file share and then each 'consumer' load and process the map. Maps would be no more than 1 million keys, 1K values. Or do we need to use the UDP / TCP replication feature?
Network file stores don't guarantee when a update will ever be visible to a reader. You can't open a file on one machine while that file is being modified by another machine.
You need to either replicate the data, or access the data on a smaller set of node remotely. e.g. via Engine.
Amazon DynamoDB allows the customer to provision the throughput of reads and writes independently. I have read the Amazon Dynamo paper about the system that preceded DynamoDB and read about how Cassandra and Riak implemented these ideas.
I understand how it is possible to increase the throughput of these systems by adding nodes to the cluster which then divides the hash keyspace of tables across more nodes, thereby allowing greater throughput as long as access is relatively random across hash keys. But in systems like Cassandra and Riak this adds throughput to both reads and writes at the same time.
How is DynamoDB architected differently that they are able to scale reads and write independently? Or are they not and Amazon is just charging for them independently even though they essentially have to allocate enough nodes to cover the greater of the two?
You are correct that adding nodes to a cluster should increase the amount of available throughput but that would be on a cluster basis, not a table basis. The DynamoDB cluster is a shared resource across many tables across many accounts. It's like an EC2 node: you are paying for a virtual machine but that virtual machine is hosted on a real machine that is shared among several EC2 virtual machines and depending on the instance type, you get a certain amount of memory, CPU, network IO, etc.
What you are paying for when you pay for throughput is IO and they can be throttled independently. Paying for more throughput does not cause Amazon to partition your table on more nodes. The only thing that cause a table to be partitioned more is if the size of your table grows to the point where more partitions are needed to store the data for your table. The maximum size of the partition, from what I have gathered talking to DynamoDB engineers, is based on the size of the SSDs of the nodes in the cluster.
The trick with provisioned throughput is that it is divided among the partitions. So if you have a hot partition, you could get throttling and ProvisionedThroughputExceededExceptions even if your total requests aren't exceeding the total read or write throughput. This is contrary to what your question ask. You would expect that if your table is divided among more partitions/nodes, you'd get more throughput but in reality it is the opposite unless you scale your throughput with the size of your table.
I'm developing an application that works distributed, and I have a SQLite database that must be shared between distributed servers.
If I'm in serverA, and change sqlite row, this change must be in the other servers instantly, but if a server were offline and then it came online, it must update all info equal other servers.
I'm trying to develop a HA service with small SQLite databases.
I'm thinking on something like MongoDB or ReThinkDB, due to replication works fine and I have got data independently server online I had.
There are a library or other SQL methodology to share data between servers?
I used the Raft consensus protocol to replicate my SQLite database. You can find the system here:
https://github.com/rqlite/rqlite
Here are some options:
LiteReplica:
It supports master-slave replication for SQLite3 databases using a single master (writable node) and one or many replicas (read-only nodes).
If a device went offline and then it came online, the secondary/slave dbs are updated with the primary/master one incrementally.
LiteSync:
It implements multi-master replication so we can write to the db in any node, even when the device is off-line.
On both we open the database using a modified URI, like this:
“file:/path/to/app.db?replica=master&bind=tcp://0.0.0.0:4444”
AergoLite:
Blockchain based, it has the highest level of security. Stores immutable relational data, secured by a distributed consensus with low resource usage.
Disclosure: I am the author of these solutions
You can synchronize SQLite databases by embedding SymmetricDS in your application. It supports occasionally connected clients, so it will capture changes and sync them when a server comes online. It supports several different database platforms and can be used as a library or as a standalone service.
You can also use CopyCat, which support SQLite as well as a few other database types.
Marmot looks good:
https://github.com/maxpert/marmot
From their docs:
What & Why?
Marmot is a distributed SQLite replicator with leaderless, and eventual consistency. It allows you to build a robust replication between your nodes by building on top of fault-tolerant NATS Jetstream. This means if you are running a read heavy website based on SQLite, you should be easily able to scale it out by adding more SQLite replicated nodes. SQLite is probably the most ubiquitous DB that exists almost everywhere, Marmot aims to make it even more ubiquitous for server side applications by building a replication layer on top.