I am using dynamo db global tables for one of my services and it is provisioned with same RCUS/WCUS(rWCUs) for all regions as per the general recommendations. I am not using the on-demand mode.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_reqs_bestpractices.html#globaltables_reqs_bestpractices.tables
I understand that the WCUs should be kept consistent across regions to allow writes to replicate. However, the read traffic for my service varies quite a lot across regions. So I was wondering if it is ok to configure different RCUs per region? The documentation doesn't specifically mention anything about RCUs.
It is safe to keep different RCUs in different regions. Typical use cases is active passive multi-region architecture.
But If failover from one region to another region is automatic, you should make sure that passive region will be able to take care of the burst of traffic.
Related
Currently, there are only two multi-regions of firestore, but if you select these, will access from other regions be quicker?
When I accessed an app that selected the Tokyo region from Germany, I was surprised at the slow speed of the firestore.
Will access from other regions be quicker?
Before you use Cloud Firestore, you must choose a location for your database. To reduce latency and increase availability, store your data close to the users and services that need it. Select a regional location for lower costs, for lower write latency if your application is sensitive to latency, or for co-location with other GCP resources.
Will selecting the Firestore multi-region speed up access from regions around the world?
Multi-region Locations locations can withstand the loss of entire regions and maintain availability without losing data. Global apps can take advantage of the Multiregion deployment of the Firestore. Having multiple servers distributed worldwide reduces the latency for the end-users, increases performance, and data will not be lost in the event of a catastrophic event in a single datacenter region.
Note:
For choosing the database location, it is important to know what are the best practices that you can follow. Having been said that, when you create your database instance, select the database location closest to your users and compute resources. Far-reaching network hops are more error-prone and increase query latency.
What's your preferred strategy for dealing with DAX's maintenance windows?
DynamoDB itself has no MWs and is very highly available. When DAX is introduced into the mix, if it's the sole access point of clients to DDB then it becomes a SPOF. How do you then handle degradation gracefully during DAX scheduled downtimes?
My thinking was to not use the DAX Client directly but introduce some abstraction layer that allows it to fall back to direct DDB access when DAX is down. Is that a good approach?
DAX maintenance window doesn't take the cluster offline, unless it is a one-node cluster. DAX provides availability through multiple nodes in the cluster. For a multi-node cluster, each node in the cluster goes through maintenance in a specific order in order for the cluster to remain available. With retries configured on the DAX client, your worload shouldn't see an impact during maintenance windows.
Other than maintenance window, cluster nodes need to be divided across multiple AZs, for availability in case an AZ goes down.
An abstraction layer to fall back to DDB is not a bad idea. But you need to make sure you have the provisioned capacity configured to handle the load spike.
What does big data have to do with cloud computing?
I have try to explain the relation between big data and cloud computing.
They overlap. One is not dependent on the other. Cloud computing enables companies to rent infrastructure over time rather than pay up-front for computers and maintain it over time.
In general, cloud vendors allow you to rent out large amounts of server pools and build networks of servers (clusters).
You can pay for servers with large storage drives and install software like Hadoop FileSystem (HDFS), Ceph, GlusterFS, etc. These softwares will make a single "shared filesystem". The more servers you combine together into this filesystem, the more data you can store.
Now, that's just storage. Hopefully, these servers also have some reasonable amount of memory and CPU processing. Other technology such as YARN (with Hadoop), Apache Mesos, Kubernetes/Docker allow you to create resource pools to deploy distributed applications that spread over all those servers and read data that's stored in all those other machines.
The above is mostly block storage, though. The alternative, cheaper alternative is object storage such as Amazon S3, which is a Hadoop Compatible filesystem. There are other object storage solutions, but people use this as it's more highly available (via replication) and can be secured easier with access keys and policies
Big data and Cloud Computing are one of the most used technologies in today’s Information Technology world. With these two technologies, business, education, healthcare, research & development, etc are growing rapidly and will provide various advantages to expand their areas with tricks and techniques.
In cloud computing, we can store and retrieve the data from anywhere at any time. Whereas, big data is the large set of data which will process to extract the necessary information.
A customer can shift to Cloud Computing when they need rapid deployment and scaling of the applications. The application deals with highly sensitive data and requires strict compliance one should keep things on the cloud. Whereas, we can use Big Data for traditional methods and here frameworks are ineffective. Big data is not replacement for relational database system and big data solve specific problem statement related to large data sets and most of the large data sets do not deal with small data.
Big Data Technology is Hadoop, MapReduce, and HDFS. while Cloud Computing includes three types which are: public, private, hybrid and community cloud.
Cloud computing provides enterprises a cost-effective & flexible way to access a vast volume of information we call the Big Data. Because of Big Data and cloud computing, it is now much easier to start an IT company than ever before. When the combination of big data and cloud computing was first initiated, it opened the road to endless possibilities. Various fields have seen many drastic changes that were made possible by this combination. It changed the decision-making process for companies and gave a huge advantage to analysts, who could base their results on concrete data.
What is the right approach to use to configure OpenSplice DDS to support 100,000 or more nodes?
Can I use a hierarchical naming scheme for partition names, so "headquarters.city.location_guid_xxx" would prevent packets from leaving a location, and "company.city*" would allow samples to align across a city, and so on? Or would all the nodes know about all these partitions just in case they wanted to publish to them?
The durability services will choose a master when it comes up. If one durability service is running on a Raspberry Pi in a remote location running over a 3G link what is to prevent it from trying becoming the master for "headquarters" and crashing?
I am experimenting with durability settings in a remote node such that I use location_guid_xxx but for the "headquarters" cloud server I use a Headquarters
On the remote client I might to do this:
<Merge scope="Headquarters" type="Ignore"/>
<Merge scope="location_guid_xxx" type="Merge"/>
so a location won't be master for the universe, but can a durability service within a location still be master for that location?
If I have 100,000 locations does this mean I have to have all of them listed in the "Merge scope" in the ospl.xml file located at headquarters? I would think this alone might limit the size of the network I can handle.
I am assuming that this product will handle this sort of Internet of Things scenario. Has anyone else tried it?
Considering the scale of your system I think you should seriously consider the use of Vortex-Cloud (see these slides http://slidesha.re/1qMVPrq). Vortex Cloud will allow you to better scale your system as well as deal with NAT/Firewall. Beside that, you'll be able to use TCP/IP to communicate from your Raspberry Pi to the cloud instance thus avoiding any problem related to NATs/Firewalls.
Before getting to your durability question, there is something else I'd like to point out. If you try to build a flat system with 100K nodes you'll generate quite a bit of discovery information. Beside generating some traffic, this will be taking memory on your end applications. If you use Vortex-Cloud, instead, we play tricks to limit the discovery information. To give you an example, if you have a data-write matching 100K data reader, when using Vortex-Cloud the data-writer would only match on end-point and thus reducing the discovery information by 100K times!!!
Finally, concerning your durability question, you could configure some durability service as alignee only. In that case they will never become master.
HTH.
A+
Amazon DynamoDB allows the customer to provision the throughput of reads and writes independently. I have read the Amazon Dynamo paper about the system that preceded DynamoDB and read about how Cassandra and Riak implemented these ideas.
I understand how it is possible to increase the throughput of these systems by adding nodes to the cluster which then divides the hash keyspace of tables across more nodes, thereby allowing greater throughput as long as access is relatively random across hash keys. But in systems like Cassandra and Riak this adds throughput to both reads and writes at the same time.
How is DynamoDB architected differently that they are able to scale reads and write independently? Or are they not and Amazon is just charging for them independently even though they essentially have to allocate enough nodes to cover the greater of the two?
You are correct that adding nodes to a cluster should increase the amount of available throughput but that would be on a cluster basis, not a table basis. The DynamoDB cluster is a shared resource across many tables across many accounts. It's like an EC2 node: you are paying for a virtual machine but that virtual machine is hosted on a real machine that is shared among several EC2 virtual machines and depending on the instance type, you get a certain amount of memory, CPU, network IO, etc.
What you are paying for when you pay for throughput is IO and they can be throttled independently. Paying for more throughput does not cause Amazon to partition your table on more nodes. The only thing that cause a table to be partitioned more is if the size of your table grows to the point where more partitions are needed to store the data for your table. The maximum size of the partition, from what I have gathered talking to DynamoDB engineers, is based on the size of the SSDs of the nodes in the cluster.
The trick with provisioned throughput is that it is divided among the partitions. So if you have a hot partition, you could get throttling and ProvisionedThroughputExceededExceptions even if your total requests aren't exceeding the total read or write throughput. This is contrary to what your question ask. You would expect that if your table is divided among more partitions/nodes, you'd get more throughput but in reality it is the opposite unless you scale your throughput with the size of your table.