I can't find the exact understanding between cloudflare worker kv and dynamoDB in aws, Can anyone make it clear in simpler ?
Although there are some similarities (ie. both DynamoDB and Worker KV are offered as managed services) I would say they are more different than they are alike.
Worker KV are always eventually consistent whereas DynamoDB can be strongly consistent for read after write operations.
DynamoDB has additional capabilities such as local and global secondary indexes allowing you to have different access patterns for the same underlying data.
Workers KV is heavily optimized for reads with infrequent writes, whereas DynamoDB doesn't have the same limitation (though DynamoDB also does better at reading data than writing in terms of throughput).
DynamoDB also has other features such as stream processing which allows you to do out of band processing in reaction to changes to the data stored in the database.
I'm not sure about the security model for Workers KV but DynamoDB allows you to configure strict access policies to the tables.
Related
I'm running a simple adf pipeline for storing data from data lake to cosmos db (sql api).
After setting database throughput to Autopilot 4000 RU/s, the run took ~11 min and I see 207 throttling requests. On setting database throughput to Autopilot 20,000 RU/s, the run took ~7 min and I see 744 throttling requests. Why is that? Thank you!
Change the Indexing Policy to None from Consistent for the ADF copy activity and then change back to Consistent when done.
Azure Cosmos DB supports two indexing modes:
Consistent: The index is updated synchronously as you create, update or delete items. This means that the consistency of your read queries will be the consistency configured for the account.
None: Indexing is disabled on the container. This is commonly used when a container is used as a pure key-value store without the need for secondary indexes. It can also be used to improve the performance of bulk operations. After the bulk operations are complete, the index mode can be set to Consistent and then monitored using the IndexTransformationProgress until complete.
How to modify the indexing policy:
Modifying the indexing policy
What is the replication type of dynamoDB?
Im assuming it is peer-to-peer based on online results but can any confirm or deny?
People often assume a connection between the Dynamo Paper and DynamoDB, so they claim that DynamoDB uses leaderless replication.
However, DynamoDB is based on many principles of Dynamo, but it is not an implementation of Dynamo. See reference 4 of the Wikipedia article on Dynamo for a quote explaining that it uses single-leader replication.
Dynamo had a multi-leader design requiring the client to resolve version conflicts and DynamoDB uses synchronous replication across multiple data centers for high durability and availability.
I am using dynamoDB and I am getting read and write ProvisionedThroughputExceededException
How can I solve this ?
Can using DAX ensure - that I do not get this error ?
DAX is a write-through cache, not write-back. Which means if a request is a cache miss, DAX makes the call to DynamoDB on your behalf to fetch the data. In this model, you are responsible for managing DynamoDB table capacity
You may want to consider using AutoScale with DAX, but it depends on your access patterns.
I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.
Background - will be using .NET 4.0, Azure SDK 1.7, Azure Table Storage
Problem
How to most efficiently (= fastest processing time ) to read N entries, where N is a large # (1000's to millions) of entities, and each entity is very small (<200 bytes) from a set of Azure tables, where upfront I know the PartitionID and RowID for each of the entities ie [(P1,R1),(P2,R2),...,(PN,RN)].
What is the most efficient way to 'batch' process such a request. Naturally, underneath there will be the need to async / parallelise the fetches, without causing threadlocks either through IO locks or Synchonisation locks, ideally I should see the CPU reach >80% throughput for the server making the calls to Azure Table storage, as this processing should be CPU bound vs IO or Memory bound.
Since you are asking for "fastest" processing time to read from Azure Storage, here are some general tips that made my performance improve (top ones are the most important):
Ensure the Azure Storage has been created since July 2012. This is the Gen2 of Azure Storage and it includes storage on SSD drives.
In your case, table storage has increased scalability targets for partitions for Gen2 of Azure Storage: http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
10 Gbps network vs 1 Gpbs networks
Single partition can process 20,000 entities/second
.NET default connections change this number (I think this might be addressed in the new SDK, but not sure): http://social.msdn.microsoft.com/Forums/en-US/windowsazuredata/thread/d84ba34b-b0e0-4961-a167-bbe7618beb83
You can "warm" Azure Storage, the more transactions it sees the more of the controller/drive cache it will use. This might be expensive to constantly hit your storage in this way
You can use MULTIPLE Azure Storage accounts. This can distribute your load very efficiently (sharding): http://tk.azurewebsites.net/2012/08/26/hacking-azure-for-more-disk-performance/
You have several ways to architect/design in Table Storage. You have the partition key and the row key. However, you also have the table itself..remember this is NoSQL, so you can have 100 tables with the same structure serving different data. That can be a performance boost in itself and also you can store these tables in different Azure Storage accounts. RowKey-> PartitionKey -> Table -> Multiple Storage Accounts can all be thought of as "indexes" for faster access
I dunno your data, but since you will be searching on PartitionKey (I assume), maybe instead of storing 1,000,0000 really small records for each PartitionKey have that in zip file and fetch it real quick/unzip and then parallel-query it with linq when it is in the local server. Playing with caching always will help since you do have a lot of small objects. You could probably put entire partitions in memory. Another option might be to store a partition key with column data that is binary/comma seperated etc.
You say you are on the Azure 1.7 SDK...I had problem with it and using the StorageClient 2.0 library. I used the 1.8 SDK with the StorageClient 2.0 library. Something of note (not necessarily performance), since they may have improved efficiency of the libraries over the last 2+ years