Does AirFlow have maximum records limit in variables function? - airflow

I know variables function in Airflow uses internal Metadata DB.
Does AirFlow have maximum records limitation for variables function?
Is there no limit to the number of variables as long as disk space is available?
I could not find specific description in official documents.

There is no limit to the number of variables you can store.
Variables is a table in the MySQL/Postgres/MsSQL database that you use as backend. Tables don't have maximum number of records.
Some tables as they get larger may cause performance issue but variables table probably won't be the one to cause such problems.
Should this be a concern for you - you can always use alternative secret backend by using this the connections/variables will not be stored on Airflow database but on other storage at your choice (for example: Google Secret Manager, Vault, etc..) You can see the full list on this doc. You can also roll your own secret backend if you wish.

Related

DolphinDB Python API: Issues with partitionColName when writing data to dfs table with compo domain

It's my understanding that the PartitionedTableAppender method of DolphinDB Python API can implement concurrent data writes. I'm trying to write data to a dfs table with compo domain where the partitions are determined by the values of "datetime" and "symbol". Now the data I'd like to write include records of 150 symbols in one day. This is what I tried:
But it seems only one partitioning column can be specified in partitionColName. Please inform me if I do have a wrong way of writing this.
Just specify one partitioning column in this case even if it uses compo domain. Based on the given information, it is suggested to specify partitionColName to be "symbol" and then concurrent writes can be implemented. However, the script still works if you set it to be "datetime", but the data cannot be written concurrently because it only contains one day's records with which only one partition is created.
Refer to the basic operating rules when you are using PartitionedTableAppender:
DolphinDB does not allow multiple writers to write data to one
partition at the same time. Therefore, when multiple threads are
writing to the same table concurrently, it is recommended to make sure each of
them writes to a different partition. Python API provides a
convenient way by dividing data by partition automatically.
With DolphinDB server version 1.30 or above, we can write to DFS
tables with the PartitionedTableAppender object in Python API. The
user needs to first specify a connection pool. The system obtains
the partitioning information before assigning them to the connection pool
for concurrent writing. A partition can only be written to by one
connection pool at one time.
Therefore, only one partitioning column is needed for a table with compo domain. Just specify a highly-differentiated partitioning column to create numbers of partitions and assign them to multiple connection pools. Thus, the data can be written concurrently to the dfs tables.

Which of the Azure Cosmos DB types of database should I use to log simple events from my mobile application?

I would like to set up event logging for my application. Simple information such as date (YYYYMMDD), activity and appVersion. Later I would like to query this to give me some simple information such as how many times a certain activity occurred for each month.
From what I see there are a few different types of database in Cosmos such as NoSQL and Casandra.
Which would be the most suitable to meet my simple needs?
You can use Cosmos DB SQL API for storing this data. It has rich querying capabilities and also has a great support for aggregate functions.
One thing you would need to keep in mind is your data partitioning strategy and design your container's partition key accordingly. Considering you're going to do data aggregation on a monthly basis, I would recommend creating a partition key for year and month so that data for a month (and year) stays in a single logical partition. However, please note that a logical partition can only contain 10GB data (including indexes) so you may have to rethink your partitioning strategy if you expect the data to go above 10GB.
A cheaper alternative for you would be to use Azure Table Storage however it doesn't have that rich querying capabilities and also it doesn't have aggregation capability. However with some code (running in Azure Functions), you can aggregate the data yourself.

DynamoDB table structure

We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR

High write concurrency backend for storing large set/array based data?

The problem:
I have a web service that needs to check membership of a given string against a set of strings, where number of elements in the set will be under constant growth, potentially numbering in the hundreds of millions.
If the string is not a member of the set, it gets added to the set. The string size will be a constant 32 bytes. Only one set variable is required, no other variables need to be persisted.
This check is performed as part of a callback on a webhook, thus performance is critical.
While my use case pretty much fits a bloom filter perfectly, I'm having trouble finding a solution to deal with the persistent storage vs i/o concurrency portion of the problem.
Environment:
DigitalOcean/Linux/Python/Flask, but open to change if required
Possible Solutions:
redis, storing the variable in a set, and then querying via sismember for a nice o(1) based solution. This is what we are currently using, but this solution doesn't scale well with a large number of keys given that everything must fit in memory, and it also has issues with write concurrency when traffic increases.
sqlite, with WAL mode turned on. concerned about lock contention when the server gets hit with a significant number of webhook requests (SQLITE_BUSY). Local server file doesn't scale across host machines.
postgres, seems like a nice middle ground solution, but might have to deal with lock contention here as well for write concurrency.
cassandra, given it's focus on write performance. overkill for storing a single column though?
custom bloom filter backend, not sure if something like this exists that provides the functionality of a bloom filter with a high i/o concurrency storage backend.
Thoughts?
The Redis solution can scale well with data sharding. You can set up several Redis instances (or use Redis-Cluster), split your data into several parts, i.e. shardings, and save each part in a different Redis instance.
When you want to check the membership of a given string, you can send the sismenber command to the corresponding Redis instance. Take this answer as an example of how to split data with hash functions.
Also, you can implement bloom filter with Redis (GETBIT and SETBIT). Just a reminder, bloom filter has the false positive problem.
First, you don't need to use sismember. Just do sadd systematically, and test the returned value. If it's 0, the value was already in the set, and so was not added. Doing so you will very easily reduce the number of requests to Redis.
Second, the description of your problem looks like a perfect match for Hbase, which is made for storing very large data set and query them using bloom filters. But you'll probably find it's overkill, just like Cassandra.

Maximum records can be stored at Riak database

Can anyone give an example of maximum record limit in Riak database with specific hardware details? please help me in this case.I'm going to build a CDR information system. Will it be suitable to select Riak as my database?
Riak uses the 2^160 SHA-1 hash value to identify the partitions to store data in. Data is then stored in the identified partitions based on the bucket and key name. The size of the hash space is therefore not related to the amount of data that can be stored. Two different objects that happen to hash to the same value will therefore not overwrite each other.
When working with Riak, it is important to model your data correctly and consider how it needs to be retrieved and queried during the design process. Ideally you should try to ensure that the vast majority of your queries can be done through direct key access. It is often recommended to de-normalise your data and use natural keys. For CDRs this may mean creating an object holding all CDRs for a subscriber per day. These objects can be named based on the subscriber id and date, making it easy to retrieve data directly by key. It is also often more efficient to retrieve a few larger objects than many small ones and perform filtering in the application rather than try to just get the exact data that is needed. I have described this approach in greater detail here.
The limit to the number of records (or key/value pairs) you can store in Riak is governed only by the size of the hash space: 2^160. According to WolframAlpha, this is the number:
1461501637330902918203684832716283019655932542976
In other words, go nuts. :)

Resources