DolphinDB Python API: Issues with partitionColName when writing data to dfs table with compo domain - database-partitioning

It's my understanding that the PartitionedTableAppender method of DolphinDB Python API can implement concurrent data writes. I'm trying to write data to a dfs table with compo domain where the partitions are determined by the values of "datetime" and "symbol". Now the data I'd like to write include records of 150 symbols in one day. This is what I tried:
But it seems only one partitioning column can be specified in partitionColName. Please inform me if I do have a wrong way of writing this.

Just specify one partitioning column in this case even if it uses compo domain. Based on the given information, it is suggested to specify partitionColName to be "symbol" and then concurrent writes can be implemented. However, the script still works if you set it to be "datetime", but the data cannot be written concurrently because it only contains one day's records with which only one partition is created.
Refer to the basic operating rules when you are using PartitionedTableAppender:
DolphinDB does not allow multiple writers to write data to one
partition at the same time. Therefore, when multiple threads are
writing to the same table concurrently, it is recommended to make sure each of
them writes to a different partition. Python API provides a
convenient way by dividing data by partition automatically.
With DolphinDB server version 1.30 or above, we can write to DFS
tables with the PartitionedTableAppender object in Python API. The
user needs to first specify a connection pool. The system obtains
the partitioning information before assigning them to the connection pool
for concurrent writing. A partition can only be written to by one
connection pool at one time.
Therefore, only one partitioning column is needed for a table with compo domain. Just specify a highly-differentiated partitioning column to create numbers of partitions and assign them to multiple connection pools. Thus, the data can be written concurrently to the dfs tables.

Related

Does AirFlow have maximum records limit in variables function?

I know variables function in Airflow uses internal Metadata DB.
Does AirFlow have maximum records limitation for variables function?
Is there no limit to the number of variables as long as disk space is available?
I could not find specific description in official documents.
There is no limit to the number of variables you can store.
Variables is a table in the MySQL/Postgres/MsSQL database that you use as backend. Tables don't have maximum number of records.
Some tables as they get larger may cause performance issue but variables table probably won't be the one to cause such problems.
Should this be a concern for you - you can always use alternative secret backend by using this the connections/variables will not be stored on Airflow database but on other storage at your choice (for example: Google Secret Manager, Vault, etc..) You can see the full list on this doc. You can also roll your own secret backend if you wish.

Highly efficient key-value storage in C

What would be the most efficient key-value pair storage algorithm that fulfills the following design goals?
Data is stored on disk to avoid data loss if the software / computer does not shut down normally
Data is read/written by a single application
Runs on desktops so it needs to use the minimal amount of memory, processing and storage so as to minimize the impact to computer performance for the user
Must support multiple (2 to 3) inserts/updates per second
Expect to have only a few thousand total records but some of these records will be updated frequently (i.e. many times more updates than inserts)
Data is only retrieved a few (2 to 3) times a day
Data written and retrieved using a single primary numeric key (i.e. short)
Will need to frequently update a secondary field (i.e. “key” or “column”) that is used for filtering and sorting the data on retrieval
The primary key cannot changed (i.e. does not need to be changed)
Records do not need to be removed (i.e. deleting not supported)
The application will also store unstructured (i.e. whatever desired by the user) data associated with the keys (primary and secondary)
The data associated with the keys can be updated
Data is always retrieved as an ordered list, either:
a. Starting from the beginning, or
b. Filtered by the secondary key
Planning to use this in a C application. The primary criteria for selection is to be as absolutely lightweight and fast as possible.
For C/C++ i'm aware of this variants available
https://www.sqlite.org/
https://github.com/erthink/libmdbx
After a lot of research, I decided to use LevelDB for this solution. It was easy to build as a static library, and very simple to use within my code. It is super fast and has small file sizes.

Maximum records can be stored at Riak database

Can anyone give an example of maximum record limit in Riak database with specific hardware details? please help me in this case.I'm going to build a CDR information system. Will it be suitable to select Riak as my database?
Riak uses the 2^160 SHA-1 hash value to identify the partitions to store data in. Data is then stored in the identified partitions based on the bucket and key name. The size of the hash space is therefore not related to the amount of data that can be stored. Two different objects that happen to hash to the same value will therefore not overwrite each other.
When working with Riak, it is important to model your data correctly and consider how it needs to be retrieved and queried during the design process. Ideally you should try to ensure that the vast majority of your queries can be done through direct key access. It is often recommended to de-normalise your data and use natural keys. For CDRs this may mean creating an object holding all CDRs for a subscriber per day. These objects can be named based on the subscriber id and date, making it easy to retrieve data directly by key. It is also often more efficient to retrieve a few larger objects than many small ones and perform filtering in the application rather than try to just get the exact data that is needed. I have described this approach in greater detail here.
The limit to the number of records (or key/value pairs) you can store in Riak is governed only by the size of the hash space: 2^160. According to WolframAlpha, this is the number:
1461501637330902918203684832716283019655932542976
In other words, go nuts. :)

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Passing whole dataset to stored procedure in MSSQL 2005

How do I pass a dataset object to a stored procedure? The dataset comprises multiple tables and I'll need to be able to access them from within the SQL.
You can use Table valued parameter for passing single table in SQL 2008 http://msdn.microsoft.com/en-us/library/bb675163.aspx
or
refer to this article and use SQL CLR procedure to pass dataset http://blogs.msdn.com/b/jpapiez/archive/2005/09/26/474059.aspx
It looks like you can do this with SQL Server 2008 or newer (at least with a DataTable). Here are the links:
http://www.eggheadcafe.com/community/aspnet/10/10138579/passing-dataset-to-stored-procedure.aspx
http://www.sqlteam.com/article/sql-server-2008-table-valued-parameters
As the article from MusiGenesis' answer states
In SQL Server 2005 and earlier, it is
not possible to pass a table variable
as a parameter to a stored procedure.
When multiple rows of data to SQL
Server need to send multiple rows of
data to SQL Server, developers either
had to send one row at a time or come
up with other workarounds to meet
requirements. While a VB.Net developer
recently informed me that there is a
SQLBulkCopy object available in .Net
to send multiple rows of data to SQL
Server at once, the data still can not
be passed to a stored proc.
At the risk of stating obvious here are two more approaches
Parametrize your processing procedure
You might re-evaluate if you truly and really need to pass a general table variable. While sometimes this can not be avoided the reason why this is a later addition to the set of features that MS SQL Server has is partially because usually you can get around it by structuring your stored procedures and the flow of your data processing.
If you are able to 'parametrize' your process then you should be able to let stored procedures retrieve full dataset based on a limited number of parameters.
This will make the process less flexible, but it will also make it more controlled, which is not a bad thing (similarly like the database which interfaces with applications only on the level of stored procedures is more robust, this approach also, by limiting the flexibility reduces the number of possible cases and consequently the number of possibly unhandeled cases. read: security holes and general bugs)
Temp tables
Besides the above there's always approach with temp tables, which can be more or less complicated, depending on the scope of sharing that you need on the data (sharing can be between db users, app users, connections, processes, etc..).
Nice side effect is that such approach would allow persistence of the process (which bring you closer to having undo, redo and ability to continue interrupted work).

Resources