Currently, I have created the following partition:
dbDate = database("", VALUE, 2010.01.01..2050.01.01)
dbSymbol = database("", HASH, [SYMBOL, 40])
db=database(db_name, COMPO, [dbDate, dbSymbol]);
When using the following code to query, I found that Local Variables will also load other date partitions
db_name = "dfs://tick_database"
tb_name = "stock";
tb = loadTable(db_name, tb_name)
select * from tb where date(time)== 2021.08.24 and symbol == `000001.SZ
As the amount of data increases, Local Variables will soon reach the upper limit of 4g when selecting, how can I avoid this, or the database will automatically handle this situation, and there will be no Out or memory situation.
After executing tb = loadTable(db_name, tb_name), the information of tb will be displayed, but this only displays the partition information. Only double-clicking it will load the specific partition data from the disk to the memory, so that users can query a partitioned table more conveniently. loadTable only loads metadata, which can be understood as a table object and occupies very little memory. Then after executing the SQL query on it, the data of the involved partition will be loaded into the memory. You can use getMemoryStat() to compare the memory used by the system before and after.
Related
I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:
var source = new DataSource();
source.Name = DataSourceName;
source.Type = DataSourceType.DocumentDb;
source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
serviceClient.DataSources.Create(source);
The highwater mark appears to work correctly when testing on a small db.
Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?
The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.
So, if a custom query isn't required for your scenario, consider not using it.
Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.
I run into stack depth limit exceeded when trying to store a row from R to PostgreSQL. In order to address bulk upserts I have been using a query like this:
sql_query_data <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
ts_updates(ts_key varchar, ts_data hstore, ts_frequency integer) ON COMMIT DROP;
INSERT INTO ts_updates(ts_key, ts_data) VALUES %s;
LOCK TABLE %s.timeseries_main IN EXCLUSIVE MODE;
UPDATE %s.timeseries_main
SET ts_data = ts_updates.ts_data,
ts_frequency = ts_updates.ts_frequency
FROM ts_updates
WHERE ts_updates.ts_key = %s.timeseries_main.ts_key;
INSERT INTO %s.timeseries_main
SELECT ts_updates.ts_key, ts_updates.ts_data, ts_updates.ts_frequency
FROM ts_updates
LEFT OUTER JOIN %s.timeseries_main ON (%s.timeseries_main.ts_key = ts_updates.ts_key)
WHERE %s.timeseries_main.ts_key IS NULL;
COMMIT;",
values, schema, schema, schema, schema, schema, schema, schema)
}
So far this query worked quite well for updating millions of records while holding the number of inserts low. Whenever I ran into stack size problems so far I simply split my records into multiple chunks and go on from there.
However, this strategy faces some trouble now. I don't have a lot of records anymore, but a handful in which the hstore is a little bit bigger. But it's not really 'large' by any means. I read suggestions by #Craig Ringer who advises not to near the limit of 1GB. So I assume the size of the hstore itself is not the problem, but I receive this message:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: stack depth limit exceeded
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
)
EDIT: I did increase the limit to 7 MB and ran into the same error stating 7 MB is not enough. This is really odd to me, because I the query itself is only 1.7 MB (checked it by pasting it to a text file). Can anybody shed some light on this?
Increase the max_stack_depth as suggested by the hint. [From the official documentation]
(http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html):
The ideal setting for this parameter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of a megabyte or so.
and
The default setting is two megabytes (2MB), which is conservatively small and unlikely to risk crashes.
Super Users can alter this setting per connection, or it can be set for all users through the postgresql.conf file (requires postgres server restart).
Without creating a trigger, are there any V$ views that show when either a Tablespace or datafile was last accessed or used?
Give you an idea of why... I'm looking to do some reorg and would be nice to know if I can take that particular object or tbs offline.
DBA_HIST_SEG_STAT records the number of reads per tablespace per snapshot. The DBA_HIST_ tables are only periodically refreshed, normally once per hour. To retrieve the latest data, a very similar query using V$SEGMENT_STATISTICS would need to be UNIONed to the query below.
Finding the information per data file is trickier. That information is in DBA_HIST_ACTIVE_SESS_HISTORY, usually in the column P1 when P1TEXT = 'file#'. But that information is only a sample, it's very possible a single read to a data file would not be captured.
Note that using the DBA_HIST_ tables requires the Configuration Pack license.
select name, begin_interval_time, end_interval_time, sum(logical_reads_delta)
from dba_hist_seg_stat
join dba_hist_snapshot using (snap_id, dbid, instance_number)
join v$tablespace using (ts#)
group by v$tablespace.name, begin_interval_time, end_interval_time
having sum(logical_reads_delta) > 0
order by v$tablespace.name, begin_interval_time desc
Background:
I'm working on a SQLite tile cache database (similar to MBTiles specification), consisting for now just from a single table Tiles with the following columns:
X [INTEGER] - horizontal tile index (not map coordinate)
Y [INTEGER] - vertical tile index (not map coordinate)
Z [INTEGER] - zoom level of a tile
Data [BLOB] - stream with a tile image data (currently PNG images)
All the coords to tile calculations are done in the application, so the SQLite R*Tree Module with the corresponding TADSQLiteRTree class have no meaning for me. All I need is to load Data field blob stream of a record found by a given X, Y, Z values as fast as possible.
The application will, except this database, have also a memory cache implemented by a hash table like this TTileCache type:
type
TTileIdent = record
X: Integer;
Y: Integer;
Z: Integer;
end;
TTileData = TMemoryStream;
TTileCache = TDictionary<TTileIdent, TTileData>;
The workflow when asking for a certain tile while having X, Y, Z values calculated will be simple. I will ask for a tile the memory cache (partially filled from the above table at app. startup), and if the tile won't be found there, ask the database (and even if there won't the tile be found, download it from tile server).
Question:
Which AnyDAC (FireDAC) component(s) would you use for frequent querying of 3 integer column values in a SQLite table (with, let's say 100k records) with an optional loading of the found blob stream ?
Would you use:
query type component (I'd say executing of the same prepared query might be efficient, isn't it ?)
memory table (I'm afraid of it's size, since there might be several GB stored in the tiles table, or is it somehow streamed for instance ?)
something different ?
Definitely use TADQuery. Unless you set the query to Unidirectional, it will buffer all the records returned from the database in memory (default 50). Since you are dealing with blobs, your query should be written to retrieve the minimum number of records you need.
Use a parameterized query, like the following the query
SELECT * FROM ATable
WHERE X = :X AND Y = :Y AND Z = :Z
Once you have initially opened the query, you can change the parameters, then use the Refresh method to retrieve the next record.
A memory table could not be used to retrieve data from the database, it would have to be populated via a query. It could be used to replace your TTileCache records, but I would not recommend it because it would have more overhead than your memory cache implementation.
I would use TFDQuery with a query like follows. Assuming you're about to display fetched tiles on a map, you may consider fetching all tiles for the missing (non cached) tile area at once, not fetching tiles always one by one for your tile grid:
SELECT
X,
Y,
Data
FROM
Tiles
WHERE
(X BETWEEN :HorzMin AND :HorzMax) AND
(Y BETWEEN :VertMin AND :VertMax) AND
(Z = :Zoom)
For the above query I would consider excluding fiBlobs from the FetchOptions for saving some I/O time for cases when the user moves the map view whilst you're reading tiles from the resultset and the requested area is out of of the visible view (you stop reading and never read the rest of them).
I have an Sqlite3 database with a table and a primary key consisting of two integers, and I'm trying to insert lots of data into it (ie. around 1GB or so)
The issue I'm having is that creating primary key also implicitly creates an index, which in my case bogs down inserts to a crawl after a few commits (and that would be because the database file is on NFS.. sigh).
So, I'd like to somehow temporary disable that index. My best plan so far involved dropping the primary key's automatic index, however it seems that SQLite doesn't like it and throws an error if I attempt to do it.
My second best plan would involve the application making transparent copies of the database on the network drive, making modifications and then merging it back. Note that as opposed to most SQlite/NFS questions, I don't need access concurrency.
What would be a correct way to do something like that?
UPDATE:
I forgot to specify the flags I'm already using:
PRAGMA synchronous = OFF
PRAGMA journal_mode = OFF
PRAGMA locking_mode = EXCLUSIVE
PRAGMA temp_store = MEMORY
UPDATE 2:
I'm in fact inserting items in batches, however every next batch is slower to commit than previous one (I'm assuming this has to do with the size of index). I tried doing batches of between 10k and 50k tuples, each one being two integers and a float.
You can't remove embedded index since it's the only address of row.
Merge your 2 integer keys in single long key = (key1<<32) + key2; and make this as a INTEGER PRIMARY KEY in youd schema (in that case you will have only 1 index)
Set page size for new DB at least 4096
Remove ANY additional index except primary
Fill in data in the SORTED order so that primary key is growing.
Reuse commands, don't create each time them from string
Set page cache size to as much memory as you have left (remember that cache size is in number of pages, but not number of bytes)
Commit every 50000 items.
If you have additional indexes - create them only AFTER ALL data is in table
If you'll be able to merge key (I think you're using 32bit, while sqlite using 64bit, so it's possible) and fill data in sorted order I bet you will fill in your first Gb with the same performance as second and both will be fast enough.
Are you doing the INSERT of each new as an individual Transaction?
If you use BEGIN TRANSACTION and INSERT rows in batches then I think the index will only get rebuilt at the end of each Transaction.
See faster-bulk-inserts-in-sqlite3.