How to insert an element into the middle of an array (json) in SQLite? - sqlite

I found a method json_insert in the json section of the SQLite document. But it seems to be not working in the way that I expected.
e.g. select json_insert('[3,2,1]', '$[3]', 4) as result;
The result column returns '[3,2,1,4]', which is correct.
But for select json_insert('[3,2,1]', '$[1]', 4) as result;
I am expecting something like '[3,2,4,1]' to be returned, instead of '[3,2,1]'.
Am I missing something ? I don't see there is an alternative method to json_insert.
P.S. I am playing it on https://sqlime.org/#demo.db, the SQLite version is 3.37.2.

The documentation states that json_insert() will not overwrite values ("Overwrite if already exists? - No"). That means you can't insert elements in the middle of the array.
My interpretation: The function is primarily meant to insert keys into an object, where this kind of behavior makes more sense - not changing the length of an array is a sacrifice for consistency.
You could shoehorn it into SQLite by turning the JSON array into a table, appending your element, sorting the result, and turning it all back into a JSON array:
select json_group_array(x.value) from (
select key, value from json_each('[3,2,1]')
union
select 1.5, 4 -- 1.5 = after 1, before 2
order by 1
) x
This will produce '[3,2,4,1]'.
But you can probably see that this won't scale, and even if there was a built-in function that did this for you, it wouldn't scale, either. String manipulation is slow. It might work well enough for one-offs, or when done infrequently.
In the long run, I would recommend properly normalizing your database structure instead of storing "non-blob" data in JSON blobs. Manipulating normalized data is much easier than manipulating JSON, not to mention faster by probably orders of magnitude.

Related

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

Delete a key-value pair in BerkeleyDB

Is there any way to delete key-value pair where the key start with sub-string1 and ends with sub-string2 in BerkeleyDB without iterating through all the keys in the DB?
For ex:
$sub1 = "B015";
$sub2 = "5646";
I want to delete
$key = "B015HGUJJ75646"
Note: It is guaranteed that there will be only one key for the combination of $sub1 and $sub2.
This can be done by taking an iterator of the DB and checking every key for the condition, but that will be very in-efficient for large DBs. Is there any way to do it without iterating through the complete DB?
If you're using a RECNO database, you're probably out of luck. But, if you can use a BTREE, you have a couple of options.
First, and probably easiest is to iterate over only the portion of the database that makes sense. Assuming you're using the default key comparison function, you can use DB_SET_RANGE to position the starting cursor (iterator) at the start of your partial key string. In your example, this might be "B0150000000000". You then scan forwards with DB_NEXT, looking at each key in turn. When either you find the key you're looking for, or if the key you find doesn't start with "B015", you're done.
Another technique that could be applicable to your situation is to redefine the key comparison function. If, as you state, there is only one combination of $sub1 and $sub2, then perhaps you only need to compare those sections of the keys to guarantee uniqueness? Here's an example of a full string comparison (I'm assuming you're using perl, just from the syntax you supplied above) from https://www2.informatik.hu-berlin.de/Themen/manuals/perl/DB_File.html :
sub Compare
{
my ($key1, $key2) = #_ ;
"\L$key1" cmp "\L$key2" ;
}
$DB_BTREE->{compare} = 'Compare' ;
So, if you can rig things such that you're only comparing the starting and ending four characters, you should be able to drop the database iterator directly onto the key you're interested in.

Dynamically generating pzPVStream from a view

Is there some way to create a view that returns a pzPVStream that can be natively parsed by Pega when it executes an RDB?
For instance, maybe a query (in MS SQL Server) that resembled:
SELECT test_tbl_outer.ID, (
select *, 'My-Int-TestClass' as "pxObjClass"
from {class:My-Int-TestClass} as test_tbl_inner
where test_tbl_inner.ID=test_tbl_outer.ID
FOR XML RAW('pagedata'), TYPE, ELEMENTS
) as pzPVStream
from {class:My-Int-TestClass} as test_tbl_outer
This gets an invalid signature error (the SQL query does work directly however), and if I try to shove a signature string onto the column ('PR6d' or previous) I just get a different error regarding headers.
So at this point, I do realize that the pzPVstream is not stored as xml but as some sort of packed & compressed string. Is there a way for me to create a valid pzPVstream on the fly? Maybe something similar to what pr_read_from_stream does but in reverse?
The use case is that we'd like to pull a whole mess of data from an existing data warehouse. And it would be nice if we could pull all the multi-value data (many,many joins deep) over in one trip. We are not too concerned with the size of this object as we plan on pulling this data one way or another.
The pzPvStream is a compressed blob and it resonates a work object. It is compressed and stored as a single column in a table.
When it is read using obj-browse or obj-open activities, the blob is decompressed and all the encompassed properties are mapped to the clipboard.
This value has a proprietary format; the values are obfuscated.

Dynamic query and caching

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?
For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.
For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Resources