Recursive Query using HQL - recursion

I have this Table
CREATE TABLE IF NOT EXISTS `branch` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`studcount` int(11) DEFAULT NULL,
`username` varchar(64) NOT NULL,
`branch_fk` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FKADAF25A2A445F1AF` (`branch_fk`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=14 ;
ALTER TABLE `branch`
ADD CONSTRAINT `FKADAF25A24CEE7BFF` FOREIGN KEY (`login_fk`) REFERENCES `login` (`id`);
as you can see each table has a foreign key that point to other Branch Row (self Relation)
I want a Query using HQL(prefer HQL) to get a username (or id) from me and return a List<String> (for username) or List<Integer> (for id) that was a list of all of my subBranch;
let me show in Example
id studentcount username branch_fk
1 312 user01 NULL
2 111 user02 1
3 432 user03 1
4 543 user04 2
5 433 user05 3
6 312 user06 5
7 312 user06 2
8 312 user06 7
when I call GetSubBranch(3) I want return:
5, 6
and when call GetSubBranch(2) I want return:
4, 7, 8

I believe there is no portable SQL to do this.
Even more, I think several major databases' SQL cannot express this.
Therefore, this capability is not part of what you can do in HQL. Sorry :-(
I read a few ways to go. Most of them involve tradeoffs depending of the number of levels (fixed in advance ? how many ?) , the number of records (hundreds ? millions ?) etc :
Do the recursive queries yourself, leveling down each time (with a in(ids)), until some level is empty.
Do a query with a fixed number of left joins (your depth need to be known in advance ; or you might need to repeat the query to find the rest of the records if needed, see point 1).
Have the denormalized information available somewhere : it could be a denormalized table copying of the indexes. But I would prefer a cached in-memory copy, that may be filled completely in only one request, and be updated or invalidated ... depending on your other requisites, like the table size, max depth, write-frequency etc).

One may have a look at 'nested sets'. Querying becomes a matter of 'between :L and :R'. But topological/hierarchical sort is lost (in comparison to recursive/hierarchical queries). Inserting new items then is quite costly as it requires updates on several if not all rows ...

Related

SQLite is very slow when performing .import on a large table

I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
They require large, slow indexes. Slow to build, slow to query, slow to insert.
It's tempting to change text, exactly what you don't want a primary key to do.
Any table referencing it also requires storing and indexing text adding to bloat.
Joins with this table will be slower due to the text primary key.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2

SQLite database size: more rows vs. more columns

Initial situation
Suppose I have a simple table that looks like this:
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterTypeA SMALLINT,
exitTypeA SMALLINT,
enterTypeB SMALLINT,
exitTypeB SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId);
The index is added, because a lot of queries are performed to select entities based on timestampMs and elementId.
I'm storing each minute exit and enter values of different types for different elements. E.g.:
elementId, timestampMs, enterTypeA, exitTypeA, enterTypeB, exitTypeB
1, 1559383200000, 4, 3, 1, 5
2, 1559383200000, 8, 2, 3, 7
1, 1559383260000, 2, 2, 4, 0
2, 1559383260000, 1, 0, 9, 2
Problem description
New types need to be added to database. More types may also be added in the future. So I tried two different approaches:
Approach 1:
Adding more columns for new types:
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterTypeA SMALLINT,
exitTypeA SMALLINT,
enterTypeB SMALLINT,
exitTypeB SMALLINT,
enterTypeC SMALLINT,
exitTypeC SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId);
Approach 2:
A new row for each type (means larger index):
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterValue SMALLINT,
exitValue SMALLINT,
type SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId, type);
Personally I prefer approach 2, because it reduces duplication.
I've tested both approaches and inserted test data for 10 days with 5 elements and 3 types. Results showed, that the database size of approach 1 is much smaller than size of approach 2 (which is from my point of view reasonably logical, since approach 2 has three times more rows):
Approach 1: 8.2 MB | 144'000 entries
Approach 2: 24.6 MB | 432'000 entries
Question
As I see, the size of the index in both solutions is about 50% of database size, so it's clear database size of approach 2 will always be larger.
Do more rows instead of more columns in SQLite always make such a big difference on database size?
So far I haven't found a solution to reduce the size of approach 2 even further. Perhaps this isn't possible due to the index?
The issue of which of the two versions would take up more space is not so important as what the proper database structure is for your needs. The second version is preferable, for several reasons:
If you need to restrict the table to only certain types, a simple WHERE clause will suffice. In the first version, you basically always get back every type when querying
Aggregation is possible in the second version. You may easily aggregate all timestamps by type. This is much harder to do in the first version.
If you need to link any of the columns in the second version to other tables, it is fairly straightforward. On the other hand, in the first version, you would need to potentially link each separate enter/exit column.
Regarding storage, storing the same amount of data in either scheme should be fairly similar, certainly within an order of magnitude and probably within a factor of 2. The design issue seems to be the bigger problem.

Is it possible to (emulate?) AUTOINCREMENT on a compound-PK in Sqlite?

According to the SQLite docs, the only way to get an auto-increment column is on the primary key.
I need a compound primary key, but I also need auto-incrementing. Is there a way to achieve both of these in SQLite?
Relevant portion of my table as I would write it in PostgreSQL:
CREATE TABLE tstage (
id SERIAL NOT NULL,
node INT REFERENCES nodes(id) NOT NULL,
PRIMARY KEY (id,node),
-- ... other columns
);
The reason for this requirement is that all nodes eventually dump their data to a single centralized node where, with a single-column PK, there would be collisions.
The documentation is correct.
However, it is possible to reimplement the autoincrement logic in a trigger:
CREATE TABLE tstage (
id INT, -- allow NULL to be handled by the trigger
node INT REFERENCES nodes(id) NOT NULL,
PRIMARY KEY (id, node)
);
CREATE TABLE tstage_sequence (
seq INTEGER NOT NULL
);
INSERT INTO tstage_sequence VALUES(0);
CREATE TRIGGER tstage_id_autoinc
AFTER INSERT ON tstage
FOR EACH ROW
WHEN NEW.id IS NULL
BEGIN
UPDATE tstage_sequence
SET seq = seq + 1;
UPDATE tstage
SET id = (SELECT seq
FROM tstage_sequence)
WHERE rowid = NEW.rowid;
END;
(Or use a common my_sequence table with the table name if there are multiple tables.)
A trigger works, but is complex. More simply, you could avoid serial ids. One approach, you could use a GUID. Unfortunately I couldn't find a way to have SQLite generate the GUID for you by default, so you'd have to generate it in your application. There also isn't a GUID type, but you could store it as a string or a binary blob.
Or, perhaps there is something in your other columns that would serve as a suitable key. If you know that inserts won't happen more frequently than the resolution of your timestamp format of choice (SQLite offers several, see section 1.2), then maybe (node, timestamp_column) is a good primary key.
Or, you could use SQLite's AUTOINCREMENT, but set the starting number on each node via the sqlite_sequence table such that the generated serials won't collide. Since rowid is SQLite is a 64-bit number, you could do this by generating a unique 32-bit number for each node (IP addresses are a convenient, probably unique 32 bit number) and shifting it left 32 bits, or equivalently, multiplying it by 4294967296. Thus, the 64-bit rowid becomes effectively two concatenated 32-bit numbers, NODE_ID, RECORD_ID, guaranteed to not collide unless one node generates over four billion records.
How about...
ASSUMPTIONS
Only need uniqueness in PK, not sequential-ness
Source table has a PK
Create the central table with one extra column, the node number...
CREATE TABLE tstage (
node INTEGER NOT NULL,
id INTEGER NOT NULL, <<< or whatever the source table PK is
PRIMARY KEY (node, id)
:
);
When you rollup the data into the centralized node, insert the number of the source node into 'node' and set 'id' to the source table's PRIMARY KEY column value...
INSERT INTO tstage (nodenumber, sourcetable_id, ...);
There's no need to maintain another autoincrementing column on the central table because nodenumber+sourcetable_id will always be unique.

Sqlite strange sort order

Here is an interesting one, it is only happening with one database file, not with any others that I have. I cured this problem but thought it was quite interesting.
I have a table -
<partial table>
CREATE TABLE [horsestats] (
[horseID] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[name] VARCHAR(30) NULL,
[flatrating] INTEGER DEFAULT '0' NULL
);
All of the zero values in the table are set by the default value, any other has been set by my software. About 70% of the records has a value set. So, if we run this -
Select horsestats.flatrating FROM horsestats WHERE horsestats.flatrating<>0 ORDER BY horsestats.flatrating DESC LIMIT 20;
We don't really expect this -
flatrating
0
0
0
etc
In fact only values that are 0 are listed, none of the values that are not zero are in the output. If we reverse it we might expect only values not 0 to be listed:
Select horsestats.flatrating FROM horsestats WHERE horsestats.flatrating=0 ORDER BY horsestats.flatrating DESC LIMIT 20;
But no, there are no records returned.
So what does this one get us (this is the place I started, because it is the first set that my software needs):
Select horsestats.flatrating FROM horsestats ORDER BY horsestats.flatrating DESC;
I bet your socks you guess wrong. It gets this:
flatrating
0
0
0
0
130
128
127
126
125
124
124
As I said this doesn't happen on any other database or table that I have. I'm going to fix it now by implicitly setting all values of zero to zero, I suspect this will put it right.
Actually, it didn't, if I run:
UPDATE horsestats SET horsestats.flatrating='0' WHERE horsestats.flatrating='0';
The problem remains, so it looks like I have to write that database file off as corrupt. In this case it is ok because I do have to load the majority of the data from elsewhere in a pre-load for the software.
So the question is Why?
Could Sqlite be doing a strange mix of ansi and numeric sort? Its the only thing I can think of to give that order of sort but also the value zero in this table does not seem to be numerically zero, though it behaves as expected once it is passed to my software.
I think your problem is that you're quoting the zero - this is making it a string. make a table like this:
CREATE TABLE [horsestats] (
[horseID] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[name] VARCHAR(30) NULL,
[flatrating] INTEGER DEFAULT 0 NULL
);
and it seems to work. alternatively, run an unquoted version of your update command:
UPDATE horsestats SET horsestats.flatrating=0 WHERE horsestats.flatrating='0';

SQL Server 2005: Basic Insert / Record Logic Help Needed

I am designing a social networking site that has a "wall" feature like the others out there today. The database has an alerts table that stores some user action worthy of sharing with his friends. For example, when a user updates his status, all of his friends are notified. The below table shows two status updates from two unique users. The first (AlertId 689 and 690) is submitted by AccountId 53. Since he has one frinend - AccountId 57 - that row is added to the table so when this user logs on, he will see Account 53's update on his wall. In the same manner, the other user's status update has four rows because he has three friends.
[AlertId] [AccountId] [CreateDate] [Timestamp] [AlertTypeId] [IsHidden] [Body]
689 57 2010-08-10 0x0000000000018725 10 0 HTML
690 53 2010-08-10 0x0000000000018726 10 0 HTML
691 53 2010-08-10 0x000000000001872B 10 0 HTML
692 52 2010-08-10 0x000000000001872C 10 0 HTML
693 51 2010-08-10 0x000000000001872D 10 0 HTML
694 57 2010-08-10 0x000000000001872E 10 0 HTML
Now, a user can comment on any given item, in this case a statusupdate. When AddComment is submitted, we are using ObjectRecordId (which is the primary key of the alert being commented on) in order to identify which statusupdate is being commented on (fyi - the objectId tells us its a statusupdate):
public void AddComment(string comment)
{
if (_webContext != null)
{
var c = new Comment
{
Body = comment,
CommentByAccountId = _webContext.CurrentUser.AccountId,
CommentByUserName = _webContext.CurrentUser.UserName,
CreateDate = DateTime.Now,
SystemObjectId = _view.ObjectId,
SystemObjectRecordId = _view.ObjectRecordId
};
_commentRepository.SaveComment(c);
}
_view.ClearComments();
LoadComments();
}
Now, the problem is that when a user wants to comment on a friend's status update, he will be using the AlertId (or ObjectRecordId in the Comments table) corresponding to his account in the alerts table. The result is that comments are only viewable by the commenter and none of his friends:
[CommentId] [Body] [CommentById] [CommentByName] [ObjectId] [ObjectRecordId] [Delete]
97 hello world. 57 GrumpyCat 7 690 0
Of course the solution to this is to do something similar to what I did in the alerts table - when somebody makes a comment, make a corresponding row for every friend in the comments table. But how do I access the AlertIds of all of my friend's status updates in the Alerts table and map them to the ObjectRecordId column in the comments table? Since I can only access the status updates corresponding to my account (and their corresponding alertids), I don't know what the alertids are for the same statusupdate in my friend's accounts.
The only solution that I can think of right now is stuffing the hidden field with all of my friend's corresponding alertIds so when I comment on an item, i already know what they are. But this feels sloppy and I'd like to know if there are any better ideas out there?
For what it is worth, here is the CREATE TABLE of dbo.Alerts:
CREATE TABLE [dbo].[Alerts](
[AlertId] [bigint] IDENTITY(1,1) NOT NULL,
[AccountId] [int] NOT NULL,
[CreateDate] [datetime] NOT NULL CONSTRAINT [DF_Alerts_CreateDate] DEFAULT (getdate()),
[Timestamp] [timestamp] NOT NULL,
[AlertTypeId] [int] NOT NULL,
[IsHidden] [bit] NOT NULL CONSTRAINT [DF_Alerts_IsHidden] DEFAULT ((0)),
[Message] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
CONSTRAINT [PK_Alerts] PRIMARY KEY CLUSTERED
(
[AlertId] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
And, here is dbo.Comments:
CREATE TABLE [dbo].[Comments](
[CommentId] [bigint] IDENTITY(1,1) NOT NULL,
[Timestamp] [timestamp] NOT NULL,
[Body] [varchar](2000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[CreateDate] [smalldatetime] NOT NULL,
[CommentByAccountId] [int] NOT NULL,
[CommentByUserName] [varchar](250) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[SystemObjectId] [int] NOT NULL,
[SystemObjectRecordId] [bigint] NOT NULL,
[FlaggedForDelete] [bit] NOT NULL CONSTRAINT [DF_Comments_FlaggedForDelete] DEFAULT ((0)),
CONSTRAINT [PK_Comments] PRIMARY KEY CLUSTERED
(
[CommentId] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
I am using SQL Server 2005. Thanks in advance.
Update
I have some real concerns about your design, I've layed them out using scenarios. I named one of my concerns earlier, which is that I don't see any way of tieing an alert back to a comment.
Scenario: A friend posts on his wall saying "hey, i'm giving away my old computer, let me know if you want it". Of course you weren't able to access the site for two weeks due to some good reason. Now when you finally get back on and see the alert for your friends posting you want to go check it out, BUT! there is nothing tieing this alert back to a comment. So when you click it you just go to your friends wall and not stright to the posting. You should be able to click an alert and go straight to the comment/post but I don't see any way of doing this right now.
Secondly, I don't see anyway of replying to a comment.
Scenario: I go to friend X's page and see that he's in texas this week for business, and I want to comment on that. So I write in the text box "hey, bring me back a present" and submit it. Now what happens to this comment? It goes in the comments table with a comment ID and it has my ID attached to it, but where does anything in the database say that it is a reply to a comment?
I think if you solve some of these other design issues the issue will probably fix itself, or if I'm way off or there are other tables in the picture that aren't included let me know.
Original Post
It looks like you need an extra column in the Alerts table, at least as far as I can tell. Here is the question I asked myself: How do I tell, just by looking at any record in the Alerts table, what comment it belongs to? I can't as far as I know. This means that the Alert is very general "hey this user said something, but I don't know what and if he removes his comment this little alert will still be here because it's not attached...".
So, I think you need a column in the Alerts table that links it back to the original comment/posting/whatever. Now you can use that original "CommentID" (?) to make the posting and everything works out clean and pretty.
I know I didn't directly answer your actual question... but I think your table design might need some work.

Resources