Find records in the same table matching multiple criteria - sqlite

I have a sqlite table containing metadata extracted from thousands of audio files in a directory tree. The objective of the extraction is to run a series of queries against the table to identify and rectify anomalies in the underlying metadata. The corrected metadata is then written back from the table to the underlying files. The underlying files are grouped into albums with each album in a directory of its own. Table structure relevant to my question is as follows:
__path: unique identifier being the path and source filename combined
__dirpath: in simple terms represents the directory from which the file represented by a table record was drawn. Records making up an album will have the same __dirpath
__discnumber: number designating the disc number from which the track originates. The field can be blank or contain a string 1,2,3... etc.
I'd like to identify all records where (__dirpath is identical and __discnumber equals 1).

SELECT DISTINCT __dirpath,
discnumber
FROM alib
WHERE __dirpath IN (
SELECT __dirpath
FROM alib
GROUP BY __dirpath
HAVING count( * ) > 0
)
AND
discnumber = 1
ORDER BY __dirpath,
discnumber;

Related

Minimising Sqlite db size when indexing 2M files with long paths

I need to index around 2 million files on each of several linux systems and I am worried the naive way to do this might create an unnecessarily large data file because of the longish path names (IE the text of an average path, perhaps /home/user/Android/gradle/blah/blah/blah/filename, times 2 million).
Assuming I put the filename in a column of its own and the path in a different column, with identical text (ie the full path) which is repeated frequently in a table, will Sqlite automatically store the full text once and just use a pointer to it in each row? If not is there a way I can instruct it to do this behind the scenes without having to code it? To code it will be a PITA (quite rusty with sql at the mo) and if I put the text in a separate table I'm wondering if it will slow it all down at run time too.
Will probably use perl for this and the intention is to find replicated data across machines with slow interconnections, so index files, make hashes of all files, transfer db files and test against other machines.
TIA, Pete
Here is a very basic schema and corresponding query for storing file paths. Similar queries could be crafted to, e.g., get all files for a particular folder, or get a relative path of a file. Other meta data can be added to either table or to an auxiliary table. I am not claiming efficiency for any particular purpose other than this schema will avoid storing redundant path strings.
First the schema and sample data. Notice the recursive foreign-key relationship on the folders table, referring to itself:
BEGIN TRANSACTION;
DROP TABLE IF EXISTS files;
DROP TABLE IF EXISTS folders;
CREATE temp TABLE folders (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
parent_id INTEGER,
name NOT NULL,
--! UNIQUE(parent_id, name), --* Changed to unique index definition
FOREIGN KEY (parent_id) REFERENCES folders (id)
);
--* Multiple identical null values were allowed with a unique constraint defined on the table.
--* Instead define a unique index that explicitly converts null values to an effective id value
CREATE UNIQUE INDEX folders_unique_root ON folders
(coalesce(parent_id, -1), name);
CREATE temp TABLE files (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
folder_id INTEGER,
name NOT NULL,
UNIQUE(folder_id, name),
FOREIGN KEY (folder_id) REFERENCES folders (id)
);
INSERT INTO folders
(id, parent_id, name)
VALUES
--* Although id is AUTOINCREMENT, explicitly assign here to make hierarchical relationships obvious
(1, null, 'home'),
(2, 1, 'dev'),
(3, 2, 'SO'),
(4, 1, 'docs'),
(5, 1, 'sys');
INSERT INTO files
(folder_id, name)
VALUES
(1, 'readme.txt'),
(3, 'recursive.sql'),
(3, 'foobar'),
(4, 'homework.txt');
COMMIT;
Now a query for recovering full paths of the files. I'll add a few comments, but to understand each detail I defer to the official docs for the WITH statement and Window functions:
WITH RECURSIVE
file_path_levels AS (
--* Prime the recursion with file rows
SELECT id AS file_id,
1 AS level,
folder_id,
name
FROM files
--WHERE name == 'foobar' --For selecting particular file
UNION ALL
--* Continue recursion by joining to next folder in relationship
SELECT file_id,
level + 1,
folders.parent_id,
folders.name
FROM file_path_levels --* Refer to own name to cause recursion
LEFT JOIN folders --* Get row of parent folder
ON file_path_levels.folder_id = folders.id
WHERE folders.id is not null --Necessary condition to terminate recursion
),
file_path_steps AS (
--* Now concatenate folders into path string
SELECT file_id,
level,
'/' || group_concat(name, '/')
OVER (PARTITION BY file_id
ORDER BY level DESC)
AS full_path
FROM file_path_levels
)
SELECT file_id, full_path
FROM file_path_steps
WHERE level == 1;
This produces
file_id full_path
1 /home/readme.txt
2 /home/dev/SO/recursive.sql
3 /home/dev/SO/foobar
4 /home/docs/homework.txt
It's possible to inspect details of intermediate results by replacing the final query with something to retrieve rows from the other named CTE queries, optionally excluding WHERE conditions to see what each step produces. This will help in learning what each part does. For example, try this after the WITH clause:
SELECT *
FROM file_path_levels
ORDER BY file_id, level;

Oracle BI Publisher - Dynamic number of columns

I'm creating a report in BI Publisher using the BI Publisher Desktop tool for Word.
What I need is to have a table with a dynamic column number.
Let's imagine I'm listing stocks by store: Each line is an item and I need to have a column for each store in the database, but that must be dynamic because a store can be created or deleted at any moment.
The number of stores, i.e., the number of columns that need to exist is obtained from an SQL query that goes into the report by a data set.
The query will be something like SELECT COUNT(*) AS STORE_COUNT FROM STORE; in a data set named G_1, so the number of columns is the variable G_1::STORE_COUNT.
Is there any way that can be achieved?
I'm developing the report using an .rtf file, so any help related would be appretiated.
Thank you very much.
Create a .rtf file with the column names mapped to a .xdo or .xdm file. The mapped column in .xdo or .xdm file should be in the cursor or the select statement of your stored procedure of function.

sqlite3 - the philosophy behind sqlite design for this scenario

suppose we have a file with just one table named TableA and this table has just one column named Text;
let say we populate our TableA with 3,000,000 of strings like these(each line a record):
Many of our patients are incontinent.
Many of our patients are severely disturbed.
Many of our patients need help with dressing.
if I save the file at this level it'll be: ~326 MB
now let say we want to increase the speed of our queries and therefore we set our Text column as the PrimaryKey(or create index on it);
if I save the file at this level it'll be: ~700 MB
our query:
SELECT Text FROM "TableA" where Text like '% home %'
for the table without index: ~5.545s
for the indexed table: ~2.231s
As far as I know when we create index on a column or set a column to be our PrimaryKey then sqlite engine doesn't need to refer to table itself(if no other column was requested in query) and it uses the index for query and hence the speed of query execution increases;
My question is in the scenario above which we have just one column and set that column to be the PrimaryKey too, then why sqlite holds some kind of unnecessary data?(at least it seems unnecessary!)(in this case ~326 MB) why not just keeping the index\PrimaryKey data?
In SQLite, table rows are stored in the order of the internal rowid column.
Therefore, indexes must be stored separately.
In SQLite 3.8.2 or later, you can create a WITHOUT ROWID table which is stored in order of its primary key values.

Importing fields from multiple columns in an Excel spreadsheet into a single row in Access

We get new data for our database from an online form that outputs as an Excel sheet. To normalize the data for the database, I want to combine multiple columns into one row.
Example, I want data like this:
ID | Home Phone | Cell Phone | Work Phone
1 .... 555-1234 ...... 555-3737 ... 555-3837
To become this:
PhoneID | ID | Phone Number | Phone type
1 ............ 1 ....... 555-1234 ....... Home
2 ............ 1 ....... 555-3737 ....... Cell
3 ............ 1 ....... 555-3837 ...... Work
To import the data, I have a button that finds the spreadsheet and then runs a bunch of queries to add the data.
How can I write a query to append this data to the end of an existing table without ending up with duplicate records? The data pulled from the website is all stored and archived in an Excel sheet that will be updated without removing the old data (we don't want to lose this extra backup), so with each import, I need it to disregard all of the previously entered data.
I was able to make a query that lists everything out in the correct from the original spreadsheet (I entered the external spreadsheet into an unnormalized table in Access to test it) but when I try to append it to the phone number table, it adds all of the data repeatedly. I can remove it with a query to remove duplicate data, but I'd rather not leave it like that.
There are several possible approaches to this problem; which one you choose may depend on the size of the dataset relative to the number of updates being processed. Basically, the choices are:
1) Add a unique index to the destination table, so that Access will refuse to add a duplicate record. You'll need to handle the possible warning ("Access was unable to add xxx records due to index violations" or similar).
2) Import the incoming data to a staging table, then outer join the staging table to the destination table and append only records where the key field(s) in the destination table are null (i.e., there's no matching record in the destination table).
I have used both approaches in the past - I like the index approach for its simplicity, and I like the staging approach for its flexibility, because you can do a lot with the incoming data before you append it if you need to.
You could run a delete query on the table where you store the queried data and then run your imports.
Assuming that the data is only being updated.
The delete query will remove all records and then you can run the import to repopulate the table - therefore no duplicates.

Understanding the ORA_ROWSCN behavior in Oracle

So this is essentially a follow-up question on Finding duplicate records.
We perform data imports from text files everyday and we ended up importing 10163 records spread across 182 files twice. On running the query mentioned above to find duplicates, the total count of records we got is 10174, which is 11 records more than what are contained in the files. I assumed about the posibility of 2 records that are exactly the same and are valid ones being accounted for as well in the query. So I thought it would be best to use a timestamp field and simply find all the records that ran today (and hence ended up adding duplicate rows). I used ORA_ROWSCN using the following query:
select count(*) from my_table
where TRUNC(SCN_TO_TIMESTAMP(ORA_ROWSCN)) = '01-MAR-2012'
;
However, the count is still more i.e. 10168. Now, I am pretty sure that the total lines in the file is 10163 by running the following command in the folder that contains all the files. wc -l *.txt.
Is it possible to find out which rows are actually inserted twice?
By default, ORA_ROWSCN is stored at the block level, not at the row level. It is only stored at the row level if the table was originally built with ROWDEPENDENCIES enabled. Assuming that you can fit many rows of your table in a single block and that you're not using the APPEND hint to insert the new data above the existing high water mark of the table, you are likely inserting new data into blocks that already have some existing data in them. By default, that is going to change the ORA_ROWSCN of every row in the block causing your query to count more rows than were actually inserted.
Since ORA_ROWSCN is only guaranteed to be an upper-bound on the last time there was DML on a row, it would be much more common to determine how many rows were inserted today by adding a CREATE_DATE column to the table that defaults to SYSDATE or to rely on SQL%ROWCOUNT after your INSERT ran (assuming, of course, that you are using a single INSERT statement to insert all the rows).
Generally, using the ORA_ROWSCN and the SCN_TO_TIMESTAMP function is going to be a problematic way to identify when a row was inserted even if the table is built with ROWDEPENDENCIES. ORA_ROWSCN returns an Oracle SCN which is a System Change Number. This is a unique identifier for a particular change (i.e. a transaction). As such, there is no direct link between a SCN and a time-- my database might be generating SCN's a million times more quickly than yours and my SCN 1 may be years different from your SCN 1. The Oracle background process SMON maintains a table that maps SCN values to approximate timestamps but it only maintains that data for a limited period of time-- otherwise, your database would end up with a multi-billion row table that was just storing SCN to timestamp mappings. If the row was inserted more than, say, a week ago (and the exact limit depends on the database and database version), SCN_TO_TIMESTAMP won't be able to convert the SCN to a timestamp and will return an error.

Resources