SQLite Table Structure for Creating a 1:Many Index Relationship - sqlite

Problem
Can't create a table with an index column that references multiple rows in a table. Picture example below of what I'm trying to create.
Overview
Imagine an (SQLite) table will hold stock dividend payments. The index column is set to the ticker symbols. However, each ticker symbol refers to multiple records, which are organized by a time stamp. The documentation on SQLite and about 15 other tutorials all seem to focus on indexing where there is always a 1:1 relationship between an index and a record. I would like to create an index with a 1:many relationship.
The lookup would find the appropriate stock by symbol, and then (probably) a secondary index on the dates in the first column. But I cannot find any examples where others have tried to set up this structure. Makes me think maybe I don't have the right approach, or this is just a special case.

I don't think your problem is actually a problem. Putting an index on a column doesn't mean it has to contain unique values. It's perfectly reasonable for values in an indexed column to repeat. Of course there are diminishing returns. E.g. If you have a million rows and only five different values in a column, an index on that column isn't really going to do much for you.
A good rule of thumb is to start with an index on the column(s) you're using in your where clause. Then run the queries and see if you're getting satisfactory performance.

Related

A way to search multiple lists and return a query for which list it belongs to

I have a csv file with multiple lists. See picture. What I want to do is query every single value so it tells me which list that the value is found in.
Eg I query number 898774 and it tells me 898774 - prim6 in set 1, set 2 and set 4.
I did find a quick work around by making one big list in excel, removing dupes and then manually searching all for each number. Doable for a small amount but not that good for '000s of sets.
I created a vector for each column and started a search with which(sapply) but then remembered I needed the names. Just a little bit out of my knowledge.

Set table vs multi set table performance

I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.

How do I ignore a column in a tSQLt AssertEqualsTable?

Is it possible to ignore certain columns that are almost definitely going to be different in a tSQLt AssertEqualsTable? Examples would be primary keys from the two results tables, insert/update date stamps, and so on.
I have been working around this by selecting only the relevant columns into new temp tables and comparing those instead, but this means extra work and extra places to make mistakes. Not a lot, sure, but it adds up over dozens or hundreds of tests.
A built-in or simple way to say 'compare these two tables but ignore columns X and Y' would be very useful. Is there a better solution than the one I'm using?
All you need to do is populate an #expected table with the columns you are interested in. When AssertEqualsTable does the comparison it will ignore any columns in the #actual table that don't exist in the #expected table.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Do we have to use fact table for reports?

I am working on building a data mart for reporting purpose.
I am new to this field and looking for help.
I have a fact table and two dimension tables.
The fact table has only 3 fields, its primary key and foreign key references to two dimension tables.
The two dimension tables have data related to 1)phonenumbers and 2)extension numbers.
(I cannot combine these dimension tables because they have different information)
As you see my fact table does not have any quantitative columns.
I want to generate a report that displays phonenumbers and corresponding extensions.
I can get this information by performing a join on the two dimension tables.
So my question is do I have to use fact table for the report? i.e Should I first get the key from phonenumber table, perform join on fact table, get extension key and perform join on extension table?
OR
Simply join the two dimension tables to generate the report because it is possible in this case?
Do we have to involve the fact table?
Thanks for reading.
Any help is appreciated.
do I have to use fact table for the report? i.e Should I first get the key from phonenumber table, perform join on fact table, get extension key and perform join on extension table?
Often, this is necessary.
Simply join the two dimension tables to generate the report because it is possible in this case?
Sometimes, this works, also.
Do we have to involve the fact table?
Depends on the relationships.
If you have a "hierarchy" of dimensional information, then the two dimensions could be directly related. In this case, the fact table doesn't tie them together. The fact ties to the detailed dimension; the detailed dimension ties to the summary. This is rare.
Dimensions change.
If you have two or more Slowly Changing Dimensions, then your dimensions may include lots of "previous" relationship information.
Fact 1: Phone xxx-xxx-xxxx, Extension yyyy
Fact 2: Phone xxx-xxx-xxxx, Extension zzzz
Then, another load applies an SCD rule to modify zzzz to aaaa, as of 7/1/11 You may have the old dimension values available, as well as the new dimension values, with an applicable date range.
Now, the fact (and the date range) are required to define which copy of the dimension value you're going to get.
Fact 2: Phone xxx-xxx-xxxx, Extension zzzz, from beginning to before 7/1/11.
Fact 2: Phone xxx-xxx-xxxx, Extension aaaa, from 7/1/11 to end.
So, you may need the fact, dimensions and time to sort out the relationships.

Resources