SQL Join on Int for Decimal Value Lookup Table - asp.net

I'm doing a project in asp.net core 2.1 (EF, MVC, SQL Server) and have a table called Orders, which in the end will be a grid (i.e. ledger) of trades and different calculations on those numbers (no paging...so could run hundreds or thousands of records long).
In that Orders table, is a property/column named Size. Size will basically be a lot value from 0.01 to maybe 10.0 in increments of 0.01..so 1000 different values to start and I'm guessing 95% of people will use values less than 5.0.
So originally, I thought i would use an OrderSize join table like so with a FK constraint to the Order table on Size (i.e. SizeId):
SizeId (Int) Value (decimal(9,2))
1 0.01
2 0.02
...etc, etc, etc...
1000 10.0
That OrderSize table will most likely never change (i.e. ~1000 decimal records) and the Size value in the Orders table could get quite repetitive if just dumping decimals in there, hence the reason for the join table.
However, the more I'm learning about SQL, the more I realize I have no clue what I'm doing and the bytes of space I'm saving might create a whole other performance robbing situation or who knows what.
I'm guessing the SizeId Int for the join uses 4 bytes? then another 5 bytes for the actual decimal Value? I'm not even sure I'm saving much space?
I realize both methods will probably work ok, especially on smaller queries? However, what is technically the correct way to do this? And are there any other gotchas or no-nos I should be considering when eventually calculating my grid values, like you would in an account ledger (i.e. assuming the join is the way to go)? Thank you!

It really depends what is your main objective behind using a lookup table. If its only around your concerns around storage space, then there are other ways you can design your database (using partitions and archiving bigger tables on cheaper storage).
That would be more scalable than using the lookup table approach (what happens if there are more than one decimal field in the Orders table - do you create a lookup table for each decimal field?).
You will also have to consider indexes on the Orders table while joining to the OrderSize table if you decide to go through that route. It can potentially lead to more frequent index scans if the join key is not part of the index on Orders table thereby causing slower query performance.

Related

Does clickhouse support quick retrieval of any column?

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.
My data has ten columns, and I want to quickly search any column through SQL statements, such as:
select * from table where key='mykeyword'; or select * from table where school='Yale';
I use order by to establish a sort key, order by (key, school, ...)
But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)
So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.
You should try to understand how works sparse primary indexes
and how exactly right ORDER BY clause in CREATE TABLE help your query performance.
Clickhouse never will works the same way as mysql
Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement
and use fields with low value cardinality on first order in PRIMARY KEY
don't try to use ALL
SELECT * ...
it's really antipattern
moreover, maybe secondary data skip index may help you (but i'm not sure)
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

websql performance, can we shard tables

I am using websql to store data in a phonegap application. One of table have a lot of data say from 2000 to 10000 rows. So when I read from this table, which is just a simple select statement it is very slow. I then debug and found that as the size of table increases the performance deceases exponentially. I read somewhere that to get performance you have to divide table into smaller chunks, is that possible how?
One idea is to look for something to group the rows by and consider breaking into separate tables based on some common category - instead of a shared table for everything.
I would also consider fine tuning the queries to make sure they are optimal for the given table.
Make sure you're not just running a simple Select query without a where clause to limit the result set.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

How to work with data stored in database?

Good day everyone,
I have some questions, about how to do calculations of data stored in the database. Like, I have a table:
| ID | Item name | quantity of items | item price | date |
and for example i have stored 10000 records.
First that I need to do is to pick up items from a date interval, so I wont need the whole database for my calculations. And then I get items from that date interval, I have to add some tables, for example to calculate:
full price = quantity of items * item price
and store them in new table for each item. So the database for the items picked from the date interval should look like this:
| ID | Item name | quantity of items | item price | date | full price |
The point is that I don't know how to store that items which i picked with date interval. Like, do i have create some temporary table, or something?
This will be using an ASP.NET web application, and for calculations in the database I think I will use SQL queries. Maybe there is an easier way to do it? Thank you for your time to help me.
Like other people have said, you can perform these queries on the fly rather than store them.
However, to answer your question, a query like this should do the trick..
I haven't tested this so the syntax might be off a touch, though it will get you on the right track.
Ultimately you have to do an insert with a select
insert into itemFullPrice
select id, itemname, itemqty, itemprice, [date], itemqty*itemprice as fullprice from items where [date] between '2012/10/01' AND '2012/11/01'
again..don't shoot me if i have got the syntax a little off.. it's a busy day today :D
Having 10000 records, it'd not be a good idea to use temporary tables.
You'd better have another table, called ProductsPriceHistory, where you peridodically calculate and store, let's say, monthly reports.
This way, your reports would be faster and you wouldn't have to make calculations everytime you want to get your report.
Be aware this approach is OK if your date intervals are fixed, I mean, monthly, quarterly, yearly, etc.
If your date intervals are dynamic, ex. from 10/20/2011 to 10/25/2011, from 05/20/2011 to 08/13/2011, etc, this approach wouldn't work.
Another approach is to make calculations on ASP.Net.
Even with 10000 records, your best bet is to calculate something like this on the fly. This is what structured databases were designed to do.
For instance:
SELECT [quantity of items] * [item price] AS [full price]
, [MyTable].*
FROM [MyTable]
More complex calculations that involve JOINs to 3 or more tables and thousands of records might lend itself to storing values.
There are few approaches:
use sql query to calculate that on the fly - this way nothing is stored to the database
use same or another table to perform calculation
use calculated field
If you have low database load (few queries per minute, few thousands of rows per fetch) then use first aproach.
If calculation on the fly performs poorly (millions of records, x fetches per second...) try second or third aproach.
Third one is ok if your db supports calculated and persisted fields, say MSSQL Server.
EDIT:
Typically, as others said, you will perform calculation in your query. That is, as long as your project is simple enough.
First, when the table where you store all the items and their prices becomes attacked with insert/update/deletes from multiple clients, you don't want to block or be blocked by others. You have to understand that e.g. table X update will possibly block your select from table X until it is finished (look up page/row lock). This means that you are going to love parallel denormalized structure (table with product and the calculated stuff along with it). This is where e.g. reporting comes into play.
Second, when calculation is simple enough (a*b) and done over not-so-many records, then it's ok. When you have e.g. 10M records and you have to correlate each row with several other rows and do some aggregation over some groups, there is a chance that calculated/persisted field will save your time - you can gain up to 10-100 times faster result using this approach.
You should separate concerns in your application:
aspx pages for presentation
sql server for data persistency
some kind of intermediate "business" layer for extra logic like fullprice = p * q
E.g. if you are using Linq-2-sql for data retrieval, it is very simple to add a the fullprice to your entities. The same for entity framework. Also, if you want, you can already do the computation of p*q in the SQL select. Only if performance really becomes an issue, you can start thinking about temporary tables, views with clustered indexes etc.

ASP.Net MVC - Data Design - Single Wide Record versus Many small record retrievals

I am designing a Web application that we estimate may have about 1500 unique users per hour. (We have no stats for concurrent users.). I am using ASP.NET MVC3 with an Oracle 11g backend and all retrieval will be through packaged stored procedures, not inline SQL. The application is read-only.
Table A has about 4 million records in it.
Table B has about 4.5 million records.
Table C has less than 200,000 records.
There are two other tiny lookup tables that are also linked to table A.
Tables B and C both have a 1 to 1 relationship to Table A - Tables A and B are required, C is not. Tables B and C contain many string columns (some up to 256 characters).
A search will always return 0, 1, or 2 records from Table A, with its mate in table b and any related data in C and the lookup tables.
My data access process would create a connection and command, execute the query, return a reader, load the appropriate object from that reader, close the connection, and dispose.
My question is this....
Is it better (as performance goes) to return a single, wide record set all at once (using only one connection) or is it better to query one table right after the other (using one connection for each query), returning narrower records and joining them in the code?
EDIT:
Clarification - I will always need all the data I would bring over in either option. Both options will eventually result in the same amount of data displayed on the screen as was brought from the DB. But one would have a single connection getting all at once (but wider, so maybe slower?) and the other would have multiple connections, one right after the other, getting smaller amounts at a time. I don't know if the impact of the number of connections would influence the decision here.
Also - I have the freedom to denormalize the table design, if I decide it's appropriate.
You only ever want to pull as much data as you need. Whichever way moves less from the database over to your code is the way you want to go. I would pick your second suggestion.
-Edit-
Since you need to pull all of the records regardless, you will only want to establish a connection once. Since you're getting the same amount of data either way, you should try to save as much memory as possible by keeping the number of connections down.

Resources