Unix Remove Duplicate uniq vs Sybase - ignore_dup_key - unix

I have a file with 20 Million records. It has 30% duplicate values.
We thought of implementing two approaches.
Writing shell script to remove the duplicates, the file will be uploaded in unix box.
creating a table in sybase with ignore_dup_key and the BCP the file as is into the table. so that the table will eliminate the duplicates.
I have read when the duplicate percentage increases the ignore_dup_key will impact the performance.
How about the performance of Unix - uniq method? which one will be applicable for this?
Inputs are welcome!

Doing a BCP into a table with an ignore-dup-key unique index should be fastest, not in the last place because it is much easier and simpler to implement.
Here is why: ultimately, in either scenario you end up inserting a set of rows into the database table and building an index for those inserted rows. That amount of work is equal for both cases.
Now, the BCP method uses the existing index to identify and discard duplicate keys. This is handled quite efficiently inside ASE, as the row is discarded before being inserted. The number of duplicates does NOT affect this efficiency in case you only want to discard the duplicates (whoever said that was incorrectly informed).
If you'd do this duplicate-filtering outside ASE, you'd need to figure out a sorting method that discards records based on uniqueness of a part of the record (only they key). That's less trivial than it sounds and also requires system resources to perform the sort. Those resources are better spent on doing the sort (=index creation) inside ASE -- which you already had to do anyway for the rows being finally inserted.
Regardless, the BCP method is much more convenient than external sorting since it requires less work (less steps) from you. That's probably an even more important consideration.
For further reading, my book "Tips, Tricks & Recipes for Sybase ASE" has a few sections dedicated to ignore_dup_key.

Without testing both approaches, you cannot say which is faster for sure. But using sybase approach will be faster more probably since databases are optimized parallize your workload.

Related

Do a lot of sequences in mariadb have an impact on performance?

We heavily use mariadb sequences for key based naming series generation. We eventually realised that we have hit 5,000+ sequences and many more (40k-50K) to come.
Till now we do not see any major impact on performance, however, knowing that every sequence internally creates a table, will this cause any major impact in future?
We use desc <table> command a lot which scans information_schema.
I don't have specifics, but...
The OS probably has troubles with 50K tables -- each table is one or more file in the oS.
AUTO_INCREMENT is extremely well optimized; use that whenever practical.
Consider MariaDB's third sequencing object: pseudo tables like seq_1_to_10, which probably takes very little overhead.
I find that SHOW CREATE TABLE is more descriptive than desc. But why do you need it "a lot"? Once an hour is "rather often" for that query. (I am looking at the STATUS value Com_create_table; I suspect DESCRIBE increments that.)

"SELECT * FROM..." VS "SELECT ID FROM..." Performance [duplicate]

This question already has answers here:
select * vs select column
(12 answers)
Closed 9 years ago.
As someone who is newer to many things SQL as I don't use it much, I'm sure there is an answer to this question out there, but I don't know what to search for to find it, so I apologize.
Question: if I had a bunch of rows in a database with many columns but only need to get back the IDs which is faster or are they the same speed?
SELECT * FROM...
vs
SELECT ID FROM...
You asked about performance in particular vs. all the other reasons to avoid SELECT *: so it is performance to which I will limit my answer.
On my system, SQL Profiler initially indicated less CPU overhead for the ID-only query, but with the small # or rows involved, each query took the same amount of time.
I think really this was only due to the ID-only query being run first, though. On re-run (in opposite order), they took equally little CPU overhead.
Here is the view of things in SQL Profiler:
With extremely high column and row counts, extremely wide rows, there may be a perceptible difference in the database engine, but nothing glaring here.
Where you will really see the difference is in sending the result set back across the network! The ID-only result set will typically be much smaller of course - i.e. less to send back.
Never use * to return all columns in a table–it’s lazy. You should only extract the data you need.
so-> select field from is more faster
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time.
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Long answer short, selecting only the columns you need will always be faster. SELECT * requires scanning the whole table. This is a best practice thing that you should adopt very early on.
For the second part, you should probably post a seperate question instead of piggybacking off this one. Makes it easy to distinguish what you are asking about.

Sqlite3 database performance

I want create database. Simple I think. Just to storage number of phone, date, time and note.
Better (for database perfomance) use new table for every phone number and notes or one table and all information in it?
The right way is to normalize your data (hence, use as much tables as needed).
If you split your data into several tables (assuming you use indexed) write performance will be better.
Regarding read performance, depends on the size of the data (namely notes), but I would argue that having more tables is also better - except if indexing is out of the question (no reason for that really) and if you would otherwise need to join tables to get data. Even then, I don't think it would be a big trade-off.
SQLite can write millions of rows/s and read another more, are you sure you want to ask this question?

Split large sqlite table by sessionid field

I am relatively new to sql(ite), and I'm learning as I go while working on a new project.
We have got millions of transaction rows in one "data" table, one field being a "sessionid" field.
Since I want to concentrate on in-session activity for now, I primarily need to look only at transactions from the same sessions.
My intuition now is, that it would be a lot faster if I separate the database by sessions into many single session tables, than always querying for a single sessionid, and then proceeding. My question: is that correct? will that make a difference?
Even if not: Could you help me out and tell me, how I could split the one "data" table rows into many session-specific tables, the rows staying the same? Plus one table which relates sessionIds to their tables?
Thanks!
A friend just told me, the splitting-into-tables thing would be extremely unflexible, and I should try adding a distinct index instead for the different sessionId rows to access single sessions faster. Any thoughts on that and how to do it best?
First of all, are you having any specific performance bottleneck with it till now? If yes, please describe it.
Having one table per session will probably speed lookups/indexes (for INSERTs) things up.
SQLite doesn't impose a limit on the number of tables, so you should be okay.
One other solution that provides easier maintenance, is if you create one table per day/week.
Depending on how long your sessions last, this could be feasible or not.
Related: https://stackoverflow.com/a/811862/89771

Post-processing in SQL vs. in code

I have a general inquiry related to processing rows from a query. In general, I always try to format/process my rows in SQL itself, using numerous CASE WHEN statements to pre-format my db result, limiting rows and filling columns based on other columns.
However, you can also opt to just select all your rows and do the post-processing in code (asp.NET in my case). What do you guys think is the best approach in terms of performance?
Thanks in advance,
Stijn
I would recommend doing the processing in the code, unless you have network bandwidth considerations. The simple reason for this is that is is generally easier to make code changes than database changes. Furthermore, performance is more often related to the actual database query and disk access rather than the amount of data returned.
However, I'm assuming that your are referring to "minor" formatting changes to the result. Standard where clauses should naturally be done in the database.

Resources