How to overcome Row size too large (> 8126) error on Google-Cloud MySQL5.7 Second Generation - innodb

Google Cloud MySQL Engine supports the InnoDB storage engine only.
I am getting the following error when creating a table with 300 columns.
[Err] 1118 - Row size too large (> 8126).
Changing some columns to TEXT or BLOB may help. In the current row format, the BLOB prefix of 0 bytes is stored inline.
I tried creating a table with the combination of some columns as text types and some others as blob types as well but it did not work.
Even modifying innodb_log_file_size is not possible, as it is not allowed on the Google Cloud-SQL Platform.

"Vertical Partitioning"
A table with lots of columns is pushing several limits; you hit one of them. There are several reasonable workarounds, Vertical Partitioning may be the best, especially if many are TEXT/BLOB.
Instead of a single table, have multiple tables with the same PRIMARY KEY, except that one may be AUTO_INCREMENT. JOIN them together as needed to collect the columns. You could even have VIEWs to hide the fact that you split up the table. I recommend grouping the columns by some logical grouping based on the application and which columns are needed 'together'.
Do not splay an array of things across columns; instead, have another table with multiple rows to handle the repetition. Example: address1, state1, country1, address2, state2, country2.
Do not use CHAR or BINARY except for truly fixed-length columns. Most of such are very short. Also, most CHAR columns should be CHARACTER SET ascii, not utf8. (Think, country_code, zipcode, md5.)
innodb_log_file_size is only indirectly related to your Question. What is it's value?
Directly related is innodb_page_size, which defaults to 16K, and virtually no one ever changes. I would expect Cloud Engines to prohibit changing it.
(I'm with Bill on desiring more info about your schema -- so we can be more specific about how to help you.)

You don't have much option here. InnoDB default page size is 16KB, and you must design your tables so at least two rows fit in a page. That's where the limit of 8126 bytes per row comes from.
Variable-length columns like VARCHAR, VARBINARY, BLOB, and TEXT can be longer, because data exceeding the row size limit can be stored on extra pages. To take advantage of this, you must enable the Barracuda table format, and choose ROW_FORMAT=DYNAMIC.
In config:
[mysqld]
innodb_file_per_table = ON
innodb_file_format = Barracuda
innodb_default_row_format = DYNAMIC;
I don't know if these settings are already enabled in Google Cloud SQL, or if they allow you to change these settings.
Read https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format.html for more information
Again, the advantage of DYNAMIC row format only applies to variable-length data types. If you have 300 columns that are fixed-length, like CHAR, then it doesn't help.
By the way, innodb_log_file_size has nothing to do with this error about row size.

In order to do what you want to do on a Cloud SQL instance, first off run this to set the innodb_strict_mode variable:
SET innodb_strict_mode = 0 ;
After that you should be able to create your table.

Related

DynamoDB Scan Vs Query on same data

I have a use case where I have to return all elements of a table in Dynamo DB.
Suppose my table has a partition key (Column X) having same value in all rows say "monitor" and sort key (Column Y) with distinct elements.
Will there be any difference in execution time in the below approaches or is it the same?
Scanning whole table.
Querying data based on the partition key having "monitor".
You should use the parallell scans concept. Basically you're doing multiple scans at once on different segments of the Table. Watch out for higher RCU usage though.
Avoid using scan as far as possible.
Scan will fetch all the rows from a table, you will have to use pagination also to iterate over all the rows. It is more like a select * from table; sql operation.
Use query if you want to fetch all the rows based on the partition key. If you know which partition key you want the results for, you should use query, because it will kind of use indexes to fetch rows only with the specific partition key
Direct answer
To the best of my knowledge, in the specific case you are describing, scan will be marginally slower (esp. in first response). This is when assuming you do not do any filtering (i.e., FilterExpression is empty).
Further thoughts
DynamoDB can potentially store huge amounts of data. By "huge" I mean "more than can fit in any machine's RAM". If you need to 'return all elements of a table' you should ask yourself: what happens if that table grows such that all elements will no longer fit in memory? you do not have to handle this right now (I believe that as of now the table is rather small) but you do need to keep in mind the possibility of going back to this code and fixing it such that it addresses this concern.
questions I would ask myself if I were in your position:
(1) can I somehow set a limit on the number of items I need to read (say,
read only the first 1000 items)?
(2) how is this information (the list of
items) used? is it sent back to a JS application running inside a
browser which displays it to a user? if the answer is yes, then what
will the user do with a huge list of items?
(3) can you work on the items one at a time (or 10 or 100 at a time)? if the answer is yes then you only need to store one (or 10 or 100) items in memory but not the entire list of items
In general, in DDB scan operations are used as described in (3): read one item (or several items) at a time, do some processing and then moving on to the next item.

How to determine the largest length of Progress OpenEdge ABL fields

In OpenEdge ABL / Progress 4GL, a field can be defined with a FORMAT, but that is only the default format for it to be displayed. Thus, a CHARACTER field with FORMAT 'X(10)' could store thousands of characters past the first ten.
The database I'm using contains millions of rows in some of the tables I'm concerned with. Is there any system table or Progress-internal program I can use to determine the longest length of a given field? I'm looking for anything more efficient than full-table scans. I'm on Progress OpenEdge 11.5.
"dbtool" will scan the db and find fields whose width exceeds the "sql width". By default that is 2x the format that was defined for character fields.
https://knowledgebase.progress.com/articles/Article/P24496/
Of course it has to scan the table to do that so it may not meet your "more efficient than table scans" criteria. FWIW dbtool is reasonably efficient.
If the fields that you are concerned about are problematic because of potential SQL access you might also want to look into "authorized data truncation" via the -SQLTruncateTooLarge parameter which will truncate the data on the fly.
Another option would be -SQLWidthUpdate which automatically adjusts the SQL width on the fly. That requires an upgrade to at least 11.6.
Both of these might solve your problem without periodic table scans.
If it's actually the character format you want to adjust to match the data, I suppose what you could do is to use dbtool to adjust the SQL width of all the fields, and then set the character format to be half the SQL width.

Organizing tables with data-heavy rows to optimize access times

I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.
The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.
I have noticed that creating indexes on the numeric columns of this table:
[15:52:36] Query finished in 723.253 second(s).
takes several times as long as creating a comparable index on the table with five times as many rows:
[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).
Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.
My current suspicions are:
This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices
but I would appreciate some feedback from someone with more knowledge in the field.
In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.
When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.
Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:
CREATE TABLE t (
id INTEGER PRIMARY KEY,
a INTEGER,
[...],
i REAL,
data BLOB NOT NULL,
);
With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.
If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.

Should I be worried about the settings table getting huge?

I have got a pretty fat settings table in SQL Server 2012, now with over 100 columns. As the name suggests, this table keeps track of all kinds of setting values within our website. It used to be having less than 50 columns but now its size is doubled.
The reason why I store setting values into database is because users will need to have ability to change these settings via UI.
Should I really be worried about this table getting bigger and bigger over time? Or I will have to find some other ways to store settings data, e.g save into files, perhaps?
First, you don't need to store settings in a database in order to update them at runtime by users. You can simply store them in a settings file that gets updated whenever the user makes changes. This is an xml config file and works well.
If, however, the application is network based, and you want the settings to follow the user from machine to machine, it makes more sense to put it in a database.
Second, yes... 100 columns is huge. Instead of storing each setting in a separate column, you might consider storing each setting in a separate row, and then have a common row format which is ID, SettingName, SettingValue, (maybe) DefaultValue. Then your table can grow as large as you like.
We are using JSON to store user settings. The table obtains only two columns - the user Id and the setting string. This string is quite long, but it doesn't matter. You can also use XML to store this data.
This is worse solution to modify data by finger, but faster to get from your DB and process by the client or by the ASP.NET server.
I am imagining that you are concerned about performance on huge tables?
One question is how many rows in this table? 100 columns with 10000 rows is not real problem. 100 columns over 10million rows is a slightly different ballgame. Not worse of better, just different.
The same considerations apply for small and large tables:
1. Are you indexing properly
2. Is your IO fine
3. Is your space fine
4. Are you querying efficiently
There is no right answer for this, it would depend of why you have big column counts and whether it's hitting your overall performance.
We run 1000s of tables with > 150 columns and no problems, even with millions of rows between them and I can't complain about performance.
And this is relatively de-normalized data, so lots of text.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Resources