I am building an application in ASP.NET, C#, MVC3 and SQL Server 2008.
A form is presented to a user to fill out (name, email, address, etc).
I would like to allow the admin of the application to add extra, dynamic questions to this form.
The amount of extra questions and the type of data returned will vary.
For instance, the admin could add 0, 1 or more of the following types of questions:
Have you a full, clean driving liscence?
Rate your drivings skills from 1 to 5.
Describe the last time you went on a long journey?
etc ...
Note, that the answers provided could be binary (Q.1), integer (Q.2) or free text (Q.3).
What is the best way of storing random data like this in MS SQL?
Any help would be greatly appriecated.
Thanks in advance.
I would create a table with the following columns and store the name of the variable along with value in the appropriate column with all other values null.
id:int (primary)
name:varchar(100)
value_bool:bit(nullable)
value_int:int (nullable)
value_text:varchar(100) (nullable)
Unless space is an issue, I would use VARCHAR(MAX). It gives you up to 8,000 characters and stores numbers and text.
edit: Actually as Aaron points out below, that will give you 2 billion characters (enough for a book). You might go with VARCHAR(8000) or the like then, wich does give you up to 8,000 characters. Since it is VARCHAR, it will not take empty space (so a 0 or 1 will not take up 8,000 characters worth of space, only 1).
Related
Google Cloud MySQL Engine supports the InnoDB storage engine only.
I am getting the following error when creating a table with 300 columns.
[Err] 1118 - Row size too large (> 8126).
Changing some columns to TEXT or BLOB may help. In the current row format, the BLOB prefix of 0 bytes is stored inline.
I tried creating a table with the combination of some columns as text types and some others as blob types as well but it did not work.
Even modifying innodb_log_file_size is not possible, as it is not allowed on the Google Cloud-SQL Platform.
"Vertical Partitioning"
A table with lots of columns is pushing several limits; you hit one of them. There are several reasonable workarounds, Vertical Partitioning may be the best, especially if many are TEXT/BLOB.
Instead of a single table, have multiple tables with the same PRIMARY KEY, except that one may be AUTO_INCREMENT. JOIN them together as needed to collect the columns. You could even have VIEWs to hide the fact that you split up the table. I recommend grouping the columns by some logical grouping based on the application and which columns are needed 'together'.
Do not splay an array of things across columns; instead, have another table with multiple rows to handle the repetition. Example: address1, state1, country1, address2, state2, country2.
Do not use CHAR or BINARY except for truly fixed-length columns. Most of such are very short. Also, most CHAR columns should be CHARACTER SET ascii, not utf8. (Think, country_code, zipcode, md5.)
innodb_log_file_size is only indirectly related to your Question. What is it's value?
Directly related is innodb_page_size, which defaults to 16K, and virtually no one ever changes. I would expect Cloud Engines to prohibit changing it.
(I'm with Bill on desiring more info about your schema -- so we can be more specific about how to help you.)
You don't have much option here. InnoDB default page size is 16KB, and you must design your tables so at least two rows fit in a page. That's where the limit of 8126 bytes per row comes from.
Variable-length columns like VARCHAR, VARBINARY, BLOB, and TEXT can be longer, because data exceeding the row size limit can be stored on extra pages. To take advantage of this, you must enable the Barracuda table format, and choose ROW_FORMAT=DYNAMIC.
In config:
[mysqld]
innodb_file_per_table = ON
innodb_file_format = Barracuda
innodb_default_row_format = DYNAMIC;
I don't know if these settings are already enabled in Google Cloud SQL, or if they allow you to change these settings.
Read https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format.html for more information
Again, the advantage of DYNAMIC row format only applies to variable-length data types. If you have 300 columns that are fixed-length, like CHAR, then it doesn't help.
By the way, innodb_log_file_size has nothing to do with this error about row size.
In order to do what you want to do on a Cloud SQL instance, first off run this to set the innodb_strict_mode variable:
SET innodb_strict_mode = 0 ;
After that you should be able to create your table.
I was on "another" programming forum, and we were talking about getting the next number from an auto-increment field BEFORE an insert takes place (there is a way using ADOX). This was in an MS-Access database btw.
Anyway, the discussion veered off into the area of SHOULD you use auto-increment fields for things like invoice numbers, PO numbers, bill of lading numbers, or anything else that needs an unique, incrementing number.
My thoughts were "why not"? Other people are arguing that an Invoice number (for instance) should be managed as a separate table and incremented with code, not using an auto-number field.
Can someone give me a good reason why that would be true?
I've used auto-number fields for years for just this type of thing and have never had problem one.
Your thoughts?
I have always avoided number auto_increment. As it turns out for good reason. But originally my reasons were because that was what the professor told us.
Facebook had a major breach a few years ago - simply because they were use AUTO_INCREMENT fields for user id's. Doesn't take a calculator to figure out that if my ID is 10320 there is likely someone with ID 10319, etc.
When debugging (or proofing design) having a key that implicit of the data it represents is a heck of a lot easier.
Have keys that are implicit of the data reduces the potencial for corrupted data (type's and user guessing).
Implicit keys require the developer think about they're data. I have never come across a table using implicit keys that was not normalized.
Other than the fact deadlines often run tight - there is no great reason for auto increment.
Normally I use and autonumbering field for the ID so I don't need to think about how's generated.
The recordset operation like insert and delete alter the sequence skipping block of numbers.
When you manage CustomerID, Invoice Numbers and so on, it's better to have the full control over them instead of letting them under system's control.
You can create a function that generates for you the desired numbers using a rule (e.g. the invoice can be a function that include the invoicing date).
With autonumbering you can't manage this.
After that there is NO FIXED RULES about what to do and what not do.
It's just your practice and experience and the degree of freedom you want to have.
Bye:-)
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.
For a website I'm creating I need to search a few tables like Articles, Products and maybe the ForumThread and ForumPosts tables. I now have a very simple LIKE search query for each of these tables title columns VARCHAR(255). The title column is indexed too.
In the future however I want to look in Description fields too which are VARCHAR(Max) and I'm guessing this will be very slow when there's lots of records.
Now I came across full text search and have the following questions about it:
Will full text search speed up these kind of simple search operations?
Can I still use a LIKE query in similar ways or do I need to rewrite all search queries?
Maybe not full text search related but how can I search in multiple tables? I'm now querying each table one by one.
If I enable full text search, will this eat more RAM (Since I'm on a 1 GB RAM VPS right now)
As you can see I have absolutely no experience with this, and even after reading theory I'm still a little confused about what it really does.
I hope someone can give me a little guidance on this,
Thank you for your time.
Kind regards,
Mark
The big problem with your LIKE-based queries is that they almost certainly can't use normal indexes. So it won't do you any good to add an index on the description column to help with performance. Full Text queries consist of two parts: 1) changing your query to use (for example) the CONTAINS() keyword instead of LIKE and 2) creating a different kind of index that the queries using these keywords will be able to take advantage of.
Here's the thing: it's not just the size of the field that determines whether full text will have a big impact. It's also the number of rows. You may have a simple nvarchar(100) that's only expected to hold a short phrase, but if you have to search millions of rows full text can still search this faster. The key there is the "have to search" part - if you have other filters that can significantly limit the working set, your LIKE query might still do fine. Another scenario is an nvarchar(max) field with only a few dozen rows, but each of those records has as much text as a novel. In this case, you'll still want to use a full text index.
There are two other important considerations for full text searches. One is that they tend to hog disk space. This isn't hugely important for most databases, but it is worth mentioning. The other is that they often need to be manually re-calculated, such that an article isn't ready for searching the moment it's added to the DB.
An alternative that is somewhere between full-text searching and simple LIKE searches that will give you much better performance, some weighting ability, and also simplify searching multiple tables, is to build your own keyword index, e.g. create a table:
keyword count tableid columnid rowid
------- ----- ------- -------- -----
varchar int int int int
You would of course need triggers or a service of some kind to keep this up to date, but what you end up with is a lightweight cross reference of the counts of all relevant keywords and where they appear. Your search queries then only need to look up the keywords in this index.
This only works for keywords, though, so if you want to let people search on phrases it won't work. You'll also have to incorporate logic to deal with things like plurals and irrelevant words. On the other hand it is extremely fast. If performance is becoming a problem for LIKE searches and you need more than just keywords searching, full-text searching is probably the best way to go.
Full-text search is really intended for when your application needs to do intensive searching of BIG blocks of text rather than simple fields of text for storing names, descriptions etc.
For example I've used it for such things as quickily searching through the content of books/CVs - it actually creates word-by-word indexes of all the content stored and will probably be overkill if you're not working with massive bits of text.
One design change you could make instead is to use nVarchar(Max) instead of Varchar - this gives you the ability to handle Unicode text (from most known human alphabet systems) and should be large enough for your needs as outlined above.
I have a curious question about efficiency. Say I have a field on a database that is just a numeric digit that represents something else. Like, a value of 1 means the term is 30 days.
Would it be better (more efficient) to code a SELECT statement like this...
SELECT
CASE TermId
WHEN 1 THEN '30 days'
WHEN 2 THEN '60 days'
END AS Term
FROM MyTable
...and bind the results directly to the GridView, or would it be better to evaluate the TermId field in RowDataBound event of the GridView and change the cell text accordingly?
Don't worry about extensibility or anything like that, I am only concerned about the differences in overall efficiency. For what it's worth, the database resides on the web server.
Efficiency probably wouldn't matter here - code maintainability does though.
Ask yourself - will these values change? What if they do? What would I need to do after 2 years of use if these values change?
If it becomes evident that scripting them in SQL would mean better maintainability (easier to change), then do it in a stored Procedure. If it's easier to change them in code later, then do that.
The benefits from doing either are quite low, as the code doesn't look complex at all.
For a number of reasons, I would process the translation in the grid view.
Reason #1: SQL resource is shared. Grid is distributed. Better scalability.
Reason #2: Lower bandwidth to transmit a couple integers vs. strings.
Reason #3: Code can be localized for other languages without affecting the SQL Server code.
A field in a database table called TermID would imply itself to represent a foreign key to another table (perhaps called "Term").
If this is the case, then perhaps that table has (or should have), a Description field which could hold the "30 days" text. You could/should join to this table to retrieve the descriptive text.
While this join might not improve efficiency, it it a light weight enough join to not get in the way.