I know the benefits of using CRUD, and that there are also some disadvantages, but I'd like to get some more expert feedback and advice on the process below for writing data to a database, particularly regarding best practice and possible pro's and con's.
I've come across two basic methods of creating records in my time as a developer. The first (and usually least helpful in most of the works I've seen) is to create a stub and use the various populated fields (including the PK) wherever it is needed. This usually leads to a raft of disowned records floating around the database with no real purpose.
The second way is to only hold a stub in memory, giving (what would be) the object's PK field a default value of, for instance, -1 to represent a new record. This keeps database access to a minimum, especially if the record is not needed later.
Personally, I've found the second way a lot more forgiving and straightforward than the first. The question I'd like to pose, though, is whether to rule out CRUD in favour of a stored procedure that carries out both the INSERT and UPDATE aspects of the CRUD process based on the afore mentioned default value, something like...
BEGIN
IF #record_id = -1
INSERT ....
ELSE
UPDATE ....
END
Any feedback would be appreciated.
As a rule of thumb, I tend to write Upsert procedures.......but I based the "match" on the unique_constraint, not the surrogate key.
For example.
dbo.Employee
EmployeeUUID is the PK, Surrogate Key
SSN is a unique constraint.
dbo.uspEmployeeUpsert would look something like this:
Insert into dbo.Employee (EmployeeUUID , LastName , FirstName, SSN )
Select NEWID() , LastName , FirstName , SSN
from #SomeHolderTable holder
where not exists (select null from dbo.Employee innerRealTable where
innerRealTable.SSN = holder.SSN )
Update dbo.Employee
Set EmployeeUUID = holder.EmployeeUUID
, LastName = ISNULL ( holder.LastName , e.LastName ) /* or COALESCE */
, FirstName = COALESCE ( holder.FirstName , e.FirstName )
from dbo.Employee e , #SomeHolderTable holder
Where e.SSN = holder.SSN
You can also use the MERGE function.
You can also replace the SSN with the SurrogateKey (EmployeeUUID in this case)
What is #SomeHolderTable you ask?
I like to pass xml to the stored procedure, shred it into a #Variable or #Temp table, then write the logic for CU. D(elete) is possible as well, but I usually isolate to a separate procedure.
Why do I do it this way?
Because I can update 1 or 100 or 1000 or N records with one db hit.
My logic seldom changes, and is isolated to one place.
Now, there is a small performance hit for shredding the Xml.
But I find it acceptable 99% of the time.
Every once in a while, I write a non "set based" Upsert routine. But that is for heavy hitter procedures for heavy hitting usage.
That's my take.
You can see the "set based" part of this approach (with the older OPENXML syntax) at this article:
http://msdn.microsoft.com/en-us/library/ff647768.aspx
Find the phrase : "Perform bulk updates and inserts by using OpenXML"
Here is the "more code" version of what the above URL talks about:
http://support.microsoft.com/kb/315968
EDIT
if exists ( select 1 from dbo.Employee e where e.SSN = holder.SSN )
BEGIN
Insert into dbo.Employee (EmployeeUUID , LastName , FirstName, SSN )
Select NEWID() , LastName , FirstName , SSN
from #SomeHolderTable holder
where not exists (select null from dbo.Employee innerRealTable where
innerRealTable.SSN = holder.SSN )
END
I wouldn't necessarily do this. But its an option if you want a "boolean check".
So, with my uniqueidentifier setup, I will pass down an "Empty Guid" (00000000-0000-0000-0000-000000000000) (Guid.Empty in C#) to the procedure, when I know I have a new item. That would be my "-1" check in your scenario.
That's one method, that you could check for an "if exists".
It kinda depends on how many hands you have in the pot.
Also, I didn't mention that when I have lot of hands in the pot, I'll shred the xml.....then I'll do a BEGIN TRAN and COMMIT TRAN around my CU statements (with ROLLBACK in there as well). That way my CU is atomic, all or nothing.
The MERGE function will do this as well. But the pros and cons of MERGE is a different topic.
Related
Similar to this question and this solution for PostgreSQL (in particular "INSERT missing FK rows at the same time"):
Suppose I am making an address book with a "Groups" table and a "Contact" table. When I create a new Contact, I may want to place them into a Group at the same time. So I could do:
INSERT INTO Contact VALUES (
"Bob",
(SELECT group_id FROM Groups WHERE name = "Friends")
)
But what if the "Friends" Group doesn't exist yet? Can we insert this new Group efficiently?
The obvious thing is to do a SELECT to test if the Group exists already; if not do an INSERT. Then do an INSERT into Contacts with the sub-SELECT above.
Or I can constrain Group.name to be UNIQUE, do an INSERT OR IGNORE, then INSERT into Contacts with the sub-SELECT.
I can also keep my own cache of which Groups exist, but that seems like I'm duplicating functionality of the database in the first place.
My guess is that there is no way to do this in one query, since INSERT does not return anything and cannot be used in a subquery. Is that intuition correct? What is the best practice here?
My guess is that there is no way to do this in one query, since INSERT
does not return anything and cannot be used in a subquery. Is that
intuition correct?
You could use a Trigger and a little modification of the tables and then you could do it with a single query.
For example consider the folowing
Purely for convenience of producing the demo:-
DROP TRIGGER IF EXISTS add_group_if_not_exists;
DROP TABLE IF EXISTS contact;
DROP TABLE IF EXISTS groups;
One-time setup SQL :-
CREATE TABLE IF NOT EXISTS groups (id INTEGER PRIMARY KEY, group_name TEXT UNIQUE);
INSERT INTO groups VALUES(-1,'NOTASSIGNED');
CREATE TABLE IF NOT EXISTS contact (id INTEGER PRIMARY KEY, contact TEXT, group_to_use TEXT, group_reference TEXT DEFAULT -1 REFERENCES groups(id));
CREATE TRIGGER IF NOT EXISTS add_group_if_not_exists
AFTER INSERT ON contact
BEGIN
INSERT OR IGNORE INTO groups (group_name) VALUES(new.group_to_use);
UPDATE contact SET group_reference = (SELECT id FROM groups WHERE group_name = new.group_to_use), group_to_use = NULL WHERE id = new.id;
END;
SQL that would be used on an ongoing basis :-
INSERT INTO contact (contact,group_to_use) VALUES
('Fred','Friends'),
('Mary','Family'),
('Ivan','Enemies'),
('Sue','Work colleagues'),
('Arthur','Fellow Rulers'),
('Amy','Work colleagues'),
('Henry','Fellow Rulers'),
('Canute','Fellow Ruler')
;
The number of values and the actual values would vary.
SQL Just for demonstration of the result
SELECT * FROM groups;
SELECT contact,group_name FROM contact JOIN groups ON group_reference = groups.id;
Results
This results in :-
1) The groups (noting that the group "NOTASSIGNED", is intrinsic to the working of the above and hence added initially) :-
have to be careful regard mistakes like (Fellow Ruler instead of Fellow Rulers)
-1 used because it would not be a normal value automatically generated.
2) The contacts with the respective group :-
Efficient insertion
That could likely be debated from here to eternity so I leave it for the fence sitters/destroyers to decide :). However, some considerations:-
It works and appears to do what is wanted.
It's a little wasteful due to the additional wasted column.
It tries to minimise the waste by changing the column to an empty string (NULL may be even more efficient, but for some can be confusing)
There will obviously be an overhead BUT in comparison to the alternatives probably negligible (perhaps important if you were extracting every Facebook user) but if it's user input driven likely irrelevant.
What is the best practice here?
Fences again. :)
Note Hopefully obvious, but the DROP statements are purely for convenience and that all other SQL up until the INSERT is run once
to setup the tables and triggers in preparation for the single INSERT
that adds a group if necessary.
I've been tasked with creating an application that allows users the ability to enter data into a web form that will be saved and then eventually used to populate pdf form fields.
I'm having trouble trying to think of a good way to store the field values in a database as the forms will be dynamic (based on pdf fields).
In the app itself I will pass data around in a hash table (fieldname, fieldvalue) but I don't know the best way to convert the hash to db values.
I'm using MS SQL server 2000 and asp.net webforms. Has anyone worked on something similar?
Have you considered using a document database here? This is just the sort of problem they solve alot better than traditional RDBMS solutions. Personally, I'm a big fan of RavenDb. Another pretty decent option is CouchDb. I'd avoid MongoDb as it really isn't a safe place for data in it's current implementation.
Even if you can't use a document database, you can make SQL pretend to be one by setting up your tables to have some metadata in traditional columns with a payload field that is serialized XML or json. This will let you search on metadata while staying out of EAV-land. EAV-land is a horrible place to be.
UPDATE
I'm not sure if a good guide exists, but the concept is pretty simple. The basic idea is to break out the parts you want to query on into "normal" columns in a table -- this lets you query in standard manners. When you find the record(s) you want, you can then grab the CLOB and deserialize it as appropriate. In your case you would have a table that looked something like:
SurveyAnswers
Id INT IDENTITY
FormId INT
SubmittedBy VARCHAR(255)
SubmittedAt DATETIME
FormData TEXT
A few protips:
a) use a text based serialization routine. Gives you a fighting chance to fix data errors and really helps debugging.
b) For SQL 2000, you might want to consider breaking the CLOB (TEXT field holding your payload data) into a separate table. Its been a long time since I used SQL 2000, but my recollection is using TEXT columns did bad things to tables.
The solution for what you're describing is called Entity Attribute Value (EAV) and this model can be a royal pain to deal with. So you should limit as much as possible your usage of this.
For example are there fields that are almost always in the forms (First Name, Last Name, Email etc) then you should put them in a table as fields.
The reason for this is because if you don't somebody sooner or later is going to realize that they have these names and emails and ask you to build this query
SELECT
Fname.value fname,
LName.Value lname,
email.Value email,
....
FROM
form f
INNER JOIN formFields fname
ON f.FormId = ff.FormID
and AttributeName = 'fname'
INNER JOIN formFields lname
ON f.FormId = ff.FormID
and AttributeName = 'lname'
INNER JOIN formFields email
ON f.FormId = ff.FormID
and AttributeName = 'email'
....
when you could have written this
SELECT
common.fname,
common.lname,
common.email,
....
FROM
form f
INNER JOIN common c
on f.FormId = c.FormId
Also get off of SQL 2000 as soon as you can because you're going to really miss the UNPIVOT clause
Its also probably not a bad idea to look at previous SO EAV questions to give you an idea of problems that people have encountered in the past
I'd suggest mirroring the same structure:
Form
-----
form_id
User
created
FormField
-------
formField_id
form_id
name
value
I have the following statements in Oracle 11g:
CREATE TYPE person AS OBJECT (
name VARCHAR2(10),
age NUMBER
);
CREATE TYPE person_varray AS VARRAY(5) OF person;
CREATE TABLE people (
somePeople person_varray
)
How can i select the name value for a person i.e.
SELECT somePeople(person(name)) FROM people
Thanks
I'm pretty sure that:
What you're doing isn't what I'd be doing. It sort of completely violates relational principles, and you're going to end up with an object/type system in Oracle that you might not be able to change once it's been laid down. The best use I've seen for SQL TYPEs (not PL/SQL types) is basically being able to cast a ref cursor back for pipelined functions.
You have to unnest the collection before you can query it relationally, like so:
SELECT NAME FROM
(SELECT SP.* FROM PEOPLE P, TABLE(P.SOME_PEOPLE) SP)
That'll give you all rows, because there's nothing in your specifications (like a PERSON_ID attribute) to restrict the rows.
The Oracle Application Developer's Guide - Object Relational Features discusses all of this in much greater depth, with examples.
To insert query:-
insert into people values (
person_varray(person('Ram','24'))
);
To select :-
select * from people;
SELECT NAME FROM (SELECT SP.* FROM PEOPLE P, TABLE(P.somePeople) SP)
While inserting a row into people table use constructor of
person_varray and then the constructor
of person type for each project.
The above INSERT command
creates a single row in people table.
select somePeople from people ;
person(NAME, age)
---------------------------------------------------
person_varray(person('Ram', 1),
To update the query will be:-
update people
set somePeople =
person_varray
(
person('SaAM','23')
)
A sample case scenario - I have a form with one question and multiple answers as checkboxes, so you can choose more than one. Table for storing answers is as below:
QuestionAnswers
(
UserID int,
QuestionID int,
AnswerID int
)
What is the best way of updating those answers to the database using a stored proc? At different jobs I've seen all spectrum, from simply deleting all previous answers and inserting new ones, to passing list of answers to remove and list of answers to add to the stored proc.
In my current project performance and scalability are pretty important, so I'm wondering what's the best way of doing it?
Thanks!
Andrey
If I had a choice of table design, and the following statements are true:
You know the maximum choices count per question/
Each choice is a simple checked/unchecked.
Each answer be classified as correct/wrong rather than marked by some scale. (Like 70% right.)
Then considering performance I would considered the following table instead of the one you presented:
QuestionAnswers
(
UserID int,
QuestionID int,
Choice1 bool,
Choice2 bool,
...
ChoiceMax bool
)
Yes, it is ugly in terms of normalization but that denormalization will buy performance and simplify queries -- just one update/insert for one question. (And I would update first and insert only if affected rows equals to zero.)
Also detecting whether the answer was correct will be also more simple -- with the following table:
QuestionCorrectAnswers
(
QuestionID int,
Choice1 bool,
Choice2 bool,
...
ChoiceMax bool
)
All you need to do is just to lookup for the row in QuestionCorrectAnswers with the same combination of choices as user answered.
If the questions are always the same, then you'd never delete anything - just run an update query on all changed Answers.
Update QuestionAnswers
SET AnswerID = #AnswerID
WHERE UserID = #UserID AND QuestionID = #QuestionID
If for some reason you still need to do some delete/insert - I'd check which QuestionIDs already exist (for the given UserID) so you do a minimum of Delete/Insert.
Updates are just far faster than Delete then Insert, plus you don't make any identity columns skyrocket.
I presume you load the QuestionAnswers from DB upon entering the page, so the user can see which answers he/she gave last time - if you did you already have the necessary data in memory, for determining what to delete, insert and update.
Andrey: If the user (userid=1) selects choices a(answerid=1) & b(answerid=2) for question 1(questionid=1) and later switches to c (a-id=3) & d(a-id=4), you would have to check, then delete the old and add the new. If you choose to follow the other approach, you would not be checking if a particular record exists (so that you can update it), you would just delete old records and insert new records. Anyways, since you are not storing any identity columns, I would go with the latter approach.
It is a simple solution:
Every [Answer] should have integer value (bit) and this value is unique for current Question.
For example, you have Question1 and four predefined answers:
[Answer] [Bit value]
answer1 0x00001
answer2 0x00002
answer3 0x00004
answer4 0x00008
...
So, you SQL INSERT/UPDATE will be:
declare #checkedMask int
set #checkedMask = 0x00009 -- answer 1 and answer 4 are checked
declare #questionId int
set #questionId = 1
-- delete
delete
--select r.*
r
from QuestionResult r
inner join QuestionAnswer a
on r.QuestionId = a.QuestionId and r.AnswerId = a.AnswerId
where r.QuestionId = #questionId
and (a.mask & #checkedMask) = 0
-- insert
insert QuestionResult (AnswerId, QuestionId)
select
AnswerId,
QuestionId
from QuestionAnswer a
where a.QuestionId = #questionId
and (a.mask & #checkedMask) > 0
and not exists(select AnswerId from QuestionResult r
where r.QuestionId = #questionId and r.AnswerId = a.AnswerId)
Sorry to resurect an old thread. I would have thought the only realistic solution is to delete all responses for that question, and create new rows where the checkbox is ticked. Having a column per answer may be efficient as far as updates go, but the inflexibility of this approach is just not an option. You need to be able to add options to a question without having to redesign your database.
Just delete and re-insert. Thats what databases are designed to do, store and retrieve lots of rows of data.
I disagree that regent's answer is denormalized. As long as each answer is not dependent on another column, and is only dependent on the key, it is in 3rd normal form. It is no different than a table with the following fields for a customer name:
CustomerName
(
name_prefix
name_first
name_mi
name_last
name_suffix
city
state
zip
)
Same as
QuestionAnswers
(
Q1answer1
Q1answer2
Q1answerN
)
There really is no difference between the "Question" of name and the multiple answers which may or may not be filled out and the "Question" of the form and the multiple answers that may or may not be selected.
I have recently stumbled upon a problem with selecting relationship details from a 1 table and inserting into another table, i hope someone can help.
I have a table structure as follows:
ID (PK) Name ParentID<br>
1 Myname 0<br>
2 nametwo 1<br>
3 namethree 2
e.g
This is the table i need to select from and get all the relationship data. As there could be unlimited number of sub links (is there a function i can create for this to create the loop ?)
Then once i have all the data i need to insert into another table and the ID's will now have to change as the id's must go in order (e.g. i cannot have id "2" be a sub of 3 for example), i am hoping i can use the same function for selecting to do the inserting.
If you are using SQL Server 2005 or above, you may use recursive queries to get your information. Here is an example:
With tree (id, Name, ParentID, [level])
As (
Select id, Name, ParentID, 1
From [myTable]
Where ParentID = 0
Union All
Select child.id
,child.Name
,child.ParentID
,parent.[level] + 1 As [level]
From [myTable] As [child]
Inner Join [tree] As [parent]
On [child].ParentID = [parent].id)
Select * From [tree];
This query will return the row requested by the first portion (Where ParentID = 0) and all sub-rows recursively. Does this help you?
I'm not sure I understand what you want to have happen with your insert. Can you provide more information in terms of the expected result when you are done?
Good luck!
For the retrieval part, you can take a look at Common Table Expression. This feature can provide recursive operation using SQL.
For the insertion part, you can use the CTE above to regenerate the ID, and insert accordingly.
I hope this URL helps Self-Joins in SQL
This is the problem of finding the transitive closure of a graph in sql. SQL does not support this directly, which leaves you with three common strategies:
use a vendor specific SQL extension
store the Materialized Path from the root to the given node in each row
store the Nested Sets, that is the interval covered by the subtree rooted at a given node when nodes are labeled depth first
The first option is straightforward, and if you don't need database portability is probably the best. The second and third options have the advantage of being plain SQL, but require maintaining some de-normalized state. Updating a table that uses materialized paths is simple, but for fast queries your database must support indexes for prefix queries on string values. Nested sets avoid needing any string indexing features, but can require updating a lot of rows as you insert or remove nodes.
If you're fine with always using MSSQL, I'd use the vendor specific option Adrian mentioned.