Avoid DataNucleus joins? - jdo

I'm experimenting with moving a JDBC webapp to JDO DataNucleus 2.1.1.
Assume I have some classes that look something like this:
public class Position {
private Integer id;
private String title;
}
public class Employee {
private Integer id;
private String name;
private Position position;
}
The contents of the Position SQL table really don't change very often. Using JDBC, I read the entire table into memory (with the ability to refresh periodically or at-will). Then, when I read an Employee into memory, I simply retrieve the position ID from the Employee table and use that to obtain the in-memory Position instance.
However, using DataNucleus, if I iterate over all Positions:
Extent<Position> extent =pm.getExtent(Position.class, true);
Iterator<Position> iter =extent.iterator();
while(iter.hasNext()) {
Position position =iterPosition.next();
System.out.println(position.toString());
}
And then later, with a different PersistenceManager, iterate over all Employees, obtaining their Position:
Extent<Employee> extent =pm.getExtent(Employee.class, true);
Iterator<Employee> iter =extent.iterator();
while(iter.hasNext()) {
Employee employee =iter.next();
System.out.println(employee.getPosition());
}
Then DataNucleus appears to produce SQL joining the two tables when I obtain an Employee's Position:
SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = <1>
My understanding is that DataNucleus will use a cached Position instance, when available. (Is that correct?) However, I'm concerned that the joins will degrade performance. I'm not yet far enough along to run benchmarks. Are my fears misplaced? Should I continue, and benchmark? Is there a way to have DataNucleus avoid the join?
<jdo>
<package name="com.example.staff">
<class name="Position" identity-type="application" schema="MYSCHEMA" table="Position">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="title">
<column name="TITLE" jdbc-type="varchar"/>
</field>
</class>
</package>
</jdo>
<jdo>
<package name="com.example.staff">
<class name="Employee" identity-type="application" schema="MYSCHEMA" table="EMPLOYEE">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="name">
<column name="NAME" jdbc-type="varchar"/>
</field>
<field name="position" table="Position">
<column name="POSITION_ID" jdbc-type="int" />
<join column="ID" />
</field>
</class>
</package>
</jdo>
I guess what I'm hoping to be able to do is tell DataNucleus to go ahead and read the POSITION_ID int as part of the default fetch group, and see if the corresponding Position is already cached. If so, then set that field. If not, then do the join later, if called upon. Better yet, go ahead and stash that int ID somewhere, and use it if getPosition() is later called. That would avoid the join in all cases.
I would think that knowing the class and the primary key value would be enough to avoid the naive case, but I don't yet know enough about DataNucleus.
With the helpful feedback I've received, my .jdo is now cleaned up. However, after adding the POSITION_ID field to the default fetch group, I'm still getting a join.
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME",A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID
I understand why it is doing that, the naive method will always work. I was just hoping it was capable of more. Although DataNucleus might not read all columns from the result set, but rather return the cached Position, it is still calling upon the datastore to access a second table, with all that entails - including possible disk seeks and reads. The fact that it will throw that work away is little consolation.
What I was hoping to do was tell DataNucleus that all Positions will be cached, trust me on that. And if for some reason you find one that isn't, blame me for the cache miss. I understand that you'll have to (transparently) perform a separate select on the Position table. (Even better, pin any Positions you do have to go fetch due to a cache miss. That way there won't be a cache miss on the object again.)
That is what I'm doing now using JDBC, by way of a DAO. One of the reasons for investigating a persistence layer was to ditch these DAOs. It is difficult to imagine moving to a persistence layer that can't move beyond naive fetches resulting in expensive joins.
As soon as Employee has not only a Position, but a Department, and other fields, an Employee fetch causes a half dozen tables to be accessed, even though all of those objects are already pinned in the cache, and are addressable given their class and primary key. In fact, I can implement this myself, changing Employee.position to an Integer, creating an IntIdentity, and passing it to PersistenceManager.getObjectByID().
What I think I'm hearing is that DataNucleus is not capable of this optimization. Is that right? It's fine, just not what I expected.

By default, a join will not be done when the Employee entity is fetched from the datastore, it will only be done when Employee.position is actually read (this is called lazy loading).
Additionally, this second fetch can be avoided using the level 2 cache. First check that the level 2 cache is actually enabled (in DataNucleus 1.1 it is disabled by default, in 2.0 it is enabled by default). You should probably then "pin" the class so that the Position entities it will be cached indefinitely:
The level 2 cache can cause issues if other applications use the same database, however, so I would recommend only enabling it for classes such as Position which are rarely changed. For other classes, set the "cacheable" attribute to false (default is true).
EDITED TO ADD:
The <join> tag in your metadata is not suitable for this situation. In fact you don't need to specify the relationship explicitly at all, DataNucleus will figure it out from the types. But you are right when you say that you need POSITION_ID to be read in the default fetch group. This can all be achieved with the following change to your metadata:
<field name="position" default-fetch-group="true">
<column name="POSITION_ID" jdbc-type="int" />
</field>
EDITED TO ADD:
Just to clarify, after making the metadata change descibed above I ran the test code which you provided (backed by a MySQL database) and I saw only these two queries:
SELECT 'com.example.staff.Position' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`TITLE` FROM `POSITION` `THIS` FOR UPDATE
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
If I run only the second part of the code (the Employee extent), then I see only the second query, without any access to the POSITION table at all. Why? Because DataNucleus initially provides "hollow" Position objects and the default implementation of Position.toString() inherited from Object doesn't access any internal fields. If I override the toString() method to return the position's title, and then run the second part of your sample code, then the calls to the database are:
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <2> FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <1> FOR UPDATE
(and so on, one fetch per Position entity). As you can see, there are no joins being performed, and so I'm surprised to hear that your experience is different.
Regarding your description of how you hope caching should work, that is how the level 2 cache ought to work when a class is pinned. In fact, I wouldn't even bother trying to pre-load Position objects into the cache at application start-up. Just let DN cache them cumulatively.
It's true that you may have to accept some compromises if you adopt JDO...you'll have to relinquish the absolute control that you get with hand-rolled JDBC-based DAOs. But in this case at least you should be able to achieve what you want. It really is one of the archetypal use cases for the level 2 cache.

Adding on to Todd's reply, to clarify a few things.
A <join> tag on a 1-1 relation means nothing. Well it could be interpreted as saying "create a join table to store this relationship", but then DataNucleus doesn't support such a concept since best practice is to use a FK in either owner or related table. So remove the <join>
A "table" on a 1-1 relation suggest that it is stored in a secondary table, yet you don't want that either, so remove it.
You retrieve Position objects, so it issues something like
SELECT 'org.datanucleus.test.Position' AS NUCLEUS_TYPE,A0.ID,A0.TITLE FROM "POSITION" A0
You retrieve Employee objects, so it issues something like
SELECT 'org.datanucleus.test.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME" FROM EMPLOYEE A0
Note that it doesn't retrieve the FK for the position here since that field is not in the default fetch group (lazy loaded)
You access the position field of an Employee object, so it needs the FK retrieving (since it doesn't know which Position object relates to this Employee), so it issues
SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM EMPLOYEE A0 LEFT OUTER JOIN "POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = ?
At this point it doesn't need to retrieve the Position object since it is already present (in the cache), so that object is returned.
All of this is expected behaviour IMHO. You could put the "position" field of Employee into its default fetch group and that FK would be retrieved in step 4, hence removing one SQL call.

Related

Maintaining Order in #ElementCollection using Set JPA

I've been having a lot of issues with JPA and Hibernate. The biggest problem was issuing so many queries it was so slow.
So to counter that I've switched all relationships to Set and been using fetch joins to fetch the object graph in one request.
As a result though, I can't switch to List to get an ordered set of values because the repository throws an exception at startup that Hibernate doesn't handle simultaneous persistent bags as soon as I try to join more than one entity.
So I found a new JPA 2.0 feature called #OrderColumn - but I'm not sure how to apply it to an ElementCollection:
#ElementCollection(fetch = FetchType.LAZY)
#CollectionTable(name = "variant_option_vals",
joinColumns = #JoinColumn(name = "variantOption_id")
#OrderColumn(name="sequence")
)
private Set<String> optionValues;
I added the annotation but the sequence column gets added to the base table, not the optionValues table.
Alternatively, is there a way to keep the natural order for the items? i.e keep the order the items were entered, and avoid the simultaneous bags issue?
The #OrderColumn annotation can't be used with Sets, only with Lists.
The OrderColumn documentation documentation says:
Specifies a column that is used to maintain the persistent order of a list. The persistence provider is responsible for maintaining the order upon retrieval and in the database. The persistence provider is responsible for updating the ordering upon flushing to the database to reflect any insertion, deletion, or reordering affecting the list.
For Sets, the only option is to use the #OrderBy annotation using a field of the Entity for ordering, not the insertion order of the Set.
Tried using a SortedSet as the implementation of that set ?

Entity Framework 5 .Include() is not loading records if there are related entities that have been deleted

I have a database with a 1..many relationship between two tables, call them Color and Car. A Color is associated 1..many with Cars. In my case, it's critical that Colors can be deleted any time. No cascade delete, so if a Color is deleted, the Car's Color_ID field points to something that doesn't exist. This is OK. They are related via a FK named Color_ID.
The problem comes in when I do this:
var query = context.Cars.Include(x => x.Colors);
This only returns Cars that have an associated Color record that exists. What I really want is ALL the Cars, even if their color doesn't exist, so I can do model binding with a GridView, i.e.
<asp:Label runat="server" Text='<%# Item.Colors == null ? "Color Deleted!" : Item.Colors %>' />
All of this works fine if I remove the .Include() and resort to lazy loading. Then Item.Car.Color is null. Perfect. However I'm seriously concerned about doing way too many database queries for a massive result set, which is certainly possible.
One solution to avoid excessive db queries is to return an anonymous type from the datasource query with all the specific related bits of info that I need for the grid and convert all my "Item" style bindings to good 'ol Eval(). But then I lose the strong typing, and the simplicity that Value Provider attributes bring. I'd hate to re-write all that.
Am I right, do I have to choose one or the other? How can I shape my query to return all the Car records, even if there is no Color record? I think I'm screwed if I try to eager load with .Include(). I need like a .IncludeWithNulls() or something.
UPDATE: Just thought of this. I don't know how ugly this is as far as query cost, but it works. Is there a better way??
var query = context.Cars.Include(x => x.Colors);
var query2 = context.Cars.Where(x => !context.Colors.Any(y => y.Color_ID == x.Color_ID);
return query.Union(query2);
The problem was an incorrect end multiplicity. What I really needed was not 1..many but 0..many. That way, Entity Framework generates a left outer join instead of an inner join from the .Include(). Which makes sense, there may be zero actual Color records in the example above. The thing that confused me was that in the SQL database, I never set those foreign key fields to nullable because at the time of creation, they always required a valid foreign key. So I set them to nullable and fixed up my .edmx table and everything is working. I did have to add a few more null checks here and there such as the one in my question above, that weren't strictly necessary before, since the .Include is now pulling in records that reference missing related entities, but no big deal.
So I lose out on the non-null checking at the db level, but I gain some consistent logic in my LINQ queries for how those tables actually relate and what I expect to get back.

Efficient way to load lists of objects from database to instantiate a single object

My situation
I have a c# object which contains some lists. One of these lists are for example a list of tags, which is a list of c# "SystemTag"-objects. I want to instantiate this object the most efficient way.
In my database structure, I have the following tables:
dbObject - the table which contains some basic information about my c# object
dbTags - a list of all available tabs
dbTagConnections - a list which has 2 fields: TagID and ObjectID (to make sure an object can have several tags)
(I have several other similar types of data)
This is how I do it now...
Retrieve my object from the DB using an ID
Send the DB object to a "Object factory" pattern, which then realise we have to get the tags (and other lists). Then it sends a call to the DAL layer using the ID of our C# object
The DAL layer retrieves the data from the DB
These data are send to a "TagFactory" pattern which converts to tags
We are back to the Object Factory
This is really inefficient and we have many calls to the database. This especially gives problems as I have 4+ types of lists.
What have I tried?
I am not really good at SQL, but I've tried the following query:
SELECT * FROM dbObject p
LEFT JOIN dbTagConnection c on p.Id= c.PointId
LEFT JOIN dbTags t on c.TagId = t.dbTagId
WHERE ....
However, this retreives as many objects as there are tagconnections - so I don't see joins as a good way to do this.
Other info...
Using .NET Framework 4.0
Using LINQ to SQL (BLL and DAL layer with Factory patterns in the BLL to convert from DAL objects)
...
So - how do I solve this as efficient as possible? :-) Thanks!
At first sight I don't see your current way of work as "inefficient" (with the information provided). I would replace the code:
SELECT * FROM dbObject p
LEFT JOIN dbTagConnection c on p.Id= c.PointId
LEFT JOIN dbTags t on c.TagId = t.dbTagId
WHERE ...
by two calls to the DALs methods, first to retrieve the object main data (1) and one after that to get, only, the data of the tags related (2) so that your factory can fill-up the object's tags list:
(1)
SELECT * FROM dbObject WHERE Id=#objectId
(2)
SELECT t.* FROM dbTags t
INNER JOIN dbTag Connection c ON c.TagId = t.dbTagId
INNER JOIN dbObject p ON p.Id = c.PointId
WHERE p.Id=#objectId
If you have many objects and the amount of data is just a few (meaning that your are not going to manage big volumes) then I would look for a ORM based solution as the Entity Framework.
I (still) feel comfortable writing SQL queries in the DAOs to have under control all queries being sent to the DB server, but finally it is because in our situation is a need. I don't see any inconvenience on having to query the database to recover, first, the object data (SELECT * FROM dbObject WHERE ID=#myId) and fill the object instance, and then query again the DB to recover all satellite data that you may need (the Tags in your case).
You have be more concise about your scenario so that we can provide valuable recommendations for your particular scenario. Hope this is useful you you anyway.
We used stored procedures that returned multiple resultsets, in a similar situation in a previous project using Java/MSSQL server/Plain JDBC.
The stored procedure takes the ID corresponding to the object to be retrieved, return the row to build the primary object, followed by multiple records of each one-to-many relationship with the primary object. This allowed us to build the object in its entirety in a single database interaction.
Have you thought about using the entity framework? You would then interact with your database in the same way as you would interact with any other type of class in your application.
It's really simple to set up and you would create the relationships between your database tables in the entity designer - this will give you all the foreign keys you need to call related objects. If you have all your keys set up in the database then the entity designer will use these instead - creating all the objects is as simple as selecting 'Create model from database' and when you make changes to your database you simply right-click in your designer and choose 'update model from database'
The framework takes care of all the SQL for you - so you don't need to worry about that; in most cases..
A great starting place to get up and running with this would be here, and here
Once you have it all set up you can use LINQ to easily query the database.
You will find this a lot more efficient than going down the table adapter route (assuming that's what you're doing at the moment?)
Sorry if i missed something and you're already using this.. :)
As far I guess, your database exists already and you are familiar enough with SQL.
You might want to use a Micro ORM, like petapoco.
To use it, you have to write classes that matches the tables you have in the database (there are T4 generator to do this automatically with Visual Studio 2010), then you can write wrappers to create richer business objects (you can use the ValueInjecter to do it, it is the simpler I ever used), or you can use them as they are.
Petapoco handles insert / update operations, and it retrieves generated IDs automatically.
Because Petapoco handles multiple relationships too, it seems to fit your requirements.

EF4: Filtering out referenced entities that do not exist

I have an Entity Framework 4 design that allows referenced tables to be deleted (no cascade delete) without modifying the entities pointing to them. So for example entity A has a foreign key reference to entity B in the ID field. B can be deleted (and there are no FK constraints in the database to stop that), so if I look at A.B.ID it is always a valid field (since all this does is return the ID field in A) even if there is no record B with that ID due to a previous deletion. This is by design, I don't want cascading deletes, I need the A records to stick around for a while for auditing purposes.
The problem is that filtering out the non-existing deleted records is not as easy as it sounds. So for example if I do this:
from c in A
select A.B.somefield;
This results in a OUTER JOIN in the generated SQL so it's picking up all the A records even if they refer to missing B records. So, the hack I've been using to solve this (since I can't figure out a better way!) is do add a where clause to check a string field in the referenced B records. If that field in the B entity is null, then I assume B doesn't exist.
from c in A
where c.B.somestringfield != null
select A.B.somefield;
seems to work IF B.somestringfield is a string. If it is an integer, this doesn't work!
This is all such a hack to me. I've thought of a few solutions but they are just not practical:
Query all tables that reference B when a B is deleted and null out their foreign keys. This is so ugly, I don't want to have to remember to do this if I add another entity that references B in the future. Not to mention a huge performace delay resolving all the references whenever I delete something.
Add a string field to every table that I can count on being there that I can check to see if the entity exists. Blech, I don't want to add a database field just for this.
Implement a soft delete and keep all the referencial integrity intact - essentially set up cascading deletes, but this is going to result is huge database bloat since I can't clean up a massive amount of records due to the references. No go.
I thought I had this problem licked with the "check if a field in the referenced entity is null" trick but it breaks under conditions that I don't completely understand (what if I don't have any strings in the referenced table? What kinds of fields will work? Integers won't.)
As an example if I have an integer field "count" in entity B and I check to see if it's null like:
from c in A
where c.B.count != null
select c.B.count;
I get a bunch of records with null for count mixed in with the results, and in fact the query bombs out with an "InvalidOperationException: The cast to value type 'Int32' failed because the materialized value is null. Either the result type's generic parameter or the query must use a nullable type."
So I need to do
from c in A
where c.B.count != null
select new { count = (int?)c.B.count };
to even see the null records. So this is pretty baffling to me how that query can result in null records in the results at all.
I just discovered something, if I do an explicit join like this, the SQL is INNER JOIN and everything works great:
from c in A
join j in B on A.B.ID equals j.ID
select c;
But this sucks. I'll have to modify a ton of queries to add explicit join clauses instead of enjoying the convenience of the relationship fields I get with the EF. Kinda defeats the purpose and adds a buch more code to maintain.
When you say that your first code snippet creates an OUTER JOIN then it's the case because B is an optional navigation property of entity A. For a required navigation property EF would create an INNER JOIN (explained in more detail here: https://stackoverflow.com/a/7640489/270591).
So, the only alternative I see to your last code snippet (using explicit join in LINQ) - aside from using direct SQL - is to make your navigation property required.
This is still a very ugly hack in my opinion which might have unexpected behaviour in other situations. If a navigation property is required or optional EF adds a "semantic meaning" to this relationship which is: If there is a foreign key != NULL there must be a related entity and EF expects that you don't have removed the enforcement of the FK constraint in the database.

insert data from a asp.net form to a sql database with foreign key constraints

i have two tables
asset employee
assetid-pk empid-pk
empid-fk
now, i have a form to populate the asset table but it cant because of the foreign key constraint..
what to do?
thx
Tk
Foreign keys are created for a good reason - to prevent orphan rows at a minimum. Create the corresponding parent and then use the appropriate value as the foreign key value on the child table.
You should think about this update as a series of SQL statements, not just one statement. You'll process the statements in order of dependency, see example.
Asset
PK AssetID
AssetName
FK EmployeeID
etc...
Employee
PK EmployeeID
EmployeeName
etc...
If you want to "add" a new asset, you'll first need to know which employee it will be assigned to. If it will be assigned to a new employee, you'll need to add them first.
Here is an example of adding a asset named 'BOOK' for a new employee named 'Zach'.
DECLARE #EmployeeFK AS INT;
INSERT (EmployeeName) VALUES ('Zach') INTO EMPLOYEE;
SELECT #EmployeeFK = ##IDENTITY;
INSERT (AssetName, EmployeeID) VALUES ('BOOK',#EmployeeFK) INTO ASSET;
The important thing to notice above, is that we grab the new identity (aka: EmployeeID) assigned to 'Zach', so we can use it when we add the new asset.
If I understand you correctly, are you trying to build the data graph locally before persisting to the data? That is, create the parent and child records within the application and persist it all at once?
There are a couple approaches to this. One approach people take is to use GUIDs as the unique identifiers for the data. That way you don't need to get the next ID from the database, you can just create the graph locally and persist the whole thing. There's been a debate on this approach between software and database for a long time, because while it makes a lot of sense in many ways (hit the database less often, maintain relationships before persisting, uniquely identify data across systems) it turns out to be a significant resource hit on the database.
Another approach is to use an ORM that will handle the persistence mapping for you. Something like NHibernate, for example. You would create your parent object and the child objects would just be properties on that. They wouldn't have any concept of foreign keys and IDs and such, they'd just be objects in code related by being set as properties on each other (such as a "blog post" object with a generic collection of "comment" objects, etc.). This graph would be handed off to the ORM which would use its knowledge of the mapping between the objects and the persistence to send it off to the database in the correct order, perhaps giving back the same object but with ID numbers populated.
Or is this not what you're asking? It's a little unclear, to be honest.

Resources