What are the rules for deciding when a new catalog should be created? - plone

I'd like to learn about using catalogs correctly.
I have about 30 useful content types, about 50 indexes in catalog.xml, and about 45 metadatas. There are just three types which account for most of the site's data - and I may need millions of these. I've been reading, and there's lots to do, but I want to have the basic configuration right before I begin all that.
This page told me that any non-default indexes should not be added to the portal_catalog. I've even read people explaining how removing one, or two of the default indexes makes a performance difference.
My question is: what are the rules for dividing up the indexes into different catalogs, and for selecting which catalog(s) index which type(s)?
So far I have created one additional catalog, used to catalog all indexes for my 'site-setup' objects (which I have caused to no longer be indexed in portal_catalog). The site-setup indexes are very often used, but more rarely modified than others, so I thought it was correct to separate them from objects which are reindexed more often. I'm not sure if that's the main consideration though.
Another similar question (a good example of the kind of thing I want to solve): how would you handle something like secondary workflow review_state variables? I give each workflow's review_state variable an index (and search on them quite often), but some of my workflows are only used on just a few types. (my most prolific objects have secondary workflows...)
I'd be very grateful for advice!
Campbell

This won't cover everything but I'll bring up some points..
Anything not in the portal_catalog won't work with collections, folder_contents view, getFolderContents method, search, portlet collections, related items(I think) and anything else the assumes you're using the portal_catalog.
I like to use an additional catalog when I need to be able to query the data but it only affects a sub-set of the content objects.
Use collective.indexing to speed up indexing operations.
Mount the catalogs on their own mount points so you can cache them differently from the rest of the site(so you can cache the whole catalog). Then, you can even serve the the catalogs from dedicated zeoserver.
Also, if your content doesn't have to be cataloged by the portal_catalog(with all the constraints listed), you may even want to think about if you need it as a full-fledged (archetype|dexterity) type in the first place. You can use a more slim repoze.catalog to catalog arbitrary objects(which could be very simple data) for whatever your purpose is and get even more performance. Or better yet, look into Solr for indexing it for VERY good performance.
On more thing, depending on the type of data you're storing, you could even look into using a relational database for a data store. But I don't know what kind of queries, indexes, data, etc you have...
30 different types seems like a lot but I don't know what your use case is. Care to share? Perhaps there is a better way to do it.

Related

How to transfer a structure from one Plone to another

I have a Plone instance which contains some structures which I need to copy to a new Plone instance (but much more which should not be copied). Those structures are document trees ("books" of Archetypes folders and documents) which use resources (e.g. images and animations, by UID) outside those trees (in a separate structure which of course contains lots of resources not needed by the ones which need to be copied).
I tried already to copy the whole data and delete the unneeded parts, but this takes very (!) long, so I'm looking for a better way.
Thus, the idea is to traverse my little forest of document trees and transfer them and the resources they need (sparsely rebuilding that separate structure) to the new Plone instance. I have full access to both of them.
Is there a suggested way to accomplish this? Or should I export all of them, including the resources structure, and delete all unneeded stuff afterwards?
I found out that each time that I make this type of migration by hand, I make mistakes that force me to do it again.
OTOH, if migration is automated, I can run it, find out what I did wrong, fix the migration, and do it all over again until I am satisfied.
In this context, to automate your migration, I advise you to look at collective.transmogrifrier.
I recommend jsonmigrator - which is a twist on collective.transmogrifier mentioned by Godefroid. See my blog on it here
You can even use it to migrate from Archetypes to Dexterity types (you just need matching fieldnames (and matching types roughly speaking).
Trying to select the resources to import will be tricky though. Perhaps you can find a way to iterate through your document trees & "touch" (in a unix sense) any resource that you are using. Then copy across only resources whose "timestamp" indicates that they have been touched.

node_load or direct query?

What rule of thumb do you use for deciding to use node_load() or just writing a direct db_query()?
In a situation I'm looking at right now I need to get some node data and resolve data on two nodereference fields. So that would be 3 calls to node_load(). At some point here, would it be more efficient to construct the query with Joins directly?
This is for use in a self contained module that won't be distributed or used anywhere else, so I don't believe I need to worry about subverting node modification hooks (or do I?).
Edit:
Thinking about my question more, node_load() is only really applicable when you have one node to grab (and then maybe drilling down further into nodereferences like in my example). But as soon as you need to return more than one node based on some criteria, you're pretty much forced to use db_query right? Does Drupal have any abstracted API for writing queries like this?
Not a full answer (Not sure myself), just some hints.
node_load() is using a static cache (in Drupal 7, you can even use the entity_cache module to make it a permanent cache). If the nodes you are loading are being used a second time on the same page, that call will be free.
Querying CCK-tables is tricky. The schema structure can change completely based on configuration, for example when using a single or multiple values.
The reasoning behind using API methods for DB calls over direct DB calls is to provide a DB abstraction layer so that your app could move between supported database engines etc, also it enables your app to gracefully handle any schema changes (however unlikely) that core/module may make to the tables in question. It's also likely easier as #Berdir says for CCK fields and Node_Ref fields, but that depends on which you are more confident with Drupal API& PHP or MySQL...the payoff of doing it the Drupal way is increased future productivity and understanding of the codebase and what is possible :)
Oh and my rule of thumb is - Do it the Drupal way if at all possible (possible being variable depending on app time/cost/performance/whatever requirements)

How do I do this in Drupal?

Im currently evaluating Drupal to see if we can use it to replace our framework. My problem is I have this legacy tables which I would want to try to reflect in Drupal. It involves a join table. There's quite a lot of this kind of relationship in our existing web app so I am looking for possible ways to solve it.
Thank you for your insight!
There are several ways to do this, and it's hard to know which is best with no context about what you're actually doing with the data, but here are some options:
One way to do this is to make a content type representing each table (using CCK) with the foreign keys represented by type-specific node reference fields. Doing everything as nodes gives you a bunch of prebuilt functionality around nodes, but has a bit of overhead you may want to avoid.
Another option is to leave your database just like it is now. Drupal can do direct database queries, or you can use Data to expose your tables to Views.
Another option, if those referenced tables really only have 1 non-ID field, is to do the project_companies_assignments as nodes and do the other 3 as taxonomies. But this won't work if those are really more complex entities, and wouldn't be very flexible if they might become more complex.
What about using hook_views_api and exposing your legacy tables in hook_views_data? i tried something like this myself - not sure if that is what you want...
try and let me know if that works for you.
http://drupalwalla.blogspot.com/2011/09/how-do-you-expose-your-legacy-database.html
Going with Views and CCK, optionally with the additional Data module has one huge disadvantage: it comes with complexity.
My preferred alternative, is to write your own module. Drupal offers little help wrt database abstraction, it comes not with a proper ORM or such. But with some simple CRUD functions for the data in the database, a few simple forms in front, and a menu-callback with some pages to present the data, you can -quite often- get your datamodel worked out much faster then going the route of the overly complex, often poorly documented CCK and views modules. KISS.

When not to use a Drupal node?

I've recently created a very simple CRUD table where the user stores some data. For the data, I created a custom node. The functionality works great for creating, editing, and deleting data in the CRUD table using the basic node functionality (I'm actually amazed how fast and easy it was to program the basic functionality with proper access controls using only a tiny bit of code)....
Since the data isn't meant to be treated the same way as 'content' such as a blog post (no title, no body, no commments, no revisions, shouldn't show up on ?q=node page, no previews, no teasers, etc)... I find that I'm spending most of my time 'turning off' and modifying the stuff that drupal does automatically for nodes.
I know its a matter of taste, but where should one draw the line on what should be treated as a node and what shouldn't? In other words, would it be better to program this stuff from scratch without using nodes?
Using nodes for custom data has quite some additional benefits besides easy edit/update/delete functionality:
possible categorization via taxonomy
implicit 'ownership' via author tracking
implicit tracking of creation/modification time
basic access control by default, expandable by a huge selection of modules
flexible query generation/listing/filtering via views
possible ad hoc extensions/annotations via CCK fields
possible definition of workflows, actions and the like
a huge number of hooks to programmatically intercept/adjust almost every usage aspect/scenario
commenting, voting, rating and tons of other functionality provided by all contributed modules that work on/with nodes ...
Given all this, I'd say you need a very good reason to not use nodes to store data in Drupal. Nodes are simply the fundamental building blocks for just about everything in the Drupal ecosystem, and the overhead of removing some unwanted default 'features' seems pretty small in comparison to the gains.
That said, one possible reason/argument to handle data separate from the node system might be if that data is directly aimed at annotating other nodes (think taxonomy). But since you can easily reference nodes from other nodes (with lots of different options on how to do this), the argument is not to strong.
Another (much stronger) argument would be data integrity - Drupal is not very strong (to put it politely) concerning normalized, relational data storage, referential integrity, transaction handling and other related topics. If you have requirements in that direction, you might have no choice but to skip the node concept and create and maintain a separate data island within the system on your own.
It helps to think also that a node doesn't need to be public either. Some nodes are private/internal and can be controlled further with access controls. The way you are doing it, whatever you're doing, makes all the scalability and extending it on your shoulders.
I would probably approach it with CCK/Taxonomy depending on what I was doing. That way, I get the added benefit of Views/Panels/etc module integration without writing any additional code.

asp.net profile provider

I know there are already some questions on this topic on the site...
I am just trying to understand if it's safe to use ASP.NET Profile Provider with a website with huge traffic?
The way I see it, it's laid out inefficiently. You store property name (which is a string) and property value (which is a string too). If you are just trying to store even age in the profile, you are unnecessarily storing the string "age" in the database over and over whereas with a self-created table, you could just add a column titled age, and no redundancy?
(I am just trying to make sure I am not missing something about it, because I am fairly new to it.)
The profile provider uses an EAV (Entity-Attribute-Value) design deliberately, because profiles in general very commonly have a sparsely populated schema - that is, there are many potential attributes, but only a few will be used for a given single entity, and the few that are used varies widely from one entity to the next.
Let's use a totally arbitrary example - let's say only one in 10 of your users want to provide their age. Making that a column now seems more like a waste, no?
But what if your application makes age mandatory? OK, that column gets populated for everyone. But what if you need to make a note in the profile "user doesn't want to see this obscure dialog anymore". Do you really want a column for every single dialog in your application whether a user wants to see it? Probably not. When you get into the little one-off details of an application of any significant scope, EAV actually becomes the more economical choice.
In the general, it scales quite well (far better than you probably think). In the specific, it doesn't matter - as always, use what works and fix performance problems when they come up. Whatever the scalability limitations of the profile provider are, you'll know when you hit them. I guarantee two things - (1) you'll have to fix a lot of other performance problems you didn't expect before you have to fix that; and (2) if your site is getting enough traffic to break the profile provider, it's a good problem to have.
I agree with Rex M, unless you have a need to do things like sort all your users by age or do other procedures with aggregate profile data. Then you could consider rolling your own. But for just storing properties that you access here and there on a user-by-user basis, Rex M is right.
I do know what you mean. Wouldn't it make sense to supplment the profile provider's table with another table that has columns with mandatory fields? or do you think the overhead of join would not make it not worth it?

Resources