For example, can I put 100.000.000 documents in one plone folder?
In recent Plone releases, folders use a BTree-backed storage, so you can store as many objects in there as you like.
The biggest folder that I can access in a production environment currently stores 25k items.
You will, of course, need to deal with large numbers of items in one location appropriately. The usual caveats about really large numbers of content in a site apply.
Related
I have two environments. One is development and another is production. Lets say I have folder in production which has all my metadata like ILs, joins, DS, Analysis, scripts etc. Now in development I have the same folder but with new enhancements done.
Now, I want to compare that what are the changes that have been done and as per the result I will be able to understand the impact.
So, could you please tell me that what is way to compare that two folders of development and production environment?
For the requirements posted here, you can create information link on top of LIB_ITEMS table to fetch details of library item details from Spotfire database. An another set of activity is performed at this link, but approach can be used for your requirements as well.
I have a Plone instance which contains some structures which I need to copy to a new Plone instance (but much more which should not be copied). Those structures are document trees ("books" of Archetypes folders and documents) which use resources (e.g. images and animations, by UID) outside those trees (in a separate structure which of course contains lots of resources not needed by the ones which need to be copied).
I tried already to copy the whole data and delete the unneeded parts, but this takes very (!) long, so I'm looking for a better way.
Thus, the idea is to traverse my little forest of document trees and transfer them and the resources they need (sparsely rebuilding that separate structure) to the new Plone instance. I have full access to both of them.
Is there a suggested way to accomplish this? Or should I export all of them, including the resources structure, and delete all unneeded stuff afterwards?
I found out that each time that I make this type of migration by hand, I make mistakes that force me to do it again.
OTOH, if migration is automated, I can run it, find out what I did wrong, fix the migration, and do it all over again until I am satisfied.
In this context, to automate your migration, I advise you to look at collective.transmogrifrier.
I recommend jsonmigrator - which is a twist on collective.transmogrifier mentioned by Godefroid. See my blog on it here
You can even use it to migrate from Archetypes to Dexterity types (you just need matching fieldnames (and matching types roughly speaking).
Trying to select the resources to import will be tricky though. Perhaps you can find a way to iterate through your document trees & "touch" (in a unix sense) any resource that you are using. Then copy across only resources whose "timestamp" indicates that they have been touched.
In the past I've heard that it can be dangerous to rename the Plone site object(change it's id).
Is renaming dangerous? What are the potential issues when renaming?
The problem seems to be in the references (from http://plone.293351.n2.nabble.com/Renaming-the-Plone-object-via-ZMI-td4428462.html):
I am trying to rename the Plone object via the Zope Management Interface, however as a side effect, all references I have on "reference_catalog" are lost, including LinguaPlone translations, related content and the path criteria from collections.
I am using Plone 3.1.7, and you can reproduce the problem by creating a vanilla Plone object, add some Documents with related content pointing to each other, add a Folder and Collection, and add a path criterion to the collection, pointing to that folder. Check "reference_catalog". After that, rename the portal object and "reference_catalog" will be empty.
Is there any sollution for this problem?"
Which was answered with:
You need to update your catalog, since the plone object's id is the
first element of the path of all your content.
http://plone.org/documentation/error/portal-content-has-gone-missing/
In the thread there are some links to scripts to avoid losing references.
Any reference to an object that is stored as a path will be a problem. As Yuri's answer points out, that includes the paths in catalogs. That's a relatively easy one to deal with, by doing a full rebuild of the catalog. There are other issues that may be harder to find, such as paths within collection criteria and portlet data.
I'm looking for the best solution to allow our users to upload XLS spreadsheet so that they can be used to populate tables in our data warehouse (DW).
Our users are heavy Business Object (BO) users, and BO lets you export to XLS. When they have data in a spreadsheet that needs to be loaded to the DW, they need a process to upload the data in the XLS to the DW's db. As a result, we end up with many of these "interfaces" when I think that what we really need is a programmatic automated feed. Using Excel as a data source for inter-system feeds, in my gut, just seems like a bad idea to me.
Question #1: I'd like to see if you agree and why or why not.
OK, there is no swimming against that tide, so I now take as a given that XLS uploads are here to stay for us. Now I need to find the best solution. First, I'll explain what we do now and then what I don't like about it:
Via web pages, we provide empty XLS files (no rows) with a defined set of columns. Each file is intended to be used to update a different target dest table. In each spreadsheet is an "upload" button. Pushing the Upload button results in the macro in the spreadsheet serializing the contents of the file to CSV and FTPing the data to server folder. Periodically, a scheduler fires off an Informatica ETL job that uses the CSV file as input and loads the data into a custom XLS-specific staging table and then, if the records pass edits, into the appropriate target table. Any errors encountered are logged to an error table. For each XLS file uploaded, the data ends up in a separate staging and error table that is specific for the file.
Some of the things I don't like include about our process are:
1) The macro code in the XLS is too exposed, includes passwords for example, can be tampered with and there are issues ensuring that the users are using the latest XLS templates.
2) Business Rule edits are placed in the ETL program, where they should probably be, but because we would like to catch the errors ASAP, i.e, in the spreadsheet, edits are also added to the macro code. This results in duplication of business edits. I want these rules in one place and centrally controlled. IMHO, I think putting any macro code in the XLS introduces a maintenance issue, even calls to stored procedures (some of which we have) or calls to web services (we haven't yet tried to call .NET Web Services from XLS macros.)
3) Every XLS file upload template has its own process with distinct set of staging and error tables and a custom screen for reporting errors encountered. It seems like we need a more generalized re-usable solution.
Besides often getting data exported to XLS from BO, the users like also Excel because it is easier to edit a large number of records and less clunkier than editing individual records via a web interface.
This is the general direction that I am thinking:
First, I want the users to have the ease of editing of Excel with editing, but without including embedded macros in the spreadsheet. I experimented with Farpoint's Grid with Excel compatibility...
http://www.fpoint.com/netproducts/spreadweb/tour/excel.aspx
...and I found that it was quite easy to allow a user the ability to open up an XLS file that resides on their PC and have it open up in a browser and be able to easily access the data read from server-side .NET web code. Excel isn't running locally in their browser, but the functionality of Excel is reproduced, presumably through a lot of client sided scripting that I expect would be a real pain to duplicate myself. You can even cut and paste from a local spreadsheet into the web's spreadsheet. This sounds great, by biggest problem is cost. Our company is near death and won't allow us to purchase any new software.
Next, I want to identify the common components across all spreadsheet upload processing and come up with generic processing code. For example, I imagine a table which defines each of our spreadsheets and the format of each including the column names and data type definitions, perhaps in terms of their destination columns instead of hard coding. Based on this table template definition, I can generate XLS templates for download from this table definition. I can also perform simple generic edits to ensure that the data entered matches the table definition. And one common web page can be used to present the data and allow report data type mismatch errors and allow for the user to correct them. I would also define a common table for storing the data in a "staging" table, using a table with two columns, submission #, row num, name and value, perhaps. No more "custom everything" is the goal.
Next I need to decide where to put the business rules. My dept's mgt firmly believes that all loading of data should be done by Informatica ETL batch processes and therefore the rules/edits belong "in Informatica". I have zero experience with Informatica tools, I am more of a .NET guy. I am therefore unsure as to how these rules are implemented but I suspect that they are not reusable in the sense that they can be used by a .NET web page to validate a particular record against. You see, in some cases, when the user is not performing a bulk upload, they do have the ability to edit a specific record and I would like the same edits that were applied by the ETL bulk insert process to be applied to an individual update attempt to a single record via a web page. If the solution to write a single web service or stored procedure that can be called from either the web page doing an update of a single record or called thousands of times for each record in a bulk upload? The latter sounds inefficient.
Your thoughts on anything above would be very much welcomed.
From a cost perspective, the efforts you'll need to go through to re-create spreadsheet functionality on the web will exceed the cost of Farpoint or other controls. Even if you made $20 an hour, do you think you could complete a working product in under 2 weeks? I think you have the facts on your side when you discussed maintenance issues if you allow ETL functionality to exist in Excel - you have twice the amount of work to maintain the transformation rules. I think you need to convince management that in order to create a maintainable, robust solution you need some flexible utilities.
Farpoint is a good choice. There is also SpreadsheetGear that is a .Net engine that interprets Excel macros and can run on a web server. It has a Win32 control that allows you to create a WinForms solution with very Excel interface functionality. Last time I checked there was no web control for the product. It does an excellent job of providing Excel capability for processing large amounts of data.
Good luck. I think you will find a good solution since you seem to have a good grasp of the pro's and con's of all the different potential solutions.
I'm trying to start using plain text files to store data on a server, rather than storing them all in a big MySQL database. The problem is that I would likely be generating thousands of folders and hundreds of thousands of files (if I ever have to scale).
What are the problems with doing this? Does it get really slow? Is it about the same performance as using a Database?
What I mean:
Instead of having a database that stores a blog table, then has a row that contains "author", "message" and "date" I would instead have:
A folder for the specific post, then *.txt files inside that folder than has "author", "message" and "date" stored in them.
This would be immensely slower reading than a database (file writes all happen at about the same speed--you can't store a write in memory).
Databases are optimized and meant to handle such large amounts of structured data. File systems are not. It would be a mistake to try to replicate a database with a file system. After all, you can index your database columns, but it's tough to index the file system without another tool.
Databases are built for rapid data access and retrieval. File systems are built for data storage. Use the right tool for the job. In this case, it's absolutely a database.
That being said, if you want to create HTML files for the posts and then store those locales in a DB so that you can easily get to them, then that's definitely a good solution (a la Movable Type).
But if you store these things on a file system, how can you find out your latest post? Most prolific author? Most controversial author? All of those things are trivial with a database, and very hard with a file system. Stick with the database, you'll be glad you did.
It is really depends:
What is file size
What durability requirements do you have?
How many updates do you perform?
What is file system?
It is not obvious that MySQL would be faster:
I did once such comparison for small object in order to use it as sessions storage for CppCMS. With one index (Key Only) and Two indexes (primary key and secondary timeout).
File System: XFS ext3
-----------------------------
Writes/s: 322 20,000
Data Base \ Indexes: Key Only Key+Timeout
-----------------------------------------------
Berkeley DB 34,400 1,450
Sqlite No Sync 4,600 3,400
Sqlite Delayed Commit 20,800 11,700
As you can see, with simple Ext3 file system was faster or as fast as Sqlite3 for storing data because it does not give you (D) of ACID.
On the other hand... DB gives you many, many important features you probably need, so
I would not recommend using files as storage unless you really need it.
Remember, DB is not always the bottle neck of the system
Forget about long-winded answers, here's the simplest reasons why storing data in plaintext files is a bad idea:
It's near-impossible to query. How would you sort blog posts by date? You'd have to read all the files and compare their date, or maintain your own index file (basically, write your own database system.)
It's a nightmare to backup. tar cjf won't cut it, and if you try you may end up with an inconsistent snapshot.
There's probably a dozen other good reasons not to use files, it's hard to monitor performance, very hard to debug, near impossible to recover in case of error, there's no tools to handle them, etc...
I think the key here is that there will be NO indexing on your data. SO to retrieve anything in say a search would be rediculously slow compared to an indexed database. Also, IO operations are expensive, a database could be (partially) in memory, which makes the data available much faster.
You don't really say why you won't use a database yourself... But in the scenario you are describing I would definitely use a DB over folder any day, for a couple of reasons. First of all, the blog scenario seems very simple but it is very easy to imagine that you, someday, would like to expand it with more functionality such as search, more post details, categories etc.
I think that growing the model would be harder to do in a folder structure than in a DB.
Also, databases are usually MUCH faster that file access due to indexing and memory caching.
IIRC Fudforum used the file-storage for speed reasons, it can be a lot faster to grab a file than to search a DB index, retrieve the data from the DB and send it to the user. You're trading the filesystem interface with the DB and DB-library interfaces.
However, that doesn't mean it will be faster or slower. I think you'll find writing is quicker on the filesystem, but reading faster on the DB for general issues. If, like fudforum, you have relatively immutable data that you want to show several posts in one, then a file-basd approach may be a lot faster: eg they don't have to search for every related post, they stick it all in 1 text file and display it once. If you can employ that kind of optimisation, then your file-based approach will work.
Also, mail servers work in the file-based approach too, the Maildir format stores each email message as a file in a directory, not in a database.
one thing I would say though, you'll be better storing everything in 1 file, not 3. The filesystem is better at reading (and caching) a single file than it is with multiple ones. So if you want to store each message as 3 parts, save them all in a single file, read it to get any of the parts and just display the one you want to show.
...and then you want to search all posts by an author and you get to read a million files instead of a simple SQL query...
Databases are NOT faster. Think about it: In the end they store the data in the filesystem as well. So the question if a database is faster depends strongly on the access path.
If you have only one access path, which correlates with your file structure the file system might be way faster then a database. Just make sure you have some caching available for the filesystem.
Of course you do loose all the nice things of a database:
- transactions
- flexible ways to index data, and therefore access data in a flexible way reasonably fast.
- flexible (though ugly) query language
- high recoverability.
The scaling really depends on the filesystem used. AFAIK most file system have some kind of upper limit for number of files (totally or per directory), though on the new ones this is often very high. For hundreds and thousands of files with some directory structure to keep directories to a reasonable size it should be possible to find a well performing file system.
#Eric's comment:
It depends on what you need. If you only need the content of exact on file per query, and you can determine the location and name of the file in a deterministic way the direct access is faster than what a database does, which is roughly:
access a bunch of index entries, in order to
access a bunch of table rows (rdbms typically read blocks that contain multiple rows), in order to
pick a single row from the block.
If you look at it: you have indexes and additional rows in memory, which make your caching inefficient, where is the the speedup of a db supposed to come from?
Databases are great for the general case. But if you have a special case, there is almost always a special solution that is better in some sense.
if you are preferred to go away with RDBMS, why dont u try the other open source key value or document DBs (Non- relational Dbs)..
From ur posting i understand that u r not goin to follow any ACID properties of relational db.. it would be better to adapt other key value dbs (mongodb,coutchdb or hyphertable) instead of your own file system implementation.. it will give better performance than the existing approaches..
Note: I am not also expert in this.. just started working on MongoDB and find useful in similar scenarios. just wanted to share in case u r not aware of these approaches