Identifying Files in Plone BlobStorage - plone

Files in var/blobstorage can be listed and sorted by their sizes via Unix commands. This way shows big files on top list. How can I identify these files belongs to which IDs/paths in a Plone site?

There is no 'supported' way to do this. You could probably write a script to inspect the ZODB storage, but it'd be complicated. If you want to find the biggest files in your Plone site, you're probably better off writing a script that runs in Plone and using it to search (using portal_catalog) for all File objects (or whatever content type is most likely to have big files) and calling get_size() on it. That should return the (cached) size, and you can delete what you want to clean up.

Related

.goutputstream-XXXXX - possible to relocate?

I've been trying to create a union file system for a college project. One of its features that differentiates it from unionfs is the fact that there are no copy-ups. This means that if a file is located in a certain branch, it will remain there even if it is written to.
But my current problem with that is the fact that .goutputstream-XXXXX are created, renamed, and deleted whenever a write operation occurs. This is actually OK if the file being written to is in the highest priority branch (i.e. the default branch where files can be created), but makes my kernel crash if I try to write to a file in a lower branch.
How do I deal with this? How can I rig it so that all .goutputstream-XXXXX files are written to only one location? These .goutputstream-XXXXX files seem to be intricately connected to the files they correspond too, and seem to work only the same directory as the file being written to.
I also noticed that .goutputstream-XXXXX files appear when a directory is read. What are they for, anyway?
There has been a bug submitted to the ubuntu launchpad in which the creation of .goutputstream-xxxxx files is discussed.
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/984785
From what i see now, these files are created when shutting down without preceding logout, but several other sources may occur, like evince or maybe gedit.
maybe lightdm has something to do with the creation of these files.
which distribution did you use?
maybe changing the distribution would help.
.goutputstream-XXXXX created by gedit and there is no simple way (menu or settings) to relocate them.

How do I Download efficiently with rsync?

A couple of questions related to one theme: downloading efficiently with Rsync.
Currently, I move files from an 'upload' folder onto a local server using rsync. Files to be moved are often dumped there, and I regularly run rsync so the files don't build up. I use '--remove-source-files' to remove files that have been transferred.
1) the '--delete' options that remove destination files have various options that allow you to choose when to remove the files. This would be handly for '--remove-source-files' since is seems that, by default, rsync only removes the files after all files have been transferred, rather than after each file; Othere than writing a script to make rsync transfer files one-by-one, is there a better way to do this?
2) on the same problem, if a large (single) file is transferred, it can only be deleted after the whole thing has been sucessfully moved. It strikes me that I might be able to use 'split' to split the file up into smaller chunks, to allow each to be deleted as the file downloads; is there a better way to do this?
Thanks.

How to let humans and programs access the same file without stepping on each others' toes

Suppose I have a file, urls.txt, that contains a list of URLs I'm monitoring. My monitoring script edits that file occasionally, say, to indicate whether each URL is reachable. I'd like to also manually edit that file, to add to or change the list of URLs. How can I allow that such that I don't have to think about it when manually editing?
Here are some possible answers. What would you do?
Engage in hackery like having the program check for the lockfiles that vim or emacs create. Since this is just for me, this would actually work.
If the human edits always take precedence, just always have the human clobber the program's changes (eg, ignore the editor's warning that the file has changed on disk). The program can then just redo its changes on its next loop. Still, changing the file while the user edits it is not so nice.
Never let a human touch a file that a program makes ongoing modifications to. Rethink the design and have one file that only the human edits and another file that only the program edits.
Give the human a custom tool to edit the file that does the appropriate file locking. That could be as crude as locking the file and then launching an editor, or a custom interface (perhaps a simple command line interface) for inserting/changing/deleting entries from the file.
Use a database instead of a flat file and then the locking is all taken care of automatically.
(Note that I concocted the URL monitoring example to make this more concrete and because what I actually have in mind is perhaps too weird and distracting -- this question is strictly about how to let humans and programs both modify the same state file.)
I'd use a database since that's basically what you're going to have to build to achieve what you want. Why re-invent the wheel?
If a full-blown DBMS is too much of a load, separate the files into two and synchronize them periodically. Whether the URL is reachable doesn't sound like something the user would be changing, so should not be editable by them.
During the synchronize process (which would have to lock out the monitor and the user although it could be a sub-function of the monitor), remove entries in the monitor file that aren't in the user full. Also, add to the monitor file those that have been added to the user file (and start monitoring them).
But, I'd go the database method with a special front-end for the user, since you can get relatively good light-weight databases nowadays.
Use a sensible version control system!
(Git would work well here).
That said, the nature of the problem implies that a real database would be best - and they will generally have either database-level, table-level, or row-level locking - but then put any scripts you need into version control.
I would go with option 3. In fact, I would have the program read the human-edited input file, and append the results of each query to a log file. In this way, you can also analyse the reachability of sites over time. You can also have the program maintain a file that indicates the current reachability state of each site in the input file, as a snapshot of the current state.
One other option is using two files, one for automated access and one for manual. You'd need a way in the user file to indicate modifications or deletions but you'd have similar problems in some of the other solutions as well.

What's the best way to sync large amounts of data around the world?

I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.
A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.
The control of versioning is not an issue, this is not a codebase per-se.
I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?
My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.
Thanks
Edit : This should be UNIX based :)
Have you tried Unison?
I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.
Sounds like a job for BitTorrent.
For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.
Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.
Actual recipe will depend on your need.
For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.
Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.
Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.
Rince and repeat, daily.
BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.
How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?
Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).
You have a lot of options:
You can try out to set up replicated DB to store data.
Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
Use git repos with max compressions and sync between them using some scripts
Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)
Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.
Sounds like a job for Foldershare
Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.

Is there a way to tell what files are not being used in a web application project

I have a project with literally thousands of image files that aren't being used. The main problem is that they are intermixed with images that are.
Is there a way to get a list of all project artifacts which aren't referenced?
EDIT: Assuming I don't have access to the web logs... Is there an option?
Basically, no there isn't a straightforward, works-always way. You could build image-references based on user input or other context. So spidering your website means that you have to execute all code paths, otherwise you might throw away stuff that you actually need.
But now for the specific case of Chris, you could use multiple approaches:
search image for image for
occurrences in your code (maybe
automate this with visual studio
plug-ins or so)
remove everything
and start browsing your website, add
all images that are not found. (this
depends on the ratio of not used
images versus used images)
search
your code for all occurrences of
.png, .jpg, .gif (and so on) and
keep those images, throw everything
else away.
...
Another approach -
Assuming all the image files are under one folder, try renaming the folder. The warnings in Visual Studio will tell you the files you need. :)
access your web server logs, parse for GET's of the desired file pattern, unique them, then compare them against your reference list.
or, look at the file access dates (you may need to turn on this feature if you are sysop)
This was from a previous post.
At a file level:
use wget to aggressively spider the
site and then process the http server
logs to get the list of files
accessed, diff this with the files in
the site
diff \ <(sed some_rules httpd_log |
sort -u) \ <(ls /var/www/whatever |
sort -u) \ | grep something

Resources