Shark external table performance - bigdata

How does querying from an external table in Shark located on the local filesystem compare to using data located on HDFS in terms of query performance? I plan to use a single high end server for running shark queries and was wondering if its absolutely necessary to install hadoop/hdfs.

Generally, if you already intend to run on a single high-end server, there's no need to set up HDFS. In such a case, you should actually achieve somewhat better performance than with HDFS installed on a single machine, since you won't incur the extra overhead of doing the extra round-trips to localhost just to get file metadata, or the extra indirection of HDFS mapping files onto a series of opaque blocks which are themselves files on your local filesystem.
Note that you'll still automatically benefit from Shark going through the Hadoop RawLocalFileSystem (which is the default "Hadoop filesystem" loaded when HDFS is not explicitly set up), so that Shark will effectively think it's using an HDFS equivalent. This means that in the future, if you indeed need to run on a distributed cluster, it should be a simple matter of modifying the fs.default.name and everything else will work the same as you're used to on a single machine setup.

Related

What is the pro's and con's of locking the actual file vs an empty lock file?

My program is writing to a binary file, and there could be multiple instances of the program accessing the same binary file for the same user. In Unix/Linux, I see some programs (particularly daemon processes) locking an empty lock file instead of the actual shared data that needs to be locked (so instead of locking ~/.data/foo they lock ~/.data/foo.lck). What are the pros and cons of locking the actual file vs an empty lock file?
flock is not supported over NFS or other network file systems for all version of unix (it wasn't even supported by Linux until 2.6.12). On the other hand O_CREAT|O_EXCL is much more reliable over many more file systems, and has been so for much longer.
Even on systems that do support flock on network filesystems (or cases where you don't need that flexibility), O_CREAT|O_EXCL together with flock is very useful because it distinguishes between a clean shutdown and a non-clean shutdown. flock helpfully goes away automatically, but it also, unhelpfully, doesn't distinguish why it went away.
flocking the file itself prevents atomic writes (copy, erase old, rename), or any other case where you might erase the existing file. Sometimes "the actual file" doesn't always have the same inode over the entire run of the program. So a separate file is much more convenient in those cases as well. This is very common in those foo.lck cases, because often you're locking foo for a short period of time, and might erase it in the process.
I see three cons of an empty lock file:
The user permissions of the directory should allow you to create a file.
In case of disk space issues, this might fail.
In case your program crashes, the lockfile is still present.
I see one con of modifying the actual file's name:
In case your program crashes, your file has been altered (only the filename, but it might generate confusion).
Obviously, I see one big advantage of the empty lock file:
your original file does not change at all.
By the way, I believe this question is better suited for the SoftwareEngineering community.

I can't packing my Data.fs, Because too large more than 500GB

Unfortunately, I have a more than 500GB ZODB, Data.fs in my Plone site(Plone 5.05)
So, I have no way to use bin/zeopack to packing it,
Seriously affecting performance
What should I do ?
I assume you're running out of space on the volume containing your data.
First, try turning off pack-keep-old in your zeoserver settings:
[zeoserver]
recipe = plone.recipe.zeoserver
...
pack-keep-old false
This will disable the creation of a .old copy of the Data.fs file and matching blobs. That may allow you to complete your pack.
Alternatively, create a matching Zope/Plone install on a separate machine or volume with more storage and copy over the data files. Run zeopack there. Copy the now-packed storage back.

Can you use rsync to replicate block changes in a Berkeley DB file?

I have a Berkeley DB file that is quite large (~1GB) and I'd like to replicate small changes that occur (weekly) to an alternate location without having the entire file be re-written at the target location.
Does rsync properly handle Berkeley DBs by it's block level algo?
Does anyone have an alternative to only have changes be written to the Berkeley DBs files that are targets of replication?
Thanks!
Rsync handles files perfectly, at the block level. The problem with databases can come into play in a number of ways.
Caching
File locking
Synchronization/transaction logs
If you can insure that during the period of the rsync, no applications have the berkeley db open, then rsync should work fine, and offer a significent advantage over copying the entire file. However, depending on the configuration and version of bdb, there are transaction logs. You probably want to investigate the same mechanisms used for backups and hot backups. They also have a "snapshot" feature that might better facilitate a working solution.
You should probably read this carefully: http://www.cs.sunysb.edu/documentation/BerkeleyDB/ref/transapp/archival.html
I'd also recommend you consider using replication as an alternative solution that is blessed by BDB https://idlebox.net/2010/apidocs/db-5.1.19.zip/programmer_reference/rep.html
They now call this High Availabity -> http://www.oracle.com/technetwork/database/berkeleydb/overview/high-availability-099050.html

Is it better to execute a file over the network or copy it locally first?

My winforms app needs to run an executable that's sitting on a share. The exe is about 50MB (it's a setup.exe type of file). My app will run on many different machines/networks with varying speeds (some fast, but some awfully slow, like barely 10baseT speeds).
Is it better to execute the file straight from the share or is it more efficient to copy it locally and then execute it? I am talking in terms of annoying the user the least.
Locally is better. A copy will read each byte of the file a single time, no more, no less. As you execute, you may revisit code that is out of cache, etc and gets pulled again.
As a setup program, I would assume that the engine will want to do some kind of CRC or other integrity check too, which means it's reading the entire file anyway.
It is always better to execute it locally than running it over the network.
If you're application is small, and does not need to load many different resource during runtime then it is ok to run it over the network. It might even be preferable because if you run it over the network the code is read (download and load to memory) once as oppose of manually downloading the file then run it which take 2 read code. For example you can run a clock widget application over the network.
On the other hand, if your application does read a lot of resources during runtim, then it is absolutely a bad idea to run it over the network because each read of the resource will go over the network, which is very slow. For example, you probably don't want to be running Eclipse over the network.
Another factor to take into consideration is how many concurrent user will be accessing the application at the same time. If there are many, you should copy the application to local and run from there.
I believe the OS always copy the file to a local temp folder before it is actually executed. There are no round trips from/to the network after it gets a copy, it only happens once. This is sort of like how a browser works... it first retrieves the file, saves it locally, then it runs if off of the local temp where it saved it. In other words, there is no need to copy it manually unless you want to keep a copy for yourself.

What's the best way to sync large amounts of data around the world?

I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.
A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.
The control of versioning is not an issue, this is not a codebase per-se.
I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?
My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.
Thanks
Edit : This should be UNIX based :)
Have you tried Unison?
I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.
Sounds like a job for BitTorrent.
For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.
Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.
Actual recipe will depend on your need.
For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.
Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.
Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.
Rince and repeat, daily.
BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.
How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?
Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).
You have a lot of options:
You can try out to set up replicated DB to store data.
Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
Use git repos with max compressions and sync between them using some scripts
Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)
Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.
Sounds like a job for Foldershare
Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.

Resources