I downloaded the latest data dump from freebase - it is a 22gb gzip file. However the archive only contains one file inside, which is 1.6gb.
Specifically, when I import the compressed gz file with apache-jena (tdbloader), the data is incomplete. George Clooney is missing from the database.
EDIT: Here's what I see when I inspect the dump:
You can't tell how big the uncompressed file is using gzip --list because it's buggy (and documented as such on its man page).
http://www.freebsd.org/cgi/man.cgi?query=gzip#end
Like Tom Morris said,
You can't tell how big the uncompressed file is using gzip --list because it's buggy (and documented as such on its man page). http://www.freebsd.org/cgi/man.cgi?query=gzip#end
The problem is that Apache-Jena relies on the gzip information to know when to stop importing files into the DB. The freebase website recommends not unzipping the archive, however because of this bug, you actually have to, otherwise you end up with an incomplete database. I will keep this question up, because someone else might find this info useful.
Related
Is there a way to look back at previous code in a file? For instance maybe be able to revert to an earlier saved version or somehow see changes the file code went through?
Short answer: no.
Unless your changes are tracked in some sort of source control (like git) or you keep backups of your files, there is no way to see previous version of a script file. A .R script is just plain text files. They do not store their own history just like other documents or images on your computer don't either. Sorry.
If that's something you want to do in the future, Rstduio makes it easy to integrate with git. Check out this guide to Connect RStudio to Git and GitHub
I'm working on some code in IDL that retrieves data files through FTP that are Unix compressed (.Z) files. I know IDL can work with .gz compressed files with the /compress keyword however it doesn't seem capable of playing nicely with the .Z compression.
What are my options for working with these files? The files I am downloading are coming from another institution so I have no control in the compression being used. Downloading and decompressing the files manually before running the code is an absolute last resort as it makes things a lot more difficult as I don't always know which files I need from the FTP site in advance so the code grabs the ones needed based on the parameters in real time.
I'm currently running on Windows 7 but once the code is finished it will be used on a Unix system as well (computer cluster).
You can use SPAWN as you note in your comment (assuming you can find an equivalent of the Unix uncompress command that runs on Windows), or for higher speed you can use an external C function with CALL_EXTERNAL to do the decompression. Just by coincidence, I posted an answer on stackexchange the other day with just such a C function to decompress .Z files here.
Does anybody know if there is a way to configure IExpress (presumably via the SED file) to not compress the files it builds into an installer package? The reason for this is the files I'm packaging are already compressed (except for setup.exe, which is very small), and the extra compression only adds to the build time without saving any additional space.
I have seen on this SED Overview that there are some options to control compression type. I have tried various configurations, but none of them seem to make a difference. The IExpress build process uses the Microsoft makecab utility, and it doesn't appear to pass the correct parameters to makecab when the SED file specifies NONE for CompressionType.
According to MSDN there is a way to disable compression in cabinet files. I just need to figure out how to tell IExpress to do it.
As an aside, another motivation for disabling this compression is that I've noticed Microsoft Security Essentials seems to take particular interest in IExpress Packages. It appears to decompress them to scan the contents whenever the file is copied, which can take a significant amount of time on a 100MB package. I was thinking that the scanning might go quicker if it didn't have to decompress the package first.
I built a .sed file with IExpress, then added
Compress=0
just before the line InsideCompressed=0. Seems to work!
I backed up a large number of files to S3 from a PC before switching to a Mac several months ago. Several months later, I'm now trying to open the files and realized the files were all compressed by the S3 GUI tool I used so I can not open them.
I can't remember what program I used to upload the files and standard decompression commands from the command line are not working e.g.,
unzip
bunzip2
tar -zxvf
How can I determine what the compression type is of the file? Alternatively, what other decompression techniques can I try?
PS - I know the files are not corrupted because I tested downloading and opening them back when I originally uploaded to S3.
You can use Universal Extractor (open source) to determine compression types.
Here is a link: http://legroom.net/software/uniextract/
The little downside is that it looks in the first place for the extension, but I manage to change the extensions myself for a inknown file and it works almost always, eg .rar or .exe etc..
EDIT:
I found a huge list of archive programs, maybe one of them will work? It's ridiciously big:
http://www.maximumcompression.com/data/summary_mf.php
http://www.maximumcompression.com/index.html
Having read this past question for git, I would like to ask if there exists something like that, but
can be done programmatically (file list) on each machine;
works for Mercurial.
The reason for this is that I would like to include in my public dotfiles repository some configuration files that store password in plaintext. I know I could write a wraparound script for hg(1) but I would like to know if there are alternative approaches, just for the sake of curiosity.
Thank you.
You could use a pair of pre-commit and post-update hooks to encrypt/decrypt as necessary. See http://hgbook.red-bean.com/read/handling-repository-events-with-hooks.html for more details.
However, it's worth pointing out that if you're storing encrypted text in your repo you'll be unable to create meaningful diffs -- essentially everything will be like a binary file but also poorly compressible.
Mercurial has a filter system that lets you mangle files when they are read from the repository or written back. If you have a program like the SSH agent running that lets you do non-interactive encryption and decryption, then this might just be workable.
As Ryan points out, this will necessarily lead to a bigger repository since each encrypted version of your files will look completely different from the previous version. Mercurial detects this and stores the versions uncompressed (encrypted files cannot be compressed anyway). Since you will use this for dotfiles, you can ignore the space overhead, but it's something to take into consideration if you will be versioning bigger files in encrypted form.
Please post a mail to Mercurial mailing list with your experiences so that other users can benefit from them too.