R: Generic Function to Uncompress Files - r

I need to read multiple compressed files with different formats of compression. I do not wish to manually uncompress all the files. I would like R to handle the uncompression and reading independent of the compression format. This is where I'm stuck.
I could construct a function with a switch case sort of structure for zip - unzip, gz - gzfile, etc. but I would like to know if there already is some function that can uncompress files irrespective of the compression format.
Any suggestions are appreciated. Thanks a lot!
PS:
I know out that read.table can read (some, if not all) compressed files. However, I've been inching towards data.table::fread (because it is much faster), and that seems to not be able to read compressed files (http://r.789695.n4.nabble.com/fread-on-gzipped-files-td4663116.html - yet?). I would prefer temporarily uncompressing and using fread rather than using read.table.

Then here's an upvote :-)
Btw I don't think there is a generic "uncompress" function that does the magic for you (like in any of the shell languages). The options may be simply too broad -- but I suspect you cover 80% of the cases with zip/tar/rar.
Just write a simple uncompress <- function(type = c("zip", "tgz", "tar", "arj :-)))")) {...} that was your original intention.

Related

(How) can I merge audio files in R using the av package?

I have a bunch of audio files, which I extracted from mp4 video files using the av package. Now I want to merge all the audio files into one long output mp3.
My question: Is there a way to merge audio files in R using the av package?
I.e. when having a vector of file paths/names such as
files <- c("file1.mp3", "file2.mp3", "file3.mp3")
I am looking for a function or concise workaround within R that could handle this, maybe similar to:
av_function_that_should_exist_already(files, output = "big_fat_file.mp3")
Note 1: I do not want to paste an ffmpeg command to the terminal. If I wanted to use the terminal or some script, I could have done that. What I would like to do, is to solve this completely within R, preferably using av. (I want to avoid implementing yet another library, and overthrowing my complete code, making it into a library mixtape, when everything else already works just fine).
Note 2: I have already checked this post: How to concatenate multiple .wav files from a list in R?, I am specifically asking about av in this question, preferably not about other packages.
So, I just want to know if this is possible or not (and if maybe I'm just not seeing it). I haven't found anything in the documentation, which is mostly about converting audio and video files, not about concatenating audio or video files such as mp3 or aac.
I was thinking that this should be possible using something like:
av_audio_convert(files, output = "big_fat_file.mp3")
However, this just leads to "file1.mp3" being written to "big_fat_file.mp3" in this example, so from a vector of file names, only the first element will be processed by av_audio_convert.
Thanks for your help and ideas in advance,
Cat

blown up .sav file size using haven::write_sav()

I am writing SPSS .sav files from R using the package haven, which works very well for me in general. However I have noticed that the .sav file size written on disk using write_sav() seems to be much bigger than nescessary. Whenever I open and save a .sav file written by write_sav() in SPSS, the file size is reduced by a factor of up to ~10!
This matters to me as I am writing rather big data to SPSS for others and sometimes SPSS refuses to open a very big file. Maybe this would problem would not arise if write_sav() would store more efficiently in a "real" native SPSS way?
Does anyone know this issue and maybe has a helpful comment on it?
SPSS installation is needed to replicate this issue
It's not clear from the Haven write_sav() documentation, but it sounds like it is saving them as uncompressed .sav files. The default for (most) SPSS installations would be to save as compressed files. SPSS has an extra compression option of 'zCompressed' which will produce even smaller files but these generally can't be opened outside of SPSS.
You can experiment with this like so;
Save outfile = 'Uncompressed file.sav'
/UnCompressed.
Save outfile = 'Compressed file.sav'
/Compressed.
Save outfile = 'ZCompressed file.zsav'
/ZCompressed.
Note the .zsav file extension isn't necessary (could be .sav) but it's considered best practice to use this to make it clear where compatibility might be an issue.
See https://www.ibm.com/support/knowledgecenter/en/SSLVMB_21.0.0/com.ibm.spss.statistics.help/syn_save_compressed_uncompressed.htm for more info.
What form does your actual data take? Is is Codepage or Unicode; and what is Haven doing? Since SPSS 16.0 and the introduction of the UNICODE setting, there has been a tripling of string field widths when converting from Codepage to Unicode. This is a pain best suffered only once. Get your data to unicode and then stay there.
See https://www.ibm.com/support/knowledgecenter/SSLVMB_26.0.0/statistics_reference_project_ddita/spss/base/syn_set_unicode.html for more information.
If the output size is a problem, you could have a look at my package readspss. Using compression and zsav you should be able to get the best available compression. Compression in sav files depends on how the file is written. SPSS has different compression methods to store numeric information. Numerics can be stored only as doubles (no compression) or in a mix of doubles and int8_t (compression 1). Zsav used zlib to compress whatever the initial input was (compression 2). Eight integers take the size of a double hence the difference in the file size.
There are three variants of the SPSS (.sav) file format:
Uncompressed (.sav). This is haven's default output, but is rarely used in my experience.
Compressed (.sav). This is what most people use, and it has been the default save format for SPSS for many, many years.
Zcompressed (.zsav, but sometimes .sav). Added a few years ago to SPSS, but doesn't seem used much. You can get this from haven by adding compress=TRUE to write_spss()
I have submitted a pull request to make the compressed (2) format the default.

R How to read in a function

I'm currently implementing a tool in R and I got stucked with a problem. I looked already in the forums and didn't found anything.
I have many .csv files, which are somehow correlated with each other. The problem is I don't know yet how (this depends on the input of the user of the tool). Now I would like to read in a csv-file, that contains an arbitrary function f, e.g. f: a=b+max(c,d), and then the inputs, e.g. a="Final Sheet", b="Sheet1", c="Sheet2", d="Sheet3". (Maybe I didn't explained it very well, then I will upload a picture).
Now my question is, can I somehow read that csv file in, such that I can later use the function f in the programm? (Of course the given function has to be common in R).
I hope you understand my problem and I would appreciate any help or idea!!
I would not combine data files with R source. Much easier to keep them separate. You put your functions in separate script files and then source() them as needed, and load your data with read.csv() etc.
"Keep It Simple" :-)
I am sure there's a contorted way of reading in the source code of a function from a text file and then eval() it somehow -- but I am not sure it would be worth the effort.

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Is there a way to read and write in-memory files in R?

I am trying to use R to analyze large DNA sequence files (fastq files, several gigabytes each), but the standard R interface to these files (ShortRead) has to read the entire file at once. This doesn't fit in memory, so it causes an error. Is there any way that I can read a few (thousand) lines at a time, stuff them into an in-memory file, and then use ShortRead to read from that in-memory file?
I'm looking for something like Perl's IO::Scalar, for R.
I don’t know much about R, but have you had a look at the mmap package?
It looks like ShortRead is soon to add a "FastqStreamer" class that does what I want.
Well, I don't know about readFastq accepting something other than a file...
But if it can, for other functions, you can use the R function pipe() to open a unix connection, then you could do this with a combination of unix commands head and tail and some pipes.
For example, to get lines 90 to 100, you use this:
head file.txt -n 100 | tail -n 10
So you can just read the file in chunks.
If you have to, you can always use these unix utilities to create a temporary file, then read that in with shortRead. It's a pain but if it can only take a file, at least it works.
Incidentally, the answer to generally how to do an in-memory file in R (like Perl's IO::Scalar) is the textConnection function. Sadly though, the ShortRead package cannot handle textConnection objects as inputs, so while the idea that I expressed in the question of reading a file in small chunks into in-memory files which are then parsed bit by bit is certainly possible for many applications, but not for may particular application since ShortRead does not like textConnections. So the solution is the FastqStreamer class described above.

Resources