blown up .sav file size using haven::write_sav() - r

I am writing SPSS .sav files from R using the package haven, which works very well for me in general. However I have noticed that the .sav file size written on disk using write_sav() seems to be much bigger than nescessary. Whenever I open and save a .sav file written by write_sav() in SPSS, the file size is reduced by a factor of up to ~10!
This matters to me as I am writing rather big data to SPSS for others and sometimes SPSS refuses to open a very big file. Maybe this would problem would not arise if write_sav() would store more efficiently in a "real" native SPSS way?
Does anyone know this issue and maybe has a helpful comment on it?
SPSS installation is needed to replicate this issue

It's not clear from the Haven write_sav() documentation, but it sounds like it is saving them as uncompressed .sav files. The default for (most) SPSS installations would be to save as compressed files. SPSS has an extra compression option of 'zCompressed' which will produce even smaller files but these generally can't be opened outside of SPSS.
You can experiment with this like so;
Save outfile = 'Uncompressed file.sav'
/UnCompressed.
Save outfile = 'Compressed file.sav'
/Compressed.
Save outfile = 'ZCompressed file.zsav'
/ZCompressed.
Note the .zsav file extension isn't necessary (could be .sav) but it's considered best practice to use this to make it clear where compatibility might be an issue.
See https://www.ibm.com/support/knowledgecenter/en/SSLVMB_21.0.0/com.ibm.spss.statistics.help/syn_save_compressed_uncompressed.htm for more info.

What form does your actual data take? Is is Codepage or Unicode; and what is Haven doing? Since SPSS 16.0 and the introduction of the UNICODE setting, there has been a tripling of string field widths when converting from Codepage to Unicode. This is a pain best suffered only once. Get your data to unicode and then stay there.
See https://www.ibm.com/support/knowledgecenter/SSLVMB_26.0.0/statistics_reference_project_ddita/spss/base/syn_set_unicode.html for more information.

If the output size is a problem, you could have a look at my package readspss. Using compression and zsav you should be able to get the best available compression. Compression in sav files depends on how the file is written. SPSS has different compression methods to store numeric information. Numerics can be stored only as doubles (no compression) or in a mix of doubles and int8_t (compression 1). Zsav used zlib to compress whatever the initial input was (compression 2). Eight integers take the size of a double hence the difference in the file size.

There are three variants of the SPSS (.sav) file format:
Uncompressed (.sav). This is haven's default output, but is rarely used in my experience.
Compressed (.sav). This is what most people use, and it has been the default save format for SPSS for many, many years.
Zcompressed (.zsav, but sometimes .sav). Added a few years ago to SPSS, but doesn't seem used much. You can get this from haven by adding compress=TRUE to write_spss()
I have submitted a pull request to make the compressed (2) format the default.

Related

Converting a (probable) ENVI file to decimals using R or excel

I got the output file from a spectrometer which is supposed to be a series of decimals numbers. The file looks like this:
™pQH1JHxþFH$ÏFH÷~EHa×BHäBHßdBH.#H²Ï=HL=HŒÚ<Hê‰:H­P:Hoõ9H¢Ž6Hº7H¨Y5H ?1H½¶.Hø²0HøŽ2H8æ.H.î,HŒt/H&1H͸0Hí.Hvî,H$ª+HµX+HCý*H·W+H!º+HP+HfØ(Hû'H†Ù'H|U(HQ`)Hn*H
})H'Hó%HÂ%H¶¨&H&H|•&H\
I have been reading a lot without getting to the solution. My silly question is: is that a ENVI or ASCII file? Or? How can I see the numbers I need do to use? I tried some online converters without being successful.
The starting point would be to get these numbers to develop a R code to make graphs. Thanks a lot for your time.
This looks like you opened the binary file of the mass spectrometer. Almost all vendors keep their format a secret. The only way to do this is to export it to an open format. Most vendors supply some kind of data analysis software and there are often export functions present. Most general open data formats are mzXML and mzML.
For converting have a look at the msconvert program from ProteoWizard.
If you have converted the data one of the packages in R where you can start with is XCMS.

How to convert .dat + .sps to .sav on command line

I get a lot of datasets that arrive as .dat files with syntax files for converting to SPSS (.sps). I'm an R user, so I need to convert the .dat file into a .sav that R can read.
In the past, I've used PSPP to do this manually. (I can't afford SPSS!) But I'd MUCH prefer a programmatic solution.
I thought pspp-convert would do the trick, but there's something I'm not understanding about how that works in terms of inputting the syntax file:
My files are:
data.dat
data.sps (which correctly points to data.dat)
I tried
pspp-convert data.sps data.sav
But get
`data.sps' is not a system or portable file.
Makes sense since the input is supposed to be a portable file. Am I trying to do something beyond the scope of this CLI?
Generally speaking, there MUST be some way to apply an SPS file to a DAT file to get a SAV file (or any other portable file) back, right?
From an SPSS Statistics point of view, a .dat file extension most often means the data is in a fixed ASCII text format. You would need the accompanying codebook to tell you what variables to read and in what formats. The SPSS Statistics command syntax file (.sps) does this for you. But this file is simply the list of SPSS Statistics commands used to read the ASCII data. It is not a data file itself.
Elsewhere you've referenced these files as "portable files". An SPSS Statistics portable file (.por) is a very special case of an ASCII file; structured to be read and written by SPSS Statistics. In any case, if your preferred tool takes an SPSS Statistics portable file (.por), these *.dat files likely aren't it.
Assuming these *.dat files are fixed ASCII text files, you'll need to discern how the information therein is stored and then use a likely tool for reading ASCII text.

Fixed Width EBCDIC Files in R

I'm trying to read some mainframe data encoded as EBCDIC into R, and am at a loss. I'd like to avoid using an external program to convert the files, since I'm operating in a corporate environment.
You can find the example files here, with both ASCII and EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions of the file -- instead, I'd be specifying the width of each line manually. R has the IBM500 encoding available in my environment, which should be the correct one for these files.
However, when I run the following commands, R seems to fail entirely.
layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500')
data <- read.fwf("EBCDIC_ZIPCODE", widths = c(32), fileEncoding='ibm500')
Where might I go from here?
Related -- some of the files I expect to use will be fairly large (1 GB or so). Preferably, I'd like a solution that scales reasonably well. (I tried packages like LaF, but they don't have the option to select encoding.)
Thank you very much!

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Reading large files into R

I am a newbie to R, but I am aware that it chokes on "big" files. I am trying to read a 200MB data file. I have tried it in csv format and also converting it to tab delimited txt but in both cases I use up my 4GB of RAM before the file loads.
Is it normal that R would use 4GB or memory to load a 200MB file, or could there be something wrong with the file and it is causing R to keep reading a bunch of nothingness in addition to the data?
From ?read.table
Less memory will be used if colClasses is specified as one of the six atomic vector classes.
...
Using nrows, even as a mild over-estimate, will help memory usage.
Use both of these arguments.
Ensure that you properly specify numeric for your numeric data. See here: Specifying colClasses in the read.csv
And do not under-estimate nrows.
If you're running 64-bit R, you might try the 32-bit version. It will use less memory to hold the same data.
See here also: Extend memory size limit in R

Resources