Exported file from R using `haven` cannot be opened by SAS - r

When exporting data from R using haven::write_sas(), the resulting sas7bdat file is not recognized (i.e. cannot be loaded) by SAS EG/9.4. Although there are several other packages such as foreign that provide alternative approaches, I was hoping to find a relatively automated way to push a dataset from my R session directly into SAS.
When using haven, the file is made but cannot be opened by SAS EG nor 9.4:
# Load package
library(haven)
# Save data
write_sas(mtcars, "mtcars.sas7bdat")
Using foreign as alternative to haven:
library(foreign)
write.foreign(df = mtcars,
datafile = 'mtcars.txt',
codefile = 'mtcars.sas',
dataname = 'libraryname.tablename', # Destination in SAS to save the data
package = 'SAS')
Running the SAS code output from foreign is successful.
* Written by R;
* write.foreign(df = mtcars, datafile = "mtcars.txt", codefile = "mtcars.sas", ;
DATA libraryname.tablename ;
INFILE "mtcars.txt"
DSD
LRECL= 43 ;
INPUT
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
;
RUN;
However, neither of these methods help with automatically pushing the data directly from R into a SAS library, which would be preferable.

There is a lengthy discussion on GitHub describing some of the challenges when exporting data from R for use in SAS via haven. In addition to providing a solution on how to automate data transfer from R to SAS, I hope this can serve as an answer to some related questions.
If one wants to use tools designed by SAS for interoperability with R, RSWAT on GitHub is likely a more robust option. However, this will assume that you have access to SAS Cloud Analytics Services configured for this purpose.
If you are working with a SAS 9.4 on your machine and perhaps also connect to SAS servers (i.e. using rsubmit; commands), it should be relatively straightforward to pass a data-set directly from R into a SAS library. There are three steps:
Format dataset for SAS; although foreign will do a lot of the formatting changes, I prefer converting factors back to characters and having NA replaced with "". This I find ensures that no special formatting is needed by colleagues to open the final table in SAS.
# Example data
data <- data.frame(ID = c(123, NA, 125),
disease = factor(c('syphilis', 'gonorrhea', NA)),
AdmitDate = as.Date(c("2014-04-05", NA, "2016-02-03")),
DOB = as.Date(c("1990-01-01", NA, NA)))
# Function defined for converting factors and blanks
convert_format_r2sas <- function(data){
data <- data %>%
dplyr::mutate_if(is.factor, as.character) %>%
dplyr::mutate_if(is.character, tidyr::replace_na, replace = "")
return(data)
}
# Convert some formatting
data <- convert_format_r2sas(data)
Use foreign to export the data and associated code
library(foreign)
# Ensure the data and code files are saved in an easily accessible location (ideally in or downstream of your R project directory)
write.foreign(df = data ,
datafile = 'data.txt',
codefile = 'data.sas',
dataname = 'libraryname.tablename', # Destination in SAS to save the data
package = 'SAS')
Pass code to local SAS installation using custom function. You may need to adjust the location of the SAS.exe as well as the configuration file. This will work both passing a list of SAS files, or SAS code written directly in R as a character vector.
# Define function for passing the code to SAS and upload data (may require tweaking the local SAS installation location and configuration file)
pass_code_to_sas <- function(sas_file_list = NULL, inputstring = NULL,
sas_path = "C:/LocationTo/SASHome/SASFoundation/9.4/sas.exe",
configFile = "C:/LocationTo/SASHome/SASFoundation/9.4/SASV9.CFG") {
# If provided list of scripts, check they are all valid
if(!is.null(sas_file_list)){
if(any(purrr::map_lgl(sas_file_list, file.exists)) == FALSE | is.list(sas_file_list) == F){
stop("You entered an invalid file location or did not provide the locations as a list of characters")
}
}
sink(file.path(R.home(), "temp_codePass.sas"))
if(!is.null(sas_file_list)){
for(i in 1:length(sas_file_list)){
cat(readLines(sas_file_list[[i]]), sep = "\n")
}
}
cat(inputstring)
sink()
# Output message to view what code was sent...
message(paste0("The above info was passed to SAS: ",
if(!is.null(sas_file_list)){for(i in 1:length(sas_file_list)){cat(readLines(sas_file_list[[i]]), sep = "\n")}},
print(inputstring)))
# Run SAS
system2(sas_path,
args = paste0(
"\"", file.path(R.home(), "temp_codePass.sas"), "\"",
if(!is.null(configFile)) { paste0(" -config \"", configFile, "\"")}
)
)
# Delete the SAS file
file.remove(file.path(R.home(), "temp_codePass.sas"))
}
# Pass data to SAS
pass_code_to_sas(sas_file_list = 'path2codefile/data.sas')

Related

Sparklyr performance comparison in R to other on disk solutions like SAS. Remove duplicates using distinct takes hours in Sparklyr, seconds in SAS

I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.
I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).
After importing using the function I ran very basic code to dedpulicate one column.
It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.
This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.
sc <- spark_connect(master = "local", version = "2.3")
download <- function(datapath, dataname) {
spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
#spec_explicit <- c(x = "character", y = "numeric")
system.time(dataname <- spark_read_csv(
sc,
path = datapath,
columns = spec_with_r,
Memory = FALSE
))
return(dataname)
}
test <- download("./data/metastases17.csv", test)
test2 <- test %>% select(DX) %>% distinct()

Write stata dataframe in R [duplicate]

I am getting an error while converting R file into Stata format. I am able to convert the numbers into
Stata file but when I include strings I get the following error:
library(foreign)
write.dta(newdata, "X.dta")
Error in write.dta(newdata, "X.dta") :
empty string is not valid in Stata's documented format
I have few strings like location, name etc. which have missing values which is probably causing this problem. Is there a way to handle this? .
I've had this error many times before, and it's easy to reproduce:
library(foreign)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write.dta(test, 'example.dta')
One solution is to use factor variables instead of character variables, e.g.,
for (colname in names(test)) {
if (is.character(test[[colname]])) {
test[[colname]] <- as.factor(test[[colname]])
}
}
Another is to change the empty strings to something else and change them back in Stata.
This is purely a problem with write.dta, because Stata is perfectly fine with empty strings. But since foreign is frozen, there's not much you can do about that.
Update: (2015-12-04) A better solution is to use write_dta in the haven package:
library(haven)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write_dta(test, 'example.dta')
This way, Stata reads string variables properly as strings.
You could use the great readstata13 package (which kindly imports only the Rcpp package).
readstata13::save.dta13(mtcars, 'mtcars.dta')
The function allows to save already in Stata 15/16 MP file format (experimental), which is the next update after Stata 13 format.
readstata13::save.dta13(mtcars, 'mtcars15.dta', version="15mp")
Note: Of course, this also works with OP's data:
readstata13::save.dta13(data.frame(a="", b=1), 'my_data.dta')

How to use Rscript with readr to get data from aws s3

I have some R code with readr package that works well on a local computer - I use list.files to find files with a specific extension and then use readr to operate on those files found.
My question: I want to do something similar with files in AWS S3 and I am looking for some pointers on how to use my current R code to do the same.
Thanks in advance.
What I want:
Given AWS folder/file structure like this
- /folder1/subfolder1/quant.sf
- /folder1/subfolder2/quant.sf
- /folder1/subfolder3/quant.sf
and so on where every subfolder has the same file 'quant.sf', I would like to get a data frame which has the S3 paths and I want to use the R code shown below to operate on all the quant.sf files.
Below, I am showing R code that works currently with data on a Linux machine.
get_quants <- function(path1, ...) {
additionalPath = list(...)
suppressMessages(library(tximport))
suppressMessages(library(readr))
salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))
samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
row.names(samples)=samples[,1]
names(salmon_filepaths)=samples$samples
# IF no tx2Gene available, we will only get isoform level counts
salmon_tx_data = tximport(salmon_filepaths, type="salmon", txOut = TRUE)
## Get transcript count summarization
write.csv(as.data.frame(salmon_tx_data$counts), file = "tx_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx_data$abundance), file = "tx_TPM_Abundance.csv")
if(length(additionalPath > 0)) {
tx2geneFile = additionalPath[[1]]
my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
salmon_tx2gene_data = tximport(salmon_filepaths, type="salmon", txOut = FALSE, tx2gene=my_tx2gene)
## Get Gene count summarization
write.csv(as.data.frame(salmon_tx2gene_data$counts), file = "tx2gene_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx2gene_data$abundance), file = "tx2gene_TPM_Abundance.csv")
}
}
I find it easiest to use the aws.s3 R package for this. In this case what you would do is use the s3read_using() and s3write_using() functions to save to and from S3. Like this:
library(aws.s3)
my_tx2gene=s3read_using(FUN=read.csv, object="[path_in_s3_to_file]",sep = "\t",stringsAsFactors = F, header=F)
It basically is a wrapper around whatever function you want to use for file input/output. Works great with read_json, saveRDS, or anything else!

Notation issues with read.csv.sql in r

I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.
Here is a small data set so my problem can be reproduced:
write.csv(iris, "iris.csv", row.names = F)
I am finding that the notation you have to use is extremely awkward using read.csv.sql the following is the first way I tried reading in the file and it works but it is messy:
library(sqldf)
csvFile <- "iris.csv"
spec <- 'setosa'
sql <- paste0("select * from file where Species = '\"", spec,"\"'")
d1 <- read.csv.sql(file = csvFile, sql = sql)
I then found another way of writing the same notation in a slightly cleaner manner is:
sql <- paste0("select * from file where Species = '", spec,"'")
d2 <- read.csv.sql(file = csvFile, sql = sql,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
Next, I wanted to read in a case where I select multiple values from the same column, so I tried this and it works:
d3 <- read.csv.sql(file = csvFile,
sql = "select * from file where Species in
('\"setosa\"', '\"versicolor\"') ")
However, I want to avoid hard coding the values like that so I tried:
spec2 <- c('setosa', 'versicolor')
sql2 <- paste0("select * from file where Species in '", spec2,"'")
d4 <- read.csv.sql(file = csvFile, sql = sql2,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
But this does not work (it seems to only read the first value from the vector and tries to match it as a table). I'm sure this is another notation issue again and would like help to clear up this chunk of code.
Also, if you have any tips/tricks on using read.csv.sql and dealing with the notation issues, I would like to hear them!
The problem is that sqldf provides text preprocessing faciliities but the code shown in the question does not use them making it overly complex.
1) Regarding text substitution, use fn$ (from gsubfn which sqldf automatically loads) as discussed on the github page for sqldf. Assuming that we used quote = FALSE in the write.csv since sqlite does not handle quotes natively:
spec <- 'setosa'
out <- fn$read.csv.sql("iris.csv", "select * from file where Species = '$spec' ")
spec <- c("setosa", "versicolor")
string <- toString(sprintf("'%s'", spec)) # add quotes and make comma-separated
out <- fn$read.csv.sql("iris.csv", "select * from file where Species in ($string) ")
2) Regarding deleting double quotes, a simpler way would be to use the following filter= argument:
read.csv.sql("iris.csv", filter = "tr -d \\042") # Windows
or
read.csv.sql("iris.csv", filter = "tr -d \\\\042") # Linux / bash
depending on your shell. The first one worked for me on Windows (with Rtools installed and on the PATH) and the second worked for me on Linux with bash. It is possible that other variations could be needed for other shells.
2a) Another possibility for removing quotes is to install the free csvfix utility (available on Windows, Linux and Mac) on your system and then use the following filter= argument which should work in all shells since it does not involve any characters that are typically interpreted specially by either R or most shells. Thus the following should work on all platforms.
read.csv.sql("iris.csv", filter = "csvfix echo -smq")
2b) Another cross platform utility that could be used is xsv. The eol= argument is only needed on Windows since xsv produces UNIX style line endings but won't hurt on other platforms so the following line should work on all platforms.
read.csv.sql("iris.csv", eol = "\n", filter = "xsv fmt")
2c) sqldf also includes an awk program (csv.awk) that can be used. It outputs UNIX style newlines so specify eol = "\n" on Windows. On other platforms it won't hurt if you specify it but you can omit it if you wish since that is the default on those platforms.
csv.awk <- system.file("csv.awk", package = "sqldf")
rm_quotes_cmd <- sprintf('gawk -f "%s"', csv.awk)
read.csv.sql("iris.csv", eol = "\n", filter = rm_quotes_cmd)
3) Regarding general tips, note that the verbose=TRUE argument to read.csv.sql can be useful to see what it is going on.
read.csv.sql("iris.csv", verbose = TRUE)

Getting an SPSS data file into R

At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).
I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.
AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.
Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.
Here's a sample data file.
I'd appreciate any help resolving this issue.
Thanks.
I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.
There is a solution to read SPSS data file in R by ODBC driver.
1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.
2) Create a DSN for the SPSS data driver.
3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.
Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.
require(RODBC)
# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
# List of tables
Tables <- sqlTables(con)
Tables
# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
# Function to query a table by name
sqlQuery.tab.name <- function(table) {
sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}
# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)
# See the data
lapply(Data, head)
# Close connection
close(con)
For example we can that value labels are defined for two variables:
[[5]]
VAR00002 VAR00002_label
1 1 Male
2 2 Female
[[6]]
VAR00003 VAR00003_label
1 2 Student
2 3 Employed
3 4 Unemployed
Additional information
Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.
I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).
get.spss <- function(channel, variables = NULL, value.labels = F) {
VarNames <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]", as.is = T)$VarName
if (is.null(variables)) variables <- VarNames else {
if (any(!variables %in% VarNames)) stop("Wrong variable names")
}
if (value.labels) {
ValueLabelTableName <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]
WHERE ValueLabelTableName is not null",
as.is = T)$VarName
ValueLabelTableName <- intersect(variables, ValueLabelTableName)
}
variables <- paste(variables, collapse = ", ")
data <- sqlQuery(channel = channel,
query = paste("SELECT", variables, "FROM [Cases]"),
as.is = T)
if (value.labels) {
for (var in ValueLabelTableName) {
VL <- sqlQuery(channel = channel,
query = paste0("SELECT * FROM [VLVAR", var,"]"),
as.is = T)
data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
}
}
return(data)
}
My work is going through the same transition.
read.spss() returns the variable labels as an attribute of the object you create with it. So in the example below I have a data frame called rvm which was created by read.spss() with to.data.frame=TRUE. It has 3,500 variables with short names a1, a2 etc but long labels for each variable in SPSS. I can access the variable labels by
cbind(attributes(rvm)$variable.labels)
which returns a list of all 3,500 variables full names up to
…
x23 "Other Expenditure Uncapped Daily Expenditure In Region"
x24 "Accommodation Expenditure In Region"
x25 "Food/Meals/Drink Expenditure In Region"
x26 "Local Transport Expenditure In Region"
x27 "Sightseeing/Attractions Expenditure In Region"
x28 "Event/Conference Expenditure In Region"
x29 "Gambling/Casino Expenditure In Region"
x30 "Gifts/Souvenirs Expenditure In Region"
x31 "Other Shopping Expenditure In Region"
x0 "Accommodation Daily Expenditure In Region"
What to do with these is another matter, but at least I have them, and if I want I can put them in some other object for safekeeping, searching with grep, etc.
Since you have SPSS available, I recommend installing the "Essentials for R" plugin (free of charge, but you need to register, also see the installation instructions) which allows you to run R within SPSS. The plugin includes an R package with functions that transfer the active SPSS data frame to R (and back) - including labeled factor levels, dates, German umlauts - details that are otherwise notoriously difficult. In my experience, it is more reliable than R's own foreign package.
Once you have everything set up, open the data in SPSS, and run something like the following code in the syntax window:
begin program r.
myDf <- spssdata.GetDataFromSPSS(missingValueToNA=TRUE,
factorMode="labels",
rDate="POSIXct")
save(myDf, file="d:/path/to/your/myDf.Rdata")
end program.
Essentials for R plugin link (apparently breaks markdown link syntax):
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/We70df3195ec8_4f95_9773_42e448fa9029/page/Downloads%20for%20IBM®%20SPSS®%20Statistics?lang=en
Nowadays, the package haven provides the functionality to achieve what you want (and much more).
The function read_sav() can import *.sav and *.zsav files and returns a tibble. The variable labels are automatically stored in the labels attribute of the corresponding variables within that tibble. The class labelled preserves the original semantics and allows us to associate arbitrary labels with numeric or character vectors. If needed, we can use the function as_factor() to coerce labeled objects, i.e. objects of the class labelled, and even all labeled vectors within data.frames or tibbles (at once) to factors.

Resources