R SAS to xlsx conversion script - r

I am attempting to write a script that allows me to quickly convert a folder of SAS datasets into .xlsx and same them in a different folder. Here is my current code:
require(haven)
require(openxlsx)
setwd(choose.dir())
lapply(list.files(pattern="*.sas7bdat"), function(x) {
openxlsx::write.xlsx(haven::read_sas(x), path = paste0(choose.dir(),x,".xlsx"))
})
I keep getting the following error and I am not sure why:
Error in saveWorkbook(wb = wb, file = file, overwrite = overwrite) :
argument "file" is missing, with no default
Final Code (thanks #oliver):
require(haven)
require(openxlsx)
setwd(choose.dir())
lapply(list.files(pattern="*.sas7bdat"), function(x) {
openxlsx::write.xlsx(haven::read_sas(x), file = paste0(gsub("\\.sas7bdat$", "", basename(x)), ".xlsx"))
})

The problem is the write.xlsx doesn't have a path argument but instead uses a file argument. This is documented in the function as well, see help("write.xlsx"):
outdir <- choose.dir() #<== choose only directory once
lapply(list.files(pattern="*.sas7bdat"), function(x) {
# Obtain the basename of the file, without SAS extension
x_basename <- gsub('\\.sas7bdat$', '', basename(x))
# Write the file to Excel
openxlsx::write.xlsx(haven::read_sas(x),
# Use "file" instead of "path"
file = paste0(outdir, x_basename, ".xlsx"))
})

Related

How do I apply the same action to all Excel Files in the directory?

I need to shape the data stored in Excel files and save it as new .csv files. I figured out what specific actions should be done, but can't understand how to use lapply.
All Excell files have the same structure. Each of the .csv files should have the name of original files.
## the original actions successfully performed on a single file
library(readxl)
library("reshape2")
DataSource <- read_excel("File1.xlsx", sheet = "Sheet10")
DataShaped <- melt(subset(DataSource [-(1),], select = - c(ng)), id.vars = c ("itemname","week"))
write.csv2(DataShaped, "C:/Users/Ol/Desktop/Meta/File1.csv")
## my attempt to apply to the rest of the files in the directory
lapply(Files, function (i){write.csv2((melt(subset(read_excel(i,sheet = "Sheet10")[-(1),], select = - c(ng)), id.vars = c ("itemname","week"))))})
R returns the result to the console but doesn't create any files. The result resembles .csv structure.
Could anybody explain what I am doing wrong? I'm new to R, I would be really grateful for the help
Answer
Thanks to the prompt answer from #Parfait the code is working! So glad. Here it is:
library(readxl)
library(reshape2)
Files <- list.files(full.names = TRUE)
lapply(Files, function(i) {
write.csv2(
melt(subset(read_excel(i, sheet = "Decomp_Val")[-(1),],
select = -c(ng)),id.vars = c("itemname","week")),
file = paste0(sub(".xlsx", ".csv",i)))
})
It reads an Excel file in the directory, drops first row (but headers) and the column named "ng", melts the data by labels "itemname" and "week", writes the result as a .csv to the working directory attributing the name of the original file. And then - rinse and repeat.
Simply pass an actual file path to write.csv2. Otherwise, as denoted in docs ?write.csv, the default value for file argument is empty string "" :
file: either a character string naming a file or a connection open for writing. "" indicates output to the console.
Below concatenates the Excel file stem to the specified path directory with .csv extension:
path <- "C:/Users/Ol/Desktop/Meta/"
lapply(Files, function (i){
write.csv2(
melt(subset(read_excel(i, sheet = "Sheet10")[-(1),],
select = -c(ng)),
id.vars = c("itemname","week")),
file = paste0(path, sub(".xlsx", ".csv", i))
)
})

How to find all files sourcing a particular file?

Suppose I have file_a.R. It is sourced via R's base source function by some other files file_b.R, file_c.R, which are located in the same folder or sub folder. Is there an easy way to get the paths of file_b.R and file_c.R given the path of file_a.R?
EDIT:
If you want to get all links between R files, and some files that are sourced in those files, you can use something like that:
library(stringr)
#Get all R files paths in working directory and subdirectories
filelist <- lapply(list.files(
pattern = "[.]R$", recursive = TRUE
), print)
#Extract one file's sources
getSources <- function(file, pattern) {
#Store all file lines in a character vector
lines <- readLines(file, warn = FALSE)
#Extract R-filenames starting with "pattern" in all lines containing "source"
sources <- lapply(lines, function(x) {
if (length(grep("source", x) > 0)) {
str_extract(x, paste0(pattern, ".*[.]R"))
}
else{
NA
}
})
#Remove NA (lines without source)
sources <- sources[!is.na(sources)]
#Return a list
list(path = file,
pattern = pattern,
sources = unlist(sources))
}
#Example
corresp <- lapply(X = filelist, FUN = getSources, pattern = "file")
It will return a list of:
$path: R file path
$pattern: pattern used to match sources
$sources: the name of the sourced file
And you'll be able to see if anything is sourced anywhere, including file_A.

Using read_csv with path to a file (readr's package)

I face a difficulty in trying to read a csv file with read_csv function from readr package using file's path.
My file ("test.csv") is located in the 'data' folder.
Data folder is located at the root of my project (working directory)
wd <- getwd()
data_path <- "data"
file.exists(file.path(wd, data_path, "test.csv")) # Returns TRUE
library(readr)
data.1 <- read_csv(file = file.path(wd, data_path, "test.csv")) # Does not work
The log provides me with the following error:
Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) :
argument "x" is missing, with no default
However it works perfectly with the standard read.csv function
data.1 <- read.csv("data/mockup_data_v1.csv", header = TRUE, sep = ",") # OK
Could you please let me know how to proceed to use read_csv from readr package with file path as an argument?
As you've already set your working directory, you should be able to just read the file with:
data.1 <- read_csv("data/test.csv")
Because R looks in your working directory by default, you are in effect asking R to look in:
working directory/working directory/data/test.csv
All you need to do is add paste and you should be good to go
library(readr)
wd <- getwd()
data_path <- "data"
data.1 <- read_csv(paste(wd, data_path, "test.csv"))

Read multiple files and save data into one dataframe in R

I am trying to read multiple files and then combine them into one data frame. The code that I am using is as follows:
library(plyr)
mydata = ldply(list.files(path="Data load for stations/data/Predicted",pattern = "txt"), function(filename) {
dum = read.table(filename,skip=5, header=F, sep=" ")
#If you want to add the filename as well on the column
dum$filename = filename
return(dum)
})
The error that I am getting is as follows:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'mobdata201001.txt': No such file or directory
The data files can be found on https://www.dropbox.com/sh/827kmkrwd0irehk/BFbftkks42
Any help is highly appreciated.
Alternatively you can use argument full.names in list.files:
list.files(path="Data load for stations/data/Predicted",
pattern = "txt", full.names=TRUE)
It will add automatically the full path before the file name.
Try the following code:
library(plyr)
path <- "Data load for stations/data/Predicted/"
filenames <- paste0(path, list.files(path, pattern = "txt"))
mydata = ldply(filenames, function(filename) {
dum = read.table(filename,skip=5, header=F, sep=" ")
#If you want to add the filename as well on the column
dum$filename = filename
return(dum)
})
I think what is happening is you're generating a list of files relative to the path in list.files, and then asking read.table to take the filename without the rest of the path...

Automate zip file reading in R

I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.
Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}

Resources