Reproducible Research: Convert sas7bdat data files to csv files by invoking statTransfer using GNU make - r

QUESTION:
I'm very new to GNU Make. Is there a better way to programmatically convert statistical datasets from sas7bdat to csv files and keep them in sync with each other using GNU Make to promote reproducible research? Would you approach this problem differently from a coding perspective or is there a better way to promote reproducible research? Can I add an additional pre-requisite (i.e. statTransferOptions.txt) while using static pattern rules?
The solution needs to:
Find all sas7bdat files in all subdirectories
Read statTransfer options
Convert the sas7bdat file to csv file using statTransfer command line tool with options
Given the current limitations of statTransfer, I think this will require a two step process:
Build statTransfer command file (.stcmd) for each SAS data file (.sas7bdat)
Build csv file for each stcmd file by executing statTransfer (st) using options in stcmd file
target stcmd and csv files should reside in same subdirectory as pre-requisite sas7bdat file
Find out-of-date stcmd and csv files and update them if a new sas7bdat file exists or if base option file changes
CONTEXT:
I have inherited a large statistical report which is published annually. In previous years, analysis was done in SAS. We are now using R. Some of the sas7bdat files generated by SAS Enterprise Guide do not import correctly with the sas7bdat package. StatTransfer, a commercial product, has a command-line interface and does convert sas7bdat files to csv files properly; however, there are options that improve conversion (e.g., writing of date formats). The sas7bdat files are in multiple subdirectories corresponding to the type of dataset and the year.
This approach was further prompted by:
Gandrud, Christopher (2013-06-21). Reproducible Research with R and RStudio (Chapman & Hall/CRC The R Series) (pp. 104-105). Chapman and Hall/CRC. Kindle Edition.
TROUBLESHOOTING:
This almost does what I want: Recursive wildcards in GNU make?
SUGGESTED MAKEFILE?
RDIR := .
######
#PREP#
######
# Use BASH shell to create list of source sas7bdat files
SASDATA = $(shell find $(RDIR) -type f -name '*.sas7bdat')
# Use pattern substring functions to define variable list of filenames
# to be used as targets in recipes
STCMD_OUT = $(patsubst $(RDIR)/%.sas7bdat, $(RDIR)/%.stcmd, $(SASDATA))
CSV_OUT = $(patsubst $(RDIR)/%.sas7bdat, $(RDIR)/%.csv, $(SASDATA))
#########
#TARGETS#
#########
all: $(STCMD_OUT) $(CSV_OUT)
# I think the name "static pattern rules" is misleading
# but I found this to be helpful:
# http://www.gnu.org/software/make/manual/make.html#Static-Pattern
# can I add statTransferOptions.txt as a pre-requisite while using static pattern rules?
$(STCMD_OUT): $(RDIR)/$(#D)/%.stcmd: $(RDIR)/$(#D)/%.sas7bdat
cp $(RDIR)/statTransferOptions.txt $#
echo copy $(RDIR)/$< delim $(RDIR)/$(basename $<).csv -v >> $#
echo quit >> $#
$(CSV_OUT): $(RDIR)/$(#D)/%.csv: $(RDIR)/$(#D)/%.stcmd
st $(RDIR)/$<
clean:
rm $(STCMD_OUT)
rm $(CSV_OUT)
REVISED MAKEFILE AFTER INPUT FROM SO:
RDIR := .
######
#PREP#
######
# Create list of source sas7bdat files
SASDATA := $(shell find $(RDIR) -type f -name '*.sas7bdat')
STCMD_OUT := $(patsubst $(RDIR)/%.sas7bdat, $(RDIR)/%.stcmd, $(SASDATA))
CSV_OUT := $(patsubst $(RDIR)/%.sas7bdat, $(RDIR)/%.csv, $(SASDATA))
#########
#TARGETS#
#########
all: $(STCMD_OUT) $(CSV_OUT)
$(STCMD_OUT): %.stcmd: %.sas7bdat statTransferOptions.txt
cp $(RDIR)/statTransferOptions.txt $#
echo copy $(RDIR)/$< delim $(RDIR)/$(basename $<).csv -v -y >> $#
echo quit >> $#
$(CSV_OUT): %.csv: %.stcmd
st $(RDIR)/$<
clean:
rm $(STCMD_OUT)
rm $(CSV_OUT)
However, correct option might be to debug CRAN sas7bdat package so that the entire toolchain is available rather than invoke proprietary statTransfer.

In SO, we generally don't have the time or energy (or, often, interest) to go read related papers, options, alternatives, etc. It works best if you simply and clearly specify the code you have problems with (in this case, the makefile which is provided so that's great), the exact problem you have including error messages or incorrect outputs (this is not obvious from your question), what you wanted to happen that did not happen, because this is not always clear, and perhaps any additional thoughts or directions you've tried and have not worked.
I'm not sure exactly what the problem you're having is, but I see a number of issues with your makefile. First, this will work but is highly inefficient:
SASDATA = $(shell find $(RDIR) -type f -name '*.sas7bdat')
You should use the := form of assignment here. Probably you should use it when setting STCMD_OUT and CSV_OUT as well, although this is less critical.
Most important, though, these rules are not right:
$(STCMD_OUT): $(RDIR)/$(#D)/%.stcmd: $(RDIR)/$(#D)/%.sas7bdat
You cannot use automatic variables like $# (or any of their alternative forms) in the target or prerequisite lists. The automatic variables are only defined within the recipe of the rule. You can use secondary expansion for this, but I'm not sure why you're trying to do this. Why not just use:
$(STCMD_OUT): %.stcmd: %.sas7bdat
? Ditto for the other static pattern rule?
As for your question, yes, it's perfectly fine to add extra prerequisites such as statTransferOptions.txt to the static pattern rule.

Related

How can a Makefile replacement pattern produce more than one output per input?

In our code base we have a code generator which takes foo.xyz and produces two source files foo-in.c and foo-out.c.
In an application's Makefile I would like to list the sources as:
SOURCES=main.c gadget.c foo.xyz
Then the corresponding OBJECTS variable should expand to:
OBJECTS=main.o gadget.o foo-in.o foo-out.o
but I'm unable to find whether it is possible to do this expansion generically using GNU Make. The common $(SOURCES:.c=.o) replacement pattern replaces a single source file with a single object file.
How can I write a substitution pattern which will produce multiple output files per input file?
Well, while writing the question I found a usable solution.
SOURCES=main.c gadget.c foo.xyz
OBJECTS=$(patsubst %.c,%.o,$(filter %.c,$(SOURCES))) \
$(patsubst %.xyz,%-in.o,$(filter %.xyz,$(SOURCES))) \
$(patsubst %.xyz,%-out.o,$(filter %.xyz,$(SOURCES)))
app: $(OBJECTS)
$(LD) -o $# $(LDFLAGS) $(OBJECTS)
%-in.c %-out.c: %.xyz
# Very special codegen rule
touch $(patsubst %.xyz,%-in.c,$<)
touch $(patsubst %.xyz,%-out.c,$<)
When converting from $(SOURCES) to $(OBJECTS) use two separate patsubst calls to the filteres out .xyz files. This way, both the %-in.o and %-out.o files ends up in the object list.
Another solution could be to create an intermediate sources list using the same trick but substituting xyz with the corresponding -in.c and -out.c patterns. Then the objects list could be created in the traditional way. An added benefit of this method would be that creating a rule which generates all source code files is trivial.

How to make a single makefile that applies the same command to sub-directories?

For clarity, I am running this on windows with GnuWin32 make.
I have a set of directories with markdown files in at several different levels - theoretically they could be in the branch nodes, but I think currently they are only in the leaf nodes. I have a set of pandoc/LaTeX commands to run to turn the markdown files into PDFs - and obviously only want to recreate the PDFs if the markdown file has been updated, so a makefile seems appropriate.
What I would like is a single makefile in the root, which iterates over any and all sub-directories (to any depth) and applies the make rule I'll specify for running pandoc.
From what I've been able to find, recursive makefiles require you to have a makefile in each sub-directory (which seems like an administrative overhead that I would like to avoid) and/or require you to list out all the sub-directories at the start of the makefile (again, would prefer to avoid this).
Theoretical folder structure:
root
|-make
|-Folder AB
| |-File1.md
| \-File2.md
|-Folder C
| \-File3.md
\-Folder D
|-Folder E
| \-File4.md
|-Folder F
\-File5.md
How do I write a makefile to deal with this situation?
Here is a small set of Makefile rules that hopefuly would get you going
%.pdf : %.md
pandoc -o $# --pdf-engine=xelatex $^
PDF_FILES=FolderA/File1.pdf FolderA/File2.pdf \
FolderC/File3.pdf FolderD/FolderE/File4.pdf FolderD/FolderF/File5.pdf
all: ${PDF_FILES}
Let me explain what is going on here. First we have a pattern rule that tells make how to convert a Markdown file to a PDF file. The --pdf-engine=xelatex option is here just for the purpose of illustration.
Then we need to tell Make which files to consider. We put the names together in a single variable PDF_FILES. This value for this variable can be build via a separate scripts that scans all subdirectories for .md files.
Note that one has to be extra careful if filenames or directory names contain spaces.
Then we ask Make to check if any of the PDF_FILES should be updated.
If you have other targets in your makefile, make sure that all is the first non-pattern target, or call make as make all
Updating the Makefile
If shell functions works for you and basic utilities such as sed and find are available, you could make your makefile dynamic with a single line.
%.pdf : %.md
pandoc -o $# --pdf-engine=xelatex $^
PDF_FILES:=$(shell find -name "*.md" | xargs echo | sed 's/\.md/\.pdf/g' )
all: ${PDF_FILES}
MadScientist suggested just that in the comments
Otherwise you could implement a script using the tools available on your operating system and add an additional target update: that would compute the list of files and replace the line starting with PDF_FILES with an updated list of files.
Final version of the code that worked for Windows, based on #DmitiChubarov and #MadScientist's suggestions is as follows:
%.pdf: %.md
pandoc $^ -o $#
PDF_FILES:=$(shell dir /s /b *.md | sed "s/\.md/\.pdf/g")
all: ${PDF_FILES}

How to rename multiple filenames in cshell script?

I have a c shell script which has the following two lines, it creates a directory and copies some files into it. My question is the following - the files being copied look like this abc.hello, abc.name, abc.date, etc... How can i strip the abc and just copy them over as .hello, .name, .date.. and so forth. I'm new to this.. any help will be appreciated!
mkdir -p $home_dir$param
cp /usr/share/skel/* $home_dir$param
You're looking for something like basename:
In Bash, for example, you could get the base name, file suffix like this:
filepath=/my/folder/readme.txt
filename=$(basename "$filepath") # $filename == "readme.txt"
extension="${filename##*.}" # $extension == "txt"
rootname="${filename%.*}" # $rootname == "readme"
ADDENDUM:
The key takeaway is "basename". Refer to the "man basename" page I linked to above. Here's another example that should make things clearer:
basename readme.txt .txt # prints "readme"
"basename" is a standard *nix command. It works in any shell; it's available on most any platform.
Going forward, I would strongly discourage you from writing scripts in csh, if you can avoid it:
bash vs csh vs others - which is better for application maintenance?
Csh Programming Considered Harmful

How to use mv command to rename multiple files in unix?

I am trying to rename multiple files with extension xyz[n] to extension xyz
example :
mv *.xyz[1] to *.xyz
but the error is coming as - " *.xyz No such file or directory"
Don't know if mv can directly work using * but this would work
find ./ -name "*.xyz\[*\]" | while read line
do
mv "$line" ${line%.*}.xyz
done
Let's say we have some files as shown below.Now i want remove the part -(ab...) from those files.
> ls -1 foo*
foo-bar-(ab-4529111094).txt
foo-bar-foo-bar-(ab-189534).txt
foo-bar-foo-bar-bar-(ab-24937932201).txt
So the expected file names would be :
> ls -1 foo*
foo-bar-foo-bar-bar.txt
foo-bar-foo-bar.txt
foo-bar.txt
>
Below is a simple way to do it.
> ls -1 | nawk '/foo-bar-/{old=$0;gsub(/-\(.*\)/,"",$0);system("mv \""old"\" "$0)}'
for detailed explanation check here
Here is another way using the automated tools of StringSolver. Let us say your first file is named abc.xyz[1] a second named def.xyz[1] and a third named ghi.jpg (not the same extension as the previous two).
First, filter the files you want by giving examples (ok and notok are any words such that the first describes the accepted files):
filter abc.xyz[1] ok def.xyz[1] ok ghi.jpg notok
Then perform the move with the filter it created:
mv abc.xyz[1] abc.xyz
mv --filter --all
The second line generalizes the first transformation on all files ending with .xyz[1].
The last two lines can also be abbreviated in just one, which performs the moves and immediately generalizes it:
mv --filter --all abc.xyz[1] abc.xyz
DISCLAIMER: I am a co-author of this work for academic purposes. Other examples are available on youtube.
I think mv can't operate on multiple files directly without loop.
Use rename command instead. it uses regular expressions but easy to use once mastered and more powerful.
rename 's/^text-to-replace/new-text-you-want/' text-to-replace*
e.g to rename all .jar files in a directory to .jar_bak
rename 's/^jar/jar_bak/' jar*

How can convert a dictionary file (.dic) with an affix file (.aff) to create a list of words?

Im looking at a dictionary file (".dic") and its associated "aff" file. What I'm trying to do is combine the rules in the "aff" file with the words in the "dic" file to create a global list of all words contained within the dictionary file.
The documentation behind these files is difficult to find. Does anyone know of a resource that I can learn from?
Is there any code out there that will already do this (am I duplicating an effort that I don't need to)?
thanks!
According to Pillowcase, here it's an example of usage:
# Download dictionary
wget -O ./dic/es_ES.aff "https://raw.githubusercontent.com/sbosio/rla-es/master/source-code/hispalabras-0.1/hispalabras/es_ES.aff"
wget -O ./dic/es_ES.dic "https://raw.githubusercontent.com/sbosio/rla-es/master/source-code/hispalabras-0.1/hispalabras/es_ES.dic"
# Compile program
wget -O ./dic/unmunch.cxx "https://raw.githubusercontent.com/hunspell/hunspell/master/src/tools/unmunch.cxx"
wget -O ./dic/unmunch.h "https://raw.githubusercontent.com/hunspell/hunspell/master/src/tools/unmunch.h"
g++ -o ./dic/unmunch ./dic/unmunch.cxx
# Generate dictionary
./dic/unmunch ./dic/es_ES.dic ./dic/es_ES.aff 2> /dev/null > ./dic/es_ES.txt.bk
sort ./dic/es_ES.txt.bk > ./dic/es_ES.txt # Opcional
rm ./dic/es_ES.txt.bk # Opcional
You need a utility called munch.exe to apply the aff rules to the dic file.
These could be Hunspell dictionary files. Unfortunately, the command to create a "global" or unmunched wordlist only fully support simple .aff and .dic files.
From the documentation.
unmunch: list all recognized words of a MySpell dictionary
Syntax:
unmunch dic_file affix_file
Try it and see what happens. For generating all wordforms for one word only, look here.

Resources