I'd like to store a knit()ted document directly in R as an R object, as a character vector.
I know I can do this by knit()ing to a tempfile() and then import the result, like so:
library(knitr)
library(readr)
ex_file <- tempfile(fileext = ".tex")
knitr::knit(text = "foo", output = ex_file)
knitted_obj <- readr::read_file(ex_file)
knitted_obj
returns
# [1] "foo\n"
as intended.
Is there a way to do this without using a tempfile() and by directly "piping" the result to a vector?
Why on earth would I want this, you ask?
*.tex string will be programmatically saved to disc, and rendered to PDF later. Reading rendered *.tex from disc in downstream functions would make code more complicated.
Caching is just a whole lot easier, and moving this cache to a different machine.
I am just really scared of side effects in general and file system shenanigans across machines/OSes in particular. I want to isolate those to as few (print(), save(), plot()) functions as possible.
Does that make me a bad (or just OCD) R developer?
It should be as straightforward as a single line like this:
knitted_obj = knitr::knit(text = "foo")
You may want to read the help page ?knitr::knit again to know what it returns.
You can use con <- textConnection("varname", "w") to create a connection that writes its output to variable varname, and use output=con in the call to knit(). For example:
library(knitr)
con <- textConnection("knitted_obj", "w")
knit(text="foo", output = con)
close(con)
knitted_obj
returns the same as your tempfile approach, except for the newline. Multiple lines will show up as different elements of knitted_obj. I haven't timed it, but text connections have a reputation for being slow, so this isn't necessarily as fast as writing to the file system.
Related
Is it possible to get R to write a plot in bitmap format (e.g. PNG) to standard output? If so, how?
Specifically I would like to run Rscript myscript.R | other_prog_that_reads_a_png_from_stdin. I realise it's possible to create a temporary file and use that, but it's inconvenient as there will potentially be many copies of this pipeline running at the same time, necessitating schemes for choosing unique filenames and removing them afterwards.
I have so far tried setting outf <- file("stdout") and then running either bitmap(file=outf, ...) or png(filename=outf, ...), but both complain ('file' must be a non-empty character string and invalid 'filename' argument, respectively), which is in line with the official documentation for these functions.
Since I was able to persuade R's read.table() function to read from standard input, I'm hoping there's a way. I wasn't able to find anything relevant here on SO by searching for [r] stdout plot, or any of the variations with stdout replaced by "standard output" (with or without double quotes), and/or plot replaced by png.
Thanks!
Unfortunately the {grDevices} (and, by implication, {ggplot2}) seems to fundamentally not support this.
The obvious approach to work around this is: let a graphics device write to a temporary file, and then read that temporary file back into the R session and write it to stdout.
But this fails because, on the one hand, the data cannot be read into a string: character strings in R do not support embedded null characters (if you try you’ll get an error such as “nul character not allowed”). On the other hand, readBin and writeBin fail because writeBin categorically refuses to write to any device that’s hooked up to stdout, which is in text mode (ignoring the fact that, on POSIX system, the two are identical).
This can only be circumvented in incredibly hacky ways, e.g. by opening a binary pipe to a command such as cat:
dev_stdout = function (underlying_device = png, ...) {
filename = tempfile()
underlying_device(filename, ...)
filename
}
dev_stdout_off = function (filename) {
dev.off()
on.exit(unlink(filename))
fake_stdout = pipe('cat', 'wb')
on.exit(close(fake_stdout), add = TRUE)
writeBin(readBin(filename, 'raw', file.info(filename)$size), fake_stdout)
}
To use it:
tmp_dev = dev_stdout()
contour(volcano)
dev_stdout_off(tmp_dev)
On systems where /dev/stdout exists (which are most but not all POSIX systems), the dev_stdout_off function can be simplified slightly by removing the command redirection:
dev_stdout_off = function (filename) {
dev.off()
on.exit(unlink(filename))
fake_stdout = file('/dev/stdout', 'wb')
on.exit(close(fake_stdout), add = TRUE)
writeBin(readBin(filename, 'raw', file.info(filename)$size), fake_stdout)
}
This might not be a complete answer, but it's the best I've got: can you open a connection using the stdout() command? I know that png() will change the output device to a file connection, but that's not what you want, so it might work to simply substitute png by stdout. I don't know enough about standard outputs to test this theory, however.
The help page suggests that this connection might be text-only. In that case, a solution might be to generate a random string to use as a filename, and pass the name of the file through stdout so that the next step in your pipeline knows where to find your file.
I'm attempting to make my code more modular: data loading and cleaning in one script, analysis in another, etc. If I were using R scripts, this would be a simple matter of calling source on data_setup.R inside analysis.R, but I'd like to document the decisions I'm making in an Rmarkdown document for both data setup and analysis. So I'm trying to write some sort of source_rmd function that will allow me to source the code from data_setup.Rmd into analysis.Rmd.
What I've tried so far:
The answer to How to source R Markdown file like `source('myfile.r')`? doesn't work if there are any repeated chunk names (a problem since the chunk named setup has special behavior in Rstudio's notebook handling). How to combine two RMarkdown (.Rmd) files into a single output? wants to combine entire documents, not just the code from one, and also requires unique chunk names. I've tried using knit_expand as recommended in Generate Dynamic R Markdown Blocks, but I have to name chunks with variables in double curly-braces, and I'd really like a way to make this easy for my colaborators to use as well. And using knit_child as recommended in How to nest knit calls to fix duplicate chunk label errors? still gives me duplicate label errors.
After some further searching, I've found a solution. There is a package option in knitr that can be set to change the behavior for handling duplicate chunks, appending a number after their label rather than failing with an error. See https://github.com/yihui/knitr/issues/957.
To set this option, use options(knitr.duplicate.label = 'allow').
For the sake of completeness, the full code for the function I've written is
source_rmd <- function(file, local = FALSE, ...){
options(knitr.duplicate.label = 'allow')
tempR <- tempfile(tmpdir = ".", fileext = ".R")
on.exit(unlink(tempR))
knitr::purl(file, output=tempR, quiet = TRUE)
envir <- globalenv()
source(tempR, local = envir, ...)
}
I am writing an R script that reads in a template .R file, a list of dates, and creates a bunch of folders corresponding to the dates and containing copes of the .R wherein text substitution has been performed in R to customize each script for the given date.
I'm stuck on the part where I write out the .R file though, because the formatting and/or character representation keeps getting screwed up.
Here's a minimal, reproducible example:
RMapsDemo <- readLines("https://raw.githubusercontent.com/hack-r/RMapsDemo/master/RMapsDemo.R")
RMapsDemo <- gsub("## File: RMapsDemo.R", "## File: RMapsDemo.R ####", RMapsDemo)
save(RMapsDemo, file = "RMapsDemo.R") # Doesn't work right
save(RMapsDemo, file = "RMapsDemo.R", ascii = T) # Doesn't work right
dput(RMapsDemo, file = "RMapsDemo.R") # Close, but no cigar
dput(RMapsDemo, file = "RMapsDemo.R", control = c("keepNA", "keepInteger")) # Close, but no cigar
Ricardo Saporta pointed out the solution in the comments -- use writeLines.
I feel stupid for not thinking of this myself. It works beautifully.
writeLines(RMapsDemo, con = "RMapsDemo.R")
I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.