I sometimes work with lots of objects and it would be nice to have a fresh start because of memory issues between chunks. Consider the following example:
warning: I have 8GB of RAM. If you don't have much, this might eat it all up.
<<chunk1>>=
a <- 1:200000000
#
<<chunk2>>=
b <- 1:200000000
#
<<chunk3>>=
c <- 1:200000000
#
The solution in this case is:
<<chunk1>>=
a <- 1:200000000
#
<<chunk2>>=
rm(a)
gc()
b <- 1:200000000
#
<<chunk3>>=
rm(b)
gc()
c <- 1:200000000
#
However, in my example (which I can post because it relies on a large dataset), even after I remove all of the objects and run gc(), R does not clear all of the memory (only some). The reason is found in ?gc:
However, it can be useful to call ‘gc’ after a large object has
been removed, as this may prompt R to return memory to the
operating system.
Note the important word may. R has a lot of situations where it specifies may like this and so it is not a bug.
Is there a chunk option according to which I can have knitr start a new R session?
My recommendation would to create an individual .Rnw for each of the major tasks, knit them to .tex files and then use \include or \input in a parent.Rnw file to build the full project. Control the building of the project via a makefile.
However, to address this specific question, using a fresh R session for each chunk, you could use the R package subprocess to spawn a R session, run the needed code, extract the results, and then kill the spawned session.
A simple example .Rnw file
\documentclass{article}
\usepackage{fullpage}
\begin{document}
<<include = FALSE>>=
knitr::opts_chunk$set(collapse = FALSE)
#
<<>>=
library(subprocess)
# define a function to identify the R binary
R_binary <- function () {
R_exe <- ifelse (tolower(.Platform$OS.type) == "windows", "R.exe", "R")
return(file.path(R.home("bin"), R_exe))
}
#
<<>>=
# Start a subprocess running vanilla R.
subR <- subprocess::spawn_process(R_binary(), c("--vanilla --quiet"))
Sys.sleep(2) # wait for the process to spawn
# write to the process
subprocess::process_write(subR, "y <- rnorm(100, mean = 2)\n")
subprocess::process_write(subR, "summary(y)\n")
# read from the process
subprocess::process_read(subR, PIPE_STDOUT)
# kill the process before moving on.
subprocess::process_kill(subR)
#
<<>>=
print(sessionInfo(), local = FALSE)
#
\end{document}
Generates the following pdf:
Related
See below for my reprex of my issues with source, <-, <<-, environments, etc.
There's 3 files, testrun.R, which calls inputs.R and CODE.R.
# testrun.R (file 1)
today <<- "abcdef"
source("inputs.R")
for (DC in c("a", "b")) {
usedlater_3 <- paste("X", DC, used_later2)
print(usedlater_3)
source("CODE.R", local = TRUE)
}
final_output <- paste(OD_output, used_later2, usedlater_3)
print(final_output)
# #---- file 2
# # inputs.R
# used_later1 <- paste(today, "_later")
# used_later2 <- "l2"
#
# #---- file 3
# # CODE.R
# OD_output <- paste(DC, today, used_later1, usedlater_2, usedlater_3)
I'm afraid I didn't learn R or CS in a proper way so I'm trying to catch up now. Any bigger picture lessons would be helpful. Previously, I've been relying on a global environment where I keep everything (and save/keep between sessions), but now I'm trying to make everything reproducible, so I'm using RStudio to run local jobs that start from scratch.
I've been trying different combinations of <-, <<-, and source(local = TRUE) (instead of local = FALSE). I do use functions for pieces of code where I know the inputs I need and outputs I want, but as you can see, CODE.R uses variables from both testrun.R, the loop inside testrun.R, and input.R. Converting some of the code into functions might help ? but I'd like to know of alternatives as well given this case.
Finally you can see my own troubleshooting log to see my thought process:
first run: variable today wasn't found, so I made today <<- "abcdef" double arrow assignment
second run: DC not found, so I will switch to local = TRUE
third run: but now usedlater_2 not found, so i will change usedlater_2 to <<-. (what about usedlater_1? why didn't this show up as error? we'll see...)
result of third run: usedlater_2 still not found when CODE.R needs it. out of ideas. note: used_later2 was found to create used_later3 in the for loop in testrun.R.
First question here, hope I did asking part right.
I'm trying to write a short piece of R code that will create a vector with lenghts of all of audiofiles in my 'Music' folder. I'm using RStudio 0.98.501 with R 3.0.3 on i686-pc-linux-gnu (32-bit). I use tuneR package to extract info about lengths of the songs. Here's a problem: I export first MP3 file fine, but when I do it to the second MP3, it gives me 'R Session aborted, R encountered a fatal error, the session will be terminated'.
I'm working on Intel® Atom™ CPU N2800 # 1.86GHz × 4 with 2 Gb memory with Ubuntu 13.10.
I put my code below, just change the directory for the one where your Music folder is.
library(tuneR)
# Set your working directory here
ddpath <- "/home/daniel/"
wdpath <- ddpath
setwd(wdpath)
# Create a character vector with all filenames
filenames <- list.files("Music", pattern="*.mp3",
full.names=TRUE, recursive=TRUE)
# How many audio files do we have?
numTracks <- length(filenames)
# Vector to store lengths
lengthVector <- numeric(0)
# Here problem arises
for (i in 1:numTracks){
numWave <- readMP3(filenames[i])
lengthSec <- length(numWave#left)/numWave#samp.rate
lengthVector <- c(lengthVector, lengthSec)
rm(numWave)
}
I am trying to quit and restart R from within R. The reason for this is that my job takes up a lot of memory, and none of the common options for cleaning R's workspace reclaim RAM taken up by R. gc(), closeAllConnections(), rm(list = ls(all = TRUE)) clear the workspace, but when I examine the processes in the Windows Task Manager, R's usage of RAM remains the same. The memory is reclaimed when R session is restarted.
I have tried the suggestion from this post:
Quit and restart a clean R session from within R?
but it doesn't work on my machine. It closes R, but doesn't open it again. I am running R x64 3.0.2 through RGui (64-bit) on Windows 7. Perhaps it is just a simple tweak of the first line in the above post:
makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)
but I am unsure how it needs to be changed.
Here is the code. It is not fully reproducible, because a large list of files is needed that are read in and scraped. What eats memory is the scrape.func(); everything else is pretty small. In the code, I apply the scrape function to all files in one folder. Eventually, I would like to apply to a set of folders, each with a large number of files (~ 12,000 per folder; 50+ folders). Doing so at present is impossible, since R runs out of memory pretty quickly.
library(XML)
library(R.utils)
## define scraper function
scrape.func <- function(file.name){
require(XML)
## read in (zipped) html file
txt <- readLines(gunzip(file.name))
## parse html
doc <- htmlTreeParse(txt, useInternalNodes = TRUE)
## extract information
top.data <- xpathSApply(doc, "//td[#valign='top']", xmlValue)
id <- top.data[which(top.data=="I.D.:") + 1]
pub.date <- top.data[which(top.data=="Data publicarii:") + 1]
doc.type <- top.data[which(top.data=="Tipul documentului:") + 1]
## tie into dataframe
df <- data.frame(
id, pub.date, doc.type, stringsAsFactors=F)
return(df)
# clean up
closeAllConnections()
rm(txt)
rm(top.data)
rm(doc)
gc()
}
## where to store the scraped data
file.create("/extract.top.data.2008.1.csv")
## extract the list of files from the target folder
write(list.files(path = "/2008/01"),
file = "/list.files.2008.1.txt")
## count the number of files
length.list <- length(readLines("/list.files.2008.1.txt"))
length.list <- length.list - 1
## read in filename by filename and scrape
for (i in 0:length.list){
## read in line by line
line <- scan("/list.files.2008.1.txt", '',
skip = i, nlines = 1, sep = '\n', quiet = TRUE)
## catch the full path
filename <- paste0("/2008/01/", as.character(line))
## scrape
data <- scrape.func(filename)
## append output to results file
write.table(data,file = /extract.top.data.2008.1.csv",
append = TRUE, sep = ",", col.names = FALSE)
## rezip the html
filename2 <- sub(".gz","",filename)
gzip(filename2)
}
Many thanks in advance,
Marko
I also did some webscraping and ran directily into the same problem like u and it turned me crazy. Although im running a mordern OS (windows 10), the memory is still not released from time to time. after having a look at R FAQ I went for CleanMem, here u can set an automated memory cleaner at every 5 minutes or so. be sure to use
rm(list = ls())
gc()
closeAllConnections()
before so that R releases the memory.
Then use CleanMem so that the OS will notice there's free memory.
Suppose I have an object x in my current session:
x <- 1
How can I use this object in an Sweave or knitr document, without having to assign it explicitly:
\documentclass{article}
\begin{document}
<<>>=
print(x)
#
\end{document}
Reason I am asking is because I want to write an R script that imports data and then produces a report for each subject using an Sweave template.
I would take a slightly different approach to this, since using global variables reduces the reproducibility of the analysis. I use brew + sweave/knitr to achieve this. Here is a simple example.
# brew template: "template.brew"
\documentclass{article}
\begin{document}
<<>>=
print(<%= x %>)
#
\end{document}
# function to write report
write_report <- function(x){
rnw_file <- sprintf('file_%s.rnw', x)
brew::brew('template.brew', rnw_file)
Sweave(rnw_file)
tex_file <- sprintf('file_%s.tex', x)
tools::texi2pdf(tex_file, clean = TRUE, quiet = TRUE)
}
# produce reports
dat <- 1:10
plyr::l_ply(dat, function(x) write_report(x))
I think it just works. If your Sweave file is named "temp.Rnw", just run
> x <- 5
> Sweave("temp.Rnw")
You'll have to worry about naming the resulting output properly so each report doesn't get overwritten.
Both Sweave and knitr makes use of the global environment (see globalenv()) when evaluating R code chunks, so whatever in your global environment can be used for your document. (Strictly speaking, knitr uses the parent frame parent.frame() which is globalenv() in most cases)
Another option I have used in the past is to have the Sweave code open a file,
in my R session
write.csv(x, "tabletoberead.csv")
in my sweave document
<<label=label, echo=FALSE>>=
datatobeused<-read.csv("tabletoberead.csv")
...more manipulations on data ....
#
Obviously you should include code to stop if the file can't be found.
...besides the fact that Rscript is invoked with #!/usr/bin/env Rscript and littler with #!/usr/local/bin/r (on my system) in first line of script file. I've found certain differences in execution speed (seems like littler is a bit slower).
I've created two dummy scripts, ran each 1000 times and compared average execution time.
Here's the Rscript file:
#!/usr/bin/env Rscript
btime <- proc.time()
x <- rnorm(100)
print(x)
print(plot(x))
etime <- proc.time()
tm <- etime - btime
sink(file = "rscript.r.out", append = TRUE)
cat(paste(tm[1:3], collapse = ";"), "\n")
sink()
print(tm)
and here's the littler file:
#!/usr/local/bin/r
btime <- proc.time()
x <- rnorm(100)
print(x)
print(plot(x))
etime <- proc.time()
tm <- etime - btime
sink(file = "little.r.out", append = TRUE)
cat(paste(tm[1:3], collapse = ";"), "\n")
sink()
print(tm)
As you can see, they are almost identical (first line and sink file argument differ). Output is sinked to text file, hence imported in R with read.table. I've created bash script to execute each script 1000 times, then calculated averages.
Here's bash script:
for i in `seq 1000`
do
./$1
echo "####################"
echo "Iteration #$i"
echo "####################"
done
And the results are:
# littler script
> mean(lit)
user system elapsed
0.489327 0.035458 0.588647
> sapply(lit, median)
L1 L2 L3
0.490 0.036 0.609
# Rscript
> mean(rsc)
user system elapsed
0.219334 0.008042 0.274017
> sapply(rsc, median)
R1 R2 R3
0.220 0.007 0.258
Long story short: beside (obvious) execution-time difference, is there some other difference? More important question is: why should/shouldn't you prefer littler over Rscript (or vice versa)?
Couple quick comments:
The path /usr/local/bin/r is arbitrary, you can use /usr/bin/env r as well as we do in some examples. As I recall, it limits what other arguments you can give to r as it takes only one when invoked via env
I don't understand your benchmark, and why you'd do it that way. We do have timing comparisons in the sources, see tests/timing.sh and tests/timing2.sh. Maybe you want to split the test between startup and graph creation or whatever you are after.
Whenever we ran those tests, littler won. (It still won when I re-ran those right now.) Which made sense to us because if you look at the sources to Rscript.exe, it works different by setting up the environment and a command string before eventually calling execv(cmd, av). littler can start a little quicker.
The main price is portability. The way littler is built, it won't make it to Windows. Or at least not easily. OTOH we have RInside ported so if someone really wanted to...
Littler came first in September 2006 versus Rscript which came with R 2.5.0 in April 2007.
Rscript is now everywhere where R is. That is a big advantage.
Command-line options are a little more sensible for littler in my view.
Both work with CRAN packages getopt and optparse for option parsing.
So it's a personal preference. I co-wrote littler, learned a lot doing that (eg for RInside) and still find it useful -- so I use it dozens of times each day. It drives CRANberries. It drives cran2deb. Your mileage may, as hey say, vary.
Disclaimer: littler is one of my projects.
Postscriptum: I would have written the test as
I would have written this as
fun <- function { X <- rnorm(100); print(x); print(plot(x)) }
replicate(N, system.time( fun )["elapsed"])
or even
mean( replicate(N, system.time(fun)["elapsed"]), trim=0.05)
to get rid of the outliers. Moreover, you only essentially measure I/O (a print, and a plot) which both will get from the R library so I would expect little difference.