Consider the following code:
library(cachem)
library(memoise)
cache.dir <- "/Users/abcd/Desktop/temp_cache/"
cache <- cachem::cache_disk(dir = cache.dir, max_size = 1024^2)
fun <- function (x) {x^2}
fun.memo <- memoise(f = fun, cache = cache)
res.1 <- fun.memo(x = 2)
res.2 <- fun.memo(x = 3)
So far so good. I can compute fun.memo once and retrieve its results later by calling it again.
Now I have the following "problem": I have a lengthy script with several memoised function calls. At the end I just want to further process the output of the last function call, which depends on the output of memoised function calls further up in the script. Now it would be nice if I can somehow retrieve the memoised objects directly from the .rds file in cache.dir. This would avoid the lengthy script on top (not for performance reasons [memoise], but to avoid lengthy code). I am thinking about something like:
setwd(cache.dir)
res.y <- readRDS(paste0(my.hash.1, ".rds"))
res.z <- readRDS(paste0(my.hash.2, ".rds"))
However, I can't generate those hashes in the filenames again:
rlang::hash(x = res.1)
rlang::hash(x = res.2)
rlang::hash(x = fun)
rlang::hash(x = fun.memo)
all yield different hashes. It seems that the hash generated within memoise is not the hash that gets written into the .rds filename.
I know that retrieving the objects like that is sub-optimal since then it is not clear what arguments they resulted from. Still it would be nice to avoid the lengthy code on top. Of course, I could wrap all the preceding code into a function or a script and source() it, but that's not the point here. Any advice?
I think you are somewhat wasting your time; but if you look inside memoise internals you can see how the keys are determined and you can hack your way to what the chosen key hashes are ...
because in this case there arent _additionals I can boil it down ...
library(cachem)
library(memoise)
cache.dir <- tempdir(check=TRUE)
cache <- cachem::cache_disk(dir = cache.dir, max_size = 1024^2)
fun <- function (x) {x^2}
fun.memo <- memoise(f = fun, cache = cache)
res.1 <- fun.memo(x = 2)
list.files(path=cache.dir)
# [1] "37513c63752949a0ae8d9befd52c6ad1.rds" ....
rlang::hash(c(
rlang::hash(list(formals(fun), as.character(body(fun)))),
list(x=2)))
# 37513c63752949a0ae8d9befd52c6ad1
Please don't do this :D
Related
I am trying to use the quasi-quotation syntax (quo, exprs, !!, etc.) as well as the foreach function to create several new variables by means of a named list of expressions to be evaluated inside the rxDataStep function, specifically, the transforms argument. I am getting the following error:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I have a dataset which includes a number of variables with I need to log-transform in order to perform further analyses. I have been using the rx functions from the "RevoScaleR" package for roughly three years and totally missed the "tidyverse"/pipeline method of data transformation techniques. I do occasionally dabble with these tools but prefer to stick with the aforementioned rx functions giving my relative familiarity and the fact that they have served me very well thus far.
As a MWE:
Required libraries:
library(foreach)
library(rlang)
Creating variables which need to be log-transformed.
vars <- foreach(i = 10:20, .combine = "cbind") %do% rnorm(10, i)
Dataframe with identifier and above variables.
data_in <- data.frame(id = 1:10, vars)
Object which creates the expressions of the log-transformed variables; this creates a named list.
log_vars <- foreach(i = names(data_in[-1]), .final = function(x) set_names(x, paste0(names(data_in[-1]), "_log"))) %do%
expr(log10(!!sym(i)))
Now attempting to add the variables to the existing dataframe.
data_out <- rxDataStep(inData = data_in, transforms = log_vars, transformObjects = list(log_vars = log_vars))
The resulting error is the following:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I simply cannot understand the error given that log_vars is defined as a named list. One can check this with str and typeof.
I have tried a slightly different way of defining the new variables:
log_vars <- unlist(foreach(i = names(data_in[-1]), j = paste0(names(data_in[-1]), "_log")) %do%
exprs(!!j := log10(!!sym(i))))
I have to use unlist given that exprs delivers a list as output already. Either way, I get the same error as before.
Naturally, I expect to have 10 new variables named result.1_log, result.2_log, etc. inserted into the dataframe. Instead, I receive the above error and the new dataframe is not created.
I suspected that the rx functions do not like working with the quasi-quotation syntax, however, I have used it before when having to identify subjects with NA values of certain variables. This was done using the rowSelection argument of rxDataStep. I do realise that rowSelection requires a single, logical expression while transforms requires a named list of expressions.
Any help would be much appreciated since this type of data transformation will keep up again in my analyses. I do suspect that I simply do not understand the inner workings of the quasi-quotation syntax or perhaps how lists work in general but, hopefully there is a simple fix.
I am using Microsoft R Open 3.4.3.
My session info is the following:
R Services Information:
Local R: C:\Program Files\Microsoft\ML Server\R_SERVER\
Version: 1.3.40517.1016
Operating System: Microsoft Windows 10.0.17134
CPU Count: 4
Physical Memory: 12169 MB, 6810 MB free
Virtual Memory: 14025 MB, 7984 MB free
Video controller[1]: Intel(R) HD Graphics 620
GPU[1]: Intel(R) HD Graphics Family
Video memory[1]: 1024 MB
Connected users: 1
I'm not quite sure what you're trying to do as I think you've made things too complicated.
If all you want to do is take the log of each # in each data point, then I show two approaches below.
Approach #1 is static, you know the fixed # of columns and hard code it. It's a bit faster for rxDataStep to run in this approach.
Approach #2 is a bit more dynamic, taking advantage of a transformFunc. transformFunc works in chunks, so it can be used safely in a clustered fashion. rxDataStep knows how to integrate the chunks together. But there will be a bit of a performance hit for it.
You might have been trying to find a hybrid approach - dynamically build the list for the transforms parameter in the rxDataStep. I haven't found a way to get that to work. Here's a similar question for doing it in rxSetVarInfo (Change a dynamic variable name with rxSetVarInfo) but using that approach hasn't yielded success for me yet.
Let me know if I've completely missed the mark!
library(foreach)
library(rlang)
startSize <- 10
endSize <- 20
vars <- foreach(i = startSize:endSize, .combine = "cbind") %do% rnorm(10, i)
data_in <- data.frame(vars)
tempInput <- tempfile(fileext = ".xdf")
tempOutput <- tempfile(fileext = ".xdf")
rxImport(inData = data_in, outFile = tempInput, overwrite = T)
rxGetInfo(tempInput, getVarInfo = T)
### Approach #1
print("Approach #1")
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transforms = list(
log_R1 = log10(result.1),
log_R2 = log10(result.2),
log_R3 = log10(result.3),
log_R4 = log10(result.4),
log_R5 = log10(result.5),
log_R6 = log10(result.6),
log_R7 = log10(result.7),
log_R8 = log10(result.8),
log_R9 = log10(result.9),
log_R10 = log10(result.10),
log_R11 = log10(result.11)))
rxGetInfo(tempOutput, getVarInfo = T)
### Approach #2
print("Approach #2")
logxform <- function(dataList) {
numRowsInChunk <- length(dataList$result.1)
for (j in 1:columnDepth) {
dataList[[paste0("log_R",j)]] <- rep(0, times=numRowsInChunk)
for (i in 1:numRowsInChunk) {
dataList[[paste0("log_R",j)]][i] <- log10(dataList[[paste0("result.",j)]][i])
}
}
return(dataList)
}
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transformObjects = list(columnDepth = endSize - startSize + 1),
transformFunc = logxform)
rxGetInfo(tempOutput, getVarInfo = T)
I've got a file with 2m+ lines.
To avoid memory overload, I want to read these lines in chunks and then perform further processing with the lines in the chunk.
I read that readLines is the fastest but I could not find a way to read chunks with readlines.
raw = readLines(target_file, n = 500)
But what I'd want is to then have a readLines for n = 501:1000, e.g.
raw = readLines(target_file, n = 501:1000)
Is there a way to do this in R?
Maybe this helps someone in the future:
The readr package has just what I was looking for: a function to read lines in chunks.
read_lines_chunked reads a file in chunks of lines and then expects a callback to be run on these chunks.
Let f be the function needed for storing a chunk for later use:
f = function(x, pos){
filename = paste("./chunks/chunk_", pos, ".RData", sep="")
save(x, file = filename)
}
Then I can use this in the main wrapper as:
read_lines_chunked(file = target_json
, chunk_size = 10000
, callback = SideEffectChunkCallback$new(f)
)
Works.
I don't know how many variables (columns) you have, but data.table::fread is a very fast alternative to what you want:
require(data.table)
raw <- fread(target_file)
I'm a newbye in R and I've seen several posts about downloading more stocks, but for a reason or another they don't work as suggested.
My purpose is to download a vector of stocks and create a whole xts-matrix containing only Close prices for every stock (so a nobservations x 3 columns).
Anyway, I'd like to start from a basic script that doesn't work properly:
library(quantmod)
ticker=c("KO","AAPL","^GSPC")
for (i in 1:length(ticker)) {
simbol=as.xts(na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign=F)))
new=Cl(simbol)
merge(new[i])
}
It would be even better to write a function(symbols) that allows me to call whenever I need to just change the name of the stocks to download.
Thanks to everyone
This is how I would do what you want with a function wrapper (which is a pretty common kind of manipulation with xts):
ticker=c("KO","AAPL","^GSPC")
collect_close_series <- function(ticker) {
# Preallocate a list to store the result from each loop iteration (Note: lapply is another alternative to a direct loop)
lst <- vector("list", length(ticker))
for (i in 1:length(ticker)) {
symbol <- na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign = FALSE))
lst[[i]] <- Cl(symbol)
}
# You have a list of close prices. You can combine the objects in the list compactly using do.call; this is a common "data manipulation pattern" with xts objects.
rr <- do.call(what = merge, lst)
rr
}
out <- collect_close_series(ticker)
More advanced (better code design): You could write cleaner code by writing a function that handles each symbol (rather than a function that wraps and passes in all the symbols together) and then run lapply on it:
per_sym_close <- function(tick) {
symbol <- na.omit(getSymbols(tick,from="2016-01-01",auto.assign = FALSE))
Cl(symbol)
}
out2 <- do.call(merge, lapply(X = ticker, FUN = per_sym_close))
This gives the same result.
Hope this helps getting you started toward writing good R code!
I am working with a list where each element is also a list, comprised of R data.tables. My task is to grab the nth element of each sublist, and then stack those data.tables into a larger data.table. So, from a list of twenty lists, each having twelve elements, I end up with a total list of twelve elements, where each element is a data.table.
I'm not having difficulty with the code to do this, but I am having some confusion about what is happening with R's memory management in this case. It is relatively simple to do the extraction, like this (just to show context, not a MWE on its own):
lst_new <- lapply(X = list_indices,
FUN = function(idx) {return(rbindlist(l = lapply(X = lst_old,FUN = `[[`,idx)))})
My question is, why is R not releasing the memory that was originally allocated to lst_old when I delete it? More generally, why is that my rbind operations seem to hold onto memory after the object is removed? Below is a minimal working example.
library(data.table)
# Create list elements of large enough size
uFunc_MakeElement <- function() {
clicode <- paste(sample(x = c(letters,LETTERS),size = 4,replace = T),collapse = "")
column_data <- replicate(n = 100,expr = {sample(x = c(0:20),size = 600000,replace = T)},simplify = FALSE)
names(column_data) <- paste("var",1:100,sep = "")
return(as.data.table(cbind(clicode = clicode,as.data.frame(column_data))))
}
lst_big <- replicate(n = 15,expr = uFunc_MakeElement(),simplify = FALSE)
# At this point, the rsession is consuming 4.01GB according to top (RES)
# According to RStudio, lst_big was 3.4Gb
# Transform to a data.table
dt_big <- rbindlist(l = lst_big)
# According to top, RES was 7.293Gb
rm(lst_big)
# RES does not change
dt_big <- rbind(dt_big,NULL)
# RES goes to 0.010t
gc()
# RES goes back down to 6.833Gb
I'm not sure why, when I remove lst_big after creating the new data.table using rbindlist, I am not having the memory returned to me. Even after manually calling gc (which you should not have to do), I still don't get back the memory that seems to be allocated to lst_big. Am I doing something wrong? Is there a better way to concatenate data.tables so that I do not leak memory?
(Tagging this with RStudio in case there's a chance it's somehow related to the IDE. This example is coming from RStudio Server running on an Ubuntu 14.04 box).
EDITED TO ADD: I just noticed that this memory usage issue remains even if I overwrite the list itself (rather than creating a new list, I just assign the output of my operations to the old list).
I know I can use ls() and rm() to see and remove objects that exist in my environment.
However, when dealing with "old" .RData file, one needs to sometimes pick an environment a part to find what to keep and what to leave out.
What I would like to do, is to have a GUI like interface to allow me to see the objects, sort them (for example, by there size), and remove the ones I don't need (for example, by a check-box interface). Since I imagine such a system is not currently implemented in R, what ways do exist? What do you use for cleaning old .RData files?
Thanks,
Tal
I never create .RData files. If you are practicing reproducible research (and you should be!) you should be able to source in R files to go from input data files to all outputs.
When you have operations that take a long time it makes sense to cache them. If often use a construct like:
if (file.exists("cache.rdata")) {
load("cache.rdata")
} else {
# do stuff ...
save(..., file = "cache.rdata")
}
This allows you to work quickly from cached files, and when you need to recalculate from scratch you can just delete all the rdata files in your working directory.
Basic solution is to load your data, remove what you don't want and save as new, clean data.
Another way to handle this situation is to control loaded RData by loading it to own environment
sandbox <- new.env()
load("some_old.RData", sandbox)
Now you can see what is inside
ls(sandbox)
sapply(ls(sandbox), function(x) object.size(get(x,sandbox)))
Then you have several posibilities:
write what you want to new RData: save(A, B, file="clean.RData", envir=sandbox)
remove what you don't want from environment rm(x, z, u, envir=sandbox)
make copy of variables you want in global workspace and remove sandbox
I usually do something similar to third option. Load my data, do some checks, transformation, copy final data to global workspace and remove environments.
You could always implement what you want. So
Load the data
vars <- load("some_old.RData")
Get sizes
vars_size <- sapply(vars, function(x) object.size(get(x)))
Order them
vars <- vars[order(vars_size, decreasing=TRUE)]
vars_size <- vars_size [order(vars_size, decreasing=TRUE)]
Make dialog box (depends on OS, here is Windows)
vars_with_size <- paste(vars,vars_size)
vars_to_save <- select.list(vars_with_size, multiple=TRUE)
Remove what you don't want
rm(vars[!vars_with_size%in%vars_to_save])
To nice form of object size I use solution based on getAnywhere(print.object_size)
pretty_size <- function(x) {
ifelse(x >= 1024^3, paste(round(x/1024^3, 1L), "Gb"),
ifelse(x >= 1024^2, paste(round(x/1024^2, 1L), "Mb"),
ifelse(x >= 1024 , paste(round(x/1024, 1L), "Kb"),
paste(x, "bytes")
)))
}
Then in 4. one can use paste(vars, pretty_size(vars_size))
You may want to check out the RGtk2 package.
You can very easily create an interface with Glade Interface Designer and then attach whatever R commands you want to it.
If you want a good starting point where to "steal" ideas on how to use RGtk2, install the rattle package and run rattle();. Then look at the source code and start making your own interface :)
I may have a go at it and see if I can come out with something simple.
EDIT: this is a quick and dirty piece of code that you can play with. The big problem with it is that for whatever reason the rm instruction does not get executed, but I'm not sure why... I know that it is the central instruction, but at least the interface works! :D
TODO:
Make rm work
I put all the variables in the remObjEnv environment. It should not be listed in the current variable and it should be removed when the window is closed
The list will only show objects in the global environment, anything inside other environment won't be shown, but that's easy enough to implement
probably there's some other bug I haven't thought of :D
Enjoy
# Our environment
remObjEnv <<- new.env()
# Various required libraries
require("RGtk2")
remObjEnv$createModel <- function()
{
# create the array of data and fill it in
remObjEnv$objList <- NULL
objs <- objects(globalenv())
for (o in objs)
remObjEnv$objList[[length(remObjEnv$objList)+1]] <- list(object = o,
type = typeof(get(o)),
size = object.size(get(o)))
# create list store
model <- gtkListStoreNew("gchararray", "gchararray", "gint")
# add items
for (i in 1:length(remObjEnv$objList))
{
iter <- model$append()$iter
model$set(iter,
0, remObjEnv$objList[[i]]$object,
1, remObjEnv$objList[[i]]$type,
2, remObjEnv$objList[[i]]$size)
}
return(model)
}
remObjEnv$addColumns <- function(treeview)
{
colNames <- c("Name", "Type", "Size (bytes)")
model <- treeview$getModel()
for (n in 1:length(colNames))
{
renderer <- gtkCellRendererTextNew()
renderer$setData("column", n-1)
treeview$insertColumnWithAttributes(-1, colNames[n], renderer, text=n-1)
}
}
# Builds the list.
# I seem to have some problems in correctly build treeviews from glade files
# so we'll just do it by hand :)
remObjEnv$buildTreeView <- function()
{
# create model
model <- remObjEnv$createModel()
# create tree view
remObjEnv$treeview <- gtkTreeViewNewWithModel(model)
remObjEnv$treeview$setRulesHint(TRUE)
remObjEnv$treeview$getSelection()$setMode("single")
remObjEnv$addColumns(remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$treeview, TRUE, TRUE, 0)
}
remObjEnv$delObj <- function(widget, treeview)
{
model <- treeview$getModel()
selection <- treeview$getSelection()
selected <- selection$getSelected()
if (selected[[1]])
{
iter <- selected$iter
path <- model$getPath(iter)
i <- path$getIndices()[[1]]
model$remove(iter)
}
obj <- as.character(remObjEnv$objList[[i+1]]$object)
rm(obj)
}
# The list of the current objects
remObjEnv$objList <- NULL
# Create the GUI.
remObjEnv$window <- gtkWindowNew("toplevel", show = FALSE)
gtkWindowSetTitle(remObjEnv$window, "R Object Remover")
gtkWindowSetDefaultSize(remObjEnv$window, 500, 300)
remObjEnv$vbox <- gtkVBoxNew(FALSE, 5)
remObjEnv$window$add(remObjEnv$vbox)
# Build the treeview
remObjEnv$buildTreeView()
remObjEnv$button <- gtkButtonNewWithLabel("Delete selected object")
gSignalConnect(remObjEnv$button, "clicked", remObjEnv$delObj, remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$button, TRUE, TRUE, 0)
remObjEnv$window$showAll()
Once you've figured out what you want to keep, you can use the function -keep- from package gdata does what its name suggests.
a <- 1
b <- 2
library(gdata)
keep(a, all = TRUE, sure = TRUE)
See help(keep) for details on the -all- and -sure- options.
all: whether hidden objects (beginning with a .) should be removed, unless explicitly kept.
sure: whether to perform the removal, otherwise return names of objects that would have been removed.
This function is so useful that I'm surprised it isn't part of R itself.
The OS X gui does have such a thing, it's called the Workspace Browser. Quite handy.
I've also wished for an interface that shows the session dependency between objects, i.e. if I start from a plot() and work backwards to find all the objects that were used to create it. This would require parsing the history.
It doesn't have checkboxes to delete with, rather you select the file(s) then click delete. However, the solution below is pretty easy to implement:
library(gWidgets)
options(guiToolkit="RGtk2")
## make data frame with files
out <- lapply((x <- list.files()), file.info)
out <- do.call("rbind", out)
out <- data.frame(name=x, size=as.integer(out$size), ## more attributes?
stringsAsFactors=FALSE)
## set up GUI
w <- gwindow("Browse directory")
g <- ggroup(cont=w, horizontal=FALSE)
tbl <- gtable(out, cont=g, multiple=TRUE)
size(tbl) <- c(400,400)
deleteThem <- gbutton("delete", cont=g)
enabled(deleteThem) <- FALSE
## add handlers
addHandlerClicked(tbl, handler=function(h,...) {
enabled(deleteThem) <- (length(svalue(h$obj, index=TRUE)) > 0)
})
addHandlerClicked(deleteThem, handler=function(h,...) {
inds <- svalue(tbl, index=TRUE)
files <- tbl[inds,1]
print(files) # replace with rm?
})
The poor guy answer could be :
ls()
# spot the rank of the variables you want to remove, for example 10 to 25
rm(list= ls()[[10:25]])
# repeat until satisfied
To clean the complete environment you can try:
rm(list(ls())