I am trying to use the quasi-quotation syntax (quo, exprs, !!, etc.) as well as the foreach function to create several new variables by means of a named list of expressions to be evaluated inside the rxDataStep function, specifically, the transforms argument. I am getting the following error:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I have a dataset which includes a number of variables with I need to log-transform in order to perform further analyses. I have been using the rx functions from the "RevoScaleR" package for roughly three years and totally missed the "tidyverse"/pipeline method of data transformation techniques. I do occasionally dabble with these tools but prefer to stick with the aforementioned rx functions giving my relative familiarity and the fact that they have served me very well thus far.
As a MWE:
Required libraries:
library(foreach)
library(rlang)
Creating variables which need to be log-transformed.
vars <- foreach(i = 10:20, .combine = "cbind") %do% rnorm(10, i)
Dataframe with identifier and above variables.
data_in <- data.frame(id = 1:10, vars)
Object which creates the expressions of the log-transformed variables; this creates a named list.
log_vars <- foreach(i = names(data_in[-1]), .final = function(x) set_names(x, paste0(names(data_in[-1]), "_log"))) %do%
expr(log10(!!sym(i)))
Now attempting to add the variables to the existing dataframe.
data_out <- rxDataStep(inData = data_in, transforms = log_vars, transformObjects = list(log_vars = log_vars))
The resulting error is the following:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I simply cannot understand the error given that log_vars is defined as a named list. One can check this with str and typeof.
I have tried a slightly different way of defining the new variables:
log_vars <- unlist(foreach(i = names(data_in[-1]), j = paste0(names(data_in[-1]), "_log")) %do%
exprs(!!j := log10(!!sym(i))))
I have to use unlist given that exprs delivers a list as output already. Either way, I get the same error as before.
Naturally, I expect to have 10 new variables named result.1_log, result.2_log, etc. inserted into the dataframe. Instead, I receive the above error and the new dataframe is not created.
I suspected that the rx functions do not like working with the quasi-quotation syntax, however, I have used it before when having to identify subjects with NA values of certain variables. This was done using the rowSelection argument of rxDataStep. I do realise that rowSelection requires a single, logical expression while transforms requires a named list of expressions.
Any help would be much appreciated since this type of data transformation will keep up again in my analyses. I do suspect that I simply do not understand the inner workings of the quasi-quotation syntax or perhaps how lists work in general but, hopefully there is a simple fix.
I am using Microsoft R Open 3.4.3.
My session info is the following:
R Services Information:
Local R: C:\Program Files\Microsoft\ML Server\R_SERVER\
Version: 1.3.40517.1016
Operating System: Microsoft Windows 10.0.17134
CPU Count: 4
Physical Memory: 12169 MB, 6810 MB free
Virtual Memory: 14025 MB, 7984 MB free
Video controller[1]: Intel(R) HD Graphics 620
GPU[1]: Intel(R) HD Graphics Family
Video memory[1]: 1024 MB
Connected users: 1
I'm not quite sure what you're trying to do as I think you've made things too complicated.
If all you want to do is take the log of each # in each data point, then I show two approaches below.
Approach #1 is static, you know the fixed # of columns and hard code it. It's a bit faster for rxDataStep to run in this approach.
Approach #2 is a bit more dynamic, taking advantage of a transformFunc. transformFunc works in chunks, so it can be used safely in a clustered fashion. rxDataStep knows how to integrate the chunks together. But there will be a bit of a performance hit for it.
You might have been trying to find a hybrid approach - dynamically build the list for the transforms parameter in the rxDataStep. I haven't found a way to get that to work. Here's a similar question for doing it in rxSetVarInfo (Change a dynamic variable name with rxSetVarInfo) but using that approach hasn't yielded success for me yet.
Let me know if I've completely missed the mark!
library(foreach)
library(rlang)
startSize <- 10
endSize <- 20
vars <- foreach(i = startSize:endSize, .combine = "cbind") %do% rnorm(10, i)
data_in <- data.frame(vars)
tempInput <- tempfile(fileext = ".xdf")
tempOutput <- tempfile(fileext = ".xdf")
rxImport(inData = data_in, outFile = tempInput, overwrite = T)
rxGetInfo(tempInput, getVarInfo = T)
### Approach #1
print("Approach #1")
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transforms = list(
log_R1 = log10(result.1),
log_R2 = log10(result.2),
log_R3 = log10(result.3),
log_R4 = log10(result.4),
log_R5 = log10(result.5),
log_R6 = log10(result.6),
log_R7 = log10(result.7),
log_R8 = log10(result.8),
log_R9 = log10(result.9),
log_R10 = log10(result.10),
log_R11 = log10(result.11)))
rxGetInfo(tempOutput, getVarInfo = T)
### Approach #2
print("Approach #2")
logxform <- function(dataList) {
numRowsInChunk <- length(dataList$result.1)
for (j in 1:columnDepth) {
dataList[[paste0("log_R",j)]] <- rep(0, times=numRowsInChunk)
for (i in 1:numRowsInChunk) {
dataList[[paste0("log_R",j)]][i] <- log10(dataList[[paste0("result.",j)]][i])
}
}
return(dataList)
}
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transformObjects = list(columnDepth = endSize - startSize + 1),
transformFunc = logxform)
rxGetInfo(tempOutput, getVarInfo = T)
Related
Recently I've been playing with doing some parallel processing in R using future (and future.apply and furrr) which has been great mostly, but I've stumbled onto something that I can't explain. It's possible that this is a bug somewhere, but it may also be sloppy coding on my part. If anyone can explain this behavior it would be much appreciated.
The setup
I'm running simulations on different subgroups of my data. For each group, I want to run the simulation n times and then calculate some summary stats on the results. Here is some example code to reproduce my basic setup and demonstrate the issue I'm seeing:
library(tidyverse)
library(future)
library(future.apply)
# Helper functions
#' Calls out to `free` to get total system memory used
sys_used <- function() {
.f <- system2("free", "-b", stdout = TRUE)
as.numeric(unlist(strsplit(.f[2], " +"))[3])
}
#' Write time, and memory usage to log file in CSV format
#' #param .f the file to write to
#' #param .id identifier for the row to be written
mem_string <- function(.f, .id) {
.s <- paste(.id, Sys.time(), sys_used(), Sys.getpid(), sep = ",")
write_lines(.s, .f, append = TRUE)
}
# Inputs
fake_inputs <- 1:16
nsim <- 100
nrows <- 1e6
log_file <- "future_mem_leak_log.csv"
if (fs::file_exists(log_file)) fs::file_delete(log_file)
test_cases <- list(
list(
name = "multisession-sequential",
plan = list(multisession, sequential)
),
list(
name = "sequential-multisession",
plan = list(sequential, multisession)
)
)
# Test code
for (.t in test_cases) {
plan(.t$plan)
# loop over subsets of the data
final_out <- future_lapply(fake_inputs, function(.i) {
# loop over simulations
out <- future_lapply(1:nsim, function(.j) {
# in real life this would be doing simulations,
# but here we just create "results" using rnorm()
res <- data.frame(
id = rep(.j, nrows),
col1 = rnorm(nrows) * .i,
col2 = rnorm(nrows) * .i,
col3 = rnorm(nrows) * .i,
col4 = rnorm(nrows) * .i,
col5 = rnorm(nrows) * .i,
col6 = rnorm(nrows) * .i
)
# write memory usage to file
mem_string(log_file, .t$name)
# in real life I would write res to file to read in later, but here we
# only return head of df so we know the returned value isn't filling up memory
res %>% slice_head(n = 10)
})
})
# clean up any leftover objects before testing the next plan
try(rm(final_out))
try(rm(out))
try(rm(res))
}
The outer loop is for testing two parallelization strategies: whether to parallelize over the subsets of data or over the 100 simulations.
Some caveats
I realize that parallelizing over the simulations is not the ideal design, and also that chunking that data to send 10-20 simulations to each core would be more efficient, but that's not really the point here. I'm just trying to understand what is happening in memory.
I also considered that maybe plan(multicore) would be better here (though I'm sure if it would) but I'm more interested in figuring out what's happening with plan(multisession)
The results
I ran this on an 8-vCPU Linux EC2 (I can give more specs if people need them) and created the following plot from the results (plotting code at the bottom for reproducibility):
First off, plan(list(multisession, sequential)) is faster (as expected, see caveat above), but what I'm confused about is the memory profile. The total system memory usage remains pretty constant for plan(list(multisession, sequential)) which I would expect, because I assumed the res object is overwritten each time through the loop.
However, the memory usage for plan(list(sequential, multisession)) steadily grows as the program runs. It appears that each time through the loop the res object is created and then hangs around in limbo somewhere, taking up memory. In my real example this got large enough that it filled my entire (32GB) system memory and killed the process about halfway through.
Plot twist: it only happens when nested
And here's the part that really has me confused! When I changed the outer future_lapply to just regular lapply and set plan(multisession) I don't see it! From my reading of this "Future: Topologies" vignette this should be the same as plan(list(sequential, multisession)) but the plot doesn't show the memory growing at all (in fact, it's almost identical to plan(list(multisession, sequential)) in the above plot)
Note on other options
I actually originally found this with furrr::future_map_dfr() but to be sure it wasn't a bug in furrr, I tried it with future.apply::future_lapply() and got the results shown. I tried to code this up with just future::future() and got very different results, but quite possibly because what I coded up wasn't actually equivalent. I don't have much experience with using futures directly without the abstraction layer provided by either furrr or future.apply.
Again, any insight on this is much appreciated.
Plotting code
library(tidyverse)
logDat <- read_csv("future_mem_leak_log.csv",
col_names = c("plan", "time", "sys_used", "pid")) %>%
group_by(plan) %>%
mutate(
start = min(time),
time_elapsed = as.numeric(difftime(time, start, units = "secs"))
)
ggplot(logDat, aes(x = time_elapsed/60, y = sys_used/1e9, group = plan, colour = plan)) +
geom_line() +
xlab("Time elapsed (in mins)") + ylab("Memory used (in GB)") +
ggtitle("Memory Usage\n list(multisession, sequential) vs list(sequential, multisession)")
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!
I'm a newbie in R and pre-processing a big data of million lines to label the connected component and sending the output to a file. But It is taking aweful lot of time using for loop and cat(). Is there any alternative way to write the output file in most faster way in R? I am sharing a sample of code. Any alternative methods or rewriting it with a function that makes it more efficient would be highly appreciated.
#Simple example of undirected graph
g <- graph_from_literal(a--b, a--c, b--c, d--e)
plot(g)
#Connected components
#The option, mode, is ignored for undirected graphs
comp <- components(g, mode = "weak")
#output to a file
fout <- file("output.txt", "w")
for (v in V(g)) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
cat(sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size), file=fout)
}
close(fout)
It seems like everything is vectorized and no for loop is needed. This gives the same output and uses data.table::fwrite, which will be quite a bit faster than cat.
vv = V(g)
vn = vv$name
comp_id = comp$membership[vv$name]
comp_size = comp$csize[comp_id]
data.table::fwrite(data.table(vn, comp_id, comp_size), "output.txt", col.names = FALSE, sep = "\t")
If you don't want the data table dependency, you could use base::write.table, which would still be better than pasting together strings with tabs yourself.
I faced similar problem, i.e. how to write 3 millions of (short) lines into a text file. I found that using writeChar speed up increasingly the file writting process (from several minutes to seconds).
Below, I replaced cat by writeChar in your code:
g <- graph_from_literal(a--b, a--c, b--c, d--e)
plot(g)
#Connected components
#The option, mode, is ignored for undirected graphs
comp <- components(g, mode = "weak")
# first clean the file if it exists
fout <- file("output.txt", "wb")
close(fout)
# switch in appending mode
fout <- file("output.txt", "ab")
for (v in V(g)) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
# set eos = NULL to avoid NULL terminators
writeChar(sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size), con = fout, eos = NULL)
}
close(fout)
(Caveat emptor: I don't have any of your data, so this is untested.)
Instead of doing the write each time within your loop, instead generate a vector of the strings (one file-line each) and write once at the end. This type of file I/O is much more efficient.
all_lines <- sapply(V(g), function(v) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size)
})
writeLines(all_lines, "output.txt")
The use of sapply is one efficiency of R, doing things as "vectors of things". Though it is not strictly necessary (this could be done with a for loop, though several precautions need to be taken in order to not be grossly inefficient, especially when dealing with a million lines), once one can "grok" the intent of vector-mechanics, it might become easier to understand and deal with.
I am working with a list where each element is also a list, comprised of R data.tables. My task is to grab the nth element of each sublist, and then stack those data.tables into a larger data.table. So, from a list of twenty lists, each having twelve elements, I end up with a total list of twelve elements, where each element is a data.table.
I'm not having difficulty with the code to do this, but I am having some confusion about what is happening with R's memory management in this case. It is relatively simple to do the extraction, like this (just to show context, not a MWE on its own):
lst_new <- lapply(X = list_indices,
FUN = function(idx) {return(rbindlist(l = lapply(X = lst_old,FUN = `[[`,idx)))})
My question is, why is R not releasing the memory that was originally allocated to lst_old when I delete it? More generally, why is that my rbind operations seem to hold onto memory after the object is removed? Below is a minimal working example.
library(data.table)
# Create list elements of large enough size
uFunc_MakeElement <- function() {
clicode <- paste(sample(x = c(letters,LETTERS),size = 4,replace = T),collapse = "")
column_data <- replicate(n = 100,expr = {sample(x = c(0:20),size = 600000,replace = T)},simplify = FALSE)
names(column_data) <- paste("var",1:100,sep = "")
return(as.data.table(cbind(clicode = clicode,as.data.frame(column_data))))
}
lst_big <- replicate(n = 15,expr = uFunc_MakeElement(),simplify = FALSE)
# At this point, the rsession is consuming 4.01GB according to top (RES)
# According to RStudio, lst_big was 3.4Gb
# Transform to a data.table
dt_big <- rbindlist(l = lst_big)
# According to top, RES was 7.293Gb
rm(lst_big)
# RES does not change
dt_big <- rbind(dt_big,NULL)
# RES goes to 0.010t
gc()
# RES goes back down to 6.833Gb
I'm not sure why, when I remove lst_big after creating the new data.table using rbindlist, I am not having the memory returned to me. Even after manually calling gc (which you should not have to do), I still don't get back the memory that seems to be allocated to lst_big. Am I doing something wrong? Is there a better way to concatenate data.tables so that I do not leak memory?
(Tagging this with RStudio in case there's a chance it's somehow related to the IDE. This example is coming from RStudio Server running on an Ubuntu 14.04 box).
EDITED TO ADD: I just noticed that this memory usage issue remains even if I overwrite the list itself (rather than creating a new list, I just assign the output of my operations to the old list).
I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.