I have a very large pairs list that I need to break down into single linkage communities. So far I have been able to do this entirely in R just fine. But I need to prepare for the eventuality that the entire list may be too large to hold in memory, or for igraph's R implementation to handle. A very simple version of this task looks like:
library(igraph)
df <- data.frame("p1" = c("a", "a", "d", "d"),
"p2" = c("b", "c", "e", "f"),
"val" = c(0.5, 0.75, 0.25, 0.35))
g <- graph_from_data_frame(d = df,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
if df is incredibly large - on the scale of hundreds of millions, to tens of billions of rows, is there a way for me to extract individual positions of sg without having to build g in it's entirety? It's relatively easy for me to store representations of df outside of R either as a compressed txt file or as a sqlite database.
To adress the problem with igraph's R implementation (assuming the dataset is still holdable in RAM, otherwise see #Paul Brodersen's answer):
The solution below works by specifying one element of the graph and then going over all connections until no further edges are found. It therefore creates the subgraph without building the whole graph. It looks a bit hacky compared to a recursive function but scales better.
library(igraph)
reduce_graph <- function(df, element) {
stop = F
elements_to_inspect <- element
rows_graph <-0
while(stop ==F) {
graph_parts <- df[df$p1 %in% elements_to_inspect |
df$p2 %in% elements_to_inspect,]
elements_to_inspect <- unique(c(unique(graph_parts$p1),
unique(graph_parts$p2)))
if(dim(graph_parts)[1] == rows_graph) {
stop <-TRUE
} else {
rows_graph <- dim(graph_parts)[1]
}
}
return(graph_parts)
}
df <- data.frame("p1" = c("a", "a", "d", "d","o"),
"p2" = c("b", "c", "e", "f","u"),
"val" = c(100, 0.75, 0.25, 0.35,1))
small_graph <- reduce_graph(df, "f")
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
One can test the speed on a bigger dataset.
##larger dataset with lots of sparse graphs.
set.seed(100)
p1 <- as.character(sample(1:10000000, 1000000, replace=T))
p2 <- as.character(sample(1:10000000, 1000000, replace=T))
val <- rep(1, 1000000)
df <- data.frame("p1" = p1,
"p2" = p2,
"val" = val)
small_graph <- reduce_graph(df, "9420672") #has 3 pairwise connections
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
Building groups and subgraph takes one second, compared to multiple minutes for the whole graph on my machine. This of course depends on how sparsely connected the graphs are.
I am trying to simulate data and adjust the model for choice-based conjoint analysis (mlogit). However, I am getting the error (Error in solve.default (H, g [! Fixed])). I believe it is because sometimes I have more than one choice for the same respondent, but I understand that this is one of the possibilities of this model and so I could not find a solution, someone who has worked with discrete choice and used the mlogit package can help me? Below is my code:
rm(list = ls())
cat("\014")
library(dplyr)
library(conjoint)
set.seed(0)
n <- 1000
#- create dummy data
data = expand.grid(Cor = c("black", "white"),
Brand = c("X", "Y"),
Price = c("low", "high"))
levn <- rbind("black", "white","X", "Y", "low", "high")
data$trat <- c("A", "B", "C", "D", "E", "F", "G", "H")
UA <- 7
UB <- 6.5
UC <- 6
UD <- 5.5
UE <- 5
UF <- 4.5
UG <- 4
UH <- 3.5
data$utility <- c(UA, UB, UC, UD, UE, UF, UG, UH)
data <- bind_rows(replicate(n, data, simplify = FALSE))
erro <- rnorm(n)
data$erro <- erro
data$determinist <- floor(rowSums(data[,5:6]))
data$id <- rep(1:n, each = 8)
data <- data %>% group_by(id) %>% mutate(determinist = (determinist == max(determinist)))
data$choice <- ifelse(data$determinist=="TRUE",1,0)
library(mlogit)
cbc.mlogit <- mlogit.data(data=data, choice="choice", shape="long", varying=1:3, alt.levels=paste("pos", 1:8), id.var="id")
cbc.ml <- mlogit(choice ~ 0 + Cor + Brand + Price, data = cbc.mlogit)
Using the nnet package I got a solution, however, I don't think it's correct, because in the choice-based joint analysis the intercept is zero.
library("nnet")
model <- multinom(choice ~ Cor + Brand + Price, data = cbc.mlogit)
summary(model)
Already researched other posts and could not solve, any help is welcome!
I am building an agent-based model with R but I have memory issues by trying to work with large 3D arrays. In the 3D arrays, the first dimension corresponds to the time (from 1 to 3650 days), the second dimension defines the properties of individuals or landscape cells and the third dimension represents each individual or landscape cell. At each time step (1 day), each 3D array is filled using several functions. Ideally, I would like to run my ABM over large landscapes (for example, 90000 cells) containing a large number of individuals (for example, 720000). Actually, this is impossible because of memory issues.
Currently, the 3D arrays are defined at initialization so that data are stored in the array at each time step. However, to fill one 3D array at t from the model, I need to only keep data at t – 1 and t – tf – 1, where tf is a duration parameter that is fixed (e.g., tf = 320 days). Here an example of one model function that is used to fill one 3D array (there are 3 duration parameters):
library(ff)
library(magrittr)
library(dplyr)
library(tidyr)
library(gtools)
## Define parameters
tf1 <- 288
tf2 <- 292
tf3 <- 150
## Define 3D array
col_array <- c(letters[seq( from = 1, to = 9 )])
s_array <- ff(-999, dim=c(3650, 9, 2500), dimnames=list(NULL, col_array, as.character(seq(1, 2500, 1))),
filename="s_array.ffd", vmode="double", overwrite = t) ## 3th dimension = patch ID
## Define initial array
initial_s_array <- matrix(sample.int(100, size = 2500*9, replace = TRUE), nrow = 2500, ncol = 9, dimnames=list(NULL, col_array))
## Loop over time
line <- 1
for(t in 1:3650){
print(t)
s_array[t,c("a", "b", "c", "d", "e", "f", "g", "h", "i"),] <- func(v_t_1 = ifelse((t - 1) >= 1, list(s_array[(t - 1),,]), NA)[[1]],
v_t_tf1_1 = ifelse((t - tf1 - 1) >= 1, list(s_array[(t - tf1 - 1),,]), NA)[[1]],
v_t_tf2_1 = ifelse((t - tf2 - 1) >= 1, list(s_array[(t - tf2 - 1),,]), NA)[[1]],
v_t_tf3_1 = ifelse((t - tf3 - 1) >= 1, list(s_array[(t - tf3 - 1),,]), NA)[[1]],
v_t_0 = initial_s_array, columns_names = col_array)
line <- line + 1
}
func <- function(v_t_1, v_t_tf1_1, v_t_tf2_1, v_t_tf3_1, v_t_0, columns_names){
## Data at t-1
dt_t_1 <- (ifelse(!(all(is.na(v_t_1))),
list(v_t_1 %>%
as.data.frame.table(stringsAsFactors = FALSE) %>%
dplyr::mutate_all(as.character)), NA))[[1]]
## Data at t-tf1-1
dt_t_tf1_1 <- (ifelse(!(all(is.na(v_t_tf1_1))),
list(v_t_tf1_1 %>%
as.data.frame.table(stringsAsFactors = FALSE) %>%
dplyr::mutate_all(as.character)), NA))[[1]]
## Data at t-tf2-1
dt_t_tf2_1 <- (ifelse(!(all(is.na(v_t_tf2_1))),
list(v_t_tf2_1 %>%
as.data.frame.table(stringsAsFactors = FALSE) %>%
dplyr::mutate_all(as.character)), NA))[[1]]
## Data at t-tf3-1
dt_t_tf3_1 <- (ifelse(!(all(is.na(v_t_tf3_1))),
list(v_t_tf3_1 %>%
as.data.frame.table(stringsAsFactors = FALSE) %>%
dplyr::mutate_all(as.character)), NA))[[1]]
## Format data at t-1
dt_t_1_reshape <- (ifelse(!(all(is.na(dt_t_1))),
list(dt_t_1 %>%
dplyr::rename(ID = Var2) %>%
tidyr::spread(Var1, Freq) %>%
dplyr::select(ID, columns_names) %>%
dplyr::arrange(match(ID, mixedsort(colnames(v_t_1))))), NA))[[1]]
## Format data at t-tf1-1
dt_t_tf1_1_reshape <- (ifelse(!(all(is.na(dt_t_tf1_1))),
list(dt_t_tf1_1 %>%
dplyr::rename(ID = Var2) %>%
tidyr::spread(Var1, Freq) %>%
dplyr::select(ID, columns_names) %>%
dplyr::arrange(match(ID, mixedsort(colnames(v_t_tf1_1))))), NA))[[1]]
## Format data at t-tf2-1
dt_t_tf2_1_reshape <- (ifelse(!(all(is.na(dt_t_tf2_1))),
list(dt_t_tf2_1 %>%
dplyr::rename(ID = Var2) %>%
tidyr::spread(Var1, Freq) %>%
dplyr::select(ID, columns_names) %>%
dplyr::arrange(match(ID, mixedsort(colnames(v_t_tf2_1))))), NA))[[1]]
## Format data at t-tf3-1
dt_t_tf3_1_reshape <- (ifelse(!(all(is.na(dt_t_tf3_1))),
list(dt_t_tf3_1 %>%
dplyr::rename(ID = Var2) %>%
tidyr::spread(Var1, Freq) %>%
dplyr::select(ID, columns_names) %>%
dplyr::arrange(match(ID, mixedsort(colnames(v_t_tf3_1))))), NA))[[1]]
## Retrieve data
a_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("a")])), list(v_t_0[,c("a")])))[[1]]
d_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("d")])), list(v_t_0[,c("d")])))[[1]]
g_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("f")])), list(v_t_0[,c("f")])))[[1]]
a_t_tf1_1 <- (ifelse(!(all(is.na(dt_t_tf1_1_reshape))), list(as.numeric(dt_t_tf1_1_reshape[,c("a")])), 0))[[1]]
d_t_tf2_1 <- (ifelse(!(all(is.na(dt_t_tf2_1_reshape))), list(as.numeric(dt_t_tf2_1_reshape[,c("d")])), 0))[[1]]
g_t_tf3_1 <- (ifelse(!(all(is.na(dt_t_tf3_1_reshape))), list(as.numeric(dt_t_tf3_1_reshape[,c("f")])), 0))[[1]]
b_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("b")])), list(v_t_0[,c("b")])))[[1]]
e_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("e")])), list(v_t_0[,c("e")])))[[1]]
h_t_1 <- (ifelse((!(all(is.na(dt_t_1_reshape)))), list(as.numeric(dt_t_1_reshape[,c("h")])), list(v_t_0[,c("h")])))[[1]]
b_t_tf1_1 <- (ifelse(!(all(is.na(dt_t_tf1_1_reshape))), list(as.numeric(dt_t_tf1_1_reshape[,c("b")])), 0))[[1]]
e_t_tf2_1 <- (ifelse(!(all(is.na(dt_t_tf2_1_reshape))), list(as.numeric(dt_t_tf2_1_reshape[,c("e")])), 0))[[1]]
h_t_tf3_1 <- (ifelse(!(all(is.na(dt_t_tf3_1_reshape))), list(as.numeric(dt_t_tf3_1_reshape[,c("h")])), 0))[[1]]
## Define discrete equations
a_t <- round(0.4*a_t_1 + 0.5*a_t_tf1_1)
b_t <- round(0.5*b_t_1 + 0.6*b_t_tf1_1)
c_t <- a_t + b_t
d_t <- round(0.7*d_t_1 + 0.7*d_t_tf2_1)
e_t <- round(0.9*e_t_1 + 0.4*e_t_tf2_1)
f_t <- d_t + e_t
g_t <- round(0.3*g_t_1 + 0.2*g_t_tf3_1)
h_t <- round(0.5*h_t_1 + 0.1*h_t_tf3_1)
i_t <- g_t + h_t
## Update the values
dt_array <- as.matrix(cbind(a_t, b_t, c_t, d_t, e_t, f_t, g_t, h_t, i_t))
## print(dt_array)
## Build the output matrix
dt_array <- t(dt_array)
return(dt_array)
}
The function “func” takes as argument data at t – 1 and t – tf – 1 providing one 3D array “s_array”. The function returns a data frame that is used to fill the 3D array.
I think that I could reduce the first dimension of my arrays by keeping only data at t – 1 and t – tf – 1 (rather than keep data at each time step from 1 to 3650 days). However, I don’t know how to manage these new 3D arrays in the ABM at each time step (i.e., how to initialize the 3D arrays and to only store data at t – 1 and t – tf – 1)?
Edit:
I have tested the example with 90000 observations for the 3rd dimension. The number of rows (i.e., 3650) in each array is too large.
> s_array <- ff(-999, dim=c(3650, 9, 90000), dimnames=list(NULL, col_array, as.character(seq(1, 90000, 1))),
+ filename="s_array.ffd", vmode="double", overwrite = TRUE)
Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(-999, dim = c(3650, 9, 90000), dimnames = list(NULL, col_array, :
NAs introduced by coercion to integer range
Is there a way to reduce the number of rows and apply the function that is used to fill the arrays ?
The reason I said that R probably isn't ideal is because of its copy-on-modify semantics,
so every time you change something in any array/matrix/data frame,
a copy must be made.
I think R can do some clever things with its memory management,
but still.
Using ff to avoid having all time slices in RAM at the same time is probably indeed advantageous,
but your code is probably changing storage formats far too often,
juggling data structures back and forth.
I think I packed the logic in a couple R6 class
(along with some improvements),
maybe it can get you started with improving your code's memory usage:
suppressPackageStartupMessages({
library(R6)
library(ff)
})
SArray <- R6::R6Class(
"SArray",
public = list(
time_slices = NULL,
initialize = function(sdim, sdimnames) {
self$time_slices <- lapply(1L:sdim[1L], function(ignored) {
ff(NA_real_, vmode="double", dim=sdim[-1L], dimnames=sdimnames[-1L])#, FF_RETURN=FALSE)
})
names(self$time_slices) <- sdimnames[[1L]]
}
)
)
`[.SArray` <- function(s_array, i, j, ...) {
s_array$time_slices[[i]][j,]
}
`[<-.SArray` <- function(s_array, i, j, ..., value) {
s_array$time_slices[[i]][j,] <- value
s_array
}
dim.SArray <- function(x) {
c(length(x$time_slices), dim(x$time_slices[[1L]]))
}
ABM <- R6::R6Class(
"ABM",
public = list(
s_array = NULL,
tf1 = NULL,
tf2 = NULL,
tf3 = NULL,
initialize = function(sdim, sdimnames, tfs) {
self$tf1 <- tfs[1L]
self$tf2 <- tfs[2L]
self$tf3 <- tfs[3L]
self$s_array <- SArray$new(sdim, sdimnames)
},
init_abm = function(seed = NULL) {
set.seed(seed)
sdim <- dim(self$s_array)
s_init <- matrix(sample.int(100L, size = 6L * sdim[3L], replace=TRUE),
nrow=6L, ncol=sdim[3L],
dimnames=list(c("a", "b", "d", "e", "g", "h"),
as.character(1:sdim[3L])))
self$a(1L, s_init["a", ])
self$b(1L, s_init["b", ])
self$c(1L)
self$d(1L, s_init["d", ])
self$e(1L, s_init["e", ])
self$f(1L)
self$g(1L, s_init["g", ])
self$h(1L, s_init["h", ])
self$i(1L)
private$t <- 1L
invisible()
},
can_advance = function() {
private$t < dim(self$s_array)[1L]
},
advance = function(verbose = FALSE) {
t <- private$t + 1L
if (verbose) print(t)
self$a(t)
self$b(t)
self$c(t)
self$d(t)
self$e(t)
self$f(t)
self$g(t)
self$h(t)
self$i(t)
private$t <- t
invisible()
},
# get time slice at t - tf - 1 for given letter
s_tf = function(t, tf, letter) {
t_tf_1 <- t - tf - 1L
if (t_tf_1 > 0L)
self$s_array[t_tf_1, letter, ]
else
0
},
# discrete equations
a = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "a", ]
t_tf_1 <- self$s_tf(t, self$tf1, "a")
}
self$s_array[t, "a", ] <- round(0.4 * t_1 + 0.5 * t_tf_1)
invisible()
},
b = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "b", ]
t_tf_1 <- self$s_tf(t, self$tf1, "b")
}
self$s_array[t, "b", ] <- round(0.5 * t_1 + 0.6 * t_tf_1)
invisible()
},
c = function(t) {
if (t < 1L) stop("t must be positive")
a_t <- self$s_array[t, "a", ]
b_t <- self$s_array[t, "b", ]
self$s_array[t, "c", ] <- a_t + b_t
invisible()
},
d = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "d", ]
t_tf_1 <- self$s_tf(t, self$tf2, "d")
}
self$s_array[t, "d", ] <- round(0.7 * t_1 + 0.7 * t_tf_1)
invisible()
},
e = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "e", ]
t_tf_1 <- self$s_tf(t, self$tf2, "e")
}
self$s_array[t, "e", ] <- round(0.9 * t_1 + 0.4 * t_tf_1)
invisible()
},
f = function(t) {
if (t < 1L) stop("t must be positive")
d_t <- self$s_array[t, "d", ]
e_t <- self$s_array[t, "e", ]
self$s_array[t, "f", ] <- d_t + e_t
invisible()
},
g = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "g", ]
t_tf_1 <- self$s_tf(t, self$tf3, "g")
}
self$s_array[t, "g", ] <- round(0.3 * t_1 + 0.2 * t_tf_1)
invisible()
},
h = function(t, t_0) {
if (t < 2L) {
t <- 1L
t_1 <- t_0
t_tf_1 <- 0
}
else {
t_1 <- self$s_array[t - 1L, "h", ]
t_tf_1 <- self$s_tf(t, self$tf3, "h")
}
self$s_array[t, "h", ] <- round(0.5 * t_1 + 0.1 * t_tf_1)
invisible()
},
i = function(t) {
if (t < 1L) stop("t must be positive")
g_t <- self$s_array[t, "g", ]
h_t <- self$s_array[t, "h", ]
self$s_array[t, "i", ] <- g_t + h_t
invisible()
}
),
private = list(
t = NULL
)
)
max_t <- 10
abm <- ABM$new(c(max_t, 9, 2500),
list(NULL, letters[1:9], as.character(1:2500)),
c(288L, 292L, 150L))
abm$init_abm()
while (abm$can_advance()) {
abm$advance(TRUE)
}
anyNA(abm$s_array[])
# FALSE
Some of the functions under discrete equations encapsulate the logic for initialization when t < 2L.
The SArray class separates the 3D array into a list of 2D arrays to work around the .Machine$integer.max limit.
sample data:
x <- sample(c("aa", "bb", "cc", NA), 5000, replace = TRUE)
y <- sample(c("mm", "nn", "pp", NA), 5000, replace = TRUE)
z <- sample(c("uu", "vv", "ww", NA), 5000, replace = TRUE)
m <- sample(c(1:99, NA), replace = TRUE)
data <- data.frame(x, y, z, m)
creating a vector of Target variable for rpart
colname <- names(data[ , -m])
passing reference by colname index as target variable to rpart for imputing NA values in x, y , z:
for(i in seq_along(colname)) {
fm <- as.formula(paste0(colname[i], "~ ."))
mod <- rpart(fm, data = data, method = "class")
i1 <- is.na(data[[colname[i]]])
pred <- predict(mod, data[[colname[i]]][i1], type = "class")
data[[colname[i]]][i1] <- pred[i1]
}
i get the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, :
variable lengths differ (found for 'm')
We create the formula with paste and then do the assignment only for the NA elements
library(rpart)
for(i in seq_along(colname)) {
fm <- as.formula(paste0(colname[i], "~ ."))
m <- rpart(fm, data = data, method = "class")
i1 <- is.na(data[[colname[i]]])
pred <- predict(m, data[c(colname[i], "m")], type = "class")[i1]
data[[colname[i]]][i1] <- pred[i1]
}
colSums(is.na(data))
# x y z
# 0 0 0
Update
With the updated dataset, subset the data in the rpart
colname <- names(data)[1:3]
for(i in seq_along(colname)) {
fm <- as.formula(paste0(colname[i], "~ ."))
m <- rpart(fm, data = data[colname], method = "class")
i1 <- is.na(data[[colname[i]]])
pred <- predict(m, data[[colname[i]]][i1], type = "class")
data[[colname[i]]][i1] <- pred[i1]
}
data
set.seed(24)
x <- sample(c("aa", "bb", "cc", NA), 5000, replace = TRUE)
y <- sample(c("mm", "nn", "pp", NA), 5000, replace = TRUE)
z <- sample(c("uu", "vv", "ww", NA), 5000, replace = TRUE)
data <- data.frame(x, y, z)
I am trying to perform a selection on a xdf file with rxDataStep.
I am using rowSelection and it works when I use explicit values but not when I use a variable, e.g.:
this works:
tmp <- rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT, UNIT_PRICE = RT_VALUE / TT_AMOUNT), varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'),
rowSelection = (A_ID == 1646041))
but this does not:
x <- 1646041
tmp <- rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT, UNIT_PRICE = RT_VALUE / TT_AMOUNT), varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'),
rowSelection = (A_ID == x))
it fails with the message:
ERROR: The sample data set for the analysis has no variables.
Caught exception in file: CxAnalysis.cpp, line: 3848. ThreadID: 31156 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5375. ThreadID: 31156 Rethrowing.
What is wrong here? I've been strugling with this for hours, tried every single sintax I found on the web.
Thanks.
We may need to pass it on the transformObjects
library(RevoScaleR)
rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT,
UNIT_PRICE = RT_VALUE / TT_AMOUNT),
varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'),
rowSelection = (A_ID == x1), transformObjects = list(x1=x))
Using a reproducible example
set.seed(100)
myData <- data.frame(x = 1:100, y = rep(c("a", "b", "c", "d"), 25),
z = rnorm(100), w = runif(100))
z1 <- 2
myDataSubset <- rxDataStep(inData = myData,
varsToKeep = c("x", "w", "z"),
rowSelection = z > zNew,
transformObjects = list(zNew=z1))
#Rows Read: 100, Total Rows Processed: 100, Total Chunk Time: 0.007 seconds
myDataSubset
# x w z
#1 20 0.03609544 2.310297
#2 64 0.79408518 2.581959
#3 96 0.07123327 2.445683
This can be also done with dplyr
library(dplyr)
myData %>%
select(x, w, z) %>%
filter(z > z1)
# x w z
#1 20 0.03609544 2.310297
#2 64 0.79408518 2.581959
#3 96 0.07123327 2.445683