Using tapply, ave functions for ff vectors in R - r

I have been trying to use tapply, ave, ddply to create statistics by group of a variable (age, sex). I haven't been able to use above mentioned R commands successfully.
library("ff")
df <- as.ffdf(data.frame(a=c(1,1,1:3,1:5), b=c(10:1), c=(1:10)))
tapply(df$a, df$b, length)
The error message I get is
Error in as.vmode(value, vmode) :
argument "value" is missing, with no default
or
Error in byMean(df$b, df$a) : object 'index' not found

There is currently no tapply or ave for ff_vectors currently implemented in package ff.
But what you can do is use functionality in ffbase.
Let's elaborate on some bigger dataset
require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)
For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.
df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]
For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:
require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
x <- as.data.table(x)
result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), by = list(a)]
result <- as.data.frame(result)
result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!
This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.
The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.

Related

calling the results of each iteration in a series

I have run a loop and the results are saved in the list com. Now I have to call the results of each iteration ( # iterations=2000) and compute the mean of values as below:
l<-rbindlist(list(com[[1]], com[[2]], com[[3]],...com[[2000]]))[, .(values = mean(values)),
by = variables][order(variables)].
I am a beginner in R. What would be the easy way of doing this?
Until you provide a data example, this will only be guesswork. I assume the results in your list com are numeric vectors. If not, this solution may not work.
This is base R, not data.table.
Example data:
set.seed(1)
com <- list(rnorm(100), rnorm(100), rnorm(100), rnorm(100), rnorm(100))
We bind the results together using do.call:
l <- do.call("rbind", com)
Now we use the vectorized rowMeans:
rowMeans(l)
> rowMeans(l)
[1] 0.10888737 -0.03780808 0.02967354 0.05160186 -0.03913424

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

clusterMap split list of data.frames

I'm working with two lists of data.frames and currently run something similar to this (simplified version of what I'm doing):
df1 <- data.frame("a","a1","L","R","b","c",1,2,3,4)
df2 <- data.frame("a","a1","L","R","b","c",4,4,4,4,4,44)
df3 <- data.frame(7,7,7,7)
df4 <- data.frame(5,5,5,5,9,9)
L1 <- list(df1,df2)
L2 <- list(df3,df4)
myfun <- function(x,y) {
difa = rowSums(abs(x[c(T,F)] - x[c(F,T)]))
difb=sum(abs(as.numeric(y[-c(1:6)])[c(T,F)] - as.numeric(y[-c(1:6)])[c(F,T)]))
diff <- difa + difb
return(diff)
}
output1 <- mapply(myfun, x = L2, y = L1)
The same number of data frames are in each list and each dataframe from one list corresponds to the dataframe in the other list. The dataframes in one list contain a single row while the other dataframes in the second list contain a dynamic number of rows; hence the use of sum and rowSums. The number of numeric columns are also dynamic but always the same between corresponding dataframes.
I'm looking to use parallel processing to speed up the computation when dealing with 1-10 million dataframes per list. I tried the following:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
clusterMap(cl, myfun, x = L2, y = L1)
stopCluster(cl)
However, due to the significant amount of data I'm using, this will quickly fill up the memory. I assume due to loading the entire lists of data frames in each cluster?? I'm new to parallel processing in R and have read that splitting up the data into chunks according to the number of cores available is required for some parallel functions that don't implement it automatically, so I tried the following which does not work:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
output1 <- clusterMap(cl, myfun, x = split(L2, ceiling(seq_along(L2)/no_cores)), y = split(L1, ceiling(seq_along(L1)/no_cores)))
stopCluster(cl)
Can someone help a newbie out? Most of the information I've been reading about uses parApply/parLapply/etc. I was able to use mcmapply, but since it only uses forking, I cannot use it. My code has to run on both unix and windows systems; hence my testing for OS.type to set it at fork.
UPDATE: So I think it is working correctly in the sense that it's parsing out chunks to different clusters, but the data type is not playing nice with binary operators inside the clusters. Issue appears to be related to it becoming a list of lists of dataframes and being treated as non-numeric in the clusters.

R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.
Here is some sample data:
ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")
What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.
The following code works and gives me the answer I want:
good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))
good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]
The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.
If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.
Does anyone have a better solution?
I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.
One way is:
require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]
Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.
That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.
So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.
And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).
Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.
# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)
DF <- data.frame(yr = rep(yr, sz),
token = token,
multiple = multiple, stringsAsFactors=FALSE)
# Arun's solution
ARUN.DT <- function(dt) {
setkey(dt, "multiple")
idx <- unique(dt$multiple)
dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}
# Ricardo's solution
RICARDO.DT <- function(dt) {
setkey(dt, key="yr")
newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
newDT
}
# create data.table
require(data.table)
DT <- data.table(DF)
# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 1 res1 <- ARUN.DT(DT) 10 9.542 1.000 7.218 1.394
# 2 res2 <- RICARDO.DT(DT) 10 17.484 1.832 14.270 2.888
But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.
you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:
# make sure `sample_data$multiple` is an integer
sample_data$multiple <- as.integer(sample_data$multiple)
# create data.table
S <- data.table(sample_data, key='yr')
# optionally, drop original data.frame if not needed
rm(sample_data)
## Allocate the memory first
newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]
Two notes:
(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.
(2) I used the following to determine the number of rows needed per year
S[, list(rows=length(token) * unique(multiple)), by=yr]

How to create vector matrix of movie ratings using R project?

Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73
It contains ratings in a file formatted as
userID::movieID::rating::timestamp
Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).
Example, if the data file contains
1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14
Then the output matrix would look like:
UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3
So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.
You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).
d <- read.delim(
"u1.base",
col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )
If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character.
If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.
d <- read.delim("a")
d <- as.character( d[,1] ) # vector of strings
d <- strsplit( d, "::" ) # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d ) # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )
From the web site pointed to in a previous question, it appears that you want to represent
> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb
which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!
I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in
what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)
the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric() on my platform take up twice as much memory as integer(), so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).
x <- list(User=factor(x$User), Film=factor(x$Film),
Rating=as.integer(2 * x$Rating))
I then allocated the matrix
ratings <- matrix(NA_integer_ ,
nrow=length(levels(x$User)),
ncol=length(levels(x$Film)),
dimnames=list(levels(x$User), levels(x$Film)))
and use the fact that a two-column matrix can be used to index another matrix
ratings[cbind(x$User, x$Film)] <- x$Rating
This is the step where memory use is maximum. I'd then remove unneeded variable
rm(x)
The gc() function tells me how much memory I've used...
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 140609 7.6 407500 21.8 350000 18.7
Vcells 373177663 2847.2 450519582 3437.2 408329775 3115.4
... a little over 3 Gb, so that's good.
Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values
> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?
Quite simply, you can represent it as a sparse matrix, using sparseMatrix from the Matrix package.
Just create a 3 column coordinate object list, i.e. in the form (i, j, value), say in a data.frame named myDF. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.
It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.

Resources