I have a function written as:
setlower <- function(df) {
for(j in seq_along(df)){
data.table::set(df, j=j, value=stringi::stri_trans_tolower(df[[j]]))
}
invisible(df)
}
I have a much larger package that calls this function on multiple data.table's in an lapply. The profile for it looks like:
You can see I spent a majority of my time for each data.table I am processing in the setlower function. Is there anyway to speed this up? I know both data.table and stringi are supposed to be about as fast as you can get for data frame and string operations but I didn't know if there is a potentially faster way to assure that any given data frame is converted entirely to lower case.
Related
How it is possible that storing data into H2O matrix are slower than in data.table?
#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)
h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
matrix2[i,1]<-3
}
Thanks!
H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)
So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.
For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().
h2o_frame = as.h2o(matrix1)
head(h2o_frame)
This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)
data.table Tips:
For data.table, prefer the in-place := syntax. This avoids copies. So, for example:
matrix1[i, 3 := 42]
H2O Tips:
The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.
The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.
If you want to watch the network messages between R and H2O, call h2o.startLogging().
I can't answer your question because I don't know h20. However I can make a guess.
Your code to fill the data.table is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.
for(i in 1:1000){
matrix1[i,1]<-3
}
for(i in 1:1000){
set(matrix1, i, 1L, 3)
}
With set my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).
I suppose h2o to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?
I have two data frames and i am doing a grouping operation based on weighted scores. I used PROFVIS to profile the code and figured that looping through the dataframes to check and add group labels is a costly operation. I understand we can use lapply, but not sure how to parse two dataframes and a new variable for this. Please help. I just need to reduce the time and space complexity of this code using apply functions.
rank1<-c()
occup_cats<-c()
for(i in 1:length(data_set$primary_occupation)){
for(j in 1:length(occup_cat_prop$Category)){
**if((as.character(data_set$primary_occupation[i])) == (as.character(occup_cat_prop$income_source[j])))**{
rank1[i]<-occup_cat_prop$prop[j]
occup_cats[i]<-as.character(occup_cat_prop$Category[j])
}
}
}
I have a large dataset I am reading in R
I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:
clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb
So I am trying to apply this function part by part by doing this:
clientsmd<-data.frame()
n<-7316738 #Amount of observations in the dataset
t<-0
for(i in 1:200){
clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
clientsm<-unique(clientsm)
clientsmd<-rbind(clientsm)
t<-(t+1) }
But I get this:
Error in `[.default`(xj, i) : subscript too large for 32-bit R
I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.
I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.
Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame
library(data.table)
clients<-data.table(clients)
clientsUnique<-unique(clients)
or
duplicateIndex <-duplicated(clients)
will give rows that are duplicates.
increase your memory limit like below and then try executing.
memory.limit(4000) ## windows specific command
You could use distinct function from dplyr package .
function - df %>% distinct(ID)
where ID is something unique in your dataframe .
I have some functions like this:
myf = function(x) {
# many similar statements involving indexing x
do1(x[, indexfunc1()])
do2(x[, indexfunc1()])
do3(x[, indexfunc1()])
do4(x[, indexfunc1()])
do5(x[, indexfunc1()])
}
In all these functions, I need extract columns or rows
of x, and these functions are used in some loops.
The problem is sometimes we also have data in a transposed
format, so this means for these data we have to get t(x).
This is very ineffecient and very time consuming since
these matrices are often huge.
Is there a smart way to deal with this? It would be very annoying
to have to change code manually.
Well, first of all, if your doX functions expect the transpose of the matrix, you are going to be calling t somewhere, for example
do1(t(x[indexfunc(),])))
So your options are:
Transpose x once at the top
Transpose at each doX call
Rewrite your doX functions so they take an optional isTranspose argument.
Option 3 will be the most work, but also the most efficient. The situation where it would make sense to use option 2 is if x is huge, but you are only selecting a small number of rows/cols each time. In which case you could do something like this:
matrixSelect<-function(x,subset,dim=1){
if(dim==1)
t(x[subset,])
else
x[,subset]
}
and then write
myf = function(x,dim=2) {
# many similar statements involving indexing x
do1(matrixSelect(x,indexfunc1(),dim)
# etc
}
this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.
Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.