expand.grid - try to solve "cannot allocate vector of size" issue - r

I need to create huge data.frame of combinations, but I don't need them all. But as I saw here, expand.grid function is not able to add specific condition which combination throw out.
So I decided to go step by step. For example I have
variants<-9 # number of possible variants
aa<-c(0:variants) # vector of possible variants
ab<-c(0:variants)
ac<-c(0:variants)
ad<-c(0:variants)
ae<-c(0:variants)
af<-c(0:variants)
ag<-c(0:variants)
ah<-c(0:variants)
ai<-c(0:variants)
aj<-c(0:variants)
If I try to
expand.grid(aa,ab,ac,ad,ae,af,ag,ah,ai,aj)
the "cannot allocate vector of size" issue comes ..
So I tried to go step by step like
step<-2 # it is a condition for subsetting the grid
grid_2<-expand.grid(aa,ab)
sub_grid_2<-grid_2[abs(grid_2[,1]-grid_2[,2])<=step,]
which gives me combinations I need. To save memory I add then another column like
fun_grid_list_3<-function(x){
a<-sub_grid_2[x,1]
b<-sub_grid_2[x,2]
d<-rep(c(1:variants))
c<-data.frame(Var1=rep(a,variants),Var2=rep(b,variants),Var3=d)
return(c)
}
sublist_grid_3<-mclapply(c(1:nrow(sub_grid_2)),fun_grid_list_3,mc.cores=detectCores(),mc.preschedule=FALSE)
sub_grid_3=ldply(sublist_grid_3)
But the problem comes when I come to grid of 8 and more variables. It takes so much time, but it should be just adding a number into another frame. Maybe I am wrong and it trully need that time but I hope there is a more efficient way how to do this.
All I need is to create expand.grid of 2 variables, then add condition to subset it. Then add another column which respects the subsetted grid (add c(0:variants) to every row, it means create more rows of course ... and then subset it by condition and so ....
Can anybody help to make it faster? I hoped that use mclapply trought function should be the fastest, but maybe not ..
Thanks to anyone ...

Related

Count number of 0's on each column

My problem is quite simple but I not beeing able to solve him. I have a tibble dataframe and want to know how much 0's each column have. I tried to use the function sum(dataframe$column == 0) on each column but I think this is kinda inefficient since I want to apply this to a bunch of differents dataframes. Are there any other more automatic way to do it?

Not aggregating correctly

My goal of this code is to create a loop that aggregates each company's word frequency by a certain principle vector I created and adds it to a list. The problem is, after I run this, it only prints the 7 principles that I have rather than the word frequencies along side them. The word frequencies being the certain column of the FREQBYPRINC.AG data frame. Individually, running this code without the loop and just testing out a certain column, it works no problem. For some reason, the loop doesn't want to give me the correct data frames for the list. Any suggestions?
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
attach(FREQBYPRINC.AG)
list.agg[i]<-aggregate(FREQBYPRINC.AG[,i+1],by=list(Type=principle),FUN=sum,na.rm=TRUE)
}
I really wish I could help. After reading your statement, It seems that to you , you feel that the code should be working and it is not. Well maybe there exists a glitch.
Since you had previously specified list. agg as a list, you need to subset it with double square brackets. Try this one out:
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
list.agg[[i]]<-aggregate(FREQBYPRINC.AG[,i+1],by=list
(Type=principle),FUN=sum,na.rm=TRUE)}

How to order a matrix by all columns

Ok, I'm stuck in a dumbness loop. I've read thru the helpful ideas at How to sort a dataframe by column(s)? , but need one more hint. I'd like a function that takes a matrix with an arbitrary number of columns, and sorts by all columns in sequence. E.g., for a matrix foo with N columns,
does the equivalent of foo[order(foo[,1],foo[,2],...foo[,N]),] . I am happy to use a with or by construction, and if necessary define the colnames of my matrix, but I can't figure out how to automate the collection of arguments to order (or to with) .
Or, I should say, I could build the entire bloody string with paste and then call it, but I'm sure there's a more straightforward way.
The most elegant (for certain values of "elegant") way would be to turn it into a data frame, and use do.call:
foo[do.call(order, as.data.frame(foo)), ]
This works because a data frame is just a list of variables with some associated attributes, and can be passed to functions expecting a list.

Best way of storing data in 100 objects for later retrieval?

When doing sequencing, I normally apply TraMineR's seqdef function on a dataset to generate a single sequence object:
sequence_object <- seqdef(data)
However, let's say I want to loop through a dataframe and generate 1 sequence object per every chunk of 10 columns. Then I would do something like this:
colpicks <- seq(10,1000,by=10)
mapply(function(start,stop) seqdef(df[,start:stop]), colpicks-9, colpicks)
Now, I want to store these objects in some suitable manner. Two questions:
What is the most suitable way of storing (or maybe just automatically naming) 100 objects, so that I can easily loop through each of them at a later point?
How can I modify my code above so that it stores the data per your answer to (1)?
"Most suitable" is completely subjective and dependent on your goal.
I'm assuming this question is related to your previous question, and thus I would suggest setting the simplify argument of mapply to FALSE
myMatrixList <- mapply(.... , simplify=FALSE)
However, even that is not necessary, as you can just combine the sapply from the previous question and skip the middle step

using value of a function & nested function in R

I wrote a function in R - called "filtre": it takes a dataframe, and for each line it says whether it should go in say bin 1 or 2. At the end, we have two data frames that sum up to the original input, and corresponding respectively to all lines thrown in either bin 1 or 2. These two sets of bin 1 and 2 are referred to as filtre1 and filtre2. For convenience the values of filtre1 and filtre2 are calculated but not returned, because it is an intermediary thing in a bigger process (plus they are quite big data frame). I have the following issue:
(i) When I later on want to use filtre1 (or filtre2), they simply don't show up... like if their value was stuck within the function, and would not be recognised elsewhere - which would oblige me to copy the whole function every time I feel like using it - quite painful and heavy.
I suspect this is a rather simple thing, but I did search on the web and did not find the answer really (I was not sure of best key words). Sorry for any inconvenience.
Thxs / g.
It's pretty hard to know the optimum way of achieve what you want as you do not provide proper example, but I'll give it a try. If your variables filtre1 and filtre2 are defined inside of your function and you do not return them, of course they do not show up on your environment. But you could just return the classification and make filtre1 and filtre2 afterwards:
#example data
df<-data.frame(id=1:20,x=sample(1:20,20,replace=TRUE))
filtre<-function(df){
#example function, this could of course be done by bins<-df$x<10
bins<-numeric(nrow(df))
for(i in 1:nrow(df))
if(df$x<10)
bins[i]<-1
return(bins)
}
bins<-filtre(df)
filtre1<-df[bins==1,]
filtre2<-df[bins==0,]

Resources