Find powerset of all unique combinations of vector of strings - r

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:
x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
"TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
"TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
"TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
"TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
"TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
"TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
"TmaxKLEX","TminKSDF","TmaxKSDF")
# Generate a list with the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.

To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.
As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.
That is, as #joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.
It's certainly feasible to buy a system that can support this, but not very cost-effective.
At a meta-level, it is very likely, as #JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.

You might try using expand.grid.
Create a data frame from all combinations of the supplied vectors or
factors. See the description of the return value for precise details
of the way this is done.

Related

subset indexing in r

I have a dataframe ma
it has a factor called type
type is comprised of the following factors: I210, I210plus, I210plusc, KV2c, KV2cplus
I'd like to put some of these factors in a vector, say, selected_types
so, selected_types<-c("I210plusc","KV2c")
then, have this command subset the dataframe ma
ma1<-subset(ma, type==selected_types)
such that ma1 would be a subset of ma consisting of only the observations that had
type I210plusc and KV2c
however, when I do this, the number of observations in the resulting dataframe ma1 is less than the sum of the occurrences of the two types in selected_types from the original ma
Any ideas on what I'm doing incorrectly?
Thank you
I originally had this in a comment, but it's a bit lengthy, plus I wanted to add to it. Here some details on what's happening:
what you're doing with == is recycling your two length vector, so that every even row is compared to "KV2c", and every odd one to "I210plusc", so your final result will be the data frame of odd rows that are "KV2c" and even rows that are "I210plusc".
An alternate solution that might make the issue clear is as follows:
subset(ma, type == selected_types[[1]] | type == selected_types[[2]])
Or, more gracefully:
subset(ma, type %in% selected_types)
The %in% operator returns a logical vector of same length as type with TRUE for every position in type that "is in" selected_types (hence the name of the operator).

Iterate process in R using range of vectors derived from matrix

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.
You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

Make the sum of all the subtractions of a vector elements in R

Hello I am new to R and I can't find the way to do exactly what I want to. I have a vector of x numbers, and what i want to do is order it in increasing order, and then start making subtractions like this (let's say the vecto has 100 numbers for example):
[x(100)-x(99)]+[x(99)-x(98)]+[x(98)-x(97)]+[x(97)-x(96)]+...[x(2)-x(1)]
and then divide all that sum by the number of elements the vector has, in this case 100.
The only thing that I am able to do at the moment is order the vector with:
sort(nameOfTheVector)
Sorry for my bad English.
diff returns suitably lagged and iterated differences. In your case you want the default single lag. sum will return the sum any arguments passed to it, so....
sum(diff(sort(nameOfTheVector))) / length(nameOfTheVector)

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Explain R tapply description

I understand what tapply() does in R. However, I cannot parse this description of it from the documentaion:
Apply a Function Over a "Ragged" Array
Description:
Apply a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Usage:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
When I think of tapply, I think of group by in sql. You group values in X together by its parallel factor levels in INDEX and apply FUN to those groups. I have read the description of tapply 100 times and still can't figure out how what it says maps to how I understand tapply. Perhaps someone can help me parse it?
#joran's great answer helped me understand it (so please vote for his - I would have added it as comment if it wasn't too long for that), but this may be of help to some:
In quite a few languages, you have twodimensional arrays. Depending on the language, these arrays have fixed dimensions (i.e.: each row has the same number of columns), or some languages allow the number of items per row to differ. So instead of:
A: 1 2 3
B: 4 5 6
C: 7 8 9
You could get something like
A: 1 3
B: 4 5 6
C: 8
This is called a ragged array because, well, the right side of it looks ragged.
In typical R-style, we might represent this as two vectors:
values<-c(1,3,4,5,6,8)
names<-c("A", "A", "B", "B", "B", "C")
So tapply with these two vectors as the first parameters indeed allows us to apply this function to each 'row' of our ragged array.
Let's see what the R documentation says on the subject:
The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.
The list of factors you supply via INDEX together specify a collection of subsets of X, of possibly different lengths (hence, the 'ragged' descriptor). And then FUN is applied to each subset.
EDIT: #Joris makes an excellent point in the comments. It may be helpful to think of tapply(X,Y,...) as a wrapper for sapply(split(X,Y),...) in that if Y is a list of grouping factors, it builds a new, single grouping factor based on their unique levels, splits X accordingly and applies FUN to each piece.
EDIT: Here's an illustrative example:
library(lattice)
library(plyr)
set.seed(123)
#Make this example unbalanced
dat <- barley[sample(1:120,50),]
#Suppose we want the avg yield by year/site:
table(dat$year,dat$site)
#That's what they mean by 'ragged' array; there are different
# numbers of obs at each comb of levels
#In plyr we could use ddply:
ddply(dat,.(year,site),.fun=function(x){mean(x$yield)})
#Which gives the same result (listed in a diff order) as:
melt(tapply (dat$yield, list (dat$year, dat$site), mean))

Resources