All N Combinations of All Subsets - r

Given a vector of elements, I would like to obtain a list of all possible n-length combinations of subsets of elements. For example, given the (simplest) sequence 1:2, I would like to obtain a list object of the form
{ {{1},{1}}, {{1},{2}}, {{2},{2}}, {{1},{1,2}}, {{2},{1,2}}, {{1,2},{1,2}} }
when n=2.
I was able to generate a list of all non-empty subsets using the following:
listOfAllSubsets <- function (s) {
n <- length(s)
unlist(lapply(1:n, function (n) {
combn(s, n, simplify=FALSE)
}), recursive=FALSE)
}
However, I'm not sure the best way to proceed from here. Essentially, I want a Cartesian product of this list with itself (for n=2).
Any suggestions? A non-iterative solution would be preferable (i.e., no for loops).

It is easier to start with a Cartesian product of the indices. Then duplication can be avoided by making sure the tuple of indices is sorted.
combosn <- function(items,n) {
i <- seq_along(items)
idx <-do.call(expand.grid,rep(list(i),n))
idx <- idx[!apply(idx,1,is.unsorted),]
apply(idx,1,function(x) items[x])
}
ss<-listOfAllSubsets(1:2)
str(combosn(ss,2))
List of 6
$ :List of 2
..$ : int 1
..$ : int 1
$ :List of 2
..$ : int 1
..$ : int 2
$ :List of 2
..$ : int 2
..$ : int 2
$ :List of 2
..$ : int 1
..$ : int [1:2] 1 2
$ :List of 2
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
Or, for n=3,
str(combosn(ss,3))
List of 10
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int 1
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int 2
$ :List of 3
..$ : int 1
..$ : int 2
..$ : int 2
$ :List of 3
..$ : int 2
..$ : int 2
..$ : int 2
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 1
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 2
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 1
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2

This is what I would do, with, e.g., s=1:2:
1) Represent subsets with a 0/1 matrix for each element's membership.
subsets = as.matrix(do.call(expand.grid,replicate(length(s),0:1,simplify=FALSE)))
which gives
Var1 Var2
[1,] 0 0
[2,] 1 0
[3,] 0 1
[4,] 1 1
Here, the first row is the empty subset; the second, {1}; the third, {2}; and the fourth, {1,2}. To get the subset itself, use mysubset = s[subsets[row,]], where row is the row of the subset you want.
2) Represent pairs of subsets as pairs of rows of the matrix:
pairs <- expand.grid(Row1=1:nrow(subsets),Row2=1:nrow(subsets))
which gives
Row1 Row2
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 4 2
9 1 3
10 2 3
11 3 3
12 4 3
13 1 4
14 2 4
15 3 4
16 4 4
Here, the fourteenth row corresponds to the second and fourth rows of subsets, so {1} & {1,2}. This assumes the order of the pair matters (which is implicit in taking the Cartesian product). To recover the subsets, use mypairosubsets=lapply(pairs[p,],function(r) s[subsets[r,]]) where p is the row of the pair you want.
Expanding beyond pairs to the P(s)^n case (where P(s) is the power set of s) would look like
setsosets = as.matrix(do.call(expand.grid,replicate(n,1:nrow(subsets),simplify=FALSE)))
Here, each row will have a vector of numbers. Each number corresponds to a row in the subsets matrix.
Making copies of the elements of s is probably not necessary for whatever you are doing after this. However, you could do it from here by using lapply(1:nrow(pairs),function(p)lapply(pairs[p,],function(r) s[subsets[r,]])), which starts like...
[[1]]
[[1]]$Row1
integer(0)
[[1]]$Row2
integer(0)
[[2]]
[[2]]$Row1
[1] 1
[[2]]$Row2
integer(0)

allSubsets<-function(n,# size of initial set
m,# number of subsets
includeEmpty=FALSE)# should the empty set be consiered a subset?
{
# m can't exceed the number of possible subsets
if(includeEmpty)
stopifnot(m <= 2^n)
else
stopifnot(m <= 2^n-1)
# get the subsets of the initial set (of size n)
if(includeEmpty){
ll <- split(t(combn(2^n,m)),seq(choose(2^n,m)))
}else
ll <- split(t(combn(2^n-1,m)),seq(choose(2^n-1,m)))
# get the subets
subsets <- apply(do.call(expand.grid,rep(list(c(F,T)),n)),
1,which)
# remove the empty subset if desired
if(!includeEmpty)
subsets <- subsets[-1]
# covert the subsets to vector
subsets <- lapply(subsets,as.vector)
# return the list of subsets
apply(t(mapply('[',list(subsets),ll)),1,function(x)x)
}
# returns a list where each element is a list of length 2 with
# subsets of the initial set of length 4
x = allSubsets(4,2,F)

Related

Building a table/dataframe/something exportable from Desc function output in R

I'm definitely a noob, though I have used R for various small tasks for several years.
For the life of me, I cannot figure out how to get the results from the "Desc" function into something I can work with. When I save the x<-Desc(mydata) the class(x) shows up as "Desc." In R studio it is under Values and says "List of 1." Then when I click on x it says ":List of 25" in the first line. There is a list of data in this object, but I cannot for the life of me figure out how to grab any of it.
Clearly I have a severe misunderstanding of the R data structures, but I have been searching for the past 90 minutes to no avail so figured I would reach out.
In short, I just want to pull certain aspects (N, mean, UB, LB, median) of the descriptive statistics provided from the Desc results for multiple datasets and build a little table that I can then work with.
Thanks for the help.
Say you have a dataframe, x, where:
x <- data.frame(i=c(1,2,3),j=c(4,5,6))
You could set:
desc.x <- Desc(x)
And access the info on any given column like:
desc.x$i
desc.x$i$mead
desc.x$j$sd
And any other stats Desc comes up with. The $ is the key here, it's how you access the named fields of the list that Desc returns.
Edit: In case you pass a single column (as the asker does), or simply a vector to Desc, you are then returned a 1 item list. The same principle applies but the usual syntax is different. Now you would use:
desc.x <- Desc(df$my.col)
desc.x[[1]]$mean
In the future, the way to attack this is to either look in the environment window in RStudio and play around trying to figure out how to access the fields, check the source code on github or elsewhere, or (best first choice) use str(desc.x), which gives us:
> str(desc.x)
List of 1
$ :List of 25
..$ xname : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i"
..$ label : NULL
..$ class : chr "numeric"
..$ classlabel: chr "numeric"
..$ length : int 3
..$ n : int 3
..$ NAs : int 0
..$ main : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i (numeric)"
..$ unique : int 3
..$ 0s : int 0
..$ mean : num 2
..$ meanSE : num 0.577
..$ quant : Named num [1:9] 1 1.1 1.2 1.5 2 2.5 2.8 2.9 3
.. ..- attr(*, "names")= chr [1:9] "min" ".05" ".10" ".25" ...
..$ range : num 2
..$ sd : num 1
..$ vcoef : num 0.5
..$ mad : num 1.48
..$ IQR : num 1
..$ skew : num 0
..$ kurt : num -2.33
..$ small :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 1 2 3
.. ..$ freq: num [1:3] 1 1 1
..$ large :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 3 2 1
.. ..$ freq: num [1:3] 1 1 1
..$ freq :Classes ‘Freq’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ level : Factor w/ 3 levels "1","2","3": 1 2 3
.. ..$ freq : int [1:3] 1 1 1
.. ..$ perc : num [1:3] 0.333 0.333 0.333
.. ..$ cumfreq: int [1:3] 1 2 3
.. ..$ cumperc: num [1:3] 0.333 0.667 1
..$ maxrows : num 12
..$ x : num [1:3] 1 2 3
- attr(*, "class")= chr "Desc"
"List of 1" means you access it by desc.x[[1]], and below that follow the $s. When you see something like num[1:3] that means it's an atomic vector so you access the first member like var$field$numbers[1]

How to cast a dataframe to a DocumentTermMatrix?

I am trying to use tidytext to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. I start from AssociatedPress which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. What am I doing wrong?
library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
cast_dtm(document, term, count)
The element $Docs is not-NULL when I cast ap_td but it was NULL in AssociatedPress:
str(tt)
List of 6
$ i : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
$ j : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
cast_dtm retrieves a warning
Warning message: Trying to compute distinct() for variables not found
in the data:
- row_col, column_col This is an error, but only a warning is raised for compatibility reasons. The operation will return the input
unchanged.
On GitHub, I found this issue which should have been fixed now.
I don't get your warning message using tidytext 0.1.9.900 and R 3.5.0.
The dtm's are identical for the number of terms, rows and columns. Also all the counts are correct.
The difference is indeed between the $dimnames$Docs of tt$dimnames$Docs and AssociatedPress$dimnames$Docs.
The reason for this is that if there are no docids in the dtm before tidying as is the case with AssociatedPress, the tidy function assigns AssociatedPress$i to the document variable in the tidy_text (ap_td). Casting this back into a dtm, will fill the $dimnames$Docs with the document value from the tidy_text data.frame (ap_td). So in the end the AssociatedPress$i values will end up in tt$dimnames$Docs.
You can see that if you compare the $i from Associated Press with the Docs from tt.
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs))
[1] TRUE
Or comparing from AssociatedPress to ap_td to tt:
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs), unique(ap_td))
[1] TRUE
If you want to follow the logic yourself, you can check all the functions used on the github page for the sparse_tidiers. Start with tidy.DocumentTermMatrix and follow the function calls to tidy.simple_triplet_matrix and finally to tidy_triplet.

I would like to merge one data frame with each vector in a list of vectors.. Output should be a list of data frames

str(list) # the list
List of 11
$ : int [1:62850] 1013128473 1010310348 1048245573 1034384956 1041152164 1044038741 1018034270 1028472668 1028965885 1009487677 ...
$ : int [1:76934] 1013175201 1008463364 1016595579 1015077603 1036297925 1033985605 1004670509 1002708962 1035740487 1033948421 ...
$ : int [1:63141] 1023522277 1028419750 1035072196 1015895913 1044665345 1045384789 1003817549 1007103029 1034294940 1048731747 ...
$ : int [1:66286] 1004375117 1015143512 1013554405 1029388459 1042758662 1002010773 1014659880 1010136990 1042787992 1034111995 ...
$ : int [1:59295] 1026598712 1046781801 1047773468 1029647490 1000445831 1004654396 1026574333 1028210894 1031396631 1017077460 ...
$ : int [1:39513] 1008628321 1031342452 1036618138 1025299916 1059540334 1044636981 1025831775 1020671796 1016064196 1000573822 ...
$ : int [1:52616] 1007104357 1035072196 1045300736 1013342439 1021471188 1014648594 1047521123 1006283327 1018237501 1052887674 ...
$ : int [1:53865] 1043482304 1006375883 1065831792 1025658285 1025898360 1042188555 1010986410 1036297925 1016468595 1042017564 ...
$ : int [1:74030] 1049026709 1076616323 1013343981 1009441716 1004974596 1032515221 1059905172 1011514112 1005423064 1006931636 ...
$ : int [1:62171] 1024128835 1006168791 1003374715 1042188555 1016219766 1002708962 1035781234 1039706286 1011430434 1055809196 ...
$ : int [1:66560] 1020967137 1029327077 1026256246 1046334023 1035156221 1017504075 1035065786 1043426434 1034294940 1019105475 ...
str(df) # the data frame
'data.frame': 3727518 obs. of 5 variables:
$ A: int 10001676 10001676 10002575 10002990 10003466 10005485 10005736 10005949 10006562 10007119 ...
$ 1: int 1020565642 1020565642 1008628321 1038358741 1045031612 1025102185 1011873328 1002079752 1028579827 1026598712 ...
$ 2: Factor w/ 2 levels "ÇäËì","ÐßÑ": 2 2 2 2 2 2 2 2 2 2 ...
$ 3: int 1 4 1 1 1 1 20 1 1 1 ...
$ 4: int 64 64 66 63 69 59 84 83 65 64 ...
I want to merge each vector in the list with the data frame by "A".
What I tried was:
for(n in 1:length(list))
{
newlist[[n]] <- merge(df, list[[n]], by.x = "A")
}
Error in merge.data.frame(rd_info, newengagementspermonth[[n]], by.x = "NEWNINUMBER") :
'by.x' and 'by.y' specify different numbers of columns
The input is a list of 11 vectors and a dataframe. the output should be a list of 11 dataframes with the each dataframe having number of rows equal to the length of the corresponding vector.
You could do something like this. First, explicitly transform each object in the list into a data.frame. Then, merge it with df. You need to specify by.x and by.y since the data.frames do not have the same names.
new list <- lapply(lapply(list,as.data.frame),function(x) merge(x,df,by.x="X[[i]]",by.y="A",all.x=TRUE))
With sample data:
list <- list(1:8,1:10,2:15)
df <- data.frame(A=1:15,
b=rnorm(15))
output
str(newlist)
List of 3
$ :'data.frame': 8 obs. of 2 variables:
..$ X[[i]]: int [1:8] 1 2 3 4 5 6 7 8
..$ b : num [1:8] 0.0127 0.2082 -0.271 0.421 -0.538 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ X[[i]]: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b : num [1:10] 0.0127 0.2082 -0.271 0.421 -0.538 ...
$ :'data.frame': 14 obs. of 2 variables:
..$ X[[i]]: int [1:14] 2 3 4 5 6 7 8 9 10 11 ...
..$ b : num [1:14] 0.208 -0.271 0.421 -0.538 0.506 ...

Topicmodels transposes the term document matrix

I am trying to run an LDA using the topicmodels package in R. The example given in the manual uses Associated Press data and works nicely. However, when I try it on my own data I get topics whose terms are the document names. I have traced the problem to the fact that my term document matrix is the transpose of the way is should be (rows -> columns).
The example TDM:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
Whereas,my TDM has Terms as rows, and Docs as columns:
List of 6
$ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ...
$ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ...
$ nrow : int 5903
$ ncol : int 9
$ dimnames:List of 2
..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ...
..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Which is causing LDA(art_tdm,3) to build topics based on doc names, not terms within docs. Is this a change in the codebase of the tm package? I can't imagine what I would be doing to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts"))
art_tdm<-TermDocumentMatrix(art_cor)
Any help would be appreciated.
On the one hand you have an object of class "TermDocumentMatrix" and the other you have one of "DocumentTermMatrix".
You probably just need to do this:
art_tdm<-DocumentTermMatrix(art_cor)

using chisq.test in R (chi-squared tests)

I am trying to read a csv file and then creating 3 matrices out of each row from the csv file and then apply chi-squared test using the method chisq.test(matrix), but somehow this methods seems to fail.
It gives me the following error:
Error in sum(x) : invalid 'type' (list) of argument
On the other hand, if I simply create a matrix passing some numbers then it works fine.
I also tried running str on two types of matrices.
That I create using the row, from the csv file. str on that gives:
List of 12
$ : int 3
$ : int 7
$ : int 3
$ : int 1
$ : int 7
$ : int 3
$ : int 1
$ : int 1
$ : int 1
$ : int 0
$ : int 2
$ : int 0
- attr(*, "dim")= int [1:2] 4 3
Matrix created using some numbers. str on that gives:
num [1:2, 1:3] 1 2 3 4 5 6
Can someone please tell me what is going on here?
The problems is that your data structure is an array of lists, and for chisq.test() you need an array of numeric values.
One solution is to coerce your data into numeric, using as.numeric(). I demonstrate this below. Another solution would be to convert the results of your read.csv() into numeric first before you create the array.
# Recreate data
x <- structure(array(list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)), dim=c(3,4))
str(x)
List of 12
$ : num 1
$ : num 2
$ : num 3
$ : num 4
$ : num 5
$ : num 6
$ : num 7
$ : num 8
$ : num 9
$ : num 10
$ : num 11
$ : num 12
- attr(*, "dim")= int [1:2] 3 4
# Convert to numeric array
x <- array(as.numeric(x), dim=dim(x))
str(x)
num [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
chisq.test(x)
Pearson's Chi-squared test
data: x
X-squared = 0.6156, df = 6, p-value = 0.9961
Warning message:
In chisq.test(x) : Chi-squared approximation may be incorrect

Resources