I have three different events (1,2,3) with different probabilities (0.15, 0.76, 0.09) and I would like to draw 5 times with replacement.
I can now determine the number of possible combinations using
nsimplex(3,5) ### =21
from the combinat-package.
And I can determine the probabilities of each combination using
mySimplex <- xsimplex(3,5)
myProbs<-c(0.15, 0.76, 0.09)
results<- apply(mySimplex,2,dmultinom,prob=myProbs)
Further, I can of course determine the number of permutations by calculating 3^5= 243.
But how do I know how often each permutation of the same combination is drawn without counting them manually? That is, how many permutations are in each of my combinations?
If I undestand that correctly, there are 243 permutations building 21 different combinations. Now my question is, how many permutations build each combination. E.g. the combination {1,1,1,1,1} will be built up only once, whereas others are created by several permutations.
I guess you can come to this by using the probabilities for each combination but I do not know how to do it? Or is there any other way to easiliy determine that in R?
Thank you in advance.
The number of permutations of a indistinguishable copies of item 1, b of item 2, c of item 3, where a + b + c = N, is N! / (a! b! c!).
For example if you had (a,b,c) = (3,1,1) then there are 5!/(3! 1! 1!) = 20 arrangements.
c b a a a b a c a a a b a a c a a c a b
c a b a a b a a c a a c b a a a a b c a
c a a b a b a a a c a c a b a a a b a c
c a a a b a b c a a a c a a b a a a b c
b c a a a a b a c a a a c b a a a a c b
In general, we can calculate the number as follows
nperm<-function(...) {
args<-as.numeric(list(...));
num<-lfactorial(sum(args));
den<-sum(lfactorial(args));
return(round(exp(num-den)));
}
So, e.g.,
x<-expand.grid(0:5,0:5,0:5)
x<-x[rowSums(x)==5,]
x[,"nperm"]<-apply(x,1,function(x) do.call(nperm,as.list(x)))
Var1 Var2 Var3 nperm
5 0 0 1
4 1 0 5
3 2 0 10
2 3 0 10
1 4 0 5
0 5 0 1
4 0 1 5
3 1 1 20
2 2 1 30
1 3 1 20
0 4 1 5
3 0 2 10
2 1 2 30
1 2 2 30
0 3 2 10
2 0 3 10
1 1 3 20
0 2 3 10
1 0 4 5
0 1 4 5
0 0 5 1
And sum(x[,"nperm"]) == 243, as expected.
To make this reproducible, I would have needed to use set.seed(<some_value>) but this is one attempt at using sample to draw distinct combinations (without considering the permutations distinct. If the permutations are to be considered distinct, then take out the sort step:
table( # get the counts of distinct combinations
apply( # this will collapse values by column
replicate(100000, # yields a 100,000 column matrix
{sample(c("1","2","3"), 5 ,repl=TRUE, prob=c(.5,.25,.25) )}),
2, function(x) paste(sort(x), collapse=".")) )
1.1.1.1.1 1.1.1.1.2 1.1.1.1.3 1.1.1.2.2 1.1.1.2.3 1.1.1.3.3 1.1.2.2.2
3090 7705 8144 7851 15408 7649 3997
1.1.2.2.3 1.1.2.3.3 1.1.3.3.3 1.2.2.2.2 1.2.2.2.3 1.2.2.3.3 1.2.3.3.3
11731 11554 3940 949 3844 5955 4019
1.3.3.3.3 2.2.2.2.2 2.2.2.2.3 2.2.2.3.3 2.2.3.3.3 2.3.3.3.3 3.3.3.3.3
961 99 506 990 997 510 101
A.Webb suggests we compare theory dmultinom to practice:
dmultinom(c(4,1,0),prob=c(0.5,0.25,0.25))*2
[1] 0.15625
So prediction for first value 3125 looks arguably accurate vs simulated at 3090 and for the second and third value as well at 7812.5 vs 7705 and 8144.
Related
This is an example.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
df
item n
1 a 3
2 b 2
3 c 2
4 d 1
5 e 1
Item needs to be grouped so that the group has a sample size of at least 4.
This would be the solution if you follow the sorting of df.
item n cluster
1 a 3 1
2 b 2 1
3 c 2 2
4 d 1 2
5 e 1 2
How to get all possible unique solutions?
Further, the code should also not allow any clusters to have a sample size less than 4.
Below, we have a brute force approach using the package partitions. The idea is that we find every partition of the rows of df. We then sum each group and check to see that the requirement has been met.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
minSize <- 4
funGetClusters <- function(df, minSize) {
allParts <- partitions::listParts(nrow(df))
goodInd <- which(sapply(allParts, function(p) {
all(sapply(p, function(x) sum(df$n[x])) >= minSize)
}))
allParts[goodInd]
}
clusterBreakdown <- funGetClusters(df, minSize)
allDfs <- lapply(clusterBreakdown, function(p) {
copyDf <- df
copyDf$cluster <- 1L
clustInd <- 2L
for (i in p[-1]) {
copyDf$cluster[i] <- clustInd
}
copyDf
})
Here is the output:
allDfs
[[1]]
item n cluster
1 a 3 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 1
[[2]]
item n cluster
1 a 3 1
2 b 2 2
3 c 2 2
4 d 1 1
5 e 1 1
[[3]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 2
5 e 1 1
[[4]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
[[5]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 2
4 d 1 1
5 e 1 1
[[6]]
item n cluster
1 a 3 2
2 b 2 2
3 c 2 1
4 d 1 1
5 e 1 1
It should be noted, that there is a combinatorial explosion as the number of rows increases. For example, just with 10 rows we would have to test 115975 different partitions.
As #chinsoon comments, RcppAlgos could be a good choice for an acceptable solution for larger cases. Disclaimer, I am the author. I have answered similar questions with much larger inputs and have had good success.
Allocating tasks to parallel workers so that expected cost is roughly equal
Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?
#AllanCameron also has a great answer and nice methodology to attacking this problem. You should give that a read as well.
Lastly, the following vignette by Robin K. S. Hankin (author of the partitions package) and Luke J. West is not only a great read, but very applicable to problems like the one presented here.
Set Partitions in R
I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)
For a chemistry project at school I want to calculate molecular masses of all possible combinations of molecular formulas including carbon (1 atom up to 100), oxygen (1 up to 50), hydrogen (1 up to 200), nitrogen (1 up to 20) and sulfur (1 up to 10) and save the results in one vector and the corresponding molecular formula string in another vector. The masses are numeric values: 12, 16, 1, 14 and 32. The strings are "C", "O", "H", "N", "S".
I want to delete molecular formulas that make no sense like C1 O100 H0 N20 S10 from the string and the corresponding mass, too. So to be more specific only leave the ones with a O/C relation between 0 and 1, a H/C relation between 2 and 1, a N/C relation between 0 and 0.2 and a S/C relation between 0 and 0.1.
Is there a easy way to do this, is using a for loop the only way or is there a faster way (maybe arrays?) and how can I take account to the relations of molecules?
Would be vary happy for some ideas or basic code to solve this.
..so #Gregor to disclude the relations of atoms that dont make sense probably will be better before the whole list is created? #Barker Yes atoms like Nitrogen should go from 0 to max. I am very new to R so when I try a loop I end up with the last value calculated...(reduced amount of dimensions).
z=matrix(0,1,5*20*10*2*2)
C=12
O=16
H=1
N=14
S=32
for( u in 1:length(z)) {
for(i in 1:5) {
for (j in 1:20) {
for(k in 1:10 ) {
for(l in 0:1) {
for(m in 0:1){
z[1,u] <- C*i+H*j+O*k+N*l+S*m
}
}
}
}
}
}
does anyone know where the mistake is here?
expand.grid is a good place to start in generating combinations. For example, to create a data.frame with combinations of H and C you could do this
mol = expand.grid(C = 1:3, H = 1:4)
mol
# C H
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
You can add on the other elements in expand.grid as well and also adjust the inputs up to 1:200 or however many you want. If your computer has enough memory, you'll be able to create the 10MM row data frame as specified in your question - though that is pretty big. If you could reduce the total number of combinations to 1MM it will be much easier on your memory.
The next step would be to delete rows that don't meet your ratio criteria. Here's one example, to make sure that the number of H is between 1 and 2 times the number of C:
mol = mol[mol$H >= mol$C & mol$H <= 2 * mol$C, ]
mol
# C H
# 1 1 1
# 4 1 2
# 5 2 2
# 8 2 3
# 9 3 3
# 11 2 4
# 12 3 4
Repeat steps like that for all your conditions.
Finally you can calculate the weights and put it in a new column:
mol$weight = with(mol, C * 12 + H * 1)
mol
# C H weight
# 1 1 1 13
# 4 1 2 14
# 5 2 2 26
# 8 2 3 27
# 9 3 3 39
# 11 2 4 28
# 12 3 4 40
You could use matrix multiplication for the weight calculation, but there's no need with a small number of possible elements. If you had 20 or more possible input elements it would make sense to do it that way.
Bonus! Formulas can be created with paste or paste0:
mol$formula = paste0("C", mol$C, " H", mol$H)
mol
# C H weight formula
# 1 1 1 13 C1 H1
# 4 1 2 14 C1 H2
# 5 2 2 26 C2 H2
# 8 2 3 27 C2 H3
# 9 3 3 39 C3 H3
# 11 2 4 28 C2 H4
# 12 3 4 40 C3 H4
Of course, most of these still won't make chemical sense - C1 H1 isn't something that would really exist, but maybe you can come up with even smarter conditions to get rid of more of the impossibilities!
I'm wondering how I can ensure I included all interactions of factors when using aggregate even if they don't appear in the given dataset.
dff <- data.frame(a=as.factor(c(rep(1,3), rep(2,4), rep(3,3))),
b=as.factor(c(rep("A", 4), rep("B",6))),
c=sample(100,10))
levels(dff$b) <- c(levels(dff$b), "C")
levels(dff$a) <- c(levels(dff$a), 10)
dff$b
#[1] A A A A B B B B B B
#Levels: A B C
dff$a
#[1] 1 1 1 2 2 2 2 3 3 3
#Levels: 1 2 3 10
aggregate(c~a+b, dff, sum)
# a b c
#1 1 A 233
#2 2 A 78
#3 2 B 212
#4 3 B 73
what I want is
a b c
1 1 A 233
2 1 B 0
3 1 C 0
4 2 A 78
5 2 B 212
6 2 C 0
7 3 A 0
8 3 B 73
9 3 C 0
10 10 A 0
11 10 B 0
12 10 C 0
NA is fine too.
The reason I want it in this format is because I need to interact dff$c with results from other datasets and they may be of different length if not all factor levels are accounted for. I'm trying avoid merge and instead use vector calculation.
Thank you in advance.
If your aggregation function is just going to be sum, you can just use xtabs, which would create an object that includes the class table. As such, you can use data.frame, which would call the respective "method", which creates a "long" data.frame.
data.frame(xtabs(c ~ b + a, dff))
# b a Freq
# 1 A 1 121
# 2 B 1 0
# 3 C 1 0
# 4 A 2 89
# 5 B 2 203
# 6 C 2 0
# 7 A 3 0
# 8 B 3 126
# 9 C 3 0
# 10 A 10 0
# 11 B 10 0
# 12 C 10 0
This is similar to #nicola's suggestion to use as.data.frame.table, which explicitly calls the method for something that is not explicitly of the class "table" but can be treated as one.
One advantage of this approach (and all the others that follow) is that you can use different functions other than sum.
as.data.frame.table(tapply(dff$c, dff[c("a","b")], sum))
If merge is OK, you can continue with your aggregate step. In this case, we use expand.grid on the levels of your factor vectors:
merge(expand.grid(lapply(dff[c(1, 2)], levels)),
aggregate(c~a+b, dff, sum, drop = FALSE), all = TRUE)
A similar approach can be taken in "data.table":
library(data.table)
as.data.table(dff)[, sum(c), by = .(a, b)][do.call(CJ, lapply(dff[c(1, 2)], levels)), on = c("a", "b")]
Or using "dplyr" + "tidyr" (which essentially hides the merge, but ultimately uses left_join to create the missing combinations):
library(dplyr)
library(tidyr)
dff %>%
group_by(a, b) %>%
summarise(c = sum(c)) %>%
complete(a, b, fill = list(c = 0))
this might be a simple question but I was hoping someone could point me in the right direction. I have a sample dataset of:
dfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2))
dfrm <-as.data.frame(dfrm)
dfrm
L J K
1 A 2 4
2 B 2 3
3 P 1 10
4 C 2 16
5 D 2 21
6 E 2 3
7 P 1 17
8 F 2 2
Column J specifies the type of variable that is defined in K. I want to be able to take the mean of the K values that have a 1 assigned next to them. In this example it would be 10 and 17
T = c(10,17)
mean(T)
13.5
Next I want to be able to assign a pass/fail rank, where pass = 1, fail = 0 to identify whether the number in column K is larger than the mean.
The final data set should look like:
cdfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2),C = c(0,0,0,1,1,0,1,0))
cdfrm <-as.data.frame(cdfrm)
cdfrm
L J K C
1 A 2 4 0
2 B 2 3 0
3 P 1 10 0
4 C 2 16 1
5 D 2 21 1
6 E 2 3 0
7 P 1 17 1
8 F 2 2 0
this seems so basic, i am sorry guys, I just don't know what I am overthinking.
There are two steps in the solution. The first is to calculate the mean for the value you are interested in. In other words, take the mean of a subset of values in your data.frame. R has a handy function to calculate subsets, called subset. Here it is in action:
meanK <- mean(subset(dfrm, subset=J==1, select=K))
meanK
K
13.5
Next, you want to compare column K in your data frame with the mean value we have just calculated. This is a straightforward vector comparison:
dfrm$Pass <- dfrm$K>meanK
dfrm
L J K Pass
1 A 2 4 FALSE
2 B 2 3 FALSE
3 P 1 10 FALSE
4 C 2 16 TRUE
5 D 2 21 TRUE
6 E 2 3 FALSE
7 P 1 17 TRUE
8 F 2 2 FALSE
Here's how to do it in one line
transform(dfrm, C = K > sapply(split(dfrm$K, dfrm$J), mean)[J])
split groups the values of K according to the values of J and sapply(..., mean) calculates group wise means.