conditional sampling without replacement - r

I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.
Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:
set.seed(100)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
head(df1)
#> pebble bucket
#> 1 1 D
#> 2 2 C
#> 3 3 F
#> 4 4 A
#> 5 5 E
#> 6 6 E
I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
colSums(table(df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).
However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
df2
#> pebble bucket
#> 1 33 I
#> 2 39 I
#> 3 5 A
#> 4 36 C
#> 5 55 J
#> 6 66 A
#> 7 92 J
#> 8 95 H
#> 9 2 C
#> 10 49 I
The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.
So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:
perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G H I J
#> 6 7 12 22 15 1 14 7 7 9
I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.
set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
perms[cnt] <- bckts[id]
bckts <- bckts[-id]
ids <- ids[ids!=id]
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J
#> 1 1 4 1 2 1 2 2
Any thoughts or advice much appreciated (and apologies for the length).
EDIT:
I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.

I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:
random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
N <- length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <-
sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))
}
The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.
Now easier to read with data.table
library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)
for( bucketi in unique(df1$bucket)){
random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble],
bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))]
}
it has the two conditions
> colSums(table(df1))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
> colSums(table(random.permutation.df2))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
To verify that there isn't any contradiction with df2
> df2
pebble bucket
1: 37 D
2: 95 H
3: 90 C
4: 80 C
5: 31 D
6: 84 G
7: 76 I
8: 57 H
9: 7 E
10: 39 A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
pebble bucket
1: 7 D
2: 31 H
3: 37 J
4: 39 F
5: 57 B
6: 76 E
7: 80 F
8: 84 B
9: 90 H
10: 95 D

Here a brute force approach where one simply tries long enough until a valid solution is found:
set.seed(123)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
Random permutation does not match the condition, so try new ones:
merge(random.permutation.df1, df2)
#> pebble bucket
#> 1 60 J
while(TRUE) {
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
if(nrow(merge(random.permutation.df1, df2)) == 0)
break;
}
New permutation matches the condition:
merge(random.permutation.df1, df2)
#> [1] pebble bucket
#> <0 Zeilen> (oder row.names mit Länge 0)
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7
colSums(table(df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7

Related

Same column ( different row ) operations in R

I have a big database and I'm trying to create a new column starting from an existing one doing the difference between elements in consecutive cells ( same column, different row):
existing_column
new_column
A
A-B
B
B-C
C
C-D
D
D-E
...
...
Z
Z-NULL
The way I'm doing it is to duplicate existing column into a dummy one, remove first element, adding NULL as last element and subtracting the existing column and the dummy one ... is there a better way? Thank you
exist <-c("A","B","C","D","E")
db<-data.frame(exist)
dummy<-exist[-1]
dummy[length(dummy)+1]<-"NULL"
new_col<-paste(exist,"-",dummy)
new_col
db<-data.frame(exist,new_col)
db
Does this work:
library(dplyr)
df <- data.frame(existing_column = LETTERS)
df %>% mutate(new_column = paste(existing_column, lead(existing_column, default = 'NULL'), sep = '-'))
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NULL
Try the code below
transform(
df,
new_column = paste(existing_column, c(existing_column[-1], NA), sep = "-")
)
which gives
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NA
If you are working with numeric data just represented as characters in your example, you can use mutate() and lead()
df<-data.frame(old_col=sample(1:10))
df%>%mutate(new_col=old_col-lead(old_col, default = 0))
old_col new_col
1 10 4
2 6 -3
3 9 8
4 1 -1
5 2 -5
6 7 3
7 4 1
8 3 -5
9 8 3
10 5 5
In case there is a need of a fast data.table version
dt[, new_column:=paste(exist, shift(exist, type="lead"), sep="-")]
Edit. Turns it isn't much faster:
df = data.table(exist = rep(letters, 80000))
> m = microbenchmark::microbenchmark(
... a = df %>% mutate(new_column = paste(exist, lead(exist, default = 'NULL'), sep = '-')),
...
... b = transform(
... df,
... new_column = paste(exist, c(exist[-1], NA), sep = "-")
... ),
...
... d = df[, new_column := paste(exist, shift(exist, type="lead"), sep="-")]
... )
> m
Unit: milliseconds
expr min lq mean median uq max neval
a 292.2430 309.6150 342.0191 323.9778 361.0937 603.8449 100
b 349.4509 383.3391 475.0177 423.8864 472.0276 2136.2970 100
d 294.6786 302.8530 332.3989 315.6228 340.9642 641.8345 100

Getting the length of a list

I am attempting to decipher a list res which has structure as per below:
How would I go about converting this to a 21 (row) by 2 (column) dataframe?
I can do it by manually hard-coding the 21:
data.frame(matrix(unlist(res), nrow=21 ))
However I would like to use length(res) which unfortunately returns 1
As it is a list use [[ to index it to get the matrix and then convert to dataframe.
data.frame(res[[1]])
Or use unlist with recursive = FALSE
data.frame(unlist(res[[1]], recursive = FALSE))
Using a reproducble example,
res <- list(matrix(letters,ncol = 2))
data.frame(res[[1]])
# X1 X2
#1 a n
#2 b o
#3 c p
#4 d q
#5 e r
#6 f s
#7 g t
#8 h u
#9 i v
#10 j w
#11 k x
#12 l y
#13 m z
You can also magrittr::extract2
res %>% magrittr::extract2(1)
## A tibble: 21 x 2
# V1 V2
# <chr> <chr>
# 1 O M
# 2 W S
# 3 C Q
# 4 L C
# 5 M K
# 6 R M
# 7 U Q
# 8 I T
# 9 K J
#10 H V
## … with 11 more rows
or use purrr::flatten_dfc
purrr::flatten_dfc(res)
## A tibble: 21 x 2
# V1 V2
# <chr> <chr>
# 1 O M
# 2 W S
# 3 C Q
# 4 L C
# 5 M K
# 6 R M
# 7 U Q
# 8 I T
# 9 K J
#10 H V
## … with 11 more rows
Sample data
set.seed(2018)
res <- list(
as_tibble(matrix(sample(LETTERS, 21 * 2, replace = T), nrow = 21, ncol = 2))
)

R: Restricted permutations more efficient way than using for loops

I am trying to permute a char vector a of variable length picking 3 elements every time, without repetition. Ordering counts only for the first element but doesn't for second and third (e.g. abc != bac != cab, but abc = acb and bca = bac). Each set of 3 permuted elements should be a row in a dataframe b.
A vector with letters a,b,c,d,e would result in this expected output:
abc
abd
abe
acd
ace
ade
bac
bad
bae
bcd
bce
bde
cab
cad
cae
cbd
cbe
cde
dab
dac
dae
dbc
dbe
dce
eab
eac
ead
ebc
ebd
ecd
Using 3 for loops I think I was able to achieve this output, but it is slow if the vector is long.
a = letters[1:5]
aL = length(a)
b <- data.frame(var1 = character(),
var2 = character(),
var3 = character(),
stringsAsFactors = FALSE)
# restricted permutations for moderation
pracma::tic()
for(i in 1:aL){
for(j in 1:(aL-1)){
for(k in (j+1):aL){
if(j != i & k != i) {
b <- rbind(b, data.frame(a[i], a[j], a[k])) }
}
}
}
pracma::toc()
#> elapsed time is 0.070000 seconds
b
#> a.i. a.j. a.k.
#> 1 a b c
#> 2 a b d
#> 3 a b e
#> 4 a c d
#> 5 a c e
#> 6 a d e
#> 7 b a c
#> 8 b a d
#> 9 b a e
#> 10 b c d
#> 11 b c e
#> 12 b d e
#> 13 c a b
#> 14 c a d
#> 15 c a e
#> 16 c b d
#> 17 c b e
#> 18 c d e
#> 19 d a b
#> 20 d a c
#> 21 d a e
#> 22 d b c
#> 23 d b e
#> 24 d c e
#> 25 e a b
#> 26 e a c
#> 27 e a d
#> 28 e b c
#> 29 e b d
#> 30 e c d
Created on 2019-07-17 by the reprex package (v0.2.1)
How can I achieve the same outcome in less time. Is recursion faster?
Any help is greatly appreciated. Thank you.
I propose the following solution:
a = letters[1:5]
A = t(combn(a,3)) # create all possible three-letter combinations,
# disregarding the order
Full = rbind(A, A[,3:1], A[,c(2,3,1)]) # put every of the elements of the
# differing combinations in first place once
Here's one option for your specific example:
library(gtools)
library(dplyr)
# example vector
vec = letters[1:5]
# vectorised function to rearrange elements (based on your restriction)
f = function(x1,x2,x3) paste0(c(x1, sort(c(x2,x3))), collapse = " ")
f = Vectorize(f)
permutations(length(vec), 3, vec) %>% # get permutations
data.frame(., stringsAsFactors = F) %>% # save as data frame
mutate(vec = f(X1,X2,X3)) %>% # apply function to each row
distinct(vec, .keep_all = T) # keep distinct vec values
# X1 X2 X3 vec
# 1 a b c a b c
# 2 a b d a b d
# 3 a b e a b e
# 4 a c d a c d
# 5 a c e a c e
# 6 a d e a d e
# 7 b a c b a c
# ...
Not clear if you want your output to be 3 separate columns with 1 element each, or one column with the vector, so I'm keeping both for you to choose from. You can keep columns {X1, X2, X3} or just vec.
The following is a straightforward rewrite of the triple for loop as a triple lapply loop.
t1 <- system.time({
for(i in 1:aL){
for(j in 1:(aL-1)){
for(k in (j+1):aL){
if(j != i & k != i) {
b <- rbind(b, data.frame(a[i], a[j], a[k])) }
}
}
}
})
t2 <- system.time({
d <- lapply(1:aL, function(i){
tmp <- lapply(1:(aL-1), function(j){
tmp <- lapply((j+1):aL, function(k){
if(j != i & k != i) c(a[i], a[j], a[k])
})
do.call(rbind, tmp)
})
do.call(rbind, tmp)
})
d <- do.call(rbind.data.frame, d)
names(d) <- paste("a", 1:3, sep = ".")
})
all.equal(b, d)
#[1] "Names: 3 string mismatches"
rbind(t1, t2)
# user.self sys.self elapsed user.child sys.child
#t1 0.051 0 0.051 0 0
#t2 0.017 0 0.018 0 0

R- random sample of groups in a data.table

How can I randomly sample e.g. three groups within a data.table so that the result contains three groups with all rows from the original data.table?
library(data.table)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20,
replace=TRUE))
I know how to select 10 rows randomly from a data.table:
dat.sampl1 <- as.data.table(sapply(dat[], sample, 10))
And also how to sample by group
dat[,.SD[sample(.N, min(.N,3))], by = groups]
But how to randomly sample groups? So the result should look like:
ids groups
1 F
11 F
3 F
18 F
8 A
9 A
10 A
17 A
19 A
12 E
14 E
16 E
Do you mean something like:
set.seed(123)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20,
replace=TRUE))
dat[groups %in% sample(unique(dat[, groups]), size = 3)][order(groups)]
# ids groups
# 1: 3 C
# 2: 10 C
# 3: 12 C
# 4: 7 D
# 5: 9 D
# 6: 14 D
# 7: 4 F
# 8: 5 F
# 9: 8 F
# 10: 11 F
# 11: 16 F
# 12: 20 F
If you want to sample groups with replacement, you can do the following, where A has been sampled twice:
dat[unique(dat[, list(groups)])[sample(.N, 3, replace = TRUE)], on = "groups"]
# ids groups
# 1: 3 C
# 2: 10 C
# 3: 12 C
# 4: 6 A
# 5: 15 A
# 6: 18 A
# 7: 6 A
# 8: 15 A
# 9: 18 A
This code works, using a single line of base R code using %in% to check an index which is generated using the sample function:
df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]
For example:
> df1 <- data.frame("ids" = 1:20, "groups" = sample(LETTERS[1:4], size = 20, replace = T))
> df2 <- df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]
> df2[order(df2[,'groups']),]
ids groups
4 4 B
6 6 B
18 18 B
20 20 B
1 1 C
2 2 C
3 3 C
9 9 C
12 12 C
16 16 C
19 19 C
7 7 D
11 11 D

mutate based on conditional sum in a group

Say I have a dataframe like this:
set.seed(1)
n <- 20
df <- data.frame(ID = sample(1:5, n, replace = TRUE),
Fac1 = sample(letters[1:5], n, replace = TRUE),
Fac2 = sample(LETTERS[10:15], n, replace = TRUE),
Val1 = sample(1:10, n, replace = TRUE)) %>%
arrange(ID) %>% group_by(ID,Fac1) %>%
summarise(Val1 = sum(Val1),Fac2 = first(Fac2)) %>%
group_by(ID,Fac2) %>%
mutate(Val2 = sum(Val1))
df
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 N 10
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 N 6
13 5 a 13 M 13
14 5 b 3 N 3
ID is a grouping variable. Rows with an Fac1 value of e should have the Fac2 value changed to be that same as the other row in the group where Fac1 is either b or c and the sum of Val 2 for the two rows if greater than 20. (I've simplified this to the point where you probably don't get why but just work with me).
This is what I have tried so far:
result <- df %>% group_by(ID) %>%
mutate(Fac2 = case_when(
Fac1 == "e" &
sum(Val2,ifelse(Fac1 %in% c("b","c"), Val2, 0)) > 20 ~
ifelse(sum(Val2,ifelse(Fac1 %in% c("b","c"),Val2,0)) > 20,
as.character(Fac2),
NA_character_),
TRUE ~ as.character(Fac2)
))
It doesn't work properly because it is summing the first value of Val2 in the group rather than only doing so when Fac1 is b or c.
Any ideas?
Adding desired outcome:
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 M 10 **Changed to M b/c row 4 is M and 10 + 18 > 20
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 L 6 **Changed to L b/c row 10 is L and 6 + 22 > 20
13 5 a 13 M 13
14 5 b 3 N 3
I'm having a hard time following what you are wanting the values to be changed to.
But when I have multiple conditions or decisions that need to be made in a sequence, I use a loop and a series of if statements to go through the data frame. I prefer while loops, so that's what I'll use in the example.
counter <- 1
stopper <- nrow(df)
while (counter <= stopper) {
fac1 <- df$Fac1[counter1]
if (fac1 == 'e') {
if ([INSERT NEXT CONDITION]) #Change whichever value your trying to change using the counter to reference the correct row.
else #Change whichever value your trying to change using the counter to reference the correct row.
}
counter <- counter + 1
}
For me, simplifying the code makes it a lot easier for me to keep track of what decisions are being made. It also allows for complex decisions that are difficult to get functions to work with.
I was able to get the desired result with this code. I made a new column containing the result of the test for what value to replace Fac2 with, which wasn't entirely necessary but makes it more readable and debugable.
The key thing was to use first(na.omit()) to get the value from a different row in the same group which met the condition.
result <- df %>% group_by(ID) %>%
mutate(Max_bc_Val = ifelse(Val2 == max(ifelse(Fac1 %in% c("b","c"),
Val2,0)),
ifelse(Fac1 %in% c("b","c"),
as.character(Fac2),NA),NA)) %>%
mutate(Fac2 = case_when(
Fac1 == "e" ~ ifelse(is.na(first(na.omit(Max_bc_Val))),
NA_character_,
first(na.omit(Max_bc_Val))),
TRUE ~ as.character(Fac2)))
This works but doesn't seem like the best solution. Any other ideas?

Resources