I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.
Related
I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.
A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the “brute force” approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via “brute force”). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.
library(groupdata2)
library(dplyr)
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999
for (i in 1:100) {
x <- fold(j, k = length(dist), method = "n_rand")
if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
x_fin <- x
}
}
This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.
Thanks in advance!
This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.
The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.
My approach here is to split the problem into three stages:
Cut the data into approximately the correct bins as a first approximation
Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
Swap the elements between columns to fine tune the fit, until the swaps are optimal.
The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.
Let's start with a function to cut the data into approximately the correct bins:
cut_elements <- function(j, dist)
{
# Specify the sums that we want to achieve in each partition
partition_sizes <- dist * sum(j)
# The cumulative partition sizes give us our initial cuts
partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))
# Name our partitions according to the given distribution
levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))
# Return our partitioned data as a data frame.
data.frame(data = j, group = partitions)
}
We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.
library(dplyr)
library(ggplot2)
library(tidyr)
compare_to_distribution <- function(df, dist, title = "Comparison")
{
df %>%
group_by(group) %>%
summarise(estimate = sum(data)/sum(j)) %>%
mutate(group = factor(cumsum(dist))) %>%
mutate(target = dist) %>%
pivot_longer(cols = c(estimate, target)) ->
plot_info
log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
plot_info$value[plot_info$name == "target"])^2))
ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
geom_col(position = "dodge") +
labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}
So now we can do:
cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")
We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:
move_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
group_diffs <- group_diffs[ignore_min:ignore_max]
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
swap_size <- (group_diffs[too_big] - group_diffs[too_small])/2
which_big <- which(df$group == names(too_big))
candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]
if(df$data[candidate_row] < 2 * swap_size)
{
df$group[candidate_row] <- names(too_small)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else
{
ignore_min <- ignore_min + 1
}
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts and moves")
You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.
Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:
swap_elements <- function(df, dist)
{
ignore_max = length(dist);
while(ignore_max > 0)
{
ignore_min = 1
match_found = FALSE
while(ignore_min < ignore_max)
{
group_diffs <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
too_big <- which.max(group_diffs)
too_small <- which.min(group_diffs)
current_excess <- group_diffs[too_big]
current_defic <- group_diffs[too_small]
current_ss <- current_excess^2 + current_defic^2
all_pairs <- expand.grid(df$data[df$group == names(too_big)],
df$data[df$group == names(too_small)])
all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
all_pairs$resultant_big <- current_excess - all_pairs$diff
all_pairs$resultant_small <- current_defic + all_pairs$diff
all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
improvements <- which(all_pairs$sum_sq < current_ss)
if(length(improvements) > 0)
{
swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
df$group[r1] <- names(too_small)
df$group[r2] <- names(too_big)
ignore_max <- length(dist)
match_found <- TRUE
break
}
else ignore_min <- ignore_min + 1
}
if (match_found == FALSE) ignore_max <- ignore_max - 1
}
return(df)
}
Let's see what that has done:
cut_elements(j, dist) %>%
move_elements(dist) %>%
swap_elements(dist) %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")
Pretty close to identical. Let's quantify that:
tapply(df$data, df$group, sum)/sum(j)
# (0,0.3] (0.3,0.5] (0.5,0.6] (0.6,0.65] (0.65,0.715] (0.715,0.9]
# 0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025
# (0.9,1]
# 0.09999969
So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.
In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:
all(df$data == j)
# [1] TRUE
and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:
partition_to_distribution <- function(data, distribution)
{
cut_elements(data, distribution) %>%
move_elements(distribution) %>%
swap_elements(distribution) %>%
`[`(,2)
}
In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:
microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
# expr min lq mean median uq
# partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
# max neval
# 93.00016 100
Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.
To add to the excellent answer by #AllanCameron, here is a solution that utilizes the highly efficient function comboGeneral from RcppAlgos*.
library(RcppAlgos)
partDist <- function(v, d, tol_ratio = 0.0001) {
tot_sum <- d * sum(v)
orig_len <- length(v)
tot_len <- d * orig_len
df <- do.call(rbind, lapply(1L:(length(d) - 1L), function(i) {
len <- as.integer(tot_len[i])
vals <- comboGeneral(v, len,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = tot_sum[i],
tolerance = tol_ratio * tot_sum[i],
upper = 1)
ind <- match(vals, v)
v <<- v[-ind]
data.frame(data = as.vector(vals), group = rep(paste0("g", i), len))
}))
len <- orig_len - nrow(df)
rbind(df, data.frame(data = v,
group = rep(paste0("g", length(d)), len)))
}
The idea is that we find a subset of v (e.g. j in the OP's case) such that the sum is within a tolerance of sum(v) * d[i] for some index i (d is equivalent to dist in the OP's example). After we find a solution (N.B. we are putting a cap on the number of solutions by setting upper = 1), we assign them to a group, and then remove them from v. We then iterate until we are left with just enough elements in v that will be assigned to the last distributed value (e.g. dist[length[dist]].
Here is an example using the OP's data:
set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
system.time(df_op <- partDist(j, dist, 0.0000001))
user system elapsed
0.019 0.000 0.019
And using the function for plotting by #AllanCameron we have:
df_op %>% compare_to_distribution(dist, "RcppAlgos OP Ex")
What about a larger sample with the same distribution:
set.seed(123)
j <- runif(10000,0,10000000)
## N.B. Very small ratio
system.time(df_huge <- partDist(j, dist, 0.000000001))
user system elapsed
0.070 0.000 0.071
The results:
df_huge %>% compare_to_distribution(dist, "RcppAlgos Large Ex")
As you can see, the solutions scales very well. We can speed up execution by loosening tol_ratio at the expense of the quality of the result.
For reference with the large data set, the solution given by #AllanCameron takes just under 3 seconds and gives a similar log sum of squares values (~44):
system.time(allan_large <- partition_to_distribution(j, dist))
user system elapsed
2.261 0.675 2.938
* I am the author of RcppAlgos
I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]
I am new to R and have written a function that needs to be run multiple times to generate the final dataset.
So the multiple times is determined by the vector of unique years and again based on these years every single time the function gives an output.
Still I am not getting the right output.
Desired output: for eg it takes 10 samples from each year, after 10th run I should have 100 rows of correct output.
create_strsample <- function(n1,n2){
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <-sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE)
r2<-sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE)
#final.data <-k1[c(r1,r2),]
sample.data <- lapply(yr, function(x) {f.data<-create_strsample(200,800)})
k1 <- do.call(rbind,k1)
return(k1)
}
final <- do.call(rbind,sample.data)
return(final)
}
stratified.sample.data <- create_strsample(200,800)
A MWE would have been nice, but I'll give you a template for these kind of questions. Note, that this is not optimized for speed (or anything else), but only for the ease of understanding.
As noted in the comments, that call to create_strsample in the loop looks weird and probably isn't what you really want.
data <- data.frame() # we need an empty, but existing variable for the first loop iteration
for (i in 1:10) {
temp <- runif(1,max=i) # do something...
data <- rbind(data,temp) # ... and add this to 'data'
} # repeat 10 times
rm(temp) # don't need this anymore
That return(k1) in the loop also looks wrong.
I tried this later after your suggestion #herbaman for the desired output minus the lapply.
create_strsample <- function(n1,n2){
final.data <- NULL
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <- k1[sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE), ]
r2 <- k1[sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE), ]
sample.data <- rbind(r1,r2)
final.data <- rbind(final.data, sample.data)
}
return(final.data)
}
stratified.sample.data <- create_strsample(200,800)
I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))
I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.
dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.
Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?
An example
Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1)
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
If your goal is to have the same variables with the same levels, then you need to avoid using dataframeReduce a second time, and instead use the same columns as produced by the dataframeReduce operation on hte train-set and apply factor reduction logic to the test-set in a manner that results in whatever degree of homology is needed of subsequent comparison operations. If it is a predict operation that is planned then you need to get the levels to be the same and you need to modify the code in dataframeReduce that works on the levels:
if (is.category(x) || length(unique(x)) == 2) {
tab <- table(x)
if ((min(tab)/n) < minprev) {
if (is.category(x)) {
x <- combine.levels(x, minlev = minprev)
s <- "grouped categories"
if (length(levels(x)) < 2)
s <- paste("prevalence<", minprev, sep = "")
}
else s <- paste("prevalence<", minprev, sep = "")
}
}
So a better problem statement is likely to produce a better strategy. This will probably require both knowing what levels are in the entire set and in the train and test sets as well as what testing or predictions are anticipated (but not yet stated).