In R, need to find best combination of 8 columns, only being able to select one value from each row - r

In R, I'm attempting to find the best combination of 8 different columns of values but with the caveat of only being able to select one value from each row. It sounds relatively simple, but I'm trying to avoid a nasty looping scenario to evaluate all possible options, so I'm hopeful there is a function available that could make this a possibility. There are scenarios where I will need to run this on datasets with over 2000 rows, so efficiency is really important.
Here is an example:
I've been racking my brain and searching forever, but every scenario and solution I'm able to find can maximize series of columns but cant handle the condition of only allowing a single value per row. Are there any functions where this is possible?

I will take a risk here, and assume that I interpreted you right. That you seek the group of 8 numbers in that table that have the maximum sum. Given, of course that they do not share a column or a row.
There is no easy answer to this question. I am not a computer scientist, but I believe this is what is called an NP-hard problem. So efficiency will always be a problem. Fortunately, in practical terms, I think you can get an answer for a 2000+ table in a matter of seconds, as long as the number of columns remains small.
The algorithm I tried to use to win this problem is essentially a depth-first search that takes advantage of existing function in R that makes it faster. You can think of your problem as jumping from column to column, each time selecting the highest value with a twist. Every time you select a value, all cells in that row are turned to zero. So in essence, when you get to the last column, there will only be one value to choose.
However, due to this nature of excluding rows, your results will be different depending on the order you choose to visit the columns (let's call that a path). Thus, you have to test all paths.
So our code must be something of the sort:
1- Enumerate all paths (all permutations of column numbers);
2- For each path, "walk" it taking the maximum value of each column and transforming to 0 the values in its row. Store the values;
3- For each set of values, calculate its sum and select based on that.
Below is the code I have used to do it:
library(combinat) # loads permn function, that enumerates all the permutations
#Create fake data
data = sample(1:25)
data = matrix(data,5,5)
# Walking function
walker = function(path,data) {
bestn = numeric(length(path)) # Placeholder for the max value of each column
usedrows = numeric(length(path)) #Placeholder for the row of each max value
data.reduced=data # copies data to a new object
for(a in 1:length(path)) { # iterate through columns
bestn[a] = max(data.reduced[,path[a]]) #find the maximum value
usedrows[a] = which.max(data.reduced[,path[a]]) # find maximum value's row
data.reduced[usedrows,]=0 # set all values in that row to 0
data.reduced[,path[a]]=0 # set current column to 0.
}
return(bestn)
}
# Create all permutations and use functions in it, get their sum, and choose based on that
paths = permn(1:5)
values = lapply(paths,walker,data)
values.sum = sapply(values,sum)
values[[ which.max(values.sum)]]
The code can handle a matrix of 2000 x 5 in less than a second in a laptop. I just did not added it here, because the more rows, the more independent the results become from the path taken. And it is less easy to see its progress with large numbers.

This problem can be solved simply as a binary integer optimization problem. Here using the ROI and ompr optimization packages. ompr is a formulation manager that calls ROI functions for optimization and processing. Here is an example:
require(ROI)
require(ROI.plugin.glpk)
require(ompr)
require(ompr.roi)
set.seed(7)
n <- runif(77, 80, 120)
n <- c(n, rep(0, 179))
n <- sample(n)
m <- matrix(n, ncol = 8)
nrows <- nrow(m)
ncols <- ncol(m)
model <- MIPModel() %>%
add_variable(x[i, j], i=1:nrows, j=1:ncols, type='binary', lb=0) %>%
set_objective(sum_expr(colwise(m[i, j]) * x[i, j], i=1:nrows, j=1:ncols), 'max') %>%
add_constraint(sum_expr(x[i, j], i=1:nrows) <= 1, j=1:ncols) %>%
add_constraint(sum_expr(x[i, j], j=1:ncols) <= 1, i=1:nrows)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 20: obj = 9.321807877e+002 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
256 integer variables, all of which are binary
Integer optimization begins...
+ 20: mip = not found yet <= +inf (1; 0)
+ 20: >>>>> 9.321807877e+002 <= 9.321807877e+002 0.0% (1; 0)
+ 20: mip = 9.321807877e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
solution <- get_solution(result, x[i, j])
solution <- subset(solution, value != 0)
solution
variable i j value
27 x 27 1 1
43 x 11 2 1
88 x 24 3 1
99 x 3 4 1
146 x 18 5 1
173 x 13 6 1
209 x 17 7 1
246 x 22 8 1
The first code chunk generates a 32X8 random matrix. The sample generates a 30% fill. The constraints constrain each column and row to have <= 1 active variable. You can use this code directly for any matrix of any dimension.

Related

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

find all unique combinations of n numbers between 1 and k

I want a list of all possible sets of five (or n) numbers between 1 and 63 (or more generalizably 1 and k)
If computing time wasn't an issue, I could do something like
#Get all combenations of numbers between 1 and 63
indexCombinations <- expand.grid(1:63, 1:63, 1:63, 1:63, 1:63)
#Throw out the rows that have more than one of the same number in them
allDifferent <- apply(indexCombinations, 1, function(x){
length(x) == length(unique(x))
} # function
) # apply
indexCombinationsValid <- indexCombinations[allDifferent,]
# And then just take the unique values
indexCombinationsValidUnique <- unique(indexCombinationsValid)
The finding of unique values, I am concerned, is going to be prohibitively slow. Furthermore, I end up having to make a bunch of rows in the first place I never use. I was wondering if anyone has a more elegant and efficient way of getting a data frame or matrix of unique combinations of each of five numbers (or n numbers) between one and some some range of values.
Credit to #SymbolixAU for a very elegant solution, which I re-post here as an answer:
n <- 1:63; x <- combn(n, m = 5)

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

how to select a matrix column based on column name

I have a table with shortest paths obtained with:
g<-barabasi.game(200)
geodesic.distr <- table(shortest.paths(g))
geodesic.distr
# 0 1 2 3 4 5 6 7
# 117 298 3002 2478 3342 3624 800 28
I then build a matrix with 100 rows and same number of columns as length(geodesic.distr):
geo<-matrix(0, nrow=100, ncol=length(unlist(labels(geodesic.distr))))
colnames(geo) <- unlist(labels(geodesic.distr))
Now I run 100 experiments where I create preferential attachment-based networks with
for(i in seq(1:100)){
bar <- barabasi.game(vcount(g))
geodesic.distr <- table(shortest.paths(bar))
distance <- unlist(labels(geodesic.distr))
for(ii in distance){
geo[i,ii]<-WHAT HERE?
}
}
and for each experiment, I'd like to store in the matrix how many paths I have found.
My question is: how to select the right column based on the column name? In my case, some names produced by the simulated network may not be present in the original one, so I need not only to find the right column by its name, but also the closest one (suppose my max value is 7, I may end up with a path of length 9 which is not present in the geo matrix, so I want to add it to the column named 7)
There is actually a problem with your approach. The length of the geodesic.distr table is stochastic, and you are allocating a matrix to store 100 realizations based on a single run. What if one of the 100 runs will give you a longer geodesic.distr vector? I assume you want to make the allocated matrix bigger in this case. Or, even better, you want run the 100 realizations first, and allocate the matrix after you know its size.
Another potential problem is that if you do table(shortest.paths(bar)), then you are (by default) considering undirected distances, will end up with a symmetric matrix and count all distances (expect for self-distances) twice. This may or may not be what you want.
Anyway, here is a simple way, with the matrix allocated after the 100 runs:
dists <- lapply(1:100, function(x) {
bar <- barabasi.game(vcount(g))
table(shortest.paths(bar))
})
maxlen <- max(sapply(dists, length))
geo <- t(sapply(dists, function(d) c(d, rep(0, maxlen-length(d)))))

Matching elements in a list

Just starting to program in R... Got stumped on this one, perhaps because I don't know where to begin.
Define a random variable to be equal to the number of trials before there is a match. So if you have a list of numbers, (4,5,7,11,3,11,12,8,8,1....), the first value of the random variable is 6 because by then there are two 11's.(4,5,7,11,3,11) The second value is 3 because then you have 2 8's..12,8,8.
The code below creates the list of numbers, u, by simulating from a uniform distribution.
Thank-you for any help or pointers. I've included a full description of the problem I am solving below if anyone is interested (trying to learn by coding a statistics text).
set.seed(1); u = matrix(runif(1000), nrow=1000)
u[u > 0 & u <= 1/12] <- 1
u[u > 1/12 & u <= 2/12] <- 2
u[u > 2/12 & u <= 3/12] <- 3
u[u > 3/12 & u <= 4/12] <- 4
u[u > 4/12 & u <= 5/12] <- 5
u[u > 5/12 & u <= 6/12] <- 6
u[u > 6/12 & u <= 7/12] <- 7
u[u > 7/12 & u <= 8/12] <- 8
u[u > 8/12 & u <= 9/12] <- 9
u[u > 9/12 & u <= 10/12] <- 10
u[u > 10/12 & u <= 11/12] <- 11
u[u > 11/12 & u < 12/12] <- 12
table(u); u[1:10,]
Example 2.6-3 Concepts in Probability and Stochastic Modeling, Higgins
Suppose we were to ask people at random in which month they were born. Let the random variable X denote the number of people we would need to ask before we found two people born in the same month. The possible values for X are 2,3,...13. That is, at least two people must be asked in order to have a match and no more than 13 need to be asked. With the simplifying assumption that every month is an equally likely candidate for a response, a computer simulation was used to estimate the probabilitiy mass function of X. The simulation generated birth months until a match was found. Based on 1000 repetitions of this experiment, the following empirical distribution and sample statistics were obtained...
R has a steep initial learning curve. I don't think it's fair to assume this is your homework, and yes, it's possible to find solutions if you know what you're looking for. However, I remember it being difficult at times to research problems online simply because I didn't know what to search for (I wasn't familiar enough with the terminology).
Below is an explanation of one approach to solving the problem in R. Read the commented code and try and figure out exactly what it's doing. Still, I would recommend working through a good beginner resource. From memory, a good one to get up and running is icebreakeR, but there are many out there...
# set the number of simulations
nsim <- 10000
# Create a matrix, with nsim columns, and fill it with something.
# The something with which you'll populate it is a random sample,
# with replacement, of month names (held in a built-in vector called
# 'month.abb'). We're telling the sample function that it should take
# 13*nsim samples, and these will be used to fill the matrix, which
# has nsim columns (and hence 13 rows). We've chosen to take samples
# of length 13, because as your textbook states, 13 is the maximum
# number of month names necessary for a month name to be duplicated.
mat <- matrix(sample(month.abb, 13*nsim, replace=TRUE), ncol=nsim)
# If you like, take a look at the first 10 columns
mat[, 1:10]
# We want to find the position of the first duplicated value for each column.
# Here's one way to do this, but it might be a bit confusing if you're just
# starting out. The 'apply' family of functions is very useful for
# repeatedly applying a function to columns/rows/elements of an object.
# Here, 'apply(mat, 2, foo)' means that for each column (2 represents columns,
# 1 would apply to rows, and 1:2 would apply to every cell), do 'foo' to that
# column. Our function below extends this a little with a custom function. It
# says: for each column of mat in turn, call that column 'x' and perform
# 'match(1, duplicated(x))'. This match function will return the position
# of the first '1' in the vector 'duplicated(x)'. The vector 'duplicated(x)'
# is a logical (boolean) vector that indicates, for each element of x,
# whether that element has already occurred earlier in the vector (i.e. if
# the month name has already occurred earlier in x, the corresponding element
# of duplicated(x) will be TRUE (which equals 1), else it will be false (0).
# So the match function returns the position of the first duplicated month
# name (well, actually the second instance of that month name). e.g. if
# x consists of 'Jan', 'Feb', 'Jan', 'Mar', then duplicated(x) will be
# FALSE, FALSE, TRUE, FALSE, and match(1, duplicated(x)) will return 3.
# Referring back to your textbook problem, this is x, a realisation of the
# random variable X.
# Because we've used the apply function, the object 'res' will end up with
# nsim realisations of X, and these can be plotted as a histogram.
res <- apply(mat, 2, function(x) match(1, duplicated(x)))
hist(res, breaks=seq(0.5, 13.5, 1))

Resources