I am working in R with a database that has these two variables. Camouflage and Detection. The values are binary, 0 for being conspicuous and 1 for being camouflaged. 1 for detected and 0 for undetected. However, during my analysis I added values that are called Unknown in the Detection variable. I would like to permute the Unknown with 1 then 0 and see if each of the permutations affects the significance of the glm that I am using. The permutation may be that all Unknown change to 0 or to 1, or that some change to 1 and others to 0. A random permutation. It may be simple, it's just that I am not really functional with R.
Try this:
camouf = c(1,NA,0,0,1,0,NA,NA,NA,0,1)
perm <- function(vec, chance = 0.5){
unknown <- which(is.na(camouf))
vec[unknown] <- sample(0:1, size=length(unknown),
prob = c(1-chance,chance), replace = TRUE)
return(vec)
}
perm(camouf) # do it once
replicate(50, perm(camouf)) # do it many times
It defines a function perm to do what you call permute to a vector of 0 and 1 and put in a random 0 or 1 at the places, where the original had an NA. The probability of a 1 can be given via the chance = argument.
Related
In R, I'm attempting to find the best combination of 8 different columns of values but with the caveat of only being able to select one value from each row. It sounds relatively simple, but I'm trying to avoid a nasty looping scenario to evaluate all possible options, so I'm hopeful there is a function available that could make this a possibility. There are scenarios where I will need to run this on datasets with over 2000 rows, so efficiency is really important.
Here is an example:
I've been racking my brain and searching forever, but every scenario and solution I'm able to find can maximize series of columns but cant handle the condition of only allowing a single value per row. Are there any functions where this is possible?
I will take a risk here, and assume that I interpreted you right. That you seek the group of 8 numbers in that table that have the maximum sum. Given, of course that they do not share a column or a row.
There is no easy answer to this question. I am not a computer scientist, but I believe this is what is called an NP-hard problem. So efficiency will always be a problem. Fortunately, in practical terms, I think you can get an answer for a 2000+ table in a matter of seconds, as long as the number of columns remains small.
The algorithm I tried to use to win this problem is essentially a depth-first search that takes advantage of existing function in R that makes it faster. You can think of your problem as jumping from column to column, each time selecting the highest value with a twist. Every time you select a value, all cells in that row are turned to zero. So in essence, when you get to the last column, there will only be one value to choose.
However, due to this nature of excluding rows, your results will be different depending on the order you choose to visit the columns (let's call that a path). Thus, you have to test all paths.
So our code must be something of the sort:
1- Enumerate all paths (all permutations of column numbers);
2- For each path, "walk" it taking the maximum value of each column and transforming to 0 the values in its row. Store the values;
3- For each set of values, calculate its sum and select based on that.
Below is the code I have used to do it:
library(combinat) # loads permn function, that enumerates all the permutations
#Create fake data
data = sample(1:25)
data = matrix(data,5,5)
# Walking function
walker = function(path,data) {
bestn = numeric(length(path)) # Placeholder for the max value of each column
usedrows = numeric(length(path)) #Placeholder for the row of each max value
data.reduced=data # copies data to a new object
for(a in 1:length(path)) { # iterate through columns
bestn[a] = max(data.reduced[,path[a]]) #find the maximum value
usedrows[a] = which.max(data.reduced[,path[a]]) # find maximum value's row
data.reduced[usedrows,]=0 # set all values in that row to 0
data.reduced[,path[a]]=0 # set current column to 0.
}
return(bestn)
}
# Create all permutations and use functions in it, get their sum, and choose based on that
paths = permn(1:5)
values = lapply(paths,walker,data)
values.sum = sapply(values,sum)
values[[ which.max(values.sum)]]
The code can handle a matrix of 2000 x 5 in less than a second in a laptop. I just did not added it here, because the more rows, the more independent the results become from the path taken. And it is less easy to see its progress with large numbers.
This problem can be solved simply as a binary integer optimization problem. Here using the ROI and ompr optimization packages. ompr is a formulation manager that calls ROI functions for optimization and processing. Here is an example:
require(ROI)
require(ROI.plugin.glpk)
require(ompr)
require(ompr.roi)
set.seed(7)
n <- runif(77, 80, 120)
n <- c(n, rep(0, 179))
n <- sample(n)
m <- matrix(n, ncol = 8)
nrows <- nrow(m)
ncols <- ncol(m)
model <- MIPModel() %>%
add_variable(x[i, j], i=1:nrows, j=1:ncols, type='binary', lb=0) %>%
set_objective(sum_expr(colwise(m[i, j]) * x[i, j], i=1:nrows, j=1:ncols), 'max') %>%
add_constraint(sum_expr(x[i, j], i=1:nrows) <= 1, j=1:ncols) %>%
add_constraint(sum_expr(x[i, j], j=1:ncols) <= 1, i=1:nrows)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 20: obj = 9.321807877e+002 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
256 integer variables, all of which are binary
Integer optimization begins...
+ 20: mip = not found yet <= +inf (1; 0)
+ 20: >>>>> 9.321807877e+002 <= 9.321807877e+002 0.0% (1; 0)
+ 20: mip = 9.321807877e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
solution <- get_solution(result, x[i, j])
solution <- subset(solution, value != 0)
solution
variable i j value
27 x 27 1 1
43 x 11 2 1
88 x 24 3 1
99 x 3 4 1
146 x 18 5 1
173 x 13 6 1
209 x 17 7 1
246 x 22 8 1
The first code chunk generates a 32X8 random matrix. The sample generates a 30% fill. The constraints constrain each column and row to have <= 1 active variable. You can use this code directly for any matrix of any dimension.
I'm trying create a function that the user enters with a vector (file) and the function returns the number of subintervals possible of differents sizes without exceeding the dimensions of the vector. The function has worked well for many values, but for some specific values has exceeded the dimensions and i don't know why. Follows the reproducible example, note that when box_size = 101 (size of subintervals) them the function returns ninbox = 5 (number of subintervals) and not 4 how should be.
file = rnorm(500);N<-length(file)
box_size <-c(92,101,111)
j=1;ninbox2<-0;aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
for(j in 1:length(box_size)){
while(aux_ninbox<=N){
ninbox<-ninbox+1
sum_box[ninbox]<- box_size[j]
aux_ninbox<-sum_box[ninbox]+ aux_ninbox
ninbox2[j]<-ninbox
}
aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
}
ninbox2
For instance, if the size of the subinterval is 101 and the size of the vector is 500, them the function count how many subintervals of size 101 fit in 500. In this case, 101+101+101+101 = 404 (4 subintervals) because the next exceeding the dimension of the vector (500). This function have an error, because is returning 5 for intervals of size 101. But note that for intervals of size 92 and 111 is working perfectly!
Okay, now I get it. I simplified your code a little bit:
file <- rnorm(500)
N <- length(file)
box_size <- c(92,101,111)
ninbox2 <- 0
for (j in 1:length(box_size)){
aux_ninbox <- box_size[j]
ninbox <- 0
sum_box <- 0
while (aux_ninbox<=N){
ninbox <- ninbox+1
sum_box[ninbox] <- box_size[j]
aux_ninbox <- sum_box[ninbox]+ aux_ninbox
ninbox2[j] <- ninbox
}
}
ninbox2
There were a few issues when your variables were declared:
aux_ninbox[1] <- box_size[j]
should not be assigned at the end of your for-loop. You have to put it at the beginning, before the while-loop starts. That caused your error, since the second iteration of you for-loop again used the box_size of 92 for the calculation.
If you just want to get the number of boxes, you could simply use
N %/% box_size
This divides the length of your given vector by the sizes of your boxes ignoring the rests. See "Arithmetic Operators" in R help for more information about div and mod.
I want to conditionally replace values in a specific vector in a 3d array, the replacement value being a value from a probability calculation. For some reason the replacement value is the same for all values of the vector, rather than being calculated on an individual vector element basis. I must have something simple incorrect in my syntax
library (abind)
pop <- array(c (1,0,1,1,1,0,0,0,0,0,2,0,2,3,5), dim = c(1,5,3))
pop <- abind(pop,pop, along = 1)
so the particular vector I want to work on is
pop[dim(pop)[1], ,1]
[1] 1 0 1 1 1
what I want to achieve is to leave the zero value alone, and if the value is one, then run a random binomial test, to see if it changes to zero, and if it does change, do the insertion. I'm told that the ifelse is vectorized but with this syntax it is not operating individually on each element of the vector. When I try to produce a new vector as such
ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
I get either no change
> ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
[1] 1 0 1 1 1
or alternatively it changes all values.
> ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
[1] 0 0 0 0 0
I'm expecting some of the values in the array to be changed, but not "all or nothing". What am I doing wrong? Also if there is a simple elegant way to do the substitution back into the original 3d array I'd be grateful. Thx. J
I think I did find a solution using the "modify_if" function of the dplyr package.
pop[dim(pop)[1], ,1] %<>% modify_if(~ .x == 1, ~ rbinom(1,1,pliv1))
HTH, J
I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.
I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}