Randomly remove some numeric data from a matrix in R? - r

I have a large data matrix with many numeric values (counts) in it. I would like to remove 10% of all counts. So, for example, a matrix which looks like this:
30 10
0 20
The sum of all counts here is 60. 10% of 60 is 6. So I want to randomly remove 6. A correct output could be:
29 6
0 19
(As you can see it removed 1 from 30, 4 from 10 and 1 from 20). There cannot be negative values.
How could I program this in R?

Here is a way. It subtracts 1 to positive matrix elements until a certain total to remove is reached.
subtract_int <- function(X, n){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
while(n > 0){
i <- sample(N, 1)
if(X[ inx[i, , drop = FALSE] ] > 0){
X[ inx[i, , drop = FALSE] ] <- X[ inx[i, , drop = FALSE] ] - 1
n <- n - 1
}
if(any(X[inx] == 0)){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
}
}
X
}
set.seed(2021)
to_remove <- round(sum(A)*0.10)
subtract_int(A, to_remove)
# [,1] [,2]
#[1,] 30 6
#[2,] 0 18
Data
A <- structure(c(30, 0, 10, 20), .Dim = c(2L, 2L))

Maybe this helps you at least to get on the right track. It's nothing more than a draft though:
randomlyRemove <- function(matrix) {
sum_mat <- sum(matrix)
while (sum_mat > 0) {
sum_mat <- sum_mat - runif(1, min = 0, max = sum_mat)
x <- round(runif(1, 1, dim(matrix)[1]), digits = 0)
y <- round(runif(1, 1, dim(matrix)[2]), digits = 0)
matrix[x,y] <- matrix[x,y] - sum_mat
}
return(matrix)
}
You might want to play with the random number generator process to get more evenly distributed substractions.
edit: added round(digits = 0) to get only integer (dimension) values and modified the random (dimension) value generation to start from 1 (not zero).

I think we can make it work with using sample. This solution is a lot more compact.
The data
A <- structure(c(30, 0, 11, 20), .Dim = c(2L, 2L))
sum(A)
#> [1] 61
The logic
UseThese <- (1:length(A))[A > 0] # Choose indices to be modified because > 0
Sample <- sample(UseThese, sum(A)*0.1, replace = TRUE) # Draw a sample of indices
A[UseThese] <- A[UseThese] - as.vector(table(Sample)) # Subtract handling repeated duplicate indices in the sample
Check the result
A
#> [,1] [,2]
#> [1,] 28 8
#> [2,] 0 19
sum(A) # should be the value above minus 6
#> [1] 55
One disadvantage of this solution is that it could lead to negative
values. So check with:
any(A < 0)
#> [1] FALSE

Related

R: Find set of columns which contain most 1s in matrix of 0 and 1

I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.
I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.
Test Data
#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
My attempt
My initial attempt was just based on combining the set of 5 columns with the highest colMeans:
#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)
#Have a look the set
col_set
>
197 199 59 80 76
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000
#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5
>
[1] 0.7
However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:
#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
list(set = col_set2, cover = cover)
})
##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]
>
[[1]]
[[1]]$set
59 169 262 68 197
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667
[[1]]$cover
[1] 0.7666667
The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.
So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.
EDIT
For the interested, I decided to microbenchmark the 3 solutions provided:
#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
#G. Grothendieck top solution
solution1 <- function(d){
cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#G. Grothendieck "Older solution"
solution2 <- function(d){
require(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
m <- matrix(res$solution[1:3000] == 1, 300)
cols <- which(rowSums(m) > 0)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#user2554330 solution
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
result
}
#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
solution2 = solution2(d),
solution3 = bestCols(d), times = 10)
>
Unit: microseconds
expr min lq mean median uq max neval
solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16 10
solution2 55252.890 71492.781 84613.301 84811.7210 93916.544 117451.35 10
solution3 425.922 517.843 3087.758 589.3145 641.551 25742.11 10
This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.
For example,
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
cat("final dim is ", dim(d))
result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5
This gives 90% coverage.
The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.
Older solution
We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.
library(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)
# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14
# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
You can try to test if there is a better column and exchange this with the one currently in the selection.
n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
sort(i)
#[1] 90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333
Taking into account, that the initial condition influences the result you can take random starts.
n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
i <- sample(idx, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res <- i
}
sort(res)
#[1] 90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333

Creating more pseudo-random matrices same time in R? Comparing the points sign matching?

I can make one pseudo-random matrix with the following :
nc=14
nr=14
set.seed(111)
M=matrix(sample(
c(runif(58,min=-1,max=0),runif(71, min=0,max=0),
runif(nr*nc-129,min=0,max=+1))), nrow=nr, nc=nc)
The more important question: I need 1000 matrices with the same amount of negative, positive and zero values, just the location in the matrices need to be various.
I can make matrices one by one, but I want to do this task faster.
The less important question: If I have the 1000 matrices, I need to identify for every point of the matrices, that how many positive negative or zero value got there, for example:
MATRIX_A
[,1]
[9,] -0,2
MATRIX_B
[,1]
[9,] -0,5
MATRIX_C
[,1]
[9,] 0,1
MATRIX_D
[,1]
[9,] 0,0
MATRIX_E
[,1]
[9,] 0,9
What I need:
FINAL_MATRIX_positive
[,1]
[9,] (2/5*100)=40% or 0,4 or 2
,because from 5 matrix in this point were 2 positive value, and also need this for negative and zero values too.
If it isn't possible to do this in R, I can compare them "manually" in Excel.
Thank you for your help!
Actually you are almost there!
You can try the code below, where replicate can make 1000 times for generating the random matrix, and Reduce gets the statistics of each position:
nc <- 14
nr <- 14
N <- 1000
lst <- replicate(
N,
matrix(sample(
c(
runif(58, min = -1, max = 0),
runif(71, min = 0, max = 0),
runif(nr * nc - 129, min = 0, max = +1)
)
), nrow = nr, nc = nc),
simplify = FALSE
)
pos <- Reduce(`+`,lapply(lst,function(M) M > 0))/N
neg <- Reduce(`+`,lapply(lst,function(M) M < 0))/N
zero <- Reduce(`+`,lapply(lst,function(M) M == 0))/N
I use a function for your simulation scheme:
my_sim <- function(n_neg = 58, n_0 = 71, n_pos = 67){
res <- c(runif(n_neg, min=-1, max=0),
rep(0, n_0),
runif(n_pos, min=0, max=+1))
return(sample(res))
}
Then, I simulate your matrices (I store them in a list):
N <- 1000
nr <- 14
nc <- nr
set.seed(111)
my_matrices <- list()
for(i in 1:N){
my_matrices[[i]] <- matrix(my_sim(), nrow = nr, ncol = nc)
}
Finally, I compute the proportion of positive numbers for the position row 1 and column 9:
sum(sapply(my_matrices, function(x) x[1,9]) > 0)/N
# [1] 0.366
However, if you are interested in all the positions, these lines will do the job:
aux <- lapply(my_matrices, function(x) x > 0)
FINAL_MATRIX_positive <- 0
for(i in 1:N){
FINAL_MATRIX_positive <- FINAL_MATRIX_positive + aux[[i]]
}
FINAL_MATRIX_positive <- FINAL_MATRIX_positive/N
# row 1, column 9
FINAL_MATRIX_positive[1, 9]
# [1] 0.366

Choose closest x elements by index in a list/vector

If I have a vector such as x <-c(1,2,3,4,5,6,7,8,9), I want a function f such that
f(vector,index,num) where it takes the vector and gives me num "closest" elements to that one on the index
Examples:
f(x,3,4) = c(1,2,4,5)
f(x,1,5) = c(2,3,4,5,6)
f(x,8,3) = c(6,7,9)
Since there is also the issue where if we have an odd num, we will need to choose whether to pick left or right side by symmetry, let's go with choosing the left side (but right side is ok too)
i.e f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)
I hope my question is clear, thank you for any help/responses!
edit: The original vector of c(1:9) is arbitrary, the vector could be a vector of strings, or a vector of length 1000 with shuffled numbers with repeats etc.
i.e c(1,7,4,2,3,7,2,6,234,56,8)
num_closest_by_indices <- function(v, idx, num) {
# Try the base case, where idx is not within (num/2) of the edge
i <- abs(seq_along(x) - idx)
i[idx] <- +Inf # sentinel
# If there are not enough elements in the base case, incrementally add more
for (cutoff_idx in seq(floor(num/2), num)) {
if (sum(i <= cutoff_idx) >= num) {
# This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
return(v[i <= cutoff_idx])
}
}
}
Here's an illustration of this algorithm: we rank the indices in order of desirability, then pick the lowest num legal ones:
> seq_along(x)
1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
-2 -1 0 1 2 3 4 5 6
> i <- abs(seq_along(x) - idx)
2 1 0 1 2 3 4 5 6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
2 1 Inf 1 2 3 4 5 6
Now we can just find num elements with smallest values (break ties arbitrarily, unless you have a preference (left)).
Our first guess is all indices <= (num/2) ; this might not be enough if index is within (num/2) of the start/end.
> i <= 2
TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
1 2 4 5
So, adapting #dash2's code to handle the corner cases where some indices are illegal (nonpositive, or > length(x)), i.e. ! %in% 1:L. Then min(elems) would be the number of illegal indices which we cannot pick, hence we must pick abs(min(elems)) more.
Notes:
in the end the code is simpler and faster to handle it by three piecewise cases. Aww.
it actually seems to simplify things if we pick (num+1) indices, then remove idx before returning the answer. Using result[-idx] to remove it.
Like so:
f <- function (vec, elem, n) {
elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
if (elems[1] < 1) elems <- elems + (1 - elems[1])
elems <- setdiff(elems, elem)
vec[elems]
}
Giving results:
> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8
Start a function with the variable argument x first, and the reference table and n after
.nearest_n <- function(x, table, n) {
The algorithm assumes that table is numeric, without any duplicates, and all values finite; n has to be less than or equal to the length of the table
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
Sort the table and then 'clamp' maximum and minimum values
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
Find the interval in table where x occurs; findInterval() uses an efficient search. Use the interval index as the initial lower index, and add 1 for the upper index, making sure to stay in-bounds.
## where to start?
lower <- findInterval(x, table)
upper <- min(lower + 1L, len)
Find the nearest n neighbors by comparing the lower and upper index distance to x, record the nearest value, and increment the lower or upper index as appropriate and making sure to stay in-bounds
## find
nearest <- numeric(n)
for (i in seq_len(n)) {
if (abs(x - table[lower]) < abs(x - table[upper])) {
nearest[i] = table[lower]
lower = max(1L, lower - 1L)
} else {
nearest[i] = table[upper]
upper = min(len, upper + 1L)
}
}
Then return the solution and finish the function
nearest
}
The code might seem verbose, but is actually relatively efficient because the only operations on the entire vector (sort(), findInterval()) are implemented efficiently in R.
A particular advantage of this approach is that it can be vectorized in it's first argument, calculating the test for using lower (use_lower = ...) as a vector and using pmin() / pmax() as clamps.
.nearest_n <- function(x, table, n) {
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
## where to start?
lower <- findInterval(x, table)
upper <- pmin(lower + 1L, len)
## find
nearest <- matrix(0, nrow = length(x), ncol = n)
for (i in seq_len(n)) {
use_lower <- abs(x - table[lower]) < abs(x - table[upper])
nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
}
# return
nearest
}
For instance
> set.seed(123)
> table <- sample(100, 10)
> sort(table)
[1] 5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
[,1] [,2] [,3] [,4]
[1,] 29 41 42 50
[2,] 29 5 41 42
Generalize this by taking any argument and coercing it to the required form using a reference look-up table table0 and the indexes into it table1
nearest_n <- function(x, table, n) {
## coerce to common form
table0 <- sort(unique(c(x, table)))
x <- match(x, table0)
table1 <- match(table, table0)
## find nearest
m <- .nearest_n(x, table1, n)
## result in original form
matrix(table0[m], nrow = nrow(m))
}
As an example...
> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "L" "O" "l" "P"
[2,] "Z" "z" "Y" "y" "w"

Binary coding of pairwise comparisons

I'm working on a questionnaire where there are always three statements presented at a time and participants have to rank order these according to their preferences (3 = most preferred, 1 = least preferred).
For further analyses I have to transform these rankings into pairwise comparisons within each block of three. Below is a code doing this for the first six items (2 blocks) of the questionnaire.
data <- matrix(c(1,2,3,1,2,3,2,1,3,3,1,2),2,6)
i1i2 <- ifelse(data[,1] > data[,2], 1, 0)
i1i3 <- ifelse(data[,1] > data[,3], 1, 0)
i2i3 <- ifelse(data[,2] > data[,3], 1, 0)
i4i5 <- ifelse(data[,4] > data[,5], 1, 0)
i4i6 <- ifelse(data[,4] > data[,6], 1, 0)
i5i6 <- ifelse(data[,5] > data[,6], 1, 0)
result <- cbind(i1i2, i1i3, i2i3, i4i5, i4i6, i5i6)
print(result)
I extended this code to fit a 45 item questionnaire and it works fine. Now, I'd like to write a function which automatically does this job for n items. I experimented with while and for loops but couldn't succeed.
Can anyone please give me a hint/ reference to the relevant functions I need/ an example on how to do this?
Related: Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502.
First off, remove the ifelse and put them at the end instead:
i1i2 <- data[,1] > data[,2]
i1i3 <- data[,1] > data[,3]
i2i3 <- data[,2] > data[,3]
…
result <- ifelse(cbind(i1i2, i1i3, i2i3, i4i5, i4i6, i5i6), 1, 0)
Next, avoid unnecessary repetition.
three_way_compare = function (data, index) {
cbind(data[, index + 0] > data[, index + 1],
data[, index + 0] > data[, index + 2],
data[, index + 1] > data[, index + 2])
}
result = ifelse(do.call(cbind, lapply(seq(1, ncol(data), by = 3),
three_way_compare, data = data)), 1, 0)
While there are probably more efficient alternatives, you could convert your matrix to a list of vectors of length 3 and apply the ifelse statements to them through a function.
Update:
If you have multiple rows in your matrix, you need to use t(data) inside split() to get the correct values.
# Put data in lists of 3
blocks <- split(t(data), ceiling(seq_along(data)/3))
# Define function
comparison <-function(x) {
i1 <- ifelse(x[1] > x[2], 1, 0)
i2 <- ifelse(x[1] > x[3], 1, 0)
i3 <- ifelse(x[2] > x[3], 1, 0)
return(cbind(i1,i2,i3))
}
# Apply function to list
lapply(blocks,comparison)
# $`1`
# i1 i2 i3
# [1,] 0 0 1
#
# $`2`
# i1 i2 i3
# [1,] 0 1 1
#
# $`3`
# i1 i2 i3
# [1,] 1 0 0
#
# $`4`
# i1 i2 i3
# [1,] 0 0 1
# Or unlist to get vector
unlist(lapply(blocks,comparison))
# 11 12 13 21 22 23 31 32 33 41 42 43
# 0 0 1 0 1 1 1 0 0 0 0 1

R: recode previous/following n observations

I have a dataframe of 0/1 dummy variables. Each dummy variable only takes the value 1 once. For each column, I would want to replace n preceding/following observations counting from the observation with the value 1 to a particular value (say 1).
So for single vector, with n=1:
c(0, 0, 1, 0, 0)
I would want to get
c(0, 1, 1, 1, 0)
What would be a good general approach with n columns and allowing for a different number of preceding/following observations to replace (e.g n-1 before & n after)?
Thanks for help!
x<-c(0,0,1,0,0)
ind<-which(x==1)
x[(ind-1):(ind+x)]<-1
Another option:
f <- function(x, pre, post) {
idx <- which.max(x)
x[max(1, (idx-pre)):min(length(x), (idx+post))] <- 1
x
}
Sample data:
df <- data.frame(x = c(0, 0, 1, 0, 0), y = c(0, 1, 0, 0, 0))
Application:
df[] <- lapply(df, f, pre=2, post=1)
#df
# x y
#1 1 1
#2 1 1
#3 1 1
#4 1 0
#5 0 0
What you can do is the following:
vec <- c(0, 0, 1, 0, 0)
sapply(1:length(vec), function(i) {
minval <- max(0, i - 1)
maxval <- min(i + 1, length(vec))
return(sum(vec[minval:maxval]))
})
# [1] 0 1 1 1 0
Or to put it in a function (same code but a bit more compact)
f <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
f(vec)
# [1] 0 1 1 1 0
Speedtest
To compare the two different solutions, I quickly ran a benchmark using microbenchmark, and the winner is: Clearly #Shenglin's code.... Always nice to see simple solutions (as well as to see how complicated some (my) solutions can be).
fDavid <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
fHeroka <- function(vec){
res <- vec
test <- which(vec==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
}
fShenglin <- function(vec){
ind<-which(vec==1)
vec[(ind-1):(ind+x)]<-1
}
vect <- sample(0:1, size = 1000, replace = T)
library(microbenchmark)
microbenchmark(fHeroka(vect), fDavid(vect), fShenglin)
# # Unit: nanoseconds
# expr min lq mean median uq max
# fHeroka(vect) 38929 42999 54422.57 49546 61755.5 145451
# fDavid(vect) 2463805 2577935 2875024.99 2696844 2849548.5 5994596
# fShenglin 0 0 138.63 1 355.0 1063
# neval cld
# 100 a
# 100 b
# 100 a
# Warning message:
# In microbenchmark(fHeroka(vect), fDavid(vect), fShenglin) :
# Could not measure a positive execution time for 30 evaluations.
This might be a start:
myv <- c(0, 0, 1, 0, 0)
#make a copy
res <- myv
#check where the ones are
test <- which(myv==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
res
> res
[1] 0 1 1 1 0
This could be a solution:
dat<-data.frame(x=c(0,0,1,0,0,0),y=c(0,0,0,1,0,0),z=c(0,1,0,0,0,0))
which_to_change<-data.frame(prev=c(2,2,1),foll=c(1,1,3))
for(i in 1:nrow(which_to_change)){
dat[(which(dat[,i]==1)-which_to_change[i,1]):(which(dat[,i]==1)+which_to_change[i,2]),i]<-1
}

Resources