I can´t nest an apply funtion to repeat it many times with other combination of columns
I need to get a percentage of sp1==1 & s1==1and sp2==1 & s1==1 regarding s1, and in the same way regarding s2, s3... s1000. Here a short example:
x <- data.frame("sp1"=rep(0:1, times=5),
"sp2"=rep(0:1, each=5),
"s1" = rep(0:1, times=10),
"s2" = rep(0:1, each=2),
"s3" = rep(1:0, each=2))
> x
sp1 sp2 s1 s2 s3
1 0 0 0 0 1
2 1 0 1 0 1
3 0 0 0 1 0
4 1 0 1 1 0
5 0 0 0 0 1
6 1 1 1 0 1
7 0 1 0 1 0
8 1 1 1 1 0
9 0 1 0 0 1
10 1 1 1 0 1
11 0 0 0 1 0
12 1 0 1 1 0
13 0 0 0 0 1
14 1 0 1 0 1
15 0 0 0 1 0
16 1 1 1 1 0
17 0 1 0 0 1
18 1 1 1 0 1
19 0 1 0 1 0
20 1 1 1 1 0
Now I typed a function to calculate percentage regarding s1:
r <- as.data.frame(sapply(x[,1:2],
function(i) sum(i ==1 & x$s1 == 1)/sum(i ==1)))
> r
sapply(x[, 1:2], function(i) sum(i == 1 & x$s1 == 1)/sum(i == 1))
sp1 1.0
sp2 0.6
I want to built a df with all percentages of sp1, sp2, sp3, ...sp200 regarding s1, s2, s3, ...s1000...
> r
s1 s2 s3 ... s1000
sp1 1.0 0.5 0.5
sp2 0.6 0.5 0.5
...
sp200
I've tried to do a function with both groups-variables, one for sp's and another for s's:
intento <- as.data.frame(sapply(i=x[,1:2],
j=x[,3:5],
function(i,j)sum(i ==1 & j == 1)/sum(i ==1)))
But logically that´s not the way:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
We can seperate the columns based on their names and use sapply on them
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
sapply(x[sp_cols], function(i) sapply(x[s_cols],
function(j) sum(i == 1 & j == 1)/sum(i == 1)))
If you have only 1 and 0's as values in the columns this can be reduced to
sapply(x[s_cols], function(i) sapply(x[sp_cols], function(j) sum(i & j)/sum(j)))
# s1 s2 s3
#sp1 1.0 0.5 0.5
#sp2 0.6 0.5 0.5
You're looking for outer. Your function just needs to be Vectorized.
FUN <- Vectorize(function(i,j) sum(x[i] == 1 & x[j] == 1)/sum(x[i] == 1))
outer(1:2, 3:5, FUN)
# [,1] [,2] [,3]
# [1,] 1.0 0.5 0.5
# [2,] 0.6 0.5 0.5
You could refine this using grep to find the columns automatically
outer(grep("sp", names(x)), grep("s\\d+", names(x)), FUN)
A similar approach is to use lapply(x, function(x) which(x == 1) and then use that down the road. The thought process being that we might as well store the information instead of repeatedly checking it.
#as suggested by #Ronak
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
x_l_zero <- lapply(x, function(x) which(x == 1))
sapply(x_l_zero[s_cols]
, function(x) sapply(x_l_zero[sp_cols]
, function(y) length(intersect(x,y))/length(y)))
s1 s2 s3
sp1 1.0 0.5 0.5
sp2 0.6 0.5 0.5
#Ronak has the fastest solution and is more-or-less the OP's code that's been addressed.
Unit: microseconds
expr min lq mean median uq max neval
jay.sf_outer_FUN 1190.8 1240.85 1360.103 1284.50 1337.30 2627.0 100
cole_which_apply 268.4 289.00 454.609 306.05 322.00 7610.7 100
ronak_1_unsimple 181.3 193.95 321.863 209.95 233.40 6227.4 100
ronak_2_simple 228.5 241.25 342.354 250.65 276.05 7478.4 100
akrun_dplyr 5218.7 5506.05 6108.997 5721.80 6081.65 25147.3 100
Code for performance:
library(microbenchmark)
library(tidyverse)
##data set
x <- data.frame("sp1"=rep(0:1, times=5),
"sp2"=rep(0:1, each=5),
"s1" = rep(0:1, times=10),
"s2" = rep(0:1, each=2),
"s3" = rep(1:0, each=2))
#for jay.sf
FUN <- Vectorize(function(i,j) sum(x[i] == 1 & x[j] == 1)/sum(x[i] == 1))
#names of columns
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
sp_cols_nam <- grep("^sp", names(x), value = T)
s_cols_nam <- grep("^s\\d+", names(x), value = T)
#benchmark
microbenchmark(
outer_FUN = {
outer(sp_cols, s_cols, FUN)
}
, apply_heaven = {
x_l_zero <- lapply(x, function(x) which(x == 1))
sapply(x_l_zero[s_cols], function(x) sapply(x_l_zero[sp_cols] , function(y) length(intersect(x,y))/length(y)))
}
, ronak_1_unsimple = {
sapply(x[sp_cols], function(i) sapply(x[s_cols],
function(j) sum(i == 1 & j == 1)/sum(i == 1)))
}
, ronak_2_simple = {
sapply(x[s_cols], function(i) sapply(x[sp_cols], function(j) sum(i & j)/sum(j)))
}
, akrun_dplyr = {
crossing(nm1 = sp_cols_nam,
nm2 = s_cols_nam) %>%
mutate(val = pmap_dbl(., ~ sum(x[..1] ==1 & x[..2] == 1)/sum(x[..1]))) %>%
spread(nm2, val)
}
)
Here is an option with tidyverse
library(tidyverse)
crossing(nm1 = names(x)[startsWith(names(x), "sp")],
nm2 = grep("^s\\d+", names(x), value = TRUE)) %>%
mutate(val = pmap_dbl(., ~ sum(x[..1] ==1 & x[..2] == 1)/sum(x[..1]))) %>%
spread(nm2, val)
# A tibble: 2 x 4
# nm1 s1 s2 s3
# <chr> <dbl> <dbl> <dbl>
#1 sp1 1 0.5 0.5
#2 sp2 0.6 0.5 0.5
Related
Consider a vector:
int = c(1, 1, 0, 5, 2, 0, 0, 2)
I'd like to get the closest subsequent index (not the difference) for a specified value. The first parameter of the function should be the vector, while the second should be the value one wants to see the closest subsequent elements.
For instance,
f(int, 0)
# [1] 2 1 0 2 1 0 0 NA
Here, the first element of the vector (1) is two positions away from the first subsequent 0, (3 - 1 = 2), so it should return 2. Then the second element is 1 position away from a 0 (2 - 1 = 1). When there is no subsequent values that match the specified value, return NA (here it's the case for the last element, because no subsequent value is 0).
Other examples:
f(int, 1)
# [1] 0 0 NA NA NA NA NA NA
f(int, 2)
# [1] 4 3 2 1 0 2 1 0
f(int, 3)
# [1] NA NA NA NA NA NA NA NA
This should also work for character vectors:
char = c("A", "B", "C", "A", "A")
f(char, "A")
# [1] 0 2 1 0 0
Find the location of each value (numeric or character)
int = c(1, 1, 0, 5, 2, 0, 0, 2)
value = 0
idx = which(int == value)
## [1] 3 6 7
Expand the index to indicate the nearest value of interest, using an NA after the last value in int.
nearest = rep(NA, length(int))
nearest[1:max(idx)] = rep(idx, diff(c(0, idx))),
## [1] 3 3 3 6 6 6 7 NA
Use simple arithmetic to find the difference between the index of the current value and the index of the nearest value
abs(seq_along(int) - nearest)
## [1] 2 1 0 2 1 0 0 NA
Written as a function
f <- function(x, value) {
idx = which(x == value)
nearest = rep(NA, length(x))
if (length(idx)) # non-NA values only if `value` in `x`
nearest[1:max(idx)] = rep(idx, diff(c(0, idx)))
abs(seq_along(x) - nearest)
}
We have
> f(int, 0)
[1] 2 1 0 2 1 0 0 NA
> f(int, 1)
[1] 0 0 NA NA NA NA NA NA
> f(int, 2)
[1] 4 3 2 1 0 2 1 0
> f(char, "A")
[1] 0 2 1 0 0
> f(char, "B")
[1] 1 0 NA NA NA
> f(char, "C")
[1] 2 1 0 NA NA
The solution doesn't involve recursion or R-level loops, so should e fast even for long vectors.
Look for the match from nth position to the end of the vector, then get the 1st match:
f <- function(v, x){
sapply(seq_along(v), function(i){
which(v[ i:length(v) ] == x)[ 1 ] - 1
})
}
f(int, 0)
# [1] 2 1 0 2 1 0 0 NA
f(int, 1)
# [1] 0 0 NA NA NA NA NA NA
f(int, 2)
# [1] 4 3 2 1 0 2 1 0
f(int, 3)
# [1] NA NA NA NA NA NA NA NA
f(char, "A")
# [1] 0 2 1 0 0
Using sequence:
f <- function(v, x){
d = diff(c(0, which(v == x)))
vec <- sequence(d, d-1, by = -1)
length(vec) <- length(int)
vec
}
Output
int = c(1, 1, 0, 5, 2, 0, 0, 2)
char = c("A", "B", "C", "A", "A")
f(int, 0)
# [1] 2 1 0 2 1 0 0 NA
f(int, 1)
# [1] 0 0 NA NA NA NA NA NA
f(int, 2)
# [1] 4 3 2 1 0 2 1 0
f(char, "A")
# [1] 0 2 1 0 0
Benchmark (n = 1000):
set.seed(123)
int = sample(0:100, size = 1000, replace = T)
library(microbenchmark)
bm <- microbenchmark(
fSequence(int, 0),
fzx8754(int, 0),
fRecursive(int, 0),
fMartinMorgan(int, 0),
fMap2dbl(int, 0),
fReduce(int, 0),
fAve(int, 0),
fjblood94(int, 0),
times = 10L,
setup = gc(FALSE)
)
autoplot(bm)
Martin Morgan's solution seems to be the quickest, followed by this answer's sequence solution, sbarbit's recursive solution, and jblood94's for loop solution.
Functions used:
fSequence <- function(v, x){
vec <- sequence(diff(c(0, which(v == x))), diff(c(0, which(v == x))) - 1, by = -1)
length(vec) <- length(v)
vec
}
fzx8754 <- function(v, x){
sapply(seq_along(v), function(i){
which(v[ i:length(v) ] == x)[ 1 ] - 1
})
}
fRecursive <- function(lookup,val ) {
ind <- which(lookup == val)[1] -1
if (length(lookup) > 1) {
c(ind, f(lookup[-1], val))
} else {
ind
}
}
fMartinMorgan <- function(x, value) {
idx = which(x == value)
nearest = rep(NA, length(x))
nearest[1:max(idx)] = rep(idx, diff(c(0, idx)))
abs(seq_along(x) - nearest)
}
fMap2dbl <- function(int, num)
{
n <- length(int)
map2_dbl(num, 1:n, ~ ifelse(length(which(.x == int[.y:n])) == 0, NA,
min(which(.x == int[.y:n])) - 1))
}
fReduce <- function(vec, value) {
replace(
Reduce(
function(x, y)
x + (y * x) ,
vec != value,
right = TRUE,
accumulate = TRUE
),
max(tail(which(vec == value), 1), 0) < seq_along(vec),
NA
)
}
fAve <- function(init, k) {
ave(
seq_along(init),
c(0, head(cumsum(init == k), -1)),
FUN = function(x) if (any(x == k)) rev(seq_along(x) - 1) else NA
)
}
fjblood94 <- function(v, val) {
out <- integer(length(v))
if (v[length(v)] != val) out[length(v)] <- NA_integer_
for (i in (length(v) - 1L):1) {
if (v[i] == val) {
out[i] <- 0L
} else {
out[i] <- out[i + 1L] + 1L
}
}
return(out)
}
Here f is defined as a recursive function that calls itself over shorter tails of the lookup vector:
f <- function(lookup,val ) {
ind <- which(lookup == val)[1] -1
if (length(lookup) > 1) {
c(ind, f(lookup[-1], val))
} else {
ind
}
}
Here is an approach using Reduce() and then some fiddling to get the NA values.
f <- function(vec, value) {
replace(
Reduce(
function(x, y)
x + (y * x) ,
vec != value,
right = TRUE,
accumulate = TRUE
),
max(tail(which(vec == value), 1), 0) < seq_along(vec),
NA
)
}
f(int, 0)
[1] 2 1 0 2 1 0 0 NA
f(int, 1)
[1] 0 0 NA NA NA NA NA NA
f(int, 2)
[1] 4 3 2 1 0 2 1 0
f(int, 3)
[1] NA NA NA NA NA NA NA NA
char = c("A", "B", "C", "A", "A")
f(char, "A")
[1] 0 2 1 0 0
Another possible solution, based on purrr::map2_dbl:
library(purrr)
int = c(1, 1, 0, 5, 2, 0, 0, 2)
f <- function(int, num)
{
n <- length(int)
map2_dbl(num, 1:n, ~ ifelse(length(which(.x == int[.y:n])) == 0, NA,
min(which(.x == int[.y:n])) - 1))
}
f(int, 0)
#> [1] 2 1 0 2 1 0 0 NA
f(int, 1)
#> [1] 0 0 NA NA NA NA NA NA
f(int, 2)
#> [1] 4 3 2 1 0 2 1 0
f(int, 3)
#> [1] NA NA NA NA NA NA NA NA
char = c("A", "B", "C", "A", "A")
f(char, "A")
#> [1] 0 2 1 0 0
A single-pass for loop is simple and efficient:
f1 <- function(v, val) {
out <- integer(length(v))
if (v[length(v)] != val) out[length(v)] <- NA_integer_
for (i in (length(v) - 1L):1) {
if (v[i] == val) {
out[i] <- 0L
} else {
out[i] <- out[i + 1L] + 1L
}
}
return(out)
}
int <- c(1, 1, 0, 5, 2, 0, 0, 2)
chr <- c("A", "B", "C", "A", "A")
f1(int, 0)
#> [1] 2 1 0 2 1 0 0 NA
f1(chr, "A")
#> [1] 0 2 1 0 0
Benchmarking against other solutions:
f2 <- function(v, x){
sapply(seq_along(v), function(i){
which(v[ i:length(v) ] == x)[ 1 ] - 1
})
}
f3 <- function(lookup,val ) {
ind <- which(lookup == val)[1] -1
if (length(lookup) > 1) {
c(ind, f3(lookup[-1], val))
} else {
ind
}
}
f4 <- function(x, value) {
idx = which(x == value)
nearest = rep(NA, length(x))
nearest[1:max(idx)] = rep(idx, diff(c(0, idx)))
abs(seq_along(x) - nearest)
}
f5 <- function(vec, value) {
replace(
Reduce(
function(x, y)
x + (y * x) ,
vec != value,
right = TRUE,
accumulate = TRUE
),
max(tail(which(vec == value), 1), 0) < seq_along(vec),
NA
)
}
microbenchmark::microbenchmark(f1 = {f1(int, 0); f1(chr, "A")},
f2 = {f2(int, 0); f2(chr, "A")},
f3 = {f3(int, 0); f3(chr, "A")},
f4 = {f4(int, 0); f4(chr, "A")},
f5 = {f5(int, 0); f5(chr, "A")},
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 6.0 7.50 8.990 8.40 9.60 18.3 100
#> f2 54.2 61.45 71.752 65.55 79.40 131.8 100
#> f3 25.5 28.60 33.393 30.75 35.90 105.2 100
#> f4 22.3 26.30 30.599 28.00 32.65 82.4 100
#> f5 59.7 64.55 73.474 69.10 75.70 157.0 100
A base R option using ave + cumsum
f <- function(init, k) {
ave(
seq_along(init),
c(0, head(cumsum(init == k), -1)),
FUN = function(x) if (any(x == k)) rev(seq_along(x) - 1) else NA
)
}
and you will see
> f(init, 0)
[1] 2 1 0 2 1 0 0 NA
> f(init, 1)
[1] 0 0 NA NA NA NA NA NA
> f(init, 2)
[1] 4 3 2 1 0 2 1 0
> f(init, 3)
[1] NA NA NA NA NA NA NA NA
I have a data frame with several binary variables: x1, x2, ... x100. I want to replace the entry 1 in each column with the number in the name of the column, i.e.:
data$x2[data$x2 == 1] <- 2
data$x3[data$x3 == 1] <- 3
data$x4[data$x4 == 1] <- 4
data$x5[data$x5 == 1] <- 5
...
How can I achieve this in a loop?
Using col:
# example data
set.seed(1); d <- as.data.frame(matrix(sample(0:1, 12, replace = TRUE), nrow = 3))
names(d) <- paste0("x", seq(ncol(d)))
d
# x1 x2 x3 x4
# 1 0 0 0 1
# 2 1 1 0 0
# 3 0 0 1 0
ix <- d == 1
d[ ix ] <- col(d)[ ix ]
d
# x1 x2 x3 x4
# 1 0 0 0 4
# 2 1 2 0 0
# 3 0 0 3 0
dplyr approach (using zx8754's data):
library(dplyr)
d %>%
mutate(across(starts_with('x'), ~ . * as.numeric(gsub('x', '', cur_column()))))
#> x1 x2 x3 x4
#> 1 0 0 0 4
#> 2 1 2 0 0
#> 3 0 0 3 0
Created on 2021-05-26 by the reprex package (v2.0.0)
Here is a base R solution with a lapply loop.
data[-1] <- lapply(names(data)[-1], function(k){
n <- as.integer(sub("[^[:digit:]]*", "", k))
data[data[[k]] == 1, k] <- n
data[[k]]
})
data
Test data.
set.seed(2021)
data <- replicate(6, rbinom(10, 1, 0.5))
data <- as.data.frame(data)
names(data) <- paste0("x", 1:6)
A solution based on a simple for loop is below (otherwise similar to the accepted answer using lapply):
for (i in 2:100) {
k <- paste0('x', i)
data[data[[k]] == 1, k] <- i
}
I have data about machines in the following form
Number of rows - 900k
Data
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 0 1 1 0 0
2 0 0 0 0 1 1 1 0 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 indicates that the machine was active and 0 indicates that it was inactive.
I want my output to look like
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
2 0 0 0 0 1 1 1 1 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
Basically all I am trying to do is look for zeros in a particular row and if that zero is surrounded by ones on either side, replace 0 with 1
example -
in row 1 you have zero in column J
but you also have 1 in column I and K
which means I replace that 0 by 1 because it is surrounded by 1s
The code I am using is this
for(j in 2:13) {
if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1){
data[i,j] = 1
}
}
}
Is there a way to reduce the time computation for this? This takes me almost 30 mins to run in R. Any help would be appreciated.
this is faster because it does not require to iterate through the rows.
for(j in 2:13) {
data[,j] = ifelse(data[,j-1] * data[,j+1]==1,1,data[,j])
}
or a littlebit more optimized, without using ifelse
for(j in 2:(ncol(data) - 1)) {
data[data[, j - 1] * data[, j + 1] == 1, j] <- 1
}
You could also use gsub to replace any instances of 101 with 111 using the following code:
collapsed <- gsub('101', '111', apply(df1, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
Here's a comparison of everyone's solutions:
library(data.table)
library(rbenchmark)
library(tidyverse)
set.seed(1)
numLetters <- 13
df <- as_tibble(matrix(round(runif(numLetters * 100)), ncol = numLetters))
names(df) <- LETTERS[1:numLetters]
benchmark(
'gsub' = {
data <- df
collapsed <- gsub('101', '111', apply(data, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
},
'for_orig' = {
data <- df
for(i in 1:nrow(data)) {
for(j in 2:(ncol(data) - 1)) {
if(data[i, j] == 0 && data[i, j - 1] == 1 && data[i, j + 1] == 1) {
data[i, j] = 1
}
}
}
},
'for_norows' = {
data <- df
for(j in 2:(ncol(data) - 1)) {
data[, j] = ifelse(data[, j - 1] * data[, j + 1] == 1, 1, data[, j])
}
},
'vectorize' = {
data <- df
for(i in seq(ncol(data) - 2) + 1) {
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
data[which(condition), i] <- 1
}
},
'index' = {
data <- df
idx <- apply(data, 1, function(x) c(0, diff(x)))
data[which(idx == -1 & lead(idx == 1), arr.ind = TRUE)[, 2:1]] <- 1
},
replications = 100
)
The indexing solution (which has since been deleted) wins hands-down in terms of computational time for a 13-by-100 data frame.
test replications elapsed relative user.self sys.self user.child
3 for_norows 100 1.19 7.438 1.19 0 NA
2 for_orig 100 9.29 58.063 9.27 0 NA
1 gsub 100 0.28 1.750 0.28 0 NA
5 index 100 0.16 1.000 0.16 0 NA
4 vectorize 100 0.87 5.438 0.87 0 NA
sys.child
3 NA
2 NA
1 NA
5 NA
4 NA
Cut the time by using vectorized operations. As you are planning to do the same thing for every row, this can be done by utilizing the vectorized conditional statements.
for(i in seq(ncol(data) - 2) + 1){ #<== all but last and first column
#Find all neighbouring columns that are equal, where the the center column is equal to 0
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
#Overwrite only the values that holds the condition
data[which(condition), i] <- 1
}
You can avoid loops altogether and use indexing to replace all the values at once:
nc <- ncol(df)
df[, 2:(nc - 1)][df[, 1:(nc - 2)] * df[, 3:nc] == 1] <- 1
Let's say I have something like this:
set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.
For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).
Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).
The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?
EDIT: wanted output
the.df
x n val wanted
1 a 0 1 1
2 a 1 0 0
3 a 2 0 0
4 a 3 1 0
5 b 0 1 1
6 b 1 0 0
7 b 2 1 0
8 b 3 1 0
9 c 0 1 1
10 c 1 1 1
11 c 2 0 0
12 c 3 0 0
Also, I don't mind making new columns (e.g., putting the wanted values there).
Using data.table I would try the following
library(data.table)
setDT(the.df)[order(n),
val := if(length(indx <- which(val[2:.N] == 0L)))
c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))),
by = x]
the.df
# x n val
# 1: a 0 1
# 2: a 1 0
# 3: a 2 0
# 4: a 3 0
# 5: b 0 1
# 6: b 1 0
# 7: b 2 0
# 8: b 3 0
# 9: c 0 1
# 10: c 1 1
# 11: c 2 0
# 12: c 3 0
This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).
Hopefully in the near future this will be implemented and then the code could potentially be
setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]
Which could be a great improvement both performance and syntax wise
A base R approach might be
df <- the.df[order(the.df$x, the.df$n),]
df$val <- ave(df$val, df$x, FUN=fun)
As for fun, #DavidArenburg's answer in plain R and written a bit more poetically might be
fun0 <- function(v) {
idx <- which.max(v[2:length(v)] == 0L) + 1L
if (length(idx))
v[idx:length(v)] <- 0L
v
}
It seems like a good idea to formulate the solution as an independent function first, because then it is easy to test. fun0 fails for some edge cases, e.g.,
> fun0(0)
[1] 0 0 0
> fun0(1)
[1] 0 0 0
> fun0(c(1, 1))
[1] 1 0
A better version is
fun1 <- function(v) {
tst <- tail(v, -1) == 0L
if (any(tst)) {
idx <- which.max(tst) + 1L
v[idx:length(v)] <- 0L
}
v
}
And even better, following #Arun
fun <- function(v)
if (length(v) > 2) c(v[1], cummin(v[-1])) else v
This is competitive (same order of magnitude) with the data.table solution, with ordering and return occurring in less than 1s for the ~10m row data.frame of #m-dz 's timings. At a second for millions of rows, it doesn't seem worth while to pursue further optimization.
Nonetheless, when there are a very large number of small groups (e.g., 2M each of size 5) an improvement is to avoid the tapply() function call by using group identity to offset the minimum. For instance,
df <- df[order(df$x, df$n),]
grp <- match(df$x, unique(df$x)) # strictly sequential groups
keep <- duplicated(grp) # ignore the first of each group
df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep]
Hmmm, should be pretty efficient if you switch to data.table...
library(data.table)
# Define the.df as a data.table (or use data.table::setDT() function)
set.seed(0)
the.df <- data.table(
x = rep(letters[1:3], each = 4),
n = rep(0:3, 3),
val = round(runif(12))
)
m_dz <- function() {
setorder(the.df, x, n)
repeat{
# Get IDs of rows to change
# ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)])
ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
# If no IDs break
if(length(ids) == 0){
break
}
# Set val to 0
# for (i in ids) set(the.df, i = i, j = "val", value = 0)
set(the.df, i = ids, j = "val", value = 0)
}
return(the.df)
}
Edit: Above function is slightly modified thanks to #jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0), which made the timings much more stable (no very high max timings).
Edit: timing comparison with #David Arenburgs's answer on a slightly bigger table, m-dz() updated (#FoldedChromatin's answer skipped because of diffrent results).
My function is slightly faster in terms of median and upper quantile, but there is quite a big spread in timings (see max...), I cannot figure out why. Hopefully the timing methodology is correct (returning the result to different object etc.).
Anything bigger will kill my PC :(
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e7/size1)
the.df1 <- data.table(
x = rep(groups_ids, each = size2), # 52 * 500 = 26000
n = rep(0:(size2-1), size1),
val = round(runif(size1*size2))
)
the.df2 <- copy(the.df1)
# m-dz
m_dz <- function() {
setorder(df1, x, n)
repeat{
ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
if(length(ids) == 0){
break
}
set(df1, i = ids, j = "val", value = 0)
}
return(df1)
}
# David Arenburg
DavidArenburg <- function() {
setorder(df2, x, n)
df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x]
return(df2)
}
library(microbenchmark)
microbenchmark(
res1 <- m_dz(),
res2 <- DavidArenburg(),
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960 100 a
# res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210 525.8095 100 a
identical(res1, res2)
# [1] TRUE
Edit: (Old) results for even bigger table:
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e8/size1)
# Unit: seconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107 100 a
# res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280 100 a
# res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377 10.28965 100 a
Why not just use by
> set.seed(0)
> the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
> the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
> Mod.df<-by(the.df,INDICES=the.df$x,function(x){
x$val[x$n==2]=0
Which=which(x$n==2 & x$val==0)+1
x$val[Which]=0
x})
> do.call(rbind,Mod.df)
x n val
a.1 a 0 1
a.2 a 1 0
a.3 a 2 0
a.4 a 3 0
b.5 b 0 1
b.6 b 1 0
b.7 b 2 0
b.8 b 3 0
c.9 c 0 1
c.10 c 1 1
c.11 c 2 0
c.12 c 3 0
I have a pair of binary variables (1's and 0's), and my professor wants me to create a new binary variable that takes the value 1 if both of the previous variables have the value 1 (i.e., x,y=1) and takes the value zero otherwise.
How would I do this in R?
Thanks!
JMC
Here's one example with some sample data to play with:
set.seed(1)
A <- sample(0:1, 10, replace = TRUE)
B <- sample(0:1, 10, replace = TRUE)
A
# [1] 0 0 1 1 0 1 1 1 1 0
B
# [1] 0 0 1 0 1 0 1 1 0 1
as.numeric(A + B == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(rowSums(cbind(A, B)) == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(A == 1 & B == 1)
# [1] 0 0 1 0 0 0 1 1 0 0
Update (to introduce some more alternatives and share a link and a benchmark)
set.seed(1)
A <- sample(0:1, 1e7, replace = TRUE)
B <- sample(0:1, 1e7, replace = TRUE)
fun1 <- function() ifelse(A == 1 & B == 1, 1, 0)
fun2 <- function() as.numeric(A + B == 2)
fun3 <- function() as.numeric(A & B)
fun4 <- function() as.numeric(A == 1 & B == 1)
fun5 <- function() as.numeric(rowSums(cbind(A, B)) == 2)
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), fun4(), fun5(), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 4842.8559 4871.7072 5022.3525 5093.5932 10424.6589 5
# fun2() 220.8336 220.9867 226.1167 229.1225 472.4408 5
# fun3() 440.7427 445.9342 461.0114 462.6184 488.6627 5
# fun4() 604.1791 613.9284 630.4838 645.2146 682.4689 5
# fun5() 373.8088 373.8532 373.9460 435.0385 1084.6227 5
As can be seen, ifelse is indeed much slower than the other approaches mentioned here. See this SO question and answer for some more details about the efficiency of ifelse.