I want to omit rows where NA appears in both of two columns.
I'm familiar with na.omit, is.na, and complete.cases, but can't figure out how to use these to get what I want. For example, I have the following dataframe:
(df <- structure(list(x = c(1L, 2L, NA, 3L, NA),
y = c(4L, 5L, NA, 6L, 7L),
z = c(8L, 9L, 10L, 11L, NA)),
.Names = c("x", "y", "z"),
class = "data.frame",
row.names = c(NA, -5L)))
x y z
1 4 8
2 5 9
NA NA 10
3 6 11
NA 7 NA
and I want to remove only those rows where NAappears in both the x and y columns (excluding anything in z), to give
x y z
1 4 8
2 5 9
3 6 11
NA 7 NA
Does anyone know an easy way to do this? Using na.omit, is.na, or complete.cases is not working.
df[!with(df,is.na(x)& is.na(y)),]
# x y z
#1 1 4 8
#2 2 5 9
#4 3 6 11
#5 NA 7 NA
I did benchmarked on a slightly bigger dataset. Here are the results:
set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T))
f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function() df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ]
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 1.044812 1.068189 1.138323 1.129611 0.856396 100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378 100
dplyr solution
require("dplyr")
df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))
can be modified to take any number columns using the .vars argument
Update: dplyr 1.0.4
df %>%
filter(!if_all(c(x, y), is.na))
See similar answer: https://stackoverflow.com/a/66136167/6105259
You can apply to slice up the rows:
sel <- apply( df, 1, function(x) sum(is.na(x))>1 )
Then you can select with that:
df[ sel, ]
To ignore the z column, just omit it from the apply:
sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )
If they all have to be TRUE, just change the function up a little:
sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )
The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems. The cost is speed (usual caveats about small datasets and speed testing apply):
> microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
Unit: microseconds
expr min lq median uq max neval
df[!with(df, is.na(x) & is.na(y)), ] 67.148 71.5150 76.0340 86.0155 1049.576 100
df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945 498.934 100
df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150 321.583 100
Use rowSums with is.na, like this:
> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
x y z
1 1 4 8
2 2 5 9
4 3 6 11
5 NA 7 NA
Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:
## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20),
1e6, replace = TRUE)))
First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.
f1 <- function() {
df[!with(df, is.na(X1) & is.na(X2)), ]
}
f2 <- function() {
df[rowSums(is.na(df[1:2])) != 2, ]
}
library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# f1() 745.8378 1100.764 1128.047 1199.607 1310.236 20
# f2() 784.2132 1101.695 1125.380 1163.675 1303.161 20
Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.
f1_5 <- function() {
df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
is.na(X4) & is.na(X5)), ]
}
f2_5 <- function() {
df[rowSums(is.na(df[1:5])) != 5, ]
}
microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772 20
# f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915 20
This is also the very basic dplyr solution:
library(dplyr)
df %>%
filter(!(is.na(x) & is.na(y)))
x y z
1 1 4 8
2 2 5 9
3 3 6 11
4 NA 7 NA
Related
For example, I have a table as follows:
DT <- data.table(
A = c(1,1,1,2,2,2,3,3,3),
B = c(1,2,3,1,2,3,1,2,3),
key = "A"
)
I wand to delete the rows under the conditon such as "A" == 2 and "B" == 1, since there is already the row that "A" == 1 and "B" == 2.
In short, I want to delete the rows that already appears symmetrically in the previous rows, how can I realize it?
Maybe not the most efficient, but leverage the duplicated.matrix method:
DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
# A B
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 2
# 5: 2 3
# 6: 3 3
Another option:
DT[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
So ...
make a grouping variable as A + B,
flip the order to B + A on subset A < B or A > B
dedupe on the grouping variable
The last step could alternately be unique(DT, by="g").
if you only have two columns, then you could do:
unique(do.call(function(A,B)data.table(A=pmin(A,B),B=pmax(A,B)),DT))
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
Another method for the case where there are only 2 columns, using anti-join.
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
timing code:
library(data.table)
set.seed(0L)
nr <- 1e5
nElem <- 1e3
mat <- matrix(sample(nElem, nr*2, replace=TRUE), ncol=2)
DT <- as.data.table(mat)
setnames(DT, c("A", "B"))
DT2 <- copy(DT)
library(microbenchmark)
mtd1 <- function() unique(data.frame(A=pmin(mat[, 1], mat[, 2]), B=pmax(mat[, 1], mat[, 2])))
mtd2 <- function() DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
mtd3 <- function() DT2[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
mtd4 <- function() {
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
}
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),times=3L)
some timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 118.62051 129.50581 153.77216 140.39111 171.34799 202.30487 3
mtd2() 3500.47877 3552.80879 3732.67006 3605.13882 3848.76571 4092.39260 3
mtd3() 89.22901 92.94830 97.22658 96.66759 101.22536 105.78313 3
mtd4() 28.61628 32.37641 50.90126 36.13654 62.04375 87.95096 3
But the fastest is eddi's method: data.table with two string columns of set elements, extract unique rows with each row unsorted
mtd5 <- function() DT[DT[, .I[1L], by=.(pmin(A, B), pmax(A, B))]$V1]
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),mtd5(),times=3L)
timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 149.62224 150.70685 175.66394 151.79146 188.68479 225.57813 3
mtd2() 4126.51014 4140.72876 4277.37907 4154.94738 4352.81353 4550.67968 3
mtd3() 126.01679 131.26463 134.63642 136.51247 138.94624 141.38000 3
mtd4() 39.24141 42.42815 45.65804 45.61489 48.86635 52.11781 3
mtd5() 12.58396 16.68156 18.21613 20.77915 21.03221 21.28527 3
I am trying to take the results of a which(..., arr.ind = TRUE) function and remove the rows that are not the first to "connect" with one another.
Examples:
#example 1 example 2 example 3
row col row col row col
1 4 2 3 1 3
2 4 2 4 2 5
4 5 3 5 3 5
3 6 2 7 4 6
4 6 3 7 5 6
3 7 4 7 6 8
4 7 5 7 9 10
# should become (trimmed.mtx)
row col row col row col
1 4 2 3 1 3
4 5 3 5 3 5
5 7 5 6
6 8
These examples can be read in using:
example1 <- structure(list(row = c(1L, 2L, 4L, 3L, 4L, 3L, 4L), col = c(4L, 4L, 5L, 6L, 6L, 7L, 7L)), .Names = c("row", "col"), class = "data.frame", row.names = c(NA, -7L))
example2 <- structure(list(row = c(2L, 2L, 3L, 2L, 3L, 4L, 5L), col = c(3L, 4L, 5L, 7L, 7L, 7L, 7L)), .Names = c("row", "col"), class = "data.frame", row.names = c(NA, -7L))
example3 <- structure(list(row = c(1L, 2L, 3L, 4L, 5L, 6L, 9L), col = c(3L, 5L, 5L, 6L, 6L, 8L, 10L)), .Names = c("row", "col"), class = "data.frame", row.names = c(NA, -7L))
The purpose of this is to take a dist matrix of Euclidean distances and turn it into a sequence of point-to-point distances that skip distances below a certain threshold. While there may be other ways to solve this problem, I am very interested in figuring out the best way to do this by filtering out rows from the which-matrix.
Reproducible example of my intended use:
set.seed(81417) # Aug 14th, 2017
# Generate fake location data (temporally sequential)
x <- as.matrix(cbind(x = rnorm(10, 10, 3), y = rnorm(10, 10, 3)))
# Find euclidean point-to-point distances and remove distances that are less than:
value = 5
# I attempted to do so by calculating an entire Euclidean distance matrix (dist())
# and then finding a path from point-to-nearest-point
# using distances that are greater than the value
d <- as.matrix(dist(x[,c("x","y")]))
d[lower.tri(d)] <- 0
mtx <- which(d > value, arr.ind = T)
mtx
# Change from EVERY point-to-point distance (mtx) > value
# to only the "connecting" points that exceed the skipping value
trimmed.mtx <- {?}
# final result
cbind(x[unique(c(trimmed.mtx)),],d[trimmed.mtx])
This is a perfect problem for Rcpp. Observe:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerMatrix findConnections(IntegerMatrix m) {
int i = 0, j = 0, k = 1, n = m.nrow();
// initialize matrix with same dimensions as m
IntegerMatrix myConnections(n, 2);
while (i < n) {
// Populate with "connected" row
myConnections(j,_) = m(i,_);
// Search for next connection
while (k < n && m(i, 1) != m(k, 0)) {k++;}
i = k;
j++;
}
// Subset matrix and output result
IntegerMatrix subMatrix(j, 2);
for (i = 0; i < j; i++) {subMatrix(i,_) = myConnections(i,_);}
return subMatrix;
}
findConnections(as.matrix(example3))
[,1] [,2]
[1,] 1 3
[2,] 3 5
[3,] 5 6
[4,] 6 8
Here are the benchmarks on example3 provided by the OP:
microbenchmark(get_path(example3),
foo(example3),
f(example3),
findConnections(as.matrix(example3)))
Unit: microseconds
expr min lq mean median uq max neval cld
get_path(example3) 3345.999 3519.0255 6361.76978 3714.014 3892.9930 202511.942 100 b
foo(example3) 215.514 239.3230 360.81086 257.180 278.3200 10256.384 100 a
f(example3) 936.355 1034.4645 1175.60323 1073.668 1142.4270 9676.755 100 a
findConnections(as.matrix(example3)) 52.135 60.3445 71.62075 67.528 80.4585 103.858 100 a
Here are some benchmarks on a larger example (didn't include get_graph as it was taking a very long time):
set.seed(6221)
x <- as.matrix(cbind(x = rnorm(1000, 10, 3), y = rnorm(1000, 10, 3)))
value = 5
d <- as.matrix(dist(x[,c("x","y")]))
d[lower.tri(d)] <- 0
mtxLarge <- which(d > value, arr.ind = T)
mtxLargeFoo <- data.frame(mtxLarge, row.names = NULL) ## this is for the function foo
## as we don't want to include
## the time it takes to create
## a data.frame every time.
microbenchmark(foo(mtxLargeFoo),
f(mtxLarge),
findConnections(as.matrix(mtxLarge)), times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
foo(mtxLargeFoo) 3168.479 3376.909 2660.377 3424.276 2319.434 1960.161 10 b
f(mtxLarge) 8307.009 8436.569 6420.919 8319.151 5184.557 4610.922 10 c
findConnections(as.matrix(mtxLarge)) 1.000 1.000 1.000 1.000 1.000 1.000 10 a
Test for equality:
a <- findConnections(as.matrix(mtxLarge))
b <- foo(mtxLargeFoo)
c <- f(mtxLarge)
sapply(1:2, function(x) identical(a[,x], b[,x], c[, x]))
[1] TRUE TRUE
UPDATE
If Rcpp isn't your flavor, here is a Base R translation of the above code that is still faster than the other solutions:
findConnectionsBase <- function(m) {
n <- nrow(m)
myConnections <- matrix(integer(0), nrow = n, ncol = 2)
i <- j <- 1L
k <- 2L
while (i <= n) {
myConnections[j, ] <- m[i, ]
while (k <= n && m[i, 2] != m[k, 1]) {k <- k + 1L}
i <- k
j <- j + 1L
}
myConnections[!is.na(myConnections[,1]), ]
}
microbenchmark(get_path(example3),
foo(example3),
f(example3),
BaseR = findConnectionsBase(as.matrix(example3)),
Rcpp = findConnections(as.matrix(example3)))
Unit: microseconds
expr min lq mean median uq max neval cld
get_path(example3) 3128.844 3204.3765 6057.18995 3406.137 3849.274 188685.016 100 b
foo(example3) 239.734 251.4325 399.71418 267.648 301.309 12455.441 100 a
f(example3) 899.409 961.3950 1145.72695 1014.555 1127.237 9583.982 100 a
BaseR 79.638 89.2850 103.63571 97.905 111.657 212.230 100 a
Rcpp 48.850 55.8290 64.24807 61.781 69.170 123.151 100 a
And for the larger example:
microbenchmark(foo(mtxLargeFoo),
f(mtxLarge),
BaseR = findConnectionsBase(as.matrix(mtxLarge)),
Rcpp = findConnections(as.matrix(mtxLarge)), times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
foo(mtxLargeFoo) 2651.9626 2555.0515 1606.2785 1703.0256 1711.4850 671.9115 10 c
f(mtxLarge) 6812.7195 6433.2009 3976.6135 4218.1703 4105.1138 1642.2768 10 d
BaseR 787.9947 733.4528 440.2043 478.9412 435.4744 167.7491 10 b
Rcpp 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 10 a
Here is an idea using igraph package along with zoo,
get_path <- function(df){
g1 <- graph_from_data_frame(df)
l1 <- all_simple_paths(g1, 1)
ind1 <- as.numeric(names(l1[[which.max(lengths(l1))]]))
final_df <- setNames(as.data.frame(rollapply(ind1, 2, c)),
c('row', 'col'))
return(final_df)
}
which gives the following,
library(igraph)
library(zoo)
get_path(example1)
row col
1 1 4
2 4 5
get_path(example2)
row col
1 2 3
2 3 5
3 5 7
get_path(example3)
row col
1 1 3
2 3 5
3 5 6
4 6 8
FUNCTION
foo = function(df){
#Initiate with a value of 1 (first row)
inds = 1
while(TRUE){
# Look for the first index where the 'row' is equal to the value
# in 'col' at the index specified by the last value of 'inds'
temp = tail(inds, 1)
ind = temp + which(df[["row"]][(temp+1):NROW(df)] == df[["col"]][temp])[1]
#Append 'ind' to 'inds'
inds = c(inds, ind)
#Iterate until the end of the rows or when NA is encountered
if (ind == NROW(df) | is.na(ind)){
#Return the subset of the df with appropirate rows
return(df[inds[!is.na(inds)],])
}
}
}
USAGE
foo(example1)
# row col
#1 1 4
#3 4 5
foo(example2)
# row col
#1 2 3
#3 3 5
#7 5 7
foo(example3)
# row col
#1 1 3
#3 3 5
#5 5 6
#6 6 8
foo(data.frame(mtx, row.names = NULL))
# row col
#1 1 3
#5 3 4
#11 4 7
This function is applicable for matrices and data.frames with two columns.
f <- function(x){
res <- x[1, ] # first row as defined
tmpCol <- x[1,2] # the target column for the "connection"
while (TRUE){ # loop until breaked
connectingRow <- x[which(x[, 1] == tmpCol)[1], ] # get first matching row
if (any(is.na(connectingRow))) return(res)
# if this row is not NA (which it would be if no connecting line is found) continue,
# else return the results
# append connecting matches and set new tmpCol for reiteration.
res <- rbind(res, connectingRow)
tmpCol <- res[nrow(res), 2]
}
}
f(example1)
# row col
# 1 1 4
# 3 4 5
f(example2)
# row col
# 1 2 3
# 3 3 5
# 7 5 7
Benchmarking
Benchmark comparison between #d.b's foo() and the above proposed f()
microbenchmark(f(mtx), foo(mtx))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# f(mtx) 18.204 19.058 22.61003 20.053 20.7640 64.851 100 a
# foo(mtx) 14.506 15.075 73.97871 15.360 15.9285 5740.151 100 a
For all you fans of functional programming, here's a recursive solution. R is not optimized for this, but it most closely represents the abstract process that the OP is describing.
connected_rows <- function(df, next.row.val = NULL){
if(is.null(next.row.val)){
return(
rbind(
head(df,1),
Recall(
df = tail(df,-1),
next.row.val = head(df$col,1)
)
)
)
} else {
next.row <- match(next.row.val,df$row)
if(is.na(next.row)){
return(NULL)
} else {
return(
rbind(
df[next.row,],
Recall(
df = tail(df,-next.row),
next.row.val = df$col[next.row]
)
)
)
}
}
}
connected_rows(example1)
# row col
# 1 1 4
# 3 4 5
connected_rows(example2)
# row col
# 1 2 3
# 3 3 5
# 7 5 7
connected_rows(example3)
# row col
# 1 1 3
# 3 3 5
# 5 5 6
# 6 6 8
I have a dataframe with three columns:
set.seed(123)
df <- data.frame(x = abs(rnorm(10)), y = abs(rnorm(10)), z = abs(rnorm(10)))
df
x y z
1 0.56047565 1.2240818 1.0678237
2 0.23017749 0.3598138 0.2179749
3 1.55870831 0.4007715 1.0260044
4 0.07050839 0.1106827 0.7288912
5 0.12928774 0.5558411 0.6250393
6 1.71506499 1.7869131 1.6866933
7 0.46091621 0.4978505 0.8377870
8 1.26506123 1.9666172 0.1533731
9 0.68685285 0.7013559 1.1381369
10 0.44566197 0.4727914 1.2538149
I want to construct a dataframe with the same number of rows, having in each row, the column names of df, ordered by the corresponding row value in df. I have a for-loop based approach that works, but is too slow for a large dataframe, but am looking for a faster, vectorized approach. Here is the for loop based approach:
df_names <- df
df_names[,] <- NA
df_names
x y z
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
for(r in 1:nrow(df)) {
sorted_row <- sort(df[r,], decreasing = TRUE)
df_names[r,] <- colnames(sorted_row)
}
df_names
x y z
1 y z x
2 y x z
3 x z y
4 z y x
5 z y x
6 y x z
7 z y x
8 y x z
9 z y x
10 z y x
How do I do this faster using the apply family or vectorization?
Revised: I merged all attempts, corrections by #rawr, and #rawr's approach is the best so far - with a 30x savings. #989 added a much faster approach. See accepted answer by #989.
library(microbenchmark)
set.seed(123)
df <- data.frame(x = abs(rnorm(1000)), y = abs(rnorm(1000)), z = abs(rnorm(1000)))
get_name_df_with_for = function(df) {
df_names <- df
df_names[,] <- NA
for(r in 1:nrow(df)) {
df_names[r,] <- colnames(sort(df[r,], decreasing = TRUE))
}
return(df_names)
}
get_name_df_with_apply = function(df) {
df_names <- data.frame(t(apply(df, 1, function(row) names(sort(row, decreasing = TRUE)))))
return(df_names)
}
get_name_df_with_apply_names = function(df) {
df_names <- data.frame(t(apply(df, 1, function(row) names(row)[(order(row, decreasing = TRUE))])))
return(df_names)
}
get_name_df_double_t = function(df) {
df_names <- data.frame(t(apply(t(df), 2, function(col) names(sort(col, decreasing = TRUE)))))
return(df_names)
}
microbenchmark(
"for" = get_name_df_with_for(df),
"double_transpose" = get_name_df_double_t(df),
"apply" = get_name_df_with_apply(df),
"apply_with_names" = get_name_df_with_apply_names(df),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
for 417.08341 424.37019 446.00655 451.67451 459.64900 480.33351 10
double_transpose 28.46577 29.96637 32.44685 33.02763 33.51309 36.77468 10
apply 27.54800 28.27331 38.02239 30.36667 37.29727 71.46596 10
apply_with_names 12.35264 12.59502 14.16868 13.92946 15.80656 17.22005 10
If the number of columns in your df is just three, here is a faster solution using max.col. It is provably about 8x faster than the fastest solution proposed in the other answer when nrow(df)=100.
The case in which nrow(df)=100
library(microbenchmark)
set.seed(123)
size <- 100
df <- data.frame(x = abs(rnorm(size)), y = abs(rnorm(size)), z = abs(rnorm(size)))
f1 <- function(df){
vec <- unlist(t(df))
sq <- seq(0,(nrow(df)-1)*3,3)
m1 <- max.col(df)
# -----------------------
vec[sq+m1] <- -Inf
m2 <- max.col(matrix(vec, ncol=3, byrow=T))
vec[sq+m2] <- -Inf
# -----------------------
m3 <- max.col(matrix(vec, ncol=3, byrow=T))
nm <- names(df)
cbind(nm[m1], nm[m2], nm[m3])
}
all(f1(df)==get_name_df_with_for(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply(df))
# [1] TRUE
all(f1(df)==get_name_df_with_apply_names(df))
# [1] TRUE
all(f1(df)==get_name_df_double_t(df))
# [1] TRUE
microbenchmark(f1(df), "f2"=get_name_df_with_for(df), "f3"=get_name_df_with_apply(df),
"f4"=get_name_df_with_apply_names(df), "f5"=get_name_df_double_t(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 395.643 458.0905 470.8278 472.633 492.7355 701.464 100
# f2 59262.146 61773.0865 63098.5840 62963.223 64309.4780 74246.953 100
# f3 5491.521 5637.1605 6754.3912 5801.619 5956.4545 90457.611 100
# f4 3392.689 3463.9055 3603.1546 3569.125 3707.2795 4237.012 100
# f5 5513.335 5636.3045 5954.9277 5781.089 5971.2115 8622.017 100
Significantly faster when nrow(df)=1000
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 693.765 769.8995 878.3698 815.6655 846.4615 3559.929 100
# f2 627876.429 646057.8155 671925.4799 657768.6270 694047.9940 797900.142 100
# f3 49570.397 52038.3515 54334.0501 53838.8465 56181.0515 62517.965 100
# f4 28892.611 30046.8180 31961.4085 31262.4040 33057.5525 48694.850 100
# f5 49866.379 51491.7235 54413.8287 53705.3970 55962.0575 75287.600 100
I can think of several ways to turn matrix (data frame) of this type:
dat = data.frame(
x1 = rep(c('a', 'b'), 100),
x2 = rep(c('x', 'y'), 100)
)
head(dat)
x1 x2
1 a x
2 b y
3 a x
4 b y
5 a x
6 b y
Into a binary (indicator) matrix (or data frame) like this:
a b x y
1 0 1 0
0 1 0 1
...
(This structure is, of course, trivial and only for illustrative purpose!)
Many thanks!
We can use table
tbl <- table(rep(1:nrow(dat),2),unlist(dat))
head(tbl, 2)
# a b x y
# 1 1 0 1 0
# 2 0 1 0 1
Or a possibly efficient option would be
library(Matrix)
sM <- sparse.model.matrix(~ -1 + x1 +x2, dat,
contrasts.arg = lapply(dat, contrasts, contrasts = FALSE))
colnames(sM) <- sub(".*\\d", "", colnames(sM))
head(sM, 2)
# 2 x 4 sparse Matrix of class "dgCMatrix"
# a b x y
#1 1 . 1 .
#2 . 1 . 1
It can be converted to binary by converting to matrix
head(as.matrix(sM),2)
# a b x y
#1 1 0 1 0
#2 0 1 0 1
There are some good solutions posted already, but none are optimal for performance. We can optimize performance by looping over each input column, and then looping over each factor level index within each input column and doing a straight integer comparison of the factor indexes. It's not the most concise or elegant piece of code, but it's fairly straightforward and fast:
do.call(cbind,lapply(dat,function(col)
`colnames<-`(do.call(cbind,lapply(seq_along(levels(col)),function(i)
as.integer(as.integer(col)==i)
)),levels(col))
));
Performance:
library(Matrix);
library(data.table);
library(microbenchmark);
bgoldst <- function(dat) do.call(cbind,lapply(dat,function(col) `colnames<-`(do.call(cbind,lapply(seq_along(levels(col)),function(i) as.integer(as.integer(col)==i))),levels(col))));
akrun1 <- function(dat) table(rep(1:nrow(dat),2),unlist(dat));
akrun2 <- function(dat) sparse.model.matrix(~-1+x1+x2,dat,contrasts.arg=lapply(dat,contrasts,contrasts=FALSE));
davidar <- function(dat) { dat[,rowid:=.I]; dcast(melt(dat,id='rowid'),rowid~value,length); }; ## requires a data.table
dataminer <- function(dat) t(apply(dat,1,function(x) as.numeric(unique(unlist(dat))%in%x)));
N <- 100L; dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
identical(unname(bgoldst(dat)),matrix(as.vector(akrun1(dat)),ncol=4L));
## [1] TRUE
identical(unname(bgoldst(dat)),unname(matrix(as.integer(as.matrix(akrun2(dat))),ncol=4L)));
## [1] TRUE
identical(bgoldst(dat),as.matrix(davidar(datDT)[,rowid:=NULL]));
## [1] TRUE
identical(unname(bgoldst(dat)),matrix(as.integer(dataminer(dat)),ncol=4L));
## [1] TRUE
N <- 100L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT),dataminer(dat));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 67.570 92.374 106.2853 99.6440 121.2405 188.596 100
## akrun1(dat) 581.182 652.386 773.6300 690.6605 916.4625 1192.299 100
## akrun2(dat) 4429.208 4836.119 5554.5902 5145.3135 5977.0990 11263.537 100
## davidar(datDT) 5064.273 5498.555 6104.7621 5664.9115 6203.9695 11713.856 100
## dataminer(dat) 47577.729 49529.753 55217.3726 53190.8940 60041.9020 74346.268 100
N <- 1e4L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 1.775617 1.820949 2.299493 1.84725 1.972124 8.362336 100
## akrun1(dat) 38.954524 41.109257 48.409613 45.60304 52.147633 162.365472 100
## akrun2(dat) 16.915832 17.762799 21.288200 19.20164 23.775180 46.494055 100
## davidar(datDT) 36.151684 38.366715 42.875940 42.38794 45.916937 58.695008 100
N <- 1e5L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 17.16473 22.97654 35.01815 26.76662 31.75562 152.6188 100
## akrun1(dat) 501.72644 626.14494 671.98315 680.91152 727.88262 828.8313 100
## akrun2(dat) 212.12381 242.65505 298.90254 272.28203 357.65106 429.6023 100
## davidar(datDT) 368.04924 461.60078 500.99431 511.54921 540.39358 638.3840 100
If you have a data.frame as you are showing (not a matrix), you could as well recast the data
library(data.table)
setDT(dat)[, rowid := .I] # Creates a row index
res <- dcast(melt(dat, id = "rowid"), rowid ~ value, length) # long/wide format
head(res)
# rowid a b x y
# 1 1 1 0 1 0
# 2 2 0 1 0 1
# 3 3 1 0 1 0
# 4 4 0 1 0 1
# 5 5 1 0 1 0
# 6 6 0 1 0 1
Some benchmarks
dat = data.frame(
x1 = rep(c('a', 'b'), 1e3),
x2 = rep(c('x', 'y'), 1e3)
)
library(data.table)
library(Matrix)
library(microbenchmark)
dat2 <- copy(dat)
microbenchmark("akrun1 : " = table(rep(1:nrow(dat),2),unlist(dat)),
"akrun2 : " = sparse.model.matrix(~ -1 + x1 +x2, dat, contrasts.arg = lapply(dat, contrasts, contrasts = FALSE)),
"DatamineR : " = t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))),
"David Ar : " = {setDT(dat2)[, rowid := .I] ; dcast(melt(dat2, id = "rowid"), rowid ~ value, length)},
times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun1 : 3.826075 4.061904 6.654399 5.165376 11.26959 11.82029 10 a
# akrun2 : 5.269531 5.713672 8.794434 5.943422 13.34118 20.01961 10 a
# DatamineR : 3199.336286 3343.774160 3410.618547 3385.756972 3517.22133 3625.70909 10 b
# David Ar : 8.092769 8.254682 11.030785 8.465232 15.44893 19.83914 10 a
The apply solution is highly inefficient and will take forever on a bigger data set. Comparing for a bigger data set while excluding the apply solution
dat = data.frame(
x1 = rep(c('a', 'b'), 1e4),
x2 = rep(c('x', 'y'), 1e4)
)
dat2 <- copy(dat)
microbenchmark("akrun1 : " = table(rep(1:nrow(dat),2),unlist(dat)),
"akrun2 : " = sparse.model.matrix(~ -1 + x1 +x2, dat, contrasts.arg = lapply(dat, contrasts, contrasts = FALSE)),
#"DatamineR : " = t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))),
"David Ar : " = {setDT(dat2)[, rowid := .I] ; dcast(melt(dat2, id = "rowid"), rowid ~ value, length)},
times = 100L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun1 : 38.66744 41.27116 52.97982 42.72534 47.17203 161.0420 100 b
# akrun2 : 17.02006 18.93534 27.27582 19.35580 20.72022 153.2397 100 a
# David Ar : 34.15915 37.91659 46.11050 38.58536 41.40412 149.0038 100 b
Seems like the Matrix package shines for a bigger data sets.
It probably worth comparing different scenarios when there are more columns/unique values too.
One alternative using apply
head(t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))))
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 0 1
[3,] 1 0 1 0
[4,] 0 1 0 1
[5,] 1 0 1 0
[6,] 0 1 0 1
Check whether the elements in a list is of equal length?
E.g.:
l <- list(c(1:3),c(2:7),c(12:13))
[[1]]
[1] 1 2 3
[[2]]
[1] 2 3 4 5 6 7
[[3]]
[1] 12 13
I have a long list with many entries and want a way to check if each element is of the same length.
Above it should return FALSE as the lengths differ (3,6,2).
Try this:
length(unique(sapply(l, length))) == 1
# [1] FALSE
Or #PierreLafortune's way:
length(unique(lengths(l))) == 1L
Or #CathG's way:
all(sapply(l, length) == length(l[[1]]))
#or
all(lengths(l) == length(l[[1]]))
Some benchmarking:
#data
set.seed(123)
l <- lapply(round(runif(1000,1,100)), runif)
library(microbenchmark)
library(ggplot2)
#benchmark
bm <- microbenchmark(
zx8754 = length(unique(sapply(l, length))) == 1,
PierreLafortune=length(unique(lengths(l))) == 1L,
CathG_1 = all(lengths(l) == length(l[[1]])),
CathG_2 = all(sapply(l, length) == length(l[[1]])),
times = 10000)
# result
bm
Unit: microseconds
expr min lq mean median uq max neval cld
zx8754 326.605 355.281 392.39741 364.034 377.618 84109.597 10000 d
PierreLafortune 23.545 25.960 30.24049 27.168 28.375 3312.829 10000 b
CathG_1 9.056 11.471 13.49464 12.679 13.584 1832.847 10000 a
CathG_2 319.965 343.207 371.50327 351.659 364.940 3531.068 10000 c
#plot benchmark
autoplot(bm)
I would use:
length(unique(lengths(l))) == 1L
[1] FALSE