identify and remove single valued columns from table in R - r

I have a reasonably large dataset (~250k rows and 400 cols # .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:
> system.time(apply(data, 2, function(x) length(unique(x))))
# user system elapsed
# 34.37 0.71 35.15
Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.

You can use lapply instead:
data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]
Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.
Edit: here's a little benchmark:
library(benchmark)
a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)
microbenchmark(
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
#Unit: relative
# expr min lq median uq max neval
#apply 41.29383 40.06719 39.72256 39.16569 28.54078 100
#lapply 1.00000 1.00000 1.00000 1.00000 1.00000 100
Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.

You may also try:
set.seed(40)
df <- as.data.frame(matrix(sample(letters[1:3], 3*10,replace=TRUE), ncol=10))
Filter(function(x) (length(unique(x))>1), df)
Or
df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)] #still better than `apply`
Including these also in speed comparison (#beginneR's sample data)
microbenchmark(
new ={Filter(function(x) (length(unique(x))>1), df)},
new1={df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)]},
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
# Unit: relative
# expr min lq median uq max neval
# new 1.0000000 1.0000000 1.000000 1.0000000 1.000000 100
# new1 4.3741503 4.5144133 4.063634 3.9591345 1.713178 100
# apply 23.9635826 24.0895813 21.361140 20.7650416 5.757233 100
#lapply 0.9991514 0.9979483 1.002005 0.9958308 1.002603 100

Related

Efficiently find set differences and generate random sample

I have a very large data set with categorical labels a and a vector b that contains all possible labels in the data set:
a <- c(1,1,3,2) # artificial data
b <- c(1,2,3,4) # fixed categories
Now I want to find for each observation in a the set of all remaining categories (that is, the elements of b excluding the given observation in a). From these remaining categories, I want to sample one at random.
My approach using a loop is
goal <- numeric() # container for results
for(i in 1:4){
d <- setdiff(b, a[i]) # find the categories except the one observed in the data
goal[i] <- sample(d,1) # sample one of the remaining categories randomly
}
goal
[1] 4 4 1 1
However, this has to be done a large number of times and applied to very large data sets. Does anyone have a more efficient version that leads to the desired result?
EDIT:
The function by akrun is unfortunately slower than the original loop. If anyone has a creative idea with a competitive result, I'm happy to hear it!
We can use vapply
vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1))
set.seed(24)
a <- sample(c(1:4), 10000, replace=TRUE)
b <- 1:4
system.time(vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)))
# user system elapsed
# 0.208 0.007 0.215
It turns out that resampling the labels that are equal to the labels in the data is an even faster approach, using
test = sample(b, length(a), replace=T)
resample = (a == test)
while(sum(resample>0)){
test[resample] = sample(b, sum(resample), replace=T)
resample = (a == test)
}
Updated Benchmarks for N=10,000:
Unit: microseconds
expr min lq mean median uq max neval
loop 14337.492 14954.595 16172.2165 15227.010 15585.5960 24071.727 100
akrun 14899.000 15507.978 16271.2095 15736.985 16050.6690 24085.839 100
resample 87.242 102.423 113.4057 112.473 122.0955 174.056 100
shree(data = a, labels = b) 5195.128 5369.610 5472.4480 5454.499 5574.0285 5796.836 100
shree_mapply(data = a, labels = b) 1500.207 1622.516 1913.1614 1682.814 1754.0190 10449.271 100
Update: Here's a fast version with mapply. This method avoids calling sample() for every iteration so is a bit faster. -
mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
Here's a version without setdiff (setdiff can be a bit slow) although I think even more optimization is possible. -
vapply(a, function(x) sample(b[!b == x], 1), numeric(1))
Benchmarks -
set.seed(24)
a <- sample(c(1:4), 1000, replace=TRUE)
b <- 1:4
microbenchmark::microbenchmark(
akrun = vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)),
shree = vapply(a, function(x) sample(b[!b == x], 1), numeric(1)),
shree_mapply = mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
)
Unit: milliseconds
expr min lq mean median uq max neval
akrun 28.7347 30.66955 38.319655 32.57875 37.45455 237.1690 100
shree 5.6271 6.05740 7.531964 6.47270 6.87375 45.9081 100
shree_mapply 1.8286 2.01215 2.628989 2.14900 2.54525 7.7700 100

How can I count the number of variables in an R quosure?

Let's say I have a function that takes in a data frame and a varying number of variables from that data frame using non-standard evaluation (NSE). Is there a faster/more straightforward way to count the number of provided variables than select()ing these variables and counting the columns?
# Works but seems non-ideal
nvar <- function(df, vars) {
vars_en <- rlang::enquo(vars)
df_sub <- dplyr::select(df, !!vars_en)
ncol(df_sub)
}
nvar(mtcars, mpg:hp)
#> 4
Highly doubtful (I realize this may receive downvotes) - I think the most sensible alternative is to simply select from the colnames of the data.frame like so - uses tidyselect::vars_select
nvar1 <- function(df, vars) {
vars_en <- rlang::enquo(vars)
ans <- vars_select(names(df), !! vars_en)
length(ans)
}
But even this is slower than select(df) %>% ncol
library(microbenchmark)
library(nycflights13)
library(tidyselect)
nvar <- function(df, vars) {
vars_en <- rlang::enquo(vars)
df_sub <- dplyr::select(df, !!vars_en)
ncol(df_sub)
}
identical(nvar(nycflights13::flights, day:sched_arr_time), nvar1(nycflights13::flights, day:sched_arr_time))
# TRUE
microbenchmark(nvar(nycflights13::flights, day:sched_arr_time), nvar1(nycflights13::flights, day:sched_arr_time), unit='relative', times=100L)
# Unit: relative
# expr min lq mean median uq max neval
# nvar(nycflights13::flights, day:sched_arr_time) 1.000000 1.000000 1.00000 1.000000 1.000000 1.0000000 100
# nvar1(nycflights13::flights, day:sched_arr_time) 1.685793 1.680676 1.60114 1.688626 1.660196 0.9878235 100

Compare Matrices in R efficiently

I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9

R: scan vectors once instead of 4 times?

Suppose I have two equal length logical vectors.
Computing the confusion matrix the easy way:
c(sum(actual == 1 & predicted == 1),
sum(actual == 0 & predicted == 1),
sum(actual == 1 & predicted == 0),
sum(actual == 0 & predicted == 0))
requires scanning the vectors 4 times.
Is it possible to do that in a single pass?
PS. I tried table(2*actual+predicted) and table(actual,predicted) but both are obviously much slower.
PPS. Speed is not my main consideration here, I am more interested in understanding the language.
You could try using data.table
library(data.table)
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N
data
set.seed(24)
actual <- sample(0:1, 10 , replace=TRUE)
predicted <- sample(0:1, 10, replace=TRUE)
Benchmarks
Using data.table_1.9.5 and dplyr_0.4.0
library(microbenchmark)
set.seed(245)
actual <- sample(0:1, 1e6 , replace=TRUE)
predicted <- sample(0:1, 1e6, replace=TRUE)
f1 <- function(){
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N}
f2 <- function(){table(actual, predicted)}
f3 <- function() {data_frame(actual, predicted) %>%
group_by(actual, predicted) %>%
summarise(n())}
microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
#f1() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
#f2() 20.818410 22.378995 22.321816 22.56931 22.140855 22.984667 20 b
#f3() 1.262047 1.248396 1.436559 1.21237 1.220109 2.504662 20 a
Including the count from dplyr and tabulate also in the benchmarks on a slightly bigger dataset
set.seed(498)
actual <- sample(0:1, 1e7 , replace=TRUE)
predicted <- sample(0:1, 1e7, replace=TRUE)
f4 <- function() {data_frame(actual, predicted) %>%
count(actual, predicted)}
f5 <- function(){tabulate(4-actual-2*predicted, 4)}
Update
Including another data.table solution (provided by #Arun) also in the benchmarks
f6 <- function() {setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N}
microbenchmark(f1(), f3(), f4(), f5(), f6(), unit='relative', times=20L)
#Unit: relative
#expr min lq mean median uq max neval cld
#f1() 2.003088 1.974501 2.020091 2.015193 2.080961 1.924808 20 c
#f3() 2.488526 2.486019 2.450749 2.464082 2.481432 2.141309 20 d
#f4() 2.388386 2.423604 2.430581 2.459973 2.531792 2.191576 20 d
#f5() 1.034442 1.125585 1.192534 1.217337 1.239453 1.294920 20 b
#f6() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
Like this:
tabulate(4 - actual - 2*predicted, 4)
(tabulate here is much faster than table because it knows the output will be a vector of length 4).
There is table which computes a cross tabulation and should give similar results if actual and predicted contain only zeros and ones:
table(actual, predicted)
Internally, this works by pasteing the vectors -- horribly inefficient. It seems that the coercion to character also happens when tabulating only one value, and this might be the very reason for the bad performance also of table(actual*2 + predicted).

aggregate a matrix (or data.frame) by column name groups in R

I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)
Tree Tree House House Tree Car Car House
I would need the data result (aggregation of mean of every row) to have the following columns:
Tree House Car
the tricky part (at least for me) is that I do not know all the column names and they are all in random order!
You could try
res1 <- vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )
Or
res2 <- sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )
identical(res1,res2)
#[1] TRUE
Another option might be to reshape into long form and then do the aggregation
library(data.table)
res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
identical(res1, as.matrix(res3))
[1] TRUE
Benchmarks
It seems like the first two methods are slightly faster for a 3000*3000 matrix
set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE),
ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
3000,replace=TRUE)))
library(microbenchmark)
f1 <-function() {vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}
f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
Var1:= NULL]}
microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
# Unit: relative
# expr min lq mean median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004 10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531 10
data
set.seed(24)
m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10,
dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))
I came up with my own solution. I first just transpose the matrix (called test_mean) so the columns become rows,then:
# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))
#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)
matrixStats:rowMeans2 with some coercive help from data.table, for the win!
Adding it to benchmarking from #akrun we get:
f4<- function() {
ucn<-unique(colnames(m1))
as.matrix(setnames(setDF(lapply(ucn, function(n) rowMeans2(m1,cols=colnames(m1)==n)))
,ucn))
}
> all.equal(f4(),f1())
[1] TRUE
> microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=10L)
Unit: relative
expr min lq mean median uq max neval cld
f1() 1.837496 1.841282 1.823375 1.834471 1.818822 1.749826 10 b
f2() 1.760133 1.825352 1.817355 1.826257 1.838439 1.793824 10 b
f3() 15.451106 15.606912 15.847117 15.586192 16.626629 16.104648 10 c
f4() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

Resources