Summarize distribution of factors in R data frame - r

Say I have a data.frame like this:
X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C
I would like to count the occurrences of A, B, C, etc. in each column, and return the result as
A_count B_count C_count
X1 3 1 0
X2 2 1 1
X3 1 2 1
I'm sure this question has a thousand duplicates, but I can't seem to find an answer that works for me :(
By running
apply(mydata, 2, table)
I get something like
$X1
B A
1 3
$X2
A C B
2 1 1
But it's not exactly what I want and if I try to build it back into a data frame, it doesn't work because I don't get the same number of columns for every row (like $X1 above where there are no C's).
What am I missing?
Many thanks!

You can refactor to include the factor levels common to each column, then tabulate. I would also recommend using lapply() instead of apply(), as apply() is for matrices.
df <- read.table(text = "X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C", h=T)
do.call(
rbind,
lapply(df, function(x) table(factor(x, levels=levels(unlist(df)))))
)
# A B C
# X1 3 1 0
# X2 2 1 1
# X3 1 2 1

Suppose your data frame is x, I would simply do:
do.call(rbind, tapply(unlist(x, use.names = FALSE),
rep(1:ncol(x), each = nrow(x)),
table))
# A B C
#1 3 1 0
#2 2 1 1
#3 1 2 1
Benchmarking
# a function to generate toy data
# `k` factor levels
# `n` row
# `p` columns
datsim <- function(n, p, k) {
as.data.frame(replicate(p, sample(LETTERS[1:k], n, TRUE), simplify = FALSE),
col.names = paste0("X",1:p), stringsAsFactors = TRUE)
}
# try `n = 100`, `p = 500` and `k = 3`
x <- datsim(100, 500, 3)
## DirtySockSniffer's answer
system.time(do.call(rbind, lapply(x, function(u) table(factor(u, levels=levels(unlist(x)))))))
# user system elapsed
# 21.240 0.068 21.365
## my answer
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 0.108 0.000 0.111
Dirty's answer can be improved, by:
## improved DirtySockSniffer's answer
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 0.108 0.000 0.108
Also consider user20650's answer:
## Let's try a large `n`, `p`, `k`
x <- datsim(200, 5000, 5)
system.time(t(table(stack(lapply(x, as.character)))))
# user system elapsed
# 0.592 0.052 0.646
While my answer does:
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 1.844 0.056 1.904
Improved Dirty's answer does:
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 1.240 0.012 1.263

Related

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Speedy/elegant way to unite many pairs of columns

Is there an elegant/fastR way to combine all pairs of columns in a data.frame?
For example, using mapply() and paste() we can turn this data.frame:
mydf <- data.frame(a.1 = letters, a.2 = 26:1, b.1 = letters, b.2 = 1:26)
head(mydf)
a.1 a.2 b.1 b.2
1 a 26 a 1
2 b 25 b 2
3 c 24 c 3
4 d 23 d 4
5 e 22 e 5
6 f 21 f 6
into this data.frame:
mydf2 <- mapply(function(x, y) {
paste(x, y, sep = ".")},
mydf[ ,seq(1, ncol(mydf), by = 2)],
mydf[ ,seq(2, ncol(mydf), by = 2)])
head(mydf2)
a.1 b.1
[1,] "a.26" "a.1"
[2,] "b.25" "b.2"
[3,] "c.24" "c.3"
[4,] "d.23" "d.4"
[5,] "e.22" "e.5"
[6,] "f.21" "f.6"
However, this feels clumsy and is a bit slow when applied to big datasets. Any suggestions, perhaps using a Hadley package?
EDIT:
The ideal solution would easily scale to large numbers of columns, such that the names of the columns would not need to be included in the function call. Thanks!
It's amusing to note that the OP's solution appears to be the fastest one:
f1 <- function(mydf) {
mapply(function(x, y) {
paste(x, y, sep = ".")},
mydf[ ,seq(1, ncol(mydf), by = 2)],
mydf[ ,seq(2, ncol(mydf), by = 2)])
}
f.thelatemail <- function(mydf) {
mapply(paste,mydf[c(TRUE,FALSE)],mydf[c(FALSE,TRUE)],sep=".")
}
require(dplyr)
f.on_the_shores_of_linux_sea <- function(mydf) {
transmute(mydf,x1=paste0( a.1,'.', a.2),x2=paste0( b.1,'.', b.2))
}
f.jazurro <- function(mydf) {
odd <- seq(1, ncol(mydf), 2);
lapply(odd, function(x) paste(mydf[,x], mydf[,x+1], sep = ".")) %>%
do.call(cbind,.)
}
library(data.table)
f.akrun <- function(mydf) {
res <- as.data.table(matrix(, ncol=ncol(mydf)/2, nrow=nrow(mydf)))
indx <- seq(1, ncol(mydf), 2)
setDT(mydf)
for(j in seq_along(indx)){
set(res, i=NULL, j=j, value= paste(mydf[[indx[j]]],
mydf[[indx[j]+1]], sep='.'))
}
res
}
mydf <- data.frame(a.1 = letters, a.2 = 26:1, b.1 = letters, b.2 = 1:26)
mydf <- mydf[rep(1:nrow(mydf),5000),]
library(rbenchmark)
benchmark(f1(mydf),f.thelatemail(mydf),f.on_the_shores_of_linux_sea(mydf),f.jazurro(mydf),f.akrun(mydf))
Results:
# test replications elapsed relative user.self sys.self user.child sys.child
# 5 f.akrun(mydf) 100 14.000 75.269 13.673 0.296 0 0
# 4 f.jazurro(mydf) 100 0.388 2.086 0.314 0.071 0 0
# 3 f.on_the_shores_of_linux_sea(mydf) 100 15.585 83.790 15.293 0.280 0 0
# 2 f.thelatemail(mydf) 100 26.416 142.022 25.736 0.639 0 0
# 1 f1(mydf) 100 0.186 1.000 0.169 0.017 0 0
[Updated Benchmark]
I've added one solution from #thelatemail, which I missed in the original answer, and one solution from #akrun:
f.thelatemail2 <- function(mydf) {
data.frame(Map(paste,mydf[c(TRUE,FALSE)],mydf[c(FALSE,TRUE)],sep="."))
}
f.akrun2 <- function(mydf) {
setDT(mydf)
indx <- as.integer(seq(1, ncol(mydf), 2))
mydf2 <- copy(mydf)
for(j in indx){
set(mydf2, i=NULL, j=j, value= paste(mydf2[[j]],
mydf2[[j+1]], sep="."))
}
mydf2[,indx, with=FALSE]
}
Benchmark:
library(rbenchmark)
benchmark(f1(mydf),f.thelatemail(mydf), f.thelatemail2(mydf), f.on_the_shores_of_linux_sea(mydf),f.jazurro(mydf),f.akrun(mydf),f.akrun2(mydf))
# test replications elapsed relative user.self sys.self user.child sys.child
# 6 f.akrun(mydf) 100 13.247 69.356 12.897 0.340 0 0
# 7 f.akrun2(mydf) 100 12.746 66.733 12.405 0.339 0 0
# 5 f.jazurro(mydf) 100 0.327 1.712 0.254 0.073 0 0
# 4 f.on_the_shores_of_linux_sea(mydf) 100 16.347 85.586 15.838 0.445 0 0
# 2 f.thelatemail(mydf) 100 26.307 137.733 25.536 0.708 0 0
# 3 f.thelatemail2(mydf) 100 15.938 83.445 15.136 0.750 0 0
# 1 f1(mydf) 100 0.191 1.000 0.156 0.036 0 0
I'm not sure this is the best approach. See if the below code gives any speed improvement
require(dplyr)
transmute(mydf,x1=paste0( a.1,'.', a.2),x2=paste0( b.1,'.', b.2))
Answer updated based on comment :-)
An option using set from data.table. It should be fast for large datasets as it modifies by reference and the overhead of [.data.table is avoided. Assuming that the columns are ordered for each pair of columns.
library(data.table)
res <- as.data.table(matrix(, ncol=ncol(mydf)/2, nrow=nrow(mydf)))
indx <- seq(1, ncol(mydf), 2)
setDT(mydf)
for(j in seq_along(indx)){
set(res, i=NULL, j=j, value= paste(mydf[[indx[j]]],
mydf[[indx[j]+1]], sep='.'))
}
head(res)
# V1 V2
#1: a.26 a.1
#2: b.25 b.2
#3: c.24 c.3
#4: d.23 d.4
#5: e.22 e.5
#6: f.21 f.6
Instead of creating a new result dataset, we can also update the same or a copy of the original dataset. There will be some warnings about type conversion, but I guess this would be a bit faster (not benchmarked)
setDT(mydf)
mydf2 <- copy(mydf)
for(j in indx){
set(mydf2, i=NULL, j=j, value= paste(mydf2[[j]],
mydf2[[j+1]], sep="."))
}
mydf2[,indx, with=FALSE]
Benchmarks
I tried the benchmarks on a slightly bigger data with many columns.
data
set.seed(24)
d1 <- as.data.frame(matrix(sample(letters,500*10000, replace=TRUE),
ncol=500), stringsAsFactors=FALSE)
set.seed(4242)
d2 <- as.data.frame(matrix(sample(1:200,500*10000,
replace=TRUE), ncol=500))
d3 <- cbind(d1,d2)
mydf <- d3[,order(c(1:ncol(d1), 1:ncol(d2)))]
mydf1 <- copy(mydf)
Compared f1, f.jazurro (fastest) (from #Marat Talipov's post) with f.akrun2
microbenchmark(f1(mydf), f.jazurro(mydf), f.akrun2(mydf1),
unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# f1(mydf) 3.420448 2.3217708 2.714495 2.653178 2.819952 2.736376 20
#f.jazurro(mydf) 1.000000 1.0000000 1.000000 1.000000 1.000000 1.000000 20
#f.akrun2(mydf1) 1.204488 0.8015648 1.031248 1.042262 1.097136 1.066671 20
#cld
#b
#a
#a
In this, f.jazurro is slighly better than f.akrun2. I think if I increase the group size, nrows etc, it would be an interesting comparison
For what its worth seven years later, here is a trick using the glue package and eval() + parse(). I don't know how it compares to other answers, but it works pretty darn well for me.
mydf <- data.frame(a.1 = letters, a.2 = 26:1, b.1 = letters, b.2 = 1:26)
mydf2 <- mydf
vars <- c('a', 'b')
eval(parse(text = glue::glue('mydf2 <- mydf2 |> unite({vars}, c(`{vars}.1`, `{vars}.2`), na.rm = T, sep = ".")')))
mydf2

Apply a function over all combinations of arguments

I would like to be able to apply a function to all combinations of a set of input arguments. I have a working solution (below) but would be surprised if there's not a better / more generic way to do this using, e.g. plyr, but so far have not found anything. Is there a better solution?
# Apply function FUN to all combinations of arguments and append results to
# data frame of arguments
cmapply <- function(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)
{
l <- expand.grid(..., stringsAsFactors=FALSE)
r <- do.call(mapply, c(
list(FUN=FUN, MoreArgs = MoreArgs, SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES),
l
))
if (is.matrix(r)) r <- t(r)
cbind(l, r)
}
examples:
# calculate sum of combinations of 1:3, 1:3 and 1:2
cmapply(arg1=1:3, arg2=1:3, 1:2, FUN=sum)
# paste input arguments
cmapply(arg1=1:3, arg2=c("a", "b"), c("x", "y", "z"), FUN=paste)
# function returns a vector
cmapply(a=1:3, b=2, FUN=function(a, b) c("x"=b-a, "y"=a+b))
This function isn't necessarily any better, just slightly different:
rcapply <- function(FUN, ...) {
## Cross-join all vectors
DT <- CJ(...)
## Get the original names
nl <- names(list(...))
## Make names, if all are missing
if(length(nl)==0L) nl <- make.names(1:length(list(...)))
## Fill in any missing names
nl[!nzchar(nl)] <- paste0("arg", 1:length(nl))[!nzchar(nl)]
setnames(DT, nl)
## Call the function using all columns of every row
DT2 <- DT[,
as.data.table(as.list(do.call(FUN, .SD))), ## Use all columns...
by=.(rn=1:nrow(DT))][ ## ...by every row
, rn:=NULL] ## Remove the temp row number
## Add res to names of unnamed result columns
setnames(DT2, gsub("(V)([0-9]+)", "res\\2", names(DT2)))
return(data.table(DT, DT2))
}
head(rcapply(arg1=1:3, arg2=1:3, 1:2, FUN=sum))
## arg1 arg2 arg3 res1
## 1: 1 1 1 3
## 2: 1 1 2 4
## 3: 1 2 1 4
## 4: 1 2 2 5
## 5: 1 3 1 5
## 6: 1 3 2 6
head(rcapply(arg1=1:3, arg2=c("a", "b"), c("x", "y", "z"), FUN=paste))
## arg1 arg2 arg3 res1
## 1: 1 a x 1 a x
## 2: 1 a y 1 a y
## 3: 1 a z 1 a z
## 4: 1 b x 1 b x
## 5: 1 b y 1 b y
## 6: 1 b z 1 b z
head(rcapply(a=1:3, b=2, FUN=function(a, b) c("x"=b-a, "y"=a+b)))
## a b x y
## 1: 1 2 1 3
## 2: 2 2 0 4
## 3: 3 2 -1 5
A slight simplification of your original code:
cmapply <- function(FUN, ..., MoreArgs = NULL)
{
l <- expand.grid(..., stringsAsFactors=FALSE)
r <- .mapply(FUN=FUN, dots=l, MoreArgs = MoreArgs)
r <- simplify2array(r, higher = FALSE)
if (is.matrix(r)) r <- t(r)
return(cbind(l, r))
}
This does not require a do.call.
It does miss the SIMPLIFY and USE.NAMES arguments, but the way you are using it seems to make the arguments not usable anyway: if SIMPLIFY = FALSE, the rbind() will fail, and USE.NAMES = TRUE does not do anything because the names get lost after the rbind() anyway.

Removal of constant columns in R

I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)

Replacing columns names using a data frame in r

I have the matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("tom", "dick","bob")))
tom dick bob
s1 1 2 3
s2 4 5 6
s3 7 8 9
#and the data frame
current<-c("tom", "dick","harry","bob")
replacement<-c("x","y","z","b")
df<-data.frame(current,replacement)
current replacement
1 tom x
2 dick y
3 harry z
4 bob b
#I need to replace the existing names i.e. df$current with df$replacement if
#colnames(m) are equal to df$current thereby producing the following matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("x", "y","b")))
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
Any advice? Should I use an 'if' loop? Thanks.
You can use which to match the colnames from m with the values in df$current. Then, when you have the indices, you can subset the replacement colnames from df$replacement.
colnames(m) = df$replacement[which(df$current %in% colnames(m))]
In the above:
%in% tests for TRUE or FALSE for any matches between the objects being compared.
which(df$current %in% colnames(m)) identifies the indexes (in this case, the row numbers) of the matched names.
df$replacement[...] is the basic way to subset the column df$replacement returning only the rows matched with step 2.
A slightly more direct way to find the indices is to use match:
> id <- match(colnames(m), df$current)
> id
[1] 1 2 4
> colnames(m) <- df$replacement[id]
> m
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
As discussed below %in% is generally more intuitive to use and the difference in efficiency is marginal unless the sets are relatively large, e.g.
> n <- 50000 # size of full vector
> m <- 10000 # size of subset
> query <- paste("A", sort(sample(1:n, m)))
> names <- paste("A", 1:n)
> all.equal(which(names %in% query), match(query, names))
[1] TRUE
> library(rbenchmark)
> benchmark(which(names %in% query))
test replications elapsed relative user.self sys.self user.child sys.child
1 which(names %in% query) 100 0.267 1 0.268 0 0 0
> benchmark(match(query, names))
test replications elapsed relative user.self sys.self user.child sys.child
1 match(query, names) 100 0.172 1 0.172 0 0 0

Resources