Show columns with NAs in a data.frame

Show columns with NAs in a data.frame - r

I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like
names(df[is.na(colMeans(df))])
returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.
nacols <- function(df) {
names(df[,!complete.cases(t(df))])
}
w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)
nacols(tmp)
[1] "y" "z"
Can someone show me a more efficient function to identify columns that have NAs?

This is the fastest way that I know of:
unlist(lapply(df, function(x) any(is.na(x))))
EDIT:
I guess everyone else wrote it out complete so here it is complete:
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
And if you microbenchmark the 4 solutions on a WIN 7 machine:
Unit: microseconds
expr min lq median uq max
1 ANDRIE 85.380 91.911 106.375 116.639 863.124
2 MANOEL 87.712 93.778 105.908 118.971 8426.886
3 MOIRA 764.215 798.273 817.402 876.188 143039.632
4 TYLER 51.321 57.853 62.518 72.316 1365.136
And here's a visual of that:
Edit At the time I wrote this anyNA did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA:
The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}

Here is one way:
colnames(tmp)[colSums(is.na(tmp)) > 0]
Hope it helps,
Manoel

One way...
nacols <- function(x){
y <- sapply(x, function(xx)any(is.na(xx)))
names(y[y])
}
nacols(tmp)
[1] "y" "z"
Explanation: since the result y is a logical vector, names(y[y]) returns the names of y for only those cases where y is TRUE.

Related

Find the index until which the subvector is unqiue (has no duplicates)

Given is a vector
x <- c("A","B","C","A","C","B","C","A")
Now, for every element x_i in x, I want to find the index until which the vector beginning at x_i has no duplicates and save those indices in a vector (same length as x).
Example:
Starting with the first element "A" in x, the solution is 3, since beginning in "A", we have "B","C" and finally "A" on index 4.
Here, at index 4, the subvector consisting of "A","B","C","A" has a duplicate "A", hence the answer is 3.
We continue with the second element "B". We then get the sequence "B","C","A","C" and have a duplicate, namely "C". Hence the answer will be 4. (Its not 3, since "B" is already the second element, hence we have to add 1)
Now, the third element. It is "C" and we get the sequence "C","A","C", which gives a first duplicate. Hence the answer is 4. (We have "C" and "A" as unique length 2 vector, but "C" is already the third element, so we have to add 2).
I hope it is clear what to do.
This question is only for performance-boosting. I already have a working function
uniqueTill <- function(x) {
res <- unlist(lapply(1:length(x), function(i) which(duplicated(x[i:length(x)]))[1]+i-2))
res[is.na(res)] <- length(x)
res
}
We can now check the function with:
inds <- uniqueTill(x)
#check if the vectors are in fact unqiue
lapply(1:length(inds), function(k) x[k:inds[k]])
lapply(1:length(inds), function(k) any(duplicated(x[k:inds[k]])))
#check if the next element in the original vector makes the subvectors no more unique
lapply(1:length(inds), function(k) x[k:(inds[k]+1)])
lapply(1:length(inds), function(k) any(duplicated(x[k:(inds[k]+1)]))) #here the last 3 are FALSE since there are no more duplicates, but thats wanted
I already got a small performance increase with the following function:
(This function does not give the desired output, because of the behaviour of which.min on vectors only containing FALSE, which comes into game when we are reaching the end of the vector where no more duplicates occur. Maybe you also have an idea how to fix this little problem)
uniqueTill2 <- function(x) {
res <- unlist(lapply(1:length(x), function(i) which.min(!duplicated(x[i:length(x)]))+i-2))
res[is.na(res)] <- length(x)
res
}
#newer version
uniqueTill2 <- function(x) {
unlist(lapply(1:length(x), function(i) {
dups <- duplicated(x[i:length(x)])
if(!any(dups)) return(length(x))
which.max(dups)+i-2
}))
}
Instead of taking "which" in every iteration and then taking its first index, its enough to get the first min.index via "which.min".
Therefore we need to negate the duplicated vector, such that a duplicated element giving TRUE (=1 in numeric) becomes a FALSE (=0 in numeric) such that the use of "which.min" works.
I got the following benchmarks:
x <- c("A","B","C","A","C","B","C","A")
microbenchmark::microbenchmark(uniqueTill(x),uniqueTill2(x))
Unit: microseconds
expr min lq mean median uq max neval
uniqueTill(x) 63.002 66.5010 74.50998 68.4015 73.3005 200.300 100
uniqueTill2(x) 56.401 58.8505 63.29607 60.8505 63.4510 119.901 100
xx <- rep(x,1e3)
microbenchmark::microbenchmark(uniqueTill(xx),uniqueTill2(xx),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
uniqueTill(xx) 752.8640 761.6679 822.4813 845.1330 850.2489 906.9668 10
uniqueTill2(xx) 719.3167 725.2254 744.7558 727.9224 747.4992 812.6559 10
Do you have any ideas how to speed things up? Some Rcpp would also be very nice.

How to efficiently compare each row of a matrix to each section of a list in R?

Here's an example of what I mean, this code outputs the right thing:
list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
compare <- function(list.t, matrix.t) {
pairs <- 0
for (i in 1:nrow(matrix.t)) {
for (j in 1:length(list.t)) {
if (length(intersect(matrix.t[i,], list.t[[j]])) == 2) {
pairs <- pairs + 1
}
}
}
return(pairs / nrow(matrix.t))
}
compare(list1, matrix1)
# = 0.33333
I hope that makes sense. I'm trying to take an nx2 matrix, and see if the two elements of each row of the matrix are also found in each section of the list. So, in the example above, the first row of the matrix is (1,2), and this pair is found in the first section of the list. The (1,5) or the (8,10) pairs are not found in any section of the list. So that's why I'm outputting 0.3333 (1/3).
I'm wondering if anyone knows a way that doesn't use two for-loops to compare each row to each section? I have larger matrices and lists, and so this is too slow.
Thank you for any help!

Wouldn't this work just the same? You could call sapply over the list and compare with all rows of the matrix simultaneously.
> list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
> matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
> s <- sapply(seq_along(list1), function(i){
length(intersect(list1[[i]], matrix1)) == 2
})
> sum(s)/nrow(matrix1)
# [1] 0.3333333
If we call your function f1(), and this sapply version of the same function f2(), we get the following difference in speed.
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 245.017 261.2240 268.843 281.7350 1265.706 100
# f2() 113.727 117.7045 125.478 135.6945 268.310 100
Hopefully that's the increase in efficiency you're looking for.

This is offered in the spirit of your R golf challenge for your problem, a compact bu potentially inscrutable solution:
mean( apply(matrix1, 1,
function(x) any( {lapply(list1, function(z) {all(x %in% z) } )}) )
)
[1] 0.3333333
The inner lapply tests whether a particular element of list1 has both of the items in the two-element vector pass as a row from matrix1. Then the any function tests whether any of the 4 elements met the challenge for a particular row. The intermediate logical vector c(TRUE,FALSE,FALSE) is converted into a fraction by the mean. (It still really two nested loops.)

YAVL: Yet another vectorized loop in R

I'm only just getting a handle on vectorizing code with R (links to useful examples would help), and I'm trying to find a faster way of handling this loop. a,b,c all have a bunch of numbers in them and I'm trying to find any particular number that occurs in all 3 columns. The loop works, but is super slow:
for(i in 1:length(a)){
if(any(a[i]==b))
if(any(a[i]==c))
print(a[i])
}
Is there an apply function that would work really well here?

Maybe this?
x <- 1:5
y <- 4:10
z <- 4:8
> Reduce(intersect,list(x,y,z))
[1] 4 5

I see you have accepted #joran solution, but it is really hidden loop. This is a "vectorized" solution:
> x <- 1:5
> y <- 4:10
> z <- 4:8
> x[ (x %in% y) & (x %in% z) ]
[1] 4 5

You could also count the total number of times each appeared (assuming there are no duplicates in each; if so, run unique on them first. This code also returns the desired numbers as characters; it could be converted back as needed.
x <- 1:5; y <- 4:10; z <- 4:8
foo <- table(c(x,y,z))
names(foo)[foo==3]
## [1] "4" "5"

You can also improve your for loop by using intersect within a for ( basically it what it is done within Reduce)
intersect.list <- function(list) { ## code from stabperf package
if (is.null(list)) return(NA)
# Handle empty list
if (length(list) < 1) return(NA)
# Start with first element of list
int <- list[[1]]
for (v in list[-1]) { int <- intersect(int, v) }
return(int)
}
intersect.list(list(x,y,z))
4 5
benchmarking :
library(microbenchmark)
set.seed(1)
N <- 1e6
x <- sample(1:100,N,rep=T)
y <- sample(1:100,N,rep=T)
z <- sample(1:100,N,rep=T)
vectorized <- function()x[ (x %in% y) & (x %in% z) ]
microbenchmark(intersect.list(list(x,y,z)),
+ vectorized(),
+ Reduce(intersect,list(x,y,z)),times=10)
Unit: milliseconds
expr min lq median uq max neval
intersect.list(list(x, y, z)) 73.2862 75.14838 76.77792 85.54216 121.8442 10
vectorized() 131.9560 132.40266 134.47248 139.93902 172.7829 10
Reduce(intersect, list(x, y, z)) 88.4308 90.06320 92.72929 128.05930 133.2982 10
As you see the for loop if slightly faster then Reduce and vectorized solution.

check if vector contains another vector

I want to check whether a vector y contains another vector x
y <- c(0,0,0,NA,NA,0)
x <- c(0,0,0,0)
In this case, it should give me FALSE because there is no sequence of four NULL in y. But if we take a look at vector y2, the result should be TRUE.
y2 <- c(0,0,NA,0,0,0,0)
EDIT:
I tried to use %in% but it seems to only work for elements of vectors, not for whole vectors. The solution doesn't have to be applicable to more general problems. It would be nice if it works for this particular case.

You can use a combination of grepl and paste. Here you need to collapse each vector into one character using the collapse argument in paste.
> grepl(paste(x,collapse=";"),paste(y2,collapse=";"))
[1] TRUE
> grepl(paste(x,collapse=";"),paste(y,collapse=";"))
[1] FALSE
> grepl(paste(c(123),collapse=";"),paste(c(12,3),collapse=";"))
[1] FALSE

Use this:
any(apply(embed(y,length(y)-length(x)+1),2,identical,x))

Just for those who might wonder, a time test of the answers.
findit1<-function(x,y) any(apply(embed(y,length(y)-length(x)+1),2,identical,x))
findit2<-function(x,y) grepl(paste(x,collapse=";"),paste(y,collapse=";"))
x<-c(0,1,1,0,0,0,1,0,1)
y<-sample(c(0,1),1e5,replace=TRUE)
Rgames> microbenchmark(findit1(x,y),findit2(x,y))
Unit: milliseconds
expr min lq median uq max neval
findit1(x, y) 403.79291 449.9028 457.8320 466.4996 603.6573 100
findit2(x, y) 99.09317 100.7774 101.4513 102.1728 119.8970 100
EDIT:
using eddi's rle answer,
Rgames> findit3<-function(x,y) sum(length(x) <= rle(y)$lengths[rle(y)$values %in% 0])
Rgames> x<-c(0,0,0,0,0)
Rgames> microbenchmark(findit1(x,y),findit2(x,y),findit3(x,y))
Unit: milliseconds
expr min lq median uq max neval
findit1(x, y) 340.63570 383.39450 414.6791 456.38786 532.98017 100
findit2(x, y) 99.72606 101.11308 101.9399 103.20869 117.91149 100
findit3(x, y) 23.39226 24.39826 31.8478 35.10592 53.15408 100
But in the general case of any sequence in x I doubt there's a way to massage rle or seqle to do this. I'll have to go play with things for a while. :-)

For this particular case of 0's only in x, just use rle:
sum(length(x) <= rle(y2)$lengths[rle(y2)$values %in% 0]) > 0
#[1] TRUE
sum(length(x) <= rle(y)$lengths[rle(y)$values %in% 0]) > 0
#[1] FALSE

The OP didn't ask for this, but here's a way to find where the instances of x occur.
I used "9" as my tagging character on the assumption that "9" never shows up in y. Clearly one could choose some other character.
> bar<-gsub(paste(x,collapse=""),'9',paste(y,collapse=""))
> rab<-as.numeric(unlist(strsplit(bar,'')))
> rle(rab==9)
Run Length Encoding
lengths: int [1:3123] 49 1 49 1 20 1 6 1 78 1 ...
values : logi [1:3123] FALSE TRUE FALSE TRUE FALSE TRUE ...

I really like this method:
# Second vector contains all the elements of the first
all(1:10 %in% 1:10)
# [1] TRUE
# Second vector does not contain all elements of the first
all(1:11 %in% 1:10)
# [1] FALSE

Yet another option:
length(x) == max(nchar(strsplit(paste(y,collapse=''),"NA")[[1]]))
length(x) == max(nchar(strsplit(paste(y2,collapse=''),"NA")[[1]]))
I also think there should be smarter way, e.g. utilize somehow cumsum (and make it reset to 0 at every occurence of NA and then get the maximum and compare it to the length of x). After some Internet search I have:
length(x) == max(sapply(split(y, replace(cumsum(is.na(y)), is.na(y), -1))[-1],length))
length(x) == max(sapply(split(y2, replace(cumsum(is.na(y2)), is.na(y2), -1))[-1],length))
Or maybe to start with which(is.na(x)) and then somehow calculate the maximum difference between elements in result.

Similar to Ferdinand's (nice use of embed(), BTW), this will return a vector of all matches (empty if none):
which(sapply(1:(length(y)-length(x)+1), function(z) identical(x, y[z:(z+length(x)-1)])))

R programming i need help finding sum of a list with 2 columns

I need you help with this:
I have a list:
list(c(0,1), c(1,1), c(3,2))
how can i get the sum:
(0-1)+(1-1)+(3-2)

Not a big fan of Reduce, do.call is usually faster. In this case the unlist solution seems to have a slight edge:
EDIT: #ds440 for the win!
expr min lq median uq max
1 do.call(sum, lapply(List, function(z) -diff(z))) 63.132 67.7520 70.061 72.7560 291.406
2 ds(List) 6.930 10.5875 11.935 12.7040 51.584
3 Reduce("+", lapply(List, function(x) -sum(diff(x)))) 78.530 81.6100 83.727 87.1915 855.355
4 sum(-sapply(List, diff)) 88.155 91.4260 94.121 97.2005 955.442
5 sum(-unlist(lapply(List, diff))) 57.358 60.4375 61.785 63.5170 145.126
Where ds is the approach by #ds440 wrapped in a function.

This probably isn't the fastest way to calculate it, and it certainly uses more resources, but here's a completely different take on it:
> mylist = list(c(0,1), c(1,1), c(3,2))
> a = matrix(unlist(mylist), ncol=2, byrow=T)
> sum(a[,1]-a[,2])

Try this
# Sum of the first differences of your list
> (Sumlist <- lapply(List, function(x) -sum(diff(x))))
[[1]]
[1] -1 # this is (0-1)
[[2]]
[1] 0 # this is (1-1)
[[3]]
[1] 1 # this is (3-2)
# Total sum of your list
> Reduce('+', Sumlist) # this is (0-1)+(1-1)+(3-2)
[1] 0

If that pattern, taking differences of first minus second element, is consistent, then just write an anonymous function in a call to sapply or lapply.
mylist <- list(c(0,1), c(1,1), c(3,2))
sapply(mylist, FUN = function(x) {x[1] - x[2]}) ## takes differences
sum(sapply(mylist, FUN = function(x) {x[1] - x[2]})) ## put it all together
This can also be achieved (as #AnandaMahto and #Jilber used) with the diff function. diff(0, 1) gives the 2nd minus the 1st, so we need to use -diff for 1st minus 2nd.
sum(-sapply(mylist, FUN = diff))