Find names of columns which contain missing values - r

I want to find all the names of columns with NA or missing data and store these column names in a vector.
# create matrix
a <- c(1,2,3,4,5,NA,7,8,9,10,NA,12,13,14,NA,16,17,18,19,20)
cnames <- c("aa", "bb", "cc", "dd", "ee")
mymatrix <- matrix(a, nrow = 4, ncol = 5, byrow = TRUE)
colnames(mymatrix) <- cnames
mymatrix
# aa bb cc dd ee
# [1,] 1 2 3 4 5
# [2,] NA 7 8 9 10
# [3,] NA 12 13 14 NA
# [4,] 16 17 18 19 20
The desired result: columns "aa" and "ee".
My attempt:
bad <- character()
for (j in 1:4){
tmp <- which(colnames(mymatrix[j, ]) %in% c("", "NA"))
bad <- tmp
}
However, I keep getting integer(0) as my output. Any help is appreciated.

Like this?
colnames(mymatrix)[colSums(is.na(mymatrix)) > 0]
# [1] "aa" "ee"
Or as suggested by #thelatemail:
names(which(colSums(is.na(mymatrix)) > 0))
# [1] "aa" "ee"

R 3.1 introduced an anyNA function, which is more convenient and faster:
colnames(mymatrix)[ apply(mymatrix, 2, anyNA) ]
Old answer:
If it's a very long matrix, apply + any can short circuit and run a bit faster.
apply(is.na(mymatrix), 2, any)
# aa bb cc dd ee
# TRUE FALSE FALSE FALSE TRUE
colnames(mymatrix)[apply(is.na(mymatrix), 2, any)]
# [1] "aa" "ee"

If you have a data frame with non-numeric columns, this solution is more general (building on previous answers):
R 3.1 +
names(which(sapply(mymatrix, anyNA)))
or
names(which(sapply(mymatrix, function(x) any(is.na(x)))))

Related

summarise information reported by a named vector

I want to extrapolate the info reported by a character see below:
X<- c("BB", "BB", "CC", "CC", "CC", "EE", "EE")
names(X) <- c(1, 2, 2, 2, 3, 3, 4)
Character is below:
X
1 2 2 2 3 3 4
"BB" "BB" "CC" "CC" "CC" "EE" "EE"
"CC" in position 2 occurs twice, this info should be reported by the line Time Expected output:
1 2 2 3 3 4 # Position
1 1 2 1 1 1 # Times
"BB" "BB" "CC" "CC" "EE" "EE" # Character
Tried:
table (names(X))
data.frame(X)
We can use group by operation on the names of 'X' with the values of 'X' and get the frequency
library(data.table)
data.table(X, nm = names(X))[, .N, .(X, nm)]
# X nm N
#1: BB 1 1
#2: BB 2 1
#3: CC 2 2
#4: CC 3 1
#5: EE 3 1
#6: EE 4 1
Or similar option with tidyverse
library(dplyr)
data_frame(X, nm = names(X)) %>%
count(X, nm)
Or with aggregate from base R
aggregate(cbind(n = rep(1, length(X))) ~ X + names(X), FUN = sum)

Verify whether row and column names are the same between matrices in R

Say I have matrices one and two:
> one <- matrix(1:9, nrow=3, ncol=3, dimnames=list(c("X","Y","Z"), c("A", "B", "C")))
> one
A B C
X 1 4 7
Y 2 5 8
Z 3 6 9
> two <- matrix(1:9, nrow=3, ncol=3, dimnames=list(c("X","Y","Z"), c("WRONG", "B", "C")))
> two
WRONG B C
X 1 4 7
Y 2 5 8
Z 3 6 9
Is there a command that can produce a logical value to verify whether the column and row names of matrix one are the same as those in matrix two?
You are looking for identical(). For row names -
identical(rownames(one), rownames(two))
# [1] TRUE
And the same for colnames(). For all dimnames(), same thing -
identical(dimnames(one), dimnames(two))
# [1] FALSE
For row and column individually at the same time -
Map(identical, dimnames(one), dimnames(two))
# [[1]]
# [1] TRUE
#
# [[2]]
# [1] FALSE
Update: In response to your comment, for multiple matrices you may try
length(unique(lapply(list(one, two, three), dimnames))) == 1
If this returns FALSE, you know that at least one set of dimnames is different.
If there is a need to identify this for each row and column, you could do this
cbind(unlist(dimnames(one)), unlist(dimnames(one)) %in% unlist(dimnames(two)))
# [,1] [,2]
#row1 "X" "TRUE"
#row2 "Y" "TRUE"
#row3 "Z" "TRUE"
#col1 "A" "FALSE"
#col2 "B" "TRUE"
#col3 "C" "TRUE"
Or else another alternative would be
do.call(`%in%`, list(dimnames(one), dimnames(two)))
#for row and column seperately
# [1] TRUE FALSE

cbind coerces a data frame to matrix

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column
I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

How do I get which.max to return the row name and not an index number

After running correlations I need to ID the row of the maximum value in each column. I am using which.max but I can not get the row name. Instead I get an index number which is worthless. Each row has a name.
apply(my.data,2,which.max)
# create example data
set.seed(1)
df <- data.frame(col1=runif(100), col2=runif(100))
row.names(df) <- paste0("row", 1:100)
# get max
rownames(df[apply(df, 2, which.max), ])
# [1] "row18" "row4"
The result from running correlations should be a matrix, so here's an example using a matrix.
> M <- matrix(c(1,5,3,17,6,8,9,2,3,10,8,4), 4, 3)
> rownames(M) <- letters[1:4]
> M
## [,1] [,2] [,3]
## a 1 6 3
## b 5 8 10
## c 3 9 8
## d 17 2 4
> rownames(M)[apply(M, 2, which.max)]
## [1] "d" "c" "b"

Named rows and cols for matrices in R

Is it possible to have named rows and columns in Matrices?
for example:
[,a] [,b]
[a,] 1 , 2
[b,] 3 , 4
Is it even reasonable to have such a thing for exploring the data?
Sure. Use dimnames:
> a <- matrix(1:4, nrow = 2)
> a
[,1] [,2]
[1,] 1 3
[2,] 2 4
> dimnames(a) <- list(c("A", "B"), c("AA", "BB"))
> a
AA BB
A 1 3
B 2 4
With dimnames, you can provide a list of (first) rownames and (second) colnames for your matrix. Alternatively, you can specify rownames(x) <- whatever and colnames(x) <- whatever.

Resources