R Sum complete cases of two columns - r

How can I sum the number of complete cases of two columns?
With c equal to:
a b
[1,] NA NA
[2,] 1 1
[3,] 1 1
[4,] NA 1
Applying something like
rollapply(c, 2, function(x) sum(complete.cases(x)),fill=NA)
I'd like to get back a single number, 2 in this case. This will be for a large data set with many columns, so I'd like to use rollapply across the whole set instead of simply doing sum(complete.cases(a,b)).
Am I over thinking it?
Thanks!

Did you try sum(complete.cases(x))?!
set.seed(123)
x <- matrix( sample( c(NA,1:5) , 15 , TRUE ) , 5 )
# [,1] [,2] [,3]
#[1,] 1 NA 5
#[2,] 4 3 2
#[3,] 2 5 4
#[4,] 5 3 3
#[5,] 5 2 NA
sum(complete.cases(x))
#[1] 3
To find the complete.cases() of the first two columns:
sum(complete.cases(x[,1:2]))
#[1] 4
And to apply to two columns of a matrix across the whole matrix you could do this:
# Bigger data for example
set.seed(123)
x <- matrix( sample( c(NA,1:5) , 50 , TRUE ) , 5 )
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 NA 5 5 5 4 5 2 NA NA
#[2,] 4 3 2 1 4 3 5 4 2 1
#[3,] 2 5 4 NA 3 3 4 1 2 2
#[4,] 5 3 3 1 5 1 4 1 2 1
#[5,] 5 2 NA 5 3 NA NA 1 NA 5
# Column indices
id <- seq( 1 , ncol(x) , by = 2 )
[1] 1 3 5 7 9
apply( cbind(id,id+1) , 1 , function(i) sum(complete.cases(x[,c(i)])) )
[1] 4 3 4 4 3
complete.cases() works row-wise across the whole data.frame or matrix returning TRUE for those rows which are not missing any data. A minor aside, "c" is a bad variable name because c() is one of the most commonly used functions.

You can calculate the number of complete cases in neighboring matrix columns using rollapply like this:
m <- matrix(c(NA,1,1,NA,1,1,1,1),ncol=4)
# [,1] [,2] [,3] [,4]
#[1,] NA 1 1 1
#[2,] 1 NA 1 1
library(zoo)
rowSums(rollapply(is.na(t(m)), 2, function(x) !any(x)))
#[1] 0 1 2

This shoudl work for both matrix and data.frame
> sum(apply(c, 1, function(x)all(!is.na(x))))
[1] 2
and you could simply iterate through large matrix M
for (i in 1:(ncol(M)-1) ){
c <- M[,c(i,i+1]
agreement <- sum(apply(c, 1, function(x)all(!is.na(x))))
}

Related

How to order a matrix by the numeric or alphabetic values of the column vectors in R?

The title with the following example should be self-explanatory:
m = unique(replicate(5, sample(1:5, 5, rep=F)), MARGIN = 2)
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 1 4 3
[2,] 5 1 5 1 2
[3,] 4 3 3 3 1
[4,] 3 4 4 5 5
[5,] 2 2 2 2 4
But what I want is instead:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 3 4 5
[2,] 5 5 2 1 1
[3,] 3 4 1 3 3
[4,] 4 3 5 5 4
[5,] 2 2 4 2 2
Ideally, I would like to find a method that allows the same process to be carried out when the column vectors are words (alphabetic order).
I tried things like m[ , sort(m)] but nothing did the trick...
m[, order(m[1, ]) will order the columns by the first row. m[, order(m[1, ], m[2, ])] will order by the first row, using second row as tie-breaker. Getting fancy, m[, do.call(order, split(m, row(m)))] will order the columns by the first row, using all subsequent rows for tie-breakers. This will work character data just as well as numeric.
set.seed(47)
m = replicate(5, sample(1:5, 5, rep=F))
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 4 1 5 1
# [2,] 2 2 3 2 3
# [3,] 3 5 5 1 2
# [4,] 4 3 2 3 5
# [5,] 1 1 4 4 4
m[, do.call(order, split(m, row(m)))]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 4 5 5
# [2,] 3 3 2 2 2
# [3,] 2 5 5 1 3
# [4,] 5 2 3 3 4
# [5,] 4 4 1 4 1

Extracting unique rows in a 3+ column matrix

Using R, I am trying to extract unique rows in a matrix, where a "unique row" is subject to all the values in a given row.
For example if I had this data set:
x = matrix(c(1,1,1,2,2,5,1,2,2,1,2,1,5,3,5,2,1,1),6,3)
Rows 1 & 6, and rows 4 & 5 are duplicated since (1,1,5) = (5,1,1) and (2,1,2) = (2,2,1).
Ultimately, i'm trying to end up with something in the form of:
y = matrix(c(1,1,1,2,1,2,2,1,5,3,5,2),4,3)
or
z = matrix(c(1,1,2,5,2,2,2,1,3,5,1,1),4,3)
The order doesn't matter as long as only one of the unique rows remains. I've searched online, but functions such as unique() and duplicated() have only worked for exact matching rows.
Thanks in advance for any help you provide.
Another answer: use sets. Slightly modified matrix:
library(sets)
x <- matrix(c(1,1,1,2,2,5,5, 1,2,2,1,2,1,5, 5,3,5,2,1,1,1),7,3)
x
[,1] [,2] [,3]
[1,] 1 1 5
[2,] 1 2 3
[3,] 1 2 5
[4,] 2 1 2
[5,] 2 2 1
[6,] 5 1 1
[7,] 5 5 1
If (5,1,1) = (5,5,1) you can use just ordinary sets:
a <- sapply(1:nrow(x), function(i) as.set(x[i,]))
x[!duplicated(a),]
[,1] [,2] [,3]
[1,] 1 1 5
[2,] 1 2 3
[3,] 1 2 5
[4,] 2 1 2
Note: rows 6 and 7 are both gone.
If (5,1,1) != (5,5,1), use generalized sets:
b <- sapply(1:nrow(x), function(i) as.gset(x[i,]))
x[!duplicated(b),]
[,1] [,2] [,3]
[1,] 1 1 5
[2,] 1 2 3
[3,] 1 2 5
[4,] 2 1 2
[5,] 5 5 1

replace NA of a matrix with some values

I have this matrix:
mat=matrix(c(1,1,1,2,2,2,3,4,NA,
4,4,4,4,4,3,5,6,4,
3,3,5,5,6,8,0,9,NA,
1,1,1,1,1,4,5,6,1),nrow=4,byrow=TRUE)
print(mat)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 1 1 1 2 2 2 3 4 NA
# [2,] 4 4 4 4 4 3 5 6 4
# [3,] 3 3 5 5 6 8 0 9 NA
# [4,] 1 1 1 1 1 4 5 6 1
I should replace the NA values with other values, in this way:
I have another matrix:
mat2=matrix(c(24,1,3,2, 4,4,4,4, 3,2,2,5, 1,3,5,1),nrow=4,byrow=TRUE)
[,1] [,2] [,3] [,4]
[1,] 24 1 3 2
[2,] 4 4 4 4
[3,] 3 2 2 5
[4,] 1 3 5 1
and the subset with the index of the rows with NA of the first matrix "mat":
subset=c(1,3)
I want to replcace the NA of the matrix with the colnames of the value of the row with the max value.
in this case, I will have "1" for the first row and "4" for the third one, I don't care about row 2 and 4.
Use this
mat[subset,9] <- apply(mat2[subset,],1,which.max)
mat[which(is.na(mat))] <- apply(mat2,1,max)[which(is.na(mat), arr.ind = T)[1,]]
This should replace every NA value with the maximum value from the same row in mat2. I don't have an open core to debug on so I hope this works. If you have any questions or it crashes just comment.

Add a column with the value of the object with max frequency

I have this matrix:
mat=matrix(c(1,1,1,2,2,2,3,4,
4,4,4,4,4,3,5,6,
3,3,5,5,6,8,0,9,
1,1,1,1,1,4,5,6),nrow=4,byrow=TRUE)
print(mat)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 1 1 2 2 2 3 4
[2,] 4 4 4 4 4 3 5 6
[3,] 3 3 5 5 6 8 0 9
[4,] 1 1 1 1 1 4 5 6
and a subset with the index of the row I want to apply my function:
subset=c(2,4)
I would like to add a new column in the matrix "mat" which contains, only for the subset I specified, the value of the object with the max frequency in the row.
In this case:
for row number 1, I would like to have an empty cell in the new column,
for row number 2, I would like to have the value "4" in the new column,
for row number 3, I would like to have an empty cell in the new column,
for row number 4, I would like to have the value "1" in the new column.
EDIT:
thanks for the code in the answer!
now i should replace the NA values with other values:
i have another matrix:
mat2=matrix(c(24,1,3,2, 4,4,4,4, 3,2,2,5, 1,3,5,1),nrow=4,byrow=TRUE)
[,1] [,2] [,3] [,4]
[1,] 24 1 3 2
[2,] 4 4 4 4
[3,] 3 2 2 5
[4,] 1 3 5 1
and the subset:
subset=c(1,3)
i want to replcace the NA of the matrix (the remaining rows out of the first subeset) with the colnames of the value of the row with the max value.
in this case, i will have "1" for the first row and "4" for the third one.
Your are looking for the mode. Unfortunately R doesn't provide a builtin mode function. But it is not too hard to write your own one:
## create mode function
modeValue <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
## add new column with NA
smat <- cbind(mat, NA)
## calculate mode for subset
smat[subset, ncol(smat)] <- apply(smat[subset, , drop=FALSE], 1, modeValue)
smat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 1 1 1 2 2 2 3 4 NA
# [2,] 4 4 4 4 4 3 5 6 4
# [3,] 3 3 5 5 6 8 0 9 NA
# [4,] 1 1 1 1 1 4 5 6 1
Here is a function that will work. It calculates such values (modes)for all rows then substitutes missings where desired:
myFunc <- function(x, myRows) {
myModes <- apply(mat, 1, FUN=function(i) {
temp<- table(i)
as.numeric(names(temp)[which.max(temp)])
})
myModes[setdiff(seq.int(nrow(x)), myRows)] <- NA
myModes
}
For the example, this returns
myFunc(mat, c(2,4))
[1] NA 4 NA 1
To add this to your matrix, just use cbind:
cbind(mat, myFunc(mat, c(2,4)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 1 1 2 2 2 3 4 NA
[2,] 4 4 4 4 4 3 5 6 4
[3,] 3 3 5 5 6 8 0 9 NA
[4,] 1 1 1 1 1 4 5 6 1

In R, using `unique()` with extra conditions to extract submatrices: easy solution without plyr

In R, let M be the matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 3 3
[3,] 2 4 5
[4,] 6 7 8
I would like to select the submatrix m
[,1] [,2] [,3]
[1,] 1 3 3
[2,] 2 4 5
[3,] 6 7 8
using unique on M[,1], specifying to keep the row with the maximal value in the second columnM.
At the end, the algorithm should keep row [2,] from the set \{[1,], [2,]\}. Unfortunately unique() returns me a vector with actual values, and not row numbers, after elimination of duplicates.
Is there a way to get the asnwer without the package plyr?
Thanks a lot,
Avitus
Here's how:
is.first.max <- function(x) seq_along(x) == which.max(x)
M[as.logical(ave(M[, 2], M[, 1], FUN = is.first.max)), ]
# [,1] [,2] [,3]
# [1,] 1 3 3
# [2,] 2 4 5
# [3,] 6 7 8
You're looking for duplicated.
m <- as.matrix(read.table(text="1 2 3
1 3 3
2 4 5
6 7 8"))
m <- m[order(m[,2], decreasing=TRUE), ]
m[!duplicated(m[,1]),]
# V1 V2 V3
# [1,] 6 7 8
# [2,] 2 4 5
# [3,] 1 3 3
Not the most efficient:
M <- matrix(c(1,1,2,6,2,3,4,7,3,3,5,8),4)
t(sapply(unique(M[,1]),function(i) {temp <- M[M[,1]==i,,drop=FALSE]
temp[which.max(temp[,2]),]
}))
# [,1] [,2] [,3]
#[1,] 1 3 3
#[2,] 2 4 5
#[3,] 6 7 8

Resources