Removing all NAs while retaining the most data possible - r

I have a 37x21 matrix in R which contains many NAs. For my analysis, I need to get rid of all the NAs. I could remove all rows containing an NA, all columns containing an NA, or some combination of the two.
I want to remove specific rows and columns in such a way that I remove all NAs but retain the highest number of data cells possible.
E.g. Removing all ROWS with an NA results in a 10x21 matrix (10*21 = 210 cells of data). Removing all COLUMNS with an NA results in a 37x12 matrix (37x12 = 444 cells of data). But instead of doing either of these extremes, I want to remove the combination of rows and columns that results in the highest number of cells of data being retained. How would I go about this?

Here is one way using the first algorithm that I could think of. The approach is just to remove a row or column in an iteration if it has at least one NA and the fewest non-NA values in the matrix (so you lose the fewest cells when removing the row/column). To do this, I make a dataframe of the rows and columns with their counts of NA and non-NA along with dimension and index. At the moment, if there is a tie it resolves by deleting rows before columns and earlier indexes before later.
I am not sure that this will give the global maximum (e.g. only takes one branch at ties) but it should do better than just deleting rows/columns. In this example we get 210 for deleting rows, 74 for deleting columns but 272 with the new approach. The code could also probably be optimised if you need to use this for much larger matrices or for many more NA.
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filter rows
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), ]))
#> [1] 210
# filter cols
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x)))]))
#> [1] 74
delete_row_col <- function(m) {
to_delete <- rbind(
data.frame(
dim = "row",
index = seq_len(nrow(m)),
nas = rowSums(is.na(m)),
non_nas = rowSums(!is.na(m)),
stringsAsFactors = FALSE
),
data.frame(
dim = "col",
index = seq_len(ncol(m)),
nas = colSums(is.na(m)),
non_nas = colSums(!is.na(m)),
stringsAsFactors = FALSE
)
)
to_delete <- to_delete[to_delete$nas > 0, ]
to_delete <- to_delete[to_delete$non_nas == min(to_delete$non_nas), ]
if (nrow(to_delete) == 0) {
return(m)
}
else if (to_delete$dim[1] == "row") {
m <- m[-to_delete$index[1], ]
} else {
m <- m[, -to_delete$index[1]]
}
return(m)
}
remove_matrix_na <- function(m) {
while (any(is.na(m))) {
m <- delete_row_col(m)
}
return(m)
}
prod(dim(remove_matrix_na(mat)))
#> [1] 272
Created on 2019-07-06 by the reprex package (v0.3.0)

Here's a way using mixed integer programming (MIP). I have used ompr package for mathematical modeling and open source "glpk" solver. I have added model explanation as comments in the code. MIP approaches, when successful, guarantee optimal solution as indicated by solver_status(model) shown in code.
This approach will easily scale up to handle large matrices.
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filtering all rows with NA retains 126 cells
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), , drop = F]))
# [1] 126
# filtering all cols with NA retains 37 cells
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x))), drop = F]))
# [1] 37
m <- +!is.na(mat) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
Get solution -
solver_status(model)
# [1] "optimal" <- "optimal" guarnatees optimality
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>%
pull(i)
# [1] 1 3 4 6 7 8 10 14 18 19 20 21 22 23 24 28 30 33 34 35 37
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>%
pull(j)
# [1] 6 14 15 16 17
result <- mat[-rm_rows, -rm_cols]
# result has retained more cells as compared to
# removing just rows (126) or just columns (37)
prod(dim(result))
# [1] 256
This approach should be possible with lpSolve package as well but I think it involves building constraint matrix manually which is very cumbersome.

Related

Introducing missing values using number of IDs in R randomly

I have a dataset with many Facilities with a unique Facility ID and variables clustered at the facility ID. I would like to select a number of randomly selected IDs and then introducing missing values for a given number of reported values within the Facility.
Below is a sample of the dataset.
h <- data.frame(cbind(FacilityID = rep(1:5,each=12),X1=rnorm(60,0,1)))
The data has 5 FacilityIDs with 12 values reported for each ID for a variable X1.
I would like to perform the following;
For 2 IDs selected randomly, 3 missing values are assigned randomly within the IDs
For 1 ID selected randomly, 4 missing values are assigned randomly within the IDs
Here is a tidyverse solution.
Use sample to get the 3 IDs. sample(row_number()) <= 4 randomly selects 4 rows.
library(tidyverse)
ids <- sample(unique(h$FacilityID), 3)
h %>%
group_by(FacilityID) %>%
mutate(
X1 = case_when(
FacilityID %in% ids[1:2] & sample(row_number()) <= 3 ~ NA_real_,
FacilityID %in% ids[3] & sample(row_number()) <= 4 ~ NA_real_,
TRUE ~ X1
)
)
It's not clear whether you want these two operations to be performed together or individually.
Individually you could do:
# Set 3 values from 2 IDs to NA
for(i in sample(unique(h$FacilityID), 2)) {
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
# Set 4 values from 1 ID to NA:
h$X1[sample(which(h$FacilityID == sample(unique(h$FacilityID), 1)), 4)] <- NA
If you want to perform both operations at once on the same data set you can do:
IDs <- sample(unique(h$FacilityID), 3)
for(i in IDs) {
if(i == IDs[3])
h$X1[sample(which(h$FacilityID == i), 4)] <- NA
else
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
Base R:
# Set seed for reproducibility:
set.seed(2020)
# Store no_nas, the number of nas to introduce per facility: no_nas => integer vector
no_nas <- c(rep(3, 2), 4)
# Store n, the number of facilities to sample: n => integer scalar
n <- length(no_nas)
# Subset data.frame to records containing randomly sampled
# FacilityIDs assign NA vals: facidsample => data.frame
facidsample <- do.call(rbind, Map(function(x, y) {
i <- h[h$FacilityID == x, ]; i$X1[sample(seq_len(nrow(i)), y)] <- NA_real_; i
}, sample(unique(h$FacilityID), n), no_nas))
# Combine sampled data with original set less nullified entries: j => data.frame
j <- rbind(h[h$FacilityID %in% setdiff(h$FacilityID, facidsample$FacilityID),],
facidsample)

Delete rows after a negative value in multiple data frames

I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))

Find the Most Recent Matching in an Array [R]

Imagine an array of numbers called A. At each level of A, you want to find the most recent item with a matching value. You could easily do this with a for loop as follows:
A = c(1, 1, 2, 2, 1, 2, 2)
for(i in 1:length(A)){
if(i > 1 & sum(A[1:i-1] == A[i]) > 0){
answer[i] = max(which(A[1:i-1] == A[i]))
}else{
answer[i] = NA
}
}
But I want vectorize this for loop (because I'll be applying this principle on a very large data set). I tried using sapply:
answer = sapply(A, FUN = function(x){max(which(A == x))})
As you can see, I need some way of reducing the array to only values that come before x. Any advice?
We can use seq_along to loop over the index of each element and then subset it and get the max index where the value last occured.
c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
#[1] NA 1 -Inf 3 2 4 6
We can change the -Inf to NA if needed in that format
inds <- c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
inds[is.infinite(inds)] <- NA
inds
#[1] NA 1 NA 3 2 4 6
The above method gives a warning, to remove the warning we can perform an additional check of the length
c(NA, sapply(seq_along(A)[-1], function(x) {
inds <- which(A[1:(x-1)] == A[x])
if (length(inds) > 0)
max(inds)
else
NA
}))
#[1] NA 1 NA 3 2 4 6
Here's an approach with dplyr which is more verbose, but easier for me to grok. We start with recording the row_number, make a group for each number we encounter, then record the prior matching row.
library(dplyr)
A2 <- A %>%
as_tibble() %>%
mutate(row = row_number()) %>%
group_by(value) %>%
mutate(last_match = lag(row)) %>%
ungroup()
You can do:
sapply(seq_along(A)-1, function(x)ifelse(any(a<-A[x+1]==A[sequence(x)]),max(which(a)),NA))
[1] NA 1 NA 3 2 4 6
Here's a function that I made (based upon Ronak's answer):
lastMatch = function(A){
uniqueItems = unique(A)
firstInstances = sapply(uniqueItems, function(x){min(which(A == x))}) #for NA
notFirstInstances = setdiff(seq(A),firstInstances)
lastMatch_notFirstInstances = sapply(notFirstInstances, function(x) max(which(A[1:(x-1)] == A[x])))
X = array(0, dim = c(0, length(A)))
X[firstInstances] = NA
X[notFirstInstances] = lastMatch_notFirstInstances
return(X)
}

How to find values less than -1 in each row for every 12 columns in R?

I have a matrix(100*120) and I am trying to find values <=-1 in each row for every 12 columns. I have tried several times but failed. It is easy to find values which are <= -1, but I do not know how to consider for every 12 columns and store the results for each row. Thanks for any help.
set.seed(100)
Mydata <- sample(x=-3:3,size = 100*120,replace = T)
Mydata <- matrix(data = Mydata,nrow = 100,ncol = 120)
results <- which(Mydata<=-1,arr.ind = T)
You can use the apply function to apply the which function across each column for each row at a time. If I misinterpreted what you wanted, you can adjust the MARGIN argument accordingly.
# MARGIN=1 to apply across rows
dd <- apply(Mydata,MARGIN=1,function(x) which(x <= -1))
dd[1] # which columns in row 1 have a value <= -1
You can do this using a combination of apply functions and seq()
#Example Data
set.seed(100)
Mydata <- sample(x=-3:3,size = 100*120,replace = T)
Mydata <- matrix(data = Mydata,nrow = 100,ncol = 120)
#Solution:
Myseq <- sapply(0:9,function(x) seq(1,12,1) + 12*x)
sapply(1:dim(Myseq)[2], function(x) which(Mydata[,Myseq[,x]] == -1))
This results in a list with:
each subset of the list representing one of your 10 groups of 12 columns
each value under each subset representing the position in the matrix of any value in those 12 columns with a value equal to -1.

Joint Occurrence of variables in R

I want to count individual and combine occurrence of variables (1 represents presence and 0 represents absence). This can be obtained by multiple uses of table function (See MWE below). Is it possible to use a more efficient approach to get the required output given below?
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
table(A)
A
0 1
48 52
table(B)
B
0 1
53 47
table(C)
C
0 1
34 66
table(A, B)
B
A 0 1
0 25 23
1 28 24
table(A, C)
C
A 0 1
0 12 36
1 22 30
table(B, C)
C
B 0 1
0 21 32
1 13 34
table(A, B, C)
, , C = 0
B
A 0 1
0 8 4
1 13 9
, , C = 1
B
A 0 1
0 17 19
1 15 15
Required Output
I am requiring something like the following:
A = 52
B = 45
C = 66
A + B = 24
A + C = 30
B + C = 34
A + B + C = 15
Expanding on Sumedh's answer, you can also do this dynamically without having to specify the filter every time. This will be useful if you have more than only 3 columns to combine.
You can do something like this:
lapply(seq_len(ncol(df)), function(i){
# Generate all the combinations of i element on all columns
tmp_i = utils::combn(names(df), i)
# In the columns of tmp_i we have the elements in the combination
apply(tmp_i, 2, function(x){
dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & ")))
df %>%
filter_(.dots = dynamic_formula) %>%
summarize(Count = n()) %>%
mutate(type = paste0(sort(x), collapse = ""))
}) %>%
bind_rows()
}) %>%
bind_rows()
This will:
1) generate all the combinations of the columns of df. First the combinations with one element (A, B, C) then the ones with two elements (AB, AC, BC), etc.
This is the external lapply
2) then for every combination will create a dynamic formula. For AB for instance the formula will be A==1 & B==1, exactly as Sumedh suggested. This is the dynamic_formula bit.
3) Will filter the dataframe with the dynamically generated formula and count the number of rows
4) Bind all together (the two bind_rows)
The output will be
Count type
1 52 A
2 47 B
3 66 C
4 24 AB
5 30 AC
6 34 BC
7 15 ABC
EDITED TO ADD: I see now that you don't want to get the exclusive counts (i.e. A and AB should both include all As).
I got more than a little nerd-sniped by this today, particularly as I wanted to solve it using base R with no packages. The below should do that.
There is a very easy (in principle) solution that simply uses xtabs(), which I've illustrated below. However, to generalize it for any potential number of dimensions, and then to apply it to a variety of combinations, actually was harder. I strove to avoid using the dreaded eval(parse()).
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
# Turn strings off
options(stringsAsFactors = FALSE)
# Obtain the n-way frequency table
# This table can be directly subset using []
# It is a little tricky to pass the arguments
# I'm trying to avoid eval(parse())
# But still give a solution that isn't bound to a specific size
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))),
data = df)
# Demonstrating what I mean
# All A
sum(xtab_freq["1",,])
# [1] 52
# AC
sum(xtab_freq["1",,"1"])
# [1] 30
# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3
# The output of combn() goes through list(), then is unlisted with recursive FALSE
# This gives us a list of vectors
# Each one being a combination in which we are interested
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE)
# For nice output naming, I just paste the values together
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "")
# This is a function I put together
# Generalizes process of extracting values from a crosstab
# It does it in this fashion to avoid eval(parse())
uFunc_GetMargins <- function(crosstab,varvector,success) {
# Obtain the dimname-names (the names within each dimension)
# From that, get the regular dimnames
xtab_dnn <- dimnames(crosstab)
xtab_dn <- names(xtab_dnn)
# Use match() to get a numeric vector for the margins
# This can be used in margin.table()
tgt_margins <- match(x = varvector,table = xtab_dn)
# Obtain a margin table
marginal <- margin.table(x = crosstab,margin = tgt_margins)
# To extract the value, figure out which marginal cell contains
# all variables of interest set to success
# sapply() goes over all the elements of the dimname names
# Finds numeric index in that dimension where the name == success
# We subset the resulting vector by tgt_margins
# (to only get the cells in our marginal table)
# Then, use prod() to multiply them together and get the location
tgt_cell <- prod(sapply(X = xtab_dnn,
FUN = match,
x = success)[tgt_margins])
# Return as named list for ease of stacking
return(list(count = marginal[tgt_cell]))
}
# Doing a call of mapply() lets us get the results
do.call(what = rbind.data.frame,
args = mapply(FUN = uFunc_GetMargins,
varvector = lst_combs,
MoreArgs = list(crosstab = xtab_freq,
success = "1"),
SIMPLIFY = FALSE,
USE.NAMES = TRUE))
# count
# A 52
# B 47
# C 66
# AB 24
# AC 30
# BC 34
# ABC 15
I ditched the prior solution that used aggregate.
Using dplyr,
Occurrence of only A:
library(dplyr)
df %>% filter(A == 1) %>% summarise(Total = nrow(.))
Occurrence of A and B:
df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.))
Occurence of A, B, and C
df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.))

Resources