using R to select rows in data set with matching missing observations - r

I have determined how to identify all unique patterns of missing observations in a data set. Now I would like to select all rows in that data set with a given pattern of missing observations. I would like to do this iteratively so that if there are n patterns of missing observations in the data set I end up with n data sets each containing only 1 pattern of missing observations.
I know how to do this, but my method is not very efficient and is not general. I am hoping to learn a more efficient and more general approach because my real data sets are much larger and more variable than in the example below.
Here is an example data set and the code I am using. I do not bother to include the code I used to create the matrix zzz from the matrix dd, but can add that code if it helps.
dd <- matrix(c(
1, 0, 1, 1,
NA, 1, 1, 0,
NA, 0, 0, 0,
NA, 1,NA, 1,
NA, 1, 1, 1,
0, 0, 1, 0,
NA, 0, 0, 0,
0,NA,NA,NA,
1,NA,NA,NA,
1, 1, 1, 1,
NA, 1, 1, 0),
nrow=11, byrow=T)
zzz <- matrix(c(
1, 1, 1, 1,
NA, 1, 1, 1,
NA, 1,NA, 1,
1,NA,NA,NA
), nrow=4, byrow=T)
for(jj in 1:dim(zzz)[1]) {
ddd <-
dd[
((dd[, 1]%in%c(0,1) & zzz[jj, 1]%in%c(0,1)) |
(is.na(dd[, 1]) & is.na(zzz[jj, 1]))) &
((dd[, 2]%in%c(0,1) & zzz[jj, 2]%in%c(0,1)) |
(is.na(dd[, 2]) & is.na(zzz[jj, 2]))) &
((dd[, 3]%in%c(0,1) & zzz[jj, 3]%in%c(0,1)) |
(is.na(dd[, 3]) & is.na(zzz[jj, 3]))) &
((dd[, 4]%in%c(0,1) & zzz[jj, 4]%in%c(0,1)) |
(is.na(dd[, 4]) & is.na(zzz[jj, 4]))),]
print(ddd)
}
The 4 resulting data sets in this example are:
a)
1 0 1 1
0 0 1 0
1 1 1 1
b)
NA 1 1 0
NA 0 0 0
NA 1 1 1
NA 0 0 0
NA 1 1 0
c)
NA 1 NA 1
d)
0 NA NA NA
1 NA NA NA
Is there a more general and more efficient method of doing the same thing? In the example above the 4 resulting data sets are not saved, but I do save them with my real data.
Thank you for any advice.
Mark Miller

# Missing value patterns (TRUE=missing, FALSE=present)
patterns <- unique( is.na(dd) )
result <- list()
for( i in seq_len(nrow(patterns))) {
# Rows with this pattern
rows <- apply( dd, 1, function(u) all( is.na(u) == patterns[i,] ) )
result <- append( result, list(dd[rows,]) )
}

Not completely sure I understand the question, but here's a stab at it...
The first thing you want to do is figure out which elements are NA, and which aren't. For that, you can use the is.na() function.
is.na(dd)
will generate a matrix of the same size as dd containing TRUE where the value was NA, and FALSE elsewhere.
You then want to find the unique patterns in your matrix. For that, you want the unique() function, which accepts a 'margin' parameter, allowing you to find only unique rows in a matrix.
zzz <- unique(is.na(dd), margin=1)
creates a matrix similar to your zzz matrix, but you could, of course, substitute the "TRUE"s for NAs and "FALSE"s for 1's so it would be identical to your matrix.
You can then go a few directions from here to try to sort these into different datasets. Unfortunately, I think you're going to need one loop here.
results <- list()
for (r in 1:nrow(dd)){
ind <- which(apply (zzz, 1, function(x) {all(x==is.na(dd[r,]))}))
if (ind %in% names(results)){
results[[ind]] <- rbind(results[[ind]], dd[r,])
}
else{
results[[ind]] <- dd[r,]
names(results)[ind] <- ind
}
}
At that point, you have a list which contains all of the rows of dd, sorted by pattern of NAs. You'll find that the pattern expressed in row 1 of zzz will be matched with row 1 of results, and the same for the rest of the rows.

Related

Is there a way to tell case_when something like "otherwise, leave values as they are"? [duplicate]

This question already has answers here:
Keep value if not in case_when statement
(2 answers)
Closed 17 days ago.
In a survey I have two vectors, one containing respondents' answers to a question (which includes NAs), and one that is a dummy for a specific NA code (i.e. it's 1 for all respondents with a specific NA value, such as "don't know" or "don't wish to say").
It could look something like this.
a <- c(0, 1, 2, 3, 4, NA, NA, 7)
b <- c(0, 0, 0, 0, 0, 0, 1, 0)
Now I want to modify a in such a way that it maintains all the observations, but gets assigned a different value (let's say 99) if b=1.
The end result should look something like this.
> a
[1] 0 1 2 3 4 NA 99 7
I can get to that outcome with work-around solutions, but it'd be great to know if there's a way to get there in a straightforward manner.
One way using dplyr:
library(dplyr)
a <- c(0, 1, 2, 3, 4, NA, NA, 7)
b <- c(0, 0, 0, 0, 0, 0, 1, 0)
dat <-
tibble(
A = a,
B = b
)
dat2 <-
dat %>%
mutate(
A = if_else(B == 1, 99, A)
)
or a very simple direct way a[b==1] = 99
under the assumption that both vectors have the same length you could just create an "index" vector of logicals based on b and use that to index a's elements for value assignment
b.index <- b == 1
a[b.index] = 99
# or in one line
a[b == 1] = 99
a
[1] 0 1 2 3 4 NA 99 7

How to assign values to a new column in R depending on values from other columns

Creating a new column in my data frame of > 900,000 rows to which I want values to be
NA if any of the values in a set of columns are NA
df$newcol[is.na(df$somecol_1) | is.na(df$somecol_2) | is.na(df$somecol_3)] <- NA
0 if all of the values in a set of columns are 0
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==0] <- 0
1 if any of the values in a set of columns are 1 while none is NA. This is the tricky part as it creates a myriad of combinations with my ten columns. The whole data frame has >50 columns, of which I have ten columns of interest for this procedure and here I present only three:
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==1] <- 1
I have a feeling I am overthinking this, there must be a way to make 3 easier? Writing different combinations of columns as shown above would take forever with ten. And a loop would go too slow due to the large dataset.
Dummy data:
df <- NULL
df$somecol_1 <- c(1,0,0,NA,0,1,0,NA,1,1)
df$somecol_2 <- c(NA,1,0,0,0,1,0,NA,0,0)
df$somecol_3 <- c(0,0,0,0,0,0,0,0,0,0)
df <- as.data.frame(df)
Based on the above, I want the new column to be
df$newcol <- c(NA,1,0,NA,0,1,0,NA,1,1)
as.numeric(rowSums(df) >= 1)
#[1] NA 1 0 NA 0 1 0 NA 1 1
rowSums will give NA if there are any missing values. It will be 0 if all the values are 0, it will be 1 otherwise (assuming your data is all either NA, 0, or 1).
(Using akrun's sample data)
We can use rowSums
nm1 <- grep('somecol', names(df))
df$newcol <- NA^(rowSums(is.na(df[nm1])) > 0) *(rowSums(df[nm1], na.rm = TRUE) > 0)
df$newcol
#[1] NA 1 0 NA 0 1 0 NA 1 1
data
df <- structure(list(somecol_1 = c(1, 0, 0, NA, 0, 1, 0, NA, 1, 1),
somecol_2 = c(NA, 1, 0, 0, 0, 1, 0, NA, 0, 0), somecol_3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-10L))
df$newcol = ifelse(apply(df,1,sum)>=1,1,0)
Should do the trick. First, apply will sum every row, then:
Whenever there is NA values in a row (1st case), any operation will return NA; When there are only 0's (2nd case), the sum is 0 (witch isn't >=1)and the ifelse third argument makes the new entry 0; And when there is at least one 1, the sum is equal or greater than 1 and the ifelse second argument makes the new entry 1.
Edit: As you want to run theese conditions only in some columns - say its columns 1-7, 9, and 23-24 - you can use the code only in that part of the df:
df$newcol = as.numeric(rowSums(df[,c(1:7,9,23:24)])>=1)
OBS: i used the simplified code that Akrun and Gregor have answered.
If you prefer, there are ways to select columns by name: Extracting specific columns from a data frame

Logical vector across many columns

I am trying to run a logical or statement across many columns in data.table but I am having trouble coming up with the code. My columns have a pattern like the one shown in the table below. I could use a regular logical vector if needed, but I was wondering if I could figure out a way to iterate across a1, a2, a3, etc. as my actual dataset has many "a" type columns.
Thanks in advance.
library(data.table)
x <- data.table(a1 = c(1, 4, 5, 6), a2 = c(2, 4, 1, 10), z = c(9, 10, 12, 12))
# this works but does not work for lots of a1, a2, a3 colnames
# because code is too long and unwieldy
x[a1 == 1 | a2 == 1 , b:= 1]
# this is broken and returns the following error
x[colnames(x)[grep("a", names(x))] == 1, b := 1]
Error in `[.data.table`(x, colnames(x)[grep("a", names(x))] == 1, `:=`(b, :
i evaluates to a logical vector length 2 but there are 4 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Output looks like below:
a1 a2 z b
1: 1 2 9 1
2: 4 4 10 NA
3: 5 1 12 1
4: 6 10 12 NA
Try using a mask:
x$b <- 0
x[rowSums(ifelse(x[, list(a1, a2)] == 1, 1, 0)) > 0, b := 1]
Now imagine you have 100 a columns and they are the first 100 columns in your data table. Then you can select the columns using:
x[rowSums(ifelse(x[, c(1:100)] == 1, 1, 0) > 0, b := 1]
ifelse(x[, list(a1, a2)] == 1, 1, 0) returns a data table that only has the values 1 where there is a 1 in the a columns. Then I used rowSums to sum horizontally, and if any of these sums is > 0, it means there was a 1 in at least one of the columns of a given row, so I simply selected those rows and set b to 1.

How to check if any two adjacent elements in a vector are non-zero

I have a list of all possible binary 12-length vectors in R via
all_possible_permutations <- expand.grid(replicate(12, 0:1, simplify = FALSE))
I'd like to flag all vectors where two non-zero cells are adjacent to one another.
So for instance
1 0 1 0 1 0 1 0 1 0 1 0 <- Not Flagged
1 1 0 1 0 1 0 1 0 1 0 1 <- Flagged (due to the first 2)
For any binary vector x, we can use the following logic to detect existing pattern of two adjacent 1:
flag <- function (x) sum(x == 1 & c(diff(x) == 0, FALSE)) > 0
x <- c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
flag(x)
#[1] TRUE
x <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
flag(x)
#[1] FALSE
So we can apply this to all columns of your data frame DF:
sapply(DF, flag)
As r2evans commented, this also works:
flag <- function (x) any(x == 1 & c(diff(x) == 0, FALSE))
Using sum gives you a by-product: it tells you the number of matches.
Gosh, you want to apply flag for every row not every column of DF. So I should not use sapply. In this case, let's do a full vectorization:
MAT <- t(DF)
result <- colSums(MAT == 1 & rbind(diff(MAT) == 0, FALSE)) > 0
table(result)
#FALSE TRUE
# 377 3719
In this case colSums can not be changed to any. The vectorization comes at more memory usage, but is probably worthwhile.
You can use rle since this is binary ie 0 and 1s:
flag = function(x)any(with(rle(x),lengths[values==1]>1))
if non binary, yet you want to check if two adjacent elements are non zero then:
flag = function(x)any(with(rle(x>0),lengths[values]>1))
which is a generalized case taking into account the binary and non-binary

Merge dataframes of different length at unique values (or count number occourrences) in R?

In have carried out three experiments, each rusluting in a list of numbers.
data1 = c(1,1,1,2,2)
data2 = c(2,2,3,3,3,4)
data3 = c(1,1,1,4,4,4,4,4,4, 5, 6)
Now I want to count the occurences of each number in each of the experiments. I do this with table, since hist uses class mids (the nice thing about histo would be, that I can give it the list of unique values)
# save histograms
result = list()
result$values[[1]] = as.data.frame(table(data1), stringsAsFactors=F)
result$values[[2]] = as.data.frame(table(data2), stringsAsFactors=F)
result$values[[3]] = as.data.frame(table(data3), stringsAsFactors=F)
str(result)
Now I only have a list of dataframes of different length, But I'd like to have a single dataframe containinglists of the same length (I want to subtract them)
nerv=data.frame(names=c(1, 2, 3, 4, 5, 6))
nerv[[2]] = c(3, 2, 0, 0, 0, 0)
nerv[[3]] = c(0, 2, 3, 1, 0, 0)
nerv[[4]] = c(3, 0, 0, 6, 1, 1)
Is it somehow possible to tell table(), which values to count? Or is there another function that allows counter of a list of values in another list (count unique(data1, data2, data3) in data1)?
Or should I merge the data.frames and fill zeros into empty spaces?
This will generate the data frame:
lev <- unique(c(data1, data2, data3)) # the unique values
data.frame(names = lev,
do.call(cbind,
lapply(list(data1, data2, data3),
function(x) table(factor(x, levels = lev)))))
The trick is to transform the numeric vectors to factors with specified levels. The function table uses all levels.
The output:
names X1 X2 X3
1 1 3 0 3
2 2 2 2 0
3 3 0 3 0
4 4 0 1 6
5 5 0 0 1
6 6 0 0 1

Resources