Selecting Rows with Missing Data in a Range of Columns - r

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.

You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA

You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))

Related

Using dplyr to select rows containing non-missing values in several specified columns

Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA

Recode missing values in multiple columns: mutate with across and ifelse

I am working with an SPSS file that has been exported as tab delimited. In SPSS, you can set values to represent different types of missing and the dataset has 98 and 99 to indicate missing.
I want to convert them to NA but only in certain columns (V2 and V3 in the example data, leaving V1 and V4 unchanged).
library(dplyr)
testdf <- data.frame(V1 = c(1, 2, 3, 4),
V2 = c(1, 98, 99, 2),
V3 = c(1, 99, 2, 3),
V4 = c(98, 99, 1, 2))
outdf <- testdf %>%
mutate(across(V2:V3), . = ifelse(. %in% c(98,99), NA, .))
I haven't used across before and cannot work out how to have the mutate return the ifelse into the same columns. I suspect I am overthinking this, but can't find any similar examples that have both across and ifelse. I need a tidyverse answer, prefer dplyr or tidyr.
You need the syntax to be slightly different to make it work. Check ?across for more info.
You need to use a ~ to make a valid function (or use \(.), or use function(.)),
You need to include the formula in the across function
library(dplyr)
testdf %>%
mutate(across(V2:V3, ~ ifelse(. %in% c(98,99), NA, .)))
# V1 V2 V3 V4
# 1 1 1 1 98
# 2 2 NA NA 99
# 3 3 NA 2 1
# 4 4 2 3 2
Note that an alternative is replace:
testdf %>%
mutate(across(V2:V3, ~ replace(., . %in% c(98,99), NA)))
Base R option using lapply with an ifelse like this:
cols <- c("V2","V3")
testdf[,cols] <- lapply(testdf[,cols],function(x) ifelse(x %in% c(98,99),NA,x))
testdf
#> V1 V2 V3 V4
#> 1 1 1 1 98
#> 2 2 NA NA 99
#> 3 3 NA 2 1
#> 4 4 2 3 2
Created on 2022-10-19 with reprex v2.0.2
Base R:
cols <- c("V2", "V3")
testdf[, cols ][ testdf[, cols ] > 97 ] <- NA

Tidy way to add column if missing from data frame

I am looking for a tidy way to add a missing column if not present in the dataset. For example, df1 does not contain column "c".
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
desired output:
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4), c=c(NA, NA, NA, NA))
Assuming you don't want to overwrite the column if it is already present in your data you can use add_column along with an if condition to check if the column is already present.
library(dplyr)
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
if(!'c' %in% names(df1)) df1 <- df1 %>% add_column(c = NA)
df1
# a b c
#1 1 NA NA
#2 2 2 NA
#3 3 3 NA
#4 NA 4 NA
Tidy way would be dplyr::mutate I guess.
library(dplyr)
df1 <- df1 %>%
mutate(c = c(NA))
No need to specify multiple NA as it will be recycled to fill all rows of the data frame.

elements of list column matching rows in other data.frame

I have the following two data.frames:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
> df1
Var1 Var2
1 3 11
2 4 32
3 8 1
4 9 7
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
> df2
ID ball
1 A 3, 11, 12
2 B 4, 1
3 C 9, 32
Note that column ball in df2 is a list.
I want to select the ID in df2 with elements in column ball that match a row in df1.
The ideal output would look like this:
> df3
ID ball1 ball2
1 A 3 11
Does anyone have an idea how to do this efficiently? The original data consists of millions of rows in both data.frames.
A data.table solution would work much more quickly than this base R solution but here is a possibility.
your data:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
the process:
df2$ID <- as.character(df2$ID) # just in case they are levels instead
n <- length(df2)# initialize the size of df3 to be big enough
df3 <- data.frame(ID = character(n),
Var1 = numeric(n), Var2 = numeric(n),
stringsAsFactors = F) # to make sure we get the ID as a string
count = 0 # counter
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
if(all(df1[i,] %in% df2$ball[[j]])){
count = count + 1
df3$ID[count] <- df2$ID[j]
df3$Var1[count] <- df1$Var1[i]
df3$Var2[count] <- df1$Var2[i]
}
}
}
df3_final <- df3[-which(df3$ID == ""),] # since we overestimated the size of d3
df3_final

Removing empty rows of a data file in R

I have a dataset with empty rows. I would like to remove them:
myData<-myData[-which(apply(myData,1,function(x)all(is.na(x)))),]
It works OK. But now I would like to add a column in my data and initialize the first value:
myData$newCol[1] <- -999
Error in `$<-.data.frame`(`*tmp*`, "newCol", value = -999) :
replacement has 1 rows, data has 0
Unfortunately it doesn't work and I don't really understand why and I can't solve this.
It worked when I removed one line at a time using:
TgData = TgData[2:nrow(TgData),]
Or anything similar.
It also works when I used only the first 13.000 rows.
But it doesn't work with my actual data, with 32.000 rows.
What did I do wrong? It seems to make no sense to me.
I assume you want to remove rows that are all NAs. Then, you can do the following :
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
data[rowSums(is.na(data)) != ncol(data),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
If you want to remove rows that have at least one NA, just change the condition :
data[rowSums(is.na(data)) == 0,]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 6 7
If you have empty rows, not NAs, you can do:
data[!apply(data == "", 1, all),]
To remove both (NAs and empty):
data <- data[!apply(is.na(data) | data == "", 1, all),]
Here are some dplyr options:
# sample data
df <- data.frame(a = c('1', NA, '3', NA), b = c('a', 'b', 'c', NA), c = c('e', 'f', 'g', NA))
library(dplyr)
# remove rows where all values are NA:
df %>% filter_all(any_vars(!is.na(.)))
df %>% filter_all(any_vars(complete.cases(.)))
# remove rows where only some values are NA:
df %>% filter_all(all_vars(!is.na(.)))
df %>% filter_all(all_vars(complete.cases(.)))
# or more succinctly:
df %>% filter(complete.cases(.))
df %>% na.omit
# dplyr and tidyr:
library(tidyr)
df %>% drop_na
Alternative solution for rows of NAs using janitor package
myData %>% remove_empty("rows")
This is similar to some of the above answers, but with this, you can specify if you want to remove rows with a percentage of missing values greater-than or equal-to a given percent (with the argument pct)
drop_rows_all_na <- function(x, pct=1) x[!rowSums(is.na(x)) >= ncol(x)*pct,]
Where x is a dataframe and pct is the threshold of NA-filled data you want to get rid of.
pct = 1 means remove rows that have 100% of its values NA.
pct = .5 means remome rows that have at least half its values NA
Using dplyr's if_all/if_any
Drop rows with any NA OR Select rows with no NA value.
df %>% filter(!if_any(a:c, is.na))
# a b c
#1 1 a e
#2 3 c g
#Also
df %>% filter(if_all(a:c, Negate(is.na)))
Drop rows with all NA values or select rows with at least one non-NA value.
df %>% filter(!if_all(a:c, is.na))
# a b c
#1 1 a e
#2 <NA> b f
#3 3 c g
#Also
df %>% filter(if_any(a:c, Negate(is.na)))
data
Using data from #sbha -
df <- data.frame(a = c('1', NA, '3', NA),
b = c('a', 'b', 'c', NA),
c = c('e', 'f', 'g', NA))
Here's yet another answer if you just want a handy function wrapper. Also, many of the above solutions remove a row with ANY NAs, whereas this one only removes rows that are ALL NAs.
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
rmNArows<-function(d){
goodRows<-apply(d,1,function(x) sum(is.na(x))!=ncol(d))
d[goodRows,]
}
rmNArows(data)

Resources