Tidy way to add column if missing from data frame - r

I am looking for a tidy way to add a missing column if not present in the dataset. For example, df1 does not contain column "c".
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
desired output:
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4), c=c(NA, NA, NA, NA))

Assuming you don't want to overwrite the column if it is already present in your data you can use add_column along with an if condition to check if the column is already present.
library(dplyr)
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
if(!'c' %in% names(df1)) df1 <- df1 %>% add_column(c = NA)
df1
# a b c
#1 1 NA NA
#2 2 2 NA
#3 3 3 NA
#4 NA 4 NA

Tidy way would be dplyr::mutate I guess.
library(dplyr)
df1 <- df1 %>%
mutate(c = c(NA))
No need to specify multiple NA as it will be recycled to fill all rows of the data frame.

Related

Using dplyr to select rows containing non-missing values in several specified columns

Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA

Selecting Rows with Missing Data in a Range of Columns

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.
You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA
You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))

Find last of several columns that is not NA (tidyverse)

Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA
The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.
tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA
An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))
base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)
Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

Remove rows which have NA into specific columns and conditions

data.frame(id = c(1,2,3,4), stock = c("stock2", NA, NA, NA), bill = c("stock3", "bill2", NA, NA)
I would like to remove the rows which have in both columns(stock, bill) missing values
Example output
data.frame(id = c(1,2), stock = c("stock2", NA), bill = c("stock3", "bill2")
We can use rowSums to create a logical vector in base R
df1[rowSums(is.na(df1[-1])) < ncol(df1)-1,]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
Or using filter_at from dplyr
library(dplyr)
df1 %>%
filter_at(-1, any_vars(!is.na(.)))
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also specify the column names within vars
df1 %>%
filter_at(vars(stock, bill), any_vars(!is.na(.)))
NOTE: This would also work when there are many columns to compare.
Here are two ways using base R or dplyr
# the data frame with your values
df <- data.frame(
id = c(1,2,3,4),
stock = c("stock2", NA, NA, NA),
bill = c("stock3", "bill2", NA, NA)
)
# base R way
df[!(is.na(df$stock) & is.na(df$bill)), ]
# dplyr way
library(dplyr)
filter(df, !(is.na(stock) & is.na(bill)))
We can check for NA values in the dataframe and use apply to select rows which have at least one non-NA value.
df[apply(!is.na(df[-1]), 1, any), ]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also use Reduce and lapply with same effect
df[Reduce(`|`, lapply(df[-1], function(x) !is.na(x))), ]
#OR
#df[Reduce(`|`, lapply(df[-1], complete.cases)), ]

Removing empty rows of a data file in R

I have a dataset with empty rows. I would like to remove them:
myData<-myData[-which(apply(myData,1,function(x)all(is.na(x)))),]
It works OK. But now I would like to add a column in my data and initialize the first value:
myData$newCol[1] <- -999
Error in `$<-.data.frame`(`*tmp*`, "newCol", value = -999) :
replacement has 1 rows, data has 0
Unfortunately it doesn't work and I don't really understand why and I can't solve this.
It worked when I removed one line at a time using:
TgData = TgData[2:nrow(TgData),]
Or anything similar.
It also works when I used only the first 13.000 rows.
But it doesn't work with my actual data, with 32.000 rows.
What did I do wrong? It seems to make no sense to me.
I assume you want to remove rows that are all NAs. Then, you can do the following :
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
data[rowSums(is.na(data)) != ncol(data),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
If you want to remove rows that have at least one NA, just change the condition :
data[rowSums(is.na(data)) == 0,]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 6 7
If you have empty rows, not NAs, you can do:
data[!apply(data == "", 1, all),]
To remove both (NAs and empty):
data <- data[!apply(is.na(data) | data == "", 1, all),]
Here are some dplyr options:
# sample data
df <- data.frame(a = c('1', NA, '3', NA), b = c('a', 'b', 'c', NA), c = c('e', 'f', 'g', NA))
library(dplyr)
# remove rows where all values are NA:
df %>% filter_all(any_vars(!is.na(.)))
df %>% filter_all(any_vars(complete.cases(.)))
# remove rows where only some values are NA:
df %>% filter_all(all_vars(!is.na(.)))
df %>% filter_all(all_vars(complete.cases(.)))
# or more succinctly:
df %>% filter(complete.cases(.))
df %>% na.omit
# dplyr and tidyr:
library(tidyr)
df %>% drop_na
Alternative solution for rows of NAs using janitor package
myData %>% remove_empty("rows")
This is similar to some of the above answers, but with this, you can specify if you want to remove rows with a percentage of missing values greater-than or equal-to a given percent (with the argument pct)
drop_rows_all_na <- function(x, pct=1) x[!rowSums(is.na(x)) >= ncol(x)*pct,]
Where x is a dataframe and pct is the threshold of NA-filled data you want to get rid of.
pct = 1 means remove rows that have 100% of its values NA.
pct = .5 means remome rows that have at least half its values NA
Using dplyr's if_all/if_any
Drop rows with any NA OR Select rows with no NA value.
df %>% filter(!if_any(a:c, is.na))
# a b c
#1 1 a e
#2 3 c g
#Also
df %>% filter(if_all(a:c, Negate(is.na)))
Drop rows with all NA values or select rows with at least one non-NA value.
df %>% filter(!if_all(a:c, is.na))
# a b c
#1 1 a e
#2 <NA> b f
#3 3 c g
#Also
df %>% filter(if_any(a:c, Negate(is.na)))
data
Using data from #sbha -
df <- data.frame(a = c('1', NA, '3', NA),
b = c('a', 'b', 'c', NA),
c = c('e', 'f', 'g', NA))
Here's yet another answer if you just want a handy function wrapper. Also, many of the above solutions remove a row with ANY NAs, whereas this one only removes rows that are ALL NAs.
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
rmNArows<-function(d){
goodRows<-apply(d,1,function(x) sum(is.na(x))!=ncol(d))
d[goodRows,]
}
rmNArows(data)

Resources