Remove rows which have NA into specific columns and conditions - r

data.frame(id = c(1,2,3,4), stock = c("stock2", NA, NA, NA), bill = c("stock3", "bill2", NA, NA)
I would like to remove the rows which have in both columns(stock, bill) missing values
Example output
data.frame(id = c(1,2), stock = c("stock2", NA), bill = c("stock3", "bill2")

We can use rowSums to create a logical vector in base R
df1[rowSums(is.na(df1[-1])) < ncol(df1)-1,]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
Or using filter_at from dplyr
library(dplyr)
df1 %>%
filter_at(-1, any_vars(!is.na(.)))
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also specify the column names within vars
df1 %>%
filter_at(vars(stock, bill), any_vars(!is.na(.)))
NOTE: This would also work when there are many columns to compare.

Here are two ways using base R or dplyr
# the data frame with your values
df <- data.frame(
id = c(1,2,3,4),
stock = c("stock2", NA, NA, NA),
bill = c("stock3", "bill2", NA, NA)
)
# base R way
df[!(is.na(df$stock) & is.na(df$bill)), ]
# dplyr way
library(dplyr)
filter(df, !(is.na(stock) & is.na(bill)))

We can check for NA values in the dataframe and use apply to select rows which have at least one non-NA value.
df[apply(!is.na(df[-1]), 1, any), ]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also use Reduce and lapply with same effect
df[Reduce(`|`, lapply(df[-1], function(x) !is.na(x))), ]
#OR
#df[Reduce(`|`, lapply(df[-1], complete.cases)), ]

Related

Using dplyr to select rows containing non-missing values in several specified columns

Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA

Selecting Rows with Missing Data in a Range of Columns

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.
You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA
You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))

If statement with two conditions and NA

I am looking to use a conditional statement to access date rows which are before 0021-01-11 and have NA value in a specific column (People_vaccinated for example). For those rows I wanted to impute with zero.
I want to use an IF statement with (condition1 AND condition 2).
Condition1 can be df$People_vaccinated == NA and condition2 can be df$date < 'given date'
Maybe this will help -
df <- data.frame(Date = c('0021-01-07', '0021-01-08','0021-01-11', '0021-01-12'),
a = c(2, NA, 3, NA),
b = c(1, NA, 2, 3))
ind <- match('0021-01-11', df$Date)
df$a[1:ind][is.na(df$a[1:ind])] <- 0
df
# Date a b
#1 0021-01-07 2 1
#2 0021-01-08 0 NA
#3 0021-01-11 3 2
#4 0021-01-12 NA 3
Or using dplyr -
library(dplyr)
df <- df %>%
mutate(a = replace(a,
row_number() <= match('0021-01-11', Date) & is.na(a), 0))
df

Tidy way to add column if missing from data frame

I am looking for a tidy way to add a missing column if not present in the dataset. For example, df1 does not contain column "c".
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
desired output:
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4), c=c(NA, NA, NA, NA))
Assuming you don't want to overwrite the column if it is already present in your data you can use add_column along with an if condition to check if the column is already present.
library(dplyr)
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
if(!'c' %in% names(df1)) df1 <- df1 %>% add_column(c = NA)
df1
# a b c
#1 1 NA NA
#2 2 2 NA
#3 3 3 NA
#4 NA 4 NA
Tidy way would be dplyr::mutate I guess.
library(dplyr)
df1 <- df1 %>%
mutate(c = c(NA))
No need to specify multiple NA as it will be recycled to fill all rows of the data frame.

For Loop to check if any rows have a specfic set of values

I'm trying to run a for-loop to check if any my rows contain a specific set of values. I already know you can simply apply a function to remove the set from the dataframe, but I want to know how to run a for-loop as well. Thanks!
This is my dataframe:
df <- as.data.frame(matrix(NA, nrow = 12, ncol = 3))
df$V1 <- c('1','1','2','3','3','3','4','4','5','5','5','5')
df$V2 <- c('CCC','BBB','AAA','AAA','EEE','BBB','AAA','DDD','EEE','EEE','BBB','CCC')
df$V3 <- c(100,90,80,85,66,98,62,74,56,85,77,66)
colnames(df) <- c('ID','Secondary_ID','Number')
Grouping the Data so there is only 1 unique ID per row
library(dplyr)
library(tidyr)
df_2 <- df%>%
group_by(ID)%>%
summarise(Key_s = paste0(Secondary_ID, collapse = ','))%>%
separate(Key_s, into = c('1','2','3','4'))
I know that you can remove the specific set like this:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
final_dataframe <- df_2[df_remove,]
I'm trying to run a for-loop which creates another column called output, and if it contains the specific set than "Yes" else "No".
Something like this:
output <- as.character(nrow(df_2))
for(i in 1:nrow(df_2)){
if(df_2[i,] %in% remove_this){
df_2$output <- "Yes"
}else{df_2$output <- "No"}
}
Reverse the test to see if the contents of remove_this are in the row.
df_2$output <- NA # initialize the column
for(i in 1:nrow(df_2)){
df_2$output[i] <- ifelse(all(remove_this %in% df_2[i,]), 'Yes', 'No')
}
You don't need to create a for loop:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
df_2 %>%
mutate(output = c("No", "Yes")[df_remove + 1L])
# A tibble: 5 x 6
ID `1` `2` `3` `4` output
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 CCC BBB NA NA No
2 2 AAA NA NA NA Yes
3 3 AAA EEE BBB NA No
4 4 AAA DDD NA NA Yes
5 5 EEE EEE BBB CCC No
The "trick" is to convert the logical values FALSE and TRUE of df_remove into integer indices which are used to subset the vector c("No", "Yes").

Resources