data.frame(id = c(1,2,3,4), stock = c("stock2", NA, NA, NA), bill = c("stock3", "bill2", NA, NA)
I would like to remove the rows which have in both columns(stock, bill) missing values
Example output
data.frame(id = c(1,2), stock = c("stock2", NA), bill = c("stock3", "bill2")
We can use rowSums to create a logical vector in base R
df1[rowSums(is.na(df1[-1])) < ncol(df1)-1,]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
Or using filter_at from dplyr
library(dplyr)
df1 %>%
filter_at(-1, any_vars(!is.na(.)))
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also specify the column names within vars
df1 %>%
filter_at(vars(stock, bill), any_vars(!is.na(.)))
NOTE: This would also work when there are many columns to compare.
Here are two ways using base R or dplyr
# the data frame with your values
df <- data.frame(
id = c(1,2,3,4),
stock = c("stock2", NA, NA, NA),
bill = c("stock3", "bill2", NA, NA)
)
# base R way
df[!(is.na(df$stock) & is.na(df$bill)), ]
# dplyr way
library(dplyr)
filter(df, !(is.na(stock) & is.na(bill)))
We can check for NA values in the dataframe and use apply to select rows which have at least one non-NA value.
df[apply(!is.na(df[-1]), 1, any), ]
# id stock bill
#1 1 stock2 stock3
#2 2 <NA> bill2
We can also use Reduce and lapply with same effect
df[Reduce(`|`, lapply(df[-1], function(x) !is.na(x))), ]
#OR
#df[Reduce(`|`, lapply(df[-1], complete.cases)), ]
Related
Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA
There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.
You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA
You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))
I am looking to use a conditional statement to access date rows which are before 0021-01-11 and have NA value in a specific column (People_vaccinated for example). For those rows I wanted to impute with zero.
I want to use an IF statement with (condition1 AND condition 2).
Condition1 can be df$People_vaccinated == NA and condition2 can be df$date < 'given date'
Maybe this will help -
df <- data.frame(Date = c('0021-01-07', '0021-01-08','0021-01-11', '0021-01-12'),
a = c(2, NA, 3, NA),
b = c(1, NA, 2, 3))
ind <- match('0021-01-11', df$Date)
df$a[1:ind][is.na(df$a[1:ind])] <- 0
df
# Date a b
#1 0021-01-07 2 1
#2 0021-01-08 0 NA
#3 0021-01-11 3 2
#4 0021-01-12 NA 3
Or using dplyr -
library(dplyr)
df <- df %>%
mutate(a = replace(a,
row_number() <= match('0021-01-11', Date) & is.na(a), 0))
df
I am looking for a tidy way to add a missing column if not present in the dataset. For example, df1 does not contain column "c".
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
desired output:
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4), c=c(NA, NA, NA, NA))
Assuming you don't want to overwrite the column if it is already present in your data you can use add_column along with an if condition to check if the column is already present.
library(dplyr)
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
if(!'c' %in% names(df1)) df1 <- df1 %>% add_column(c = NA)
df1
# a b c
#1 1 NA NA
#2 2 2 NA
#3 3 3 NA
#4 NA 4 NA
Tidy way would be dplyr::mutate I guess.
library(dplyr)
df1 <- df1 %>%
mutate(c = c(NA))
No need to specify multiple NA as it will be recycled to fill all rows of the data frame.
I'm trying to run a for-loop to check if any my rows contain a specific set of values. I already know you can simply apply a function to remove the set from the dataframe, but I want to know how to run a for-loop as well. Thanks!
This is my dataframe:
df <- as.data.frame(matrix(NA, nrow = 12, ncol = 3))
df$V1 <- c('1','1','2','3','3','3','4','4','5','5','5','5')
df$V2 <- c('CCC','BBB','AAA','AAA','EEE','BBB','AAA','DDD','EEE','EEE','BBB','CCC')
df$V3 <- c(100,90,80,85,66,98,62,74,56,85,77,66)
colnames(df) <- c('ID','Secondary_ID','Number')
Grouping the Data so there is only 1 unique ID per row
library(dplyr)
library(tidyr)
df_2 <- df%>%
group_by(ID)%>%
summarise(Key_s = paste0(Secondary_ID, collapse = ','))%>%
separate(Key_s, into = c('1','2','3','4'))
I know that you can remove the specific set like this:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
final_dataframe <- df_2[df_remove,]
I'm trying to run a for-loop which creates another column called output, and if it contains the specific set than "Yes" else "No".
Something like this:
output <- as.character(nrow(df_2))
for(i in 1:nrow(df_2)){
if(df_2[i,] %in% remove_this){
df_2$output <- "Yes"
}else{df_2$output <- "No"}
}
Reverse the test to see if the contents of remove_this are in the row.
df_2$output <- NA # initialize the column
for(i in 1:nrow(df_2)){
df_2$output[i] <- ifelse(all(remove_this %in% df_2[i,]), 'Yes', 'No')
}
You don't need to create a for loop:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
df_2 %>%
mutate(output = c("No", "Yes")[df_remove + 1L])
# A tibble: 5 x 6
ID `1` `2` `3` `4` output
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 CCC BBB NA NA No
2 2 AAA NA NA NA Yes
3 3 AAA EEE BBB NA No
4 4 AAA DDD NA NA Yes
5 5 EEE EEE BBB CCC No
The "trick" is to convert the logical values FALSE and TRUE of df_remove into integer indices which are used to subset the vector c("No", "Yes").