Replace strings containing only blanks with NA - r

I have a dataframe containing columns named Q1 through Q98. These columns contain strings ("This is a string"), yet some entries only contain a varying number of blanks (" ", " "). I would like to replace all entries containing only blanks with NA.
Consider the dataframe created by the following code:
df<-data.frame(Q1=c("Test test","Test"," "," "),Q2=c("Sample sample"," ","Sample","Sample"))
The solution would modify the above dataframe df such that df$Q1[3:4]==NA and df$Q2[2]==NA.
I have already tried using grepl(" ", df), but this lets me replace every entry that contains blanks, not only those which consist purely of blanks.

One dplyr possibility could be:
df %>%
mutate_all(~ ifelse(nchar(trimws(.)) == 0, NA_character_, .))
Q1 Q2
1 Test test Sample sample
2 Test <NA>
3 <NA> Sample
4 <NA> Sample
Or the same with base R:
df[] <- lapply(df, function(x) ifelse(nchar(trimws(x)) == 0, NA_character_, x))
Or:
df %>%
mutate_all(~ trimws(.)) %>%
na_if(., "")

A dplyr+stringr option
library(dplyr)
library(stringr)
df %>% mutate_all(~str_replace(., "^\\s+$", NA_character_))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample

You can search for strings with a start ^, then one or more spaces +, then an end $.
df[sapply(df, function(x) grepl('^ +$', x))] <- NA
# Q1 Q2
# 1 Test test Sample sample
# 2 Test <NA>
# 3 <NA> Sample
# 4 <NA> Sample
Some other possibilities
df[] <- lapply(df, function(x) replace(x, grep('^ +$', x), NA))
#or
replace(df, sapply(df, function(x) grepl('^ +$', x)), NA)

Apply sub to all columns of whitespaces:
lapply(df, FUN = sub, pattern = "^\\s*$", replacement = NA)

We can do this in base R
df[trimws(as.matrix(df)) == ''] <- NA
df
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample
Or with replace
library(dplyr)
df %>%
mutate_all(list(~ replace(., trimws(.)=="", NA)))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample

Related

Fill a new column from multiple columns if they exist [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
Example data.frame:
df <- data.frame(col_1=c("A", NA, NA), col_2=c(NA, "B", NA), col_3=c(NA, NA, "C"), other_col=rep("x", 3), stringsAsFactors=F)
df
col_1 col_2 col_3 other_col
1 A <NA> <NA> x
2 <NA> B <NA> x
3 <NA> <NA> C x
I can create a new column new_col filled with non-NA values from the 3 columns col_1, col_2 and col_3:
df %>%
mutate(new_col = case_when(
!is.na(col_1) ~ col_1,
!is.na(col_2) ~ col_2,
!is.na(col_3) ~ col_3,
TRUE ~ "none"))
col_1 col_2 col_3 other_col new_col
1 A <NA> <NA> x A
2 <NA> B <NA> x B
3 <NA> <NA> C x C
However, sometimes the number of columns from which I pick the new_col value can vary.
How could I check that the columns exist before applying the previous case_when command?
The following triggers an error:
df %>%
select(-col_3) %>%
mutate(new_col = case_when(
!is.null(.$col_1) & !is.na(col_1) ~ col_1,
!is.null(.$col_2) & !is.na(col_2) ~ col_2,
!is.null(.$col_3) & !is.na(col_3) ~ col_3,
TRUE ~ "none"))
Error: Problem with `mutate()` input `new_col`.
x object 'col_3' not found
ℹ Input `new_col` is `case_when(...)`.
I like Adam's answer, but if you want to be able to combine from col_1 and col_2 (assuming they both have values), you should use unite()
library(tidyverse)
df %>%
unite(new_col, starts_with("col"), remove = FALSE, na.rm = TRUE)
Edit to respond to: "How could I check that the columns exist before applying the previous case_when command?"
You won't need to check with this command. And if your columns to unite aren't named consistently, replace starts_with("col") with c("your_name_1", "your_name_2", "etc.")
You can use coalesce.
library(dplyr)
# vector of all the columns you might want
candidate_cols <- paste("col", 1:3, sep = "_")
# convert to symbol only the ones in the dataframe
check_cols <- syms(intersect(candidate_cols, names(df)))
# coalesce over the columns to check
df %>%
mutate(new_col = coalesce(!!!check_cols))
# col_1 col_2 col_3 other_col new_col
#1 A <NA> <NA> x A
#2 <NA> B <NA> x B
#3 <NA> <NA> C x C

Error when duplicating a row conditionally - R

I have a data frame with columns A, B, C as follows:
A <- c("NX300", "BT400", "GD200")
B <- c("M0102", "N0703", "M0405")
C <- c(NA, "M0104", "N0404")
df <- data.frame (A,B,C)
Instead, I would like to duplicate a row whenever a value in C is not NA and replace the value of B with NA for the duplicated row. This is the desired output:
A1 <- c("NX300", "BT400", "BT400", "GD200", "GD200")
B1 <- c("M0102", "N0703", NA, "M0405", NA)
C1 <- c(NA, NA, "M0104", NA, "N0404")
df1 <- data.frame(A1,B1,C1)
To achieve this, I tried duplicating the row, without replacing B with NA just yet, but I get the following error code:
rbind(df, df[,is.na(C)==FALSE])
Error: object "C" not found
Can anyone help please?
Define a function newrows which accepts a row x and returns it or the duplicated rows and then apply it to each row. No packages are used.
newrows <- function(x) {
if (is.na(x$C)) x
else rbind(replace(x, "C", NA), replace(x, "B", NA))
}
do.call("rbind", by(df, 1:nrow(df), newrows))
giving:
A B C
1 NX300 M0102 <NA>
2.2 BT400 N0703 <NA>
2.21 BT400 <NA> M0104
3.3 GD200 M0405 <NA>
3.31 GD200 <NA> N0404
An option would be
library(dplyr)
df %>%
mutate(i1 = 1 + !is.na(C)) %>%
uncount(i1) %>%
mutate(B = replace(B, duplicated(B), NA)) %>%
group_by(A) %>%
mutate(C = replace(C, duplicated(C, fromLast = TRUE), NA))
If sorting does not matter, and continuing your first steps you can try:
x <- rbind(df, cbind(df[!is.na(df$C),1:2], C=NA))
x$B[!is.na(x$C)] <- NA
x
# A B C
#1 NX300 M0102 <NA>
#2 BT400 <NA> M0104
#3 GD200 <NA> N0404
#21 BT400 N0703 <NA>
#31 GD200 M0405 <NA>

Add a boolean column to a data.frame indicating wether specific columns are all NAs

I have a data.frame, which has NA's in several columns:
df <- data.frame(a0 = 1:3, a1 = c("A","B",NA), a2 = c("a",NA,NA),
a3 = rep(NA,3), stringsAsFactors = FALSE)
I would like to add a new column, all.na, indicating whether columns: c("a1","a2","a3") are all(is.na), per each row.
It can be done using sapply:
df$all.na <- sapply(1:nrow(df), function(x) all(is.na(df[x,c("a1","a2","a3")])))
But I'm looking for something faster.
I thought using dplyr::mutate might be a good solution but:
> df %>% dplyr::mutate(all(is.na(c(a1,a2,a3))))
a0 a1 a2 a3 all(is.na(c(a1, a2, a3)))
1 1 A a NA FALSE
2 2 B <NA> NA FALSE
3 3 <NA> <NA> NA FALSE
Doesn't give me the desired outcome.
Any idea how to get dplyr::mutate to give:
df$all.na <- c(FALSE, FALSE, TRUE)
On this?
We could use rowwise with do
library(dplyr)
cols <- c("a1","a2","a3")
df %>%
rowwise() %>%
do( (.) %>% as.data.frame %>%
mutate(all.na = all(is.na(.[cols]))))
# a0 a1 a2 a3 all.na
# <int> <chr> <chr> <lgl> <lgl>
#1 1 A a NA FALSE
#2 2 B NA NA FALSE
#3 3 NA NA NA TRUE
Or a more general approach using tidyverse gather and spread
library(tidyverse)
df %>%
gather(key, value, -a0) %>%
group_by(a0) %>%
mutate(all.na = all(is.na(value))) %>%
spread(key, value)
However, in base R there is a better approach using is.na and rowSums
df$all.na <- rowSums(is.na(df[cols])) == length(cols)
df
# a0 a1 a2 a3 all.na
#1 1 A a NA FALSE
#2 2 B <NA> NA FALSE
#3 3 <NA> <NA> NA TRUE
This can also be achieved using apply row-wise (MARGIN = 1) but this will not help with any speed improvements.
df$all.na <- apply(df[cols], 1, function(x) all(is.na(x)))
Here is one option with tidyverse making use of pmap
library(tidyverse)
df %>%
mutate(all.na = pmap_lgl(.[cols], ~ all(is.na(c(...)))))
# a0 a1 a2 a3 all.na
#1 1 A a NA FALSE
#2 2 B <NA> NA FALSE
#3 3 <NA> <NA> NA TRUE
Or another option is to convert to logical vector with map and reduce it back to a single logical vector
df %>%
mutate(all.na = map(.[cols], is.na) %>%
reduce(`&`))
With base R, this can be achieved using Reduce and lapply
df$all.na <- Reduce(`&`, lapply(df[cols], is.na))
data
cols <- c("a1","a2","a3")

How to split my columns using a unique and tidyR

I'm working on a data.table with a column like this:
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
I would like to separate this column into 3 columns like this:
seg1 seg2 seg3
1 a b c
2 a b <NA>
3 d a b
4 f <NA> <NA>
The thing here is that when i split each row by ";" i need to keep unique of the row.
Here's a tidyverse approach. We split the character in A, keep only the unique values, paste the result back together and separate into three columns:
library(tidyverse)
df %>%
mutate(A = map(strsplit(as.character(A), ";"),
.f = ~ paste(unique(.x), collapse = ";"))) %>%
separate(A, into = c("seg1", "seg2", "seg3"))
Which gives:
# seg1 seg2 seg3
#1 a b c
#2 a b <NA>
#3 d a b
#4 f <NA> <NA>
library(stringr)
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
df <- str_split_fixed(df$A, ";", 3)
df <- apply(X = df,
FUN = function(x){
return(x[!duplicated(x)][1:ncol(df)])
},
MARGIN = 1)
df <- t(df)
df <- as.data.frame(df)
names(df) <- c("seg1", "seg2", "seg3")
df
# seg1 seg2 seg3
# 1 a b c
# 2 a b <NA>
# 3 d a b
# 4 f <NA> <NA>

separate() in tidyr with NA

I have a question related to separate() in the tidyr package. When there is no NA in a data frame, separate() works. I have been using this function a lot. But, today I had a case in which there were NAs in a data frame. separate() returned an error message. I could be very silly. But, I wonder if tidyr may not be designed for this kind of data cleaning. Or is there any way separate() can work with NAs? Thank you very much for taking your time.
Here is an updated sample based on the comments. Say I want to separate characters in y and create new columns. If I remove the row with NA, separate() will work. But, I do not want to delete the row, what could I do?
x <- c("a-1","b-2","c-3")
y <- c("d-4","e-5", NA)
z <- c("f-6", "g-7", "h-8")
foo <- data.frame(x,y,z, stringsAsFactors = F)
ana <- foo %>%
separate(y, c("part1", "part2"))
# > foo
# x y z
# 1 a-1 d-4 f-6
# 2 b-2 e-5 g-7
# 3 c-3 <NA> h-8
# > ana <- foo %>%
# + separate(y, c("part1", "part2"))
# Error: Values not split into 2 pieces at 3
One way would be:
res <- foo %>%
mutate(y=ifelse(is.na(y), paste0(NA,"-", NA), y)) %>%
separate(y, c('part1', 'part2'))
res[res=='NA'] <- NA
res
# x part1 part2 z
#1 a-1 d 4 f-6
#2 b-2 e 5 g-7
#3 c-3 <NA> <NA> h-8
You can use extra option in separate.
Here's an example from hadley's github issue page
> df <- data.frame(x = c("a", "a b", "a b c", NA))
> df
x
1 a
2 a b
3 a b c
4 <NA>
> df %>% separate(x, c("a", "b"), extra = "merge")
a b
1 a <NA>
2 a b
3 a b c
4 <NA> <NA>
> df %>% separate(x, c("a", "b"), extra = "drop")
a b
1 a <NA>
2 a b
3 a b
4 <NA> <NA>

Resources