I have a data frame with columns A, B, C as follows:
A <- c("NX300", "BT400", "GD200")
B <- c("M0102", "N0703", "M0405")
C <- c(NA, "M0104", "N0404")
df <- data.frame (A,B,C)
Instead, I would like to duplicate a row whenever a value in C is not NA and replace the value of B with NA for the duplicated row. This is the desired output:
A1 <- c("NX300", "BT400", "BT400", "GD200", "GD200")
B1 <- c("M0102", "N0703", NA, "M0405", NA)
C1 <- c(NA, NA, "M0104", NA, "N0404")
df1 <- data.frame(A1,B1,C1)
To achieve this, I tried duplicating the row, without replacing B with NA just yet, but I get the following error code:
rbind(df, df[,is.na(C)==FALSE])
Error: object "C" not found
Can anyone help please?
Define a function newrows which accepts a row x and returns it or the duplicated rows and then apply it to each row. No packages are used.
newrows <- function(x) {
if (is.na(x$C)) x
else rbind(replace(x, "C", NA), replace(x, "B", NA))
}
do.call("rbind", by(df, 1:nrow(df), newrows))
giving:
A B C
1 NX300 M0102 <NA>
2.2 BT400 N0703 <NA>
2.21 BT400 <NA> M0104
3.3 GD200 M0405 <NA>
3.31 GD200 <NA> N0404
An option would be
library(dplyr)
df %>%
mutate(i1 = 1 + !is.na(C)) %>%
uncount(i1) %>%
mutate(B = replace(B, duplicated(B), NA)) %>%
group_by(A) %>%
mutate(C = replace(C, duplicated(C, fromLast = TRUE), NA))
If sorting does not matter, and continuing your first steps you can try:
x <- rbind(df, cbind(df[!is.na(df$C),1:2], C=NA))
x$B[!is.na(x$C)] <- NA
x
# A B C
#1 NX300 M0102 <NA>
#2 BT400 <NA> M0104
#3 GD200 <NA> N0404
#21 BT400 N0703 <NA>
#31 GD200 M0405 <NA>
Related
We are looking to rename columns in a dataframe in R, however the columns may be missing and this throws an error:
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
my_df %>% dplyr::rename(aa = a, bb = b, cc = c)
Error: Can't rename columns that don't exist.
x Column `c` doesn't exist.
our desired output is this, which creates a new column with NA values if the original column does not exist:
> my_df
aa bb c
1 1 4 NA
2 2 5 NA
3 3 6 NA
A possible solution:
library(tidyverse)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
cols <- c(a = NA_real_, b = NA_real_, c = NA_real_)
my_df %>% add_column(!!!cols[!names(cols) %in% names(.)]) %>%
rename(aa = a, bb = b, cc = c)
#> aa bb cc
#> 1 1 4 NA
#> 2 2 5 NA
#> 3 3 6 NA
You can use a named vector with any_of() to rename that won't error on missing variables. I'm uncertain of a dplyr way to then create the missing vars but it's easy enough in base R.
library(dplyr)
cols <- c(aa = "a", bb = "b", cc = "c")
my_df %>%
rename(any_of(cols)) %>%
`[<-`(., , setdiff(names(cols), names(.)), NA)
aa bb cc
1 1 4 NA
2 2 5 NA
3 3 6 NA
Here is a solution using the data.table function setnames. I've added a second "missing" column "d" to demonstrate generality.
library(tidyverse)
library(data.table)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
curr <- names(my_df)
cols <- data.frame(new=c("aa","bb","cc","dd"), old = c("a", "b", "c","d")) %>%
mutate(exist = old %in% curr)
foo <- filter(cols, exist)
bar <- filter(cols, !exist)
setnames(my_df, new = foo$new)
my_df[, bar$old] <- NA
my_df
#> my_df
# aa bb c d
#1 1 4 NA NA
#2 2 5 NA NA
#3 3 6 NA NA
This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
Example data.frame:
df <- data.frame(col_1=c("A", NA, NA), col_2=c(NA, "B", NA), col_3=c(NA, NA, "C"), other_col=rep("x", 3), stringsAsFactors=F)
df
col_1 col_2 col_3 other_col
1 A <NA> <NA> x
2 <NA> B <NA> x
3 <NA> <NA> C x
I can create a new column new_col filled with non-NA values from the 3 columns col_1, col_2 and col_3:
df %>%
mutate(new_col = case_when(
!is.na(col_1) ~ col_1,
!is.na(col_2) ~ col_2,
!is.na(col_3) ~ col_3,
TRUE ~ "none"))
col_1 col_2 col_3 other_col new_col
1 A <NA> <NA> x A
2 <NA> B <NA> x B
3 <NA> <NA> C x C
However, sometimes the number of columns from which I pick the new_col value can vary.
How could I check that the columns exist before applying the previous case_when command?
The following triggers an error:
df %>%
select(-col_3) %>%
mutate(new_col = case_when(
!is.null(.$col_1) & !is.na(col_1) ~ col_1,
!is.null(.$col_2) & !is.na(col_2) ~ col_2,
!is.null(.$col_3) & !is.na(col_3) ~ col_3,
TRUE ~ "none"))
Error: Problem with `mutate()` input `new_col`.
x object 'col_3' not found
ℹ Input `new_col` is `case_when(...)`.
I like Adam's answer, but if you want to be able to combine from col_1 and col_2 (assuming they both have values), you should use unite()
library(tidyverse)
df %>%
unite(new_col, starts_with("col"), remove = FALSE, na.rm = TRUE)
Edit to respond to: "How could I check that the columns exist before applying the previous case_when command?"
You won't need to check with this command. And if your columns to unite aren't named consistently, replace starts_with("col") with c("your_name_1", "your_name_2", "etc.")
You can use coalesce.
library(dplyr)
# vector of all the columns you might want
candidate_cols <- paste("col", 1:3, sep = "_")
# convert to symbol only the ones in the dataframe
check_cols <- syms(intersect(candidate_cols, names(df)))
# coalesce over the columns to check
df %>%
mutate(new_col = coalesce(!!!check_cols))
# col_1 col_2 col_3 other_col new_col
#1 A <NA> <NA> x A
#2 <NA> B <NA> x B
#3 <NA> <NA> C x C
This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))
I'm working on a data.table with a column like this:
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
I would like to separate this column into 3 columns like this:
seg1 seg2 seg3
1 a b c
2 a b <NA>
3 d a b
4 f <NA> <NA>
The thing here is that when i split each row by ";" i need to keep unique of the row.
Here's a tidyverse approach. We split the character in A, keep only the unique values, paste the result back together and separate into three columns:
library(tidyverse)
df %>%
mutate(A = map(strsplit(as.character(A), ";"),
.f = ~ paste(unique(.x), collapse = ";"))) %>%
separate(A, into = c("seg1", "seg2", "seg3"))
Which gives:
# seg1 seg2 seg3
#1 a b c
#2 a b <NA>
#3 d a b
#4 f <NA> <NA>
library(stringr)
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
df <- str_split_fixed(df$A, ";", 3)
df <- apply(X = df,
FUN = function(x){
return(x[!duplicated(x)][1:ncol(df)])
},
MARGIN = 1)
df <- t(df)
df <- as.data.frame(df)
names(df) <- c("seg1", "seg2", "seg3")
df
# seg1 seg2 seg3
# 1 a b c
# 2 a b <NA>
# 3 d a b
# 4 f <NA> <NA>
I have a question related to separate() in the tidyr package. When there is no NA in a data frame, separate() works. I have been using this function a lot. But, today I had a case in which there were NAs in a data frame. separate() returned an error message. I could be very silly. But, I wonder if tidyr may not be designed for this kind of data cleaning. Or is there any way separate() can work with NAs? Thank you very much for taking your time.
Here is an updated sample based on the comments. Say I want to separate characters in y and create new columns. If I remove the row with NA, separate() will work. But, I do not want to delete the row, what could I do?
x <- c("a-1","b-2","c-3")
y <- c("d-4","e-5", NA)
z <- c("f-6", "g-7", "h-8")
foo <- data.frame(x,y,z, stringsAsFactors = F)
ana <- foo %>%
separate(y, c("part1", "part2"))
# > foo
# x y z
# 1 a-1 d-4 f-6
# 2 b-2 e-5 g-7
# 3 c-3 <NA> h-8
# > ana <- foo %>%
# + separate(y, c("part1", "part2"))
# Error: Values not split into 2 pieces at 3
One way would be:
res <- foo %>%
mutate(y=ifelse(is.na(y), paste0(NA,"-", NA), y)) %>%
separate(y, c('part1', 'part2'))
res[res=='NA'] <- NA
res
# x part1 part2 z
#1 a-1 d 4 f-6
#2 b-2 e 5 g-7
#3 c-3 <NA> <NA> h-8
You can use extra option in separate.
Here's an example from hadley's github issue page
> df <- data.frame(x = c("a", "a b", "a b c", NA))
> df
x
1 a
2 a b
3 a b c
4 <NA>
> df %>% separate(x, c("a", "b"), extra = "merge")
a b
1 a <NA>
2 a b
3 a b c
4 <NA> <NA>
> df %>% separate(x, c("a", "b"), extra = "drop")
a b
1 a <NA>
2 a b
3 a b
4 <NA> <NA>