exists function doesn't work as expected with dplyr transmute - r

I would like to do create the output as for df3 with dplyr transmute. But somehow it just takes the first row of the dataframe columns a and b and not the column itselft. any ideas?
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = ifelse(exists("a", df)==TRUE,a, NA),
newcol2 = ifelse(exists("b", df)==TRUE,b, NA),
newcol3 = ifelse(exists("c", df)==TRUE,c, NA),
)
df2
df3 = data.frame(newcol=1:10, newcol2=2:11, newcol3 = NA)
df3

The problem is that exists("a", df) returns a length-1 logical vector, so the ifelse returns a length-1 numeric vector. This is then recycled, which is why the first number in each column get recycled. You can use if(condition) a else NA instead:
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = if(exists("a", df)) a else NA,
newcol2 = if(exists("b", df)) b else NA,
newcol3 = if(exists("c", df)) c else NA)
)
df2
#> newcol newcol2 newcol3
#> 1 1 2 NA
#> 2 2 3 NA
#> 3 3 4 NA
#> 4 4 5 NA
#> 5 5 6 NA
#> 6 6 7 NA
#> 7 7 8 NA
#> 8 8 9 NA
#> 9 9 10 NA
#> 10 10 11 NA

An option with map
library(dplyr)
library(purrr)
map_dfc(c('a', 'b', 'c'),
~ if(exists(.x, df)) df %>% select(.x) else df %>% transmute(!! .x := NA))

Related

Expand each group to the max n of rows

How can I expand a group to length of the max group:
df <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 3L), col1 = c("A",
"B", "O", "U", "L", "R")), class = "data.frame", row.names = c(NA,
-6L))
ID col1
1 A
1 B
2 O
3 U
3 L
3 R
Desired Output:
1 A
1 B
NA NA
2 O
NA NA
NA NA
3 U
3 L
3 R
You can take advantage of the fact that df[n_bigger_than_nrow,] gives a row of NAs
dplyr
max_n <- max(count(df, ID)$n)
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),])
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 9 × 2
#> # Groups: ID [3]
#> ID col1
#> <int> <chr>
#> 1 1 A
#> 2 1 B
#> 3 1 <NA>
#> 4 2 O
#> 5 2 <NA>
#> 6 2 <NA>
#> 7 3 U
#> 8 3 L
#> 9 3 R
base R
n <- tapply(df$ID, df$ID, length)
max_n <- max(n)
i <- lapply(n, \(x) c(seq(x), rep(Inf, max_n - x)))
i <- Map(`+`, i, c(0, cumsum(head(n, -1))))
df <- df[unlist(i),]
rownames(df) <- NULL
df$ID <- rep(as.numeric(names(i)), each = max_n)
df
#> ID col1
#> 1 1 A
#> 2 1 B
#> 3 1 <NA>
#> 4 2 O
#> 5 2 <NA>
#> 6 2 <NA>
#> 7 3 U
#> 8 3 L
#> 9 3 R
Here's a base R solution.
split the df by the ID column, then use lapply to iterate over the split df, and rbind with a data frame of NA if there's fewer row than 3 (max(table(df$ID))).
do.call(rbind,
lapply(split(df, df$ID),
\(x) rbind(x, data.frame(ID = NA, col1 = NA)[rep(1, max(table(df$ID)) - nrow(x)), ]))
)
ID col1
1.1 1 A
1.2 1 B
1.3 NA <NA>
2.3 2 O
2.1 NA <NA>
2.1.1 NA <NA>
3.4 3 U
3.5 3 L
3.6 3 R
Here is a possible tidyverse solution. We can use add_row inside of summarise to add n number of rows to each group. I use max(count(df, ID)$n) to get the max group length, then I subtract that from the number of rows in each group to get the total number of rows that need to be added for each group. I use rep to produce the correct number of values that we need to add for each group. Finally, I replace ID with NA when there is an NA in col1.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(add_row(cur_data(),
col1 = rep(NA_character_,
unique(max(count(df, ID)$n) - n()))),
.groups = "drop") %>%
mutate(ID = replace(ID, is.na(col1), NA))
Output
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
Or another option without using add_row:
library(dplyr)
# Get maximum number of rows for all groups
N = max(count(df,ID)$n)
df %>%
group_by(ID) %>%
summarise(col1 = c(col1, rep(NA, N-length(col1))), .groups = "drop") %>%
mutate(ID = replace(ID, is.na(col1), NA))
Another option could be:
df %>%
group_split(ID) %>%
map_dfr(~ rows_append(.x, tibble(col1 = rep(NA_character_, max(pull(count(df, ID), n)) - group_size(.x)))))
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
A base R using merge + rle
merge(
transform(
data.frame(ID = with(rle(df$ID), rep(values, each = max(lengths)))),
q = ave(ID, ID, FUN = seq_along)
),
transform(
df,
q = ave(ID, ID, FUN = seq_along)
),
all = TRUE
)[-2]
gives
ID col1
1 1 A
2 1 B
3 1 <NA>
4 2 O
5 2 <NA>
6 2 <NA>
7 3 U
8 3 L
9 3 R
A data.table option may also work
> setDT(df)[, .(col1 = `length<-`(col1, max(df[, .N, ID][, N]))), ID]
ID col1
1: 1 A
2: 1 B
3: 1 <NA>
4: 2 O
5: 2 <NA>
6: 2 <NA>
7: 3 U
8: 3 L
9: 3 R
An option to tidyr::complete the ID and row_new, using row_old to replace ID with NA.
library (tidyverse)
df %>%
group_by(ID) %>%
mutate(
row_new = row_number(),
row_old = row_number()) %>%
ungroup() %>%
complete(ID, row_new) %>%
mutate(ID = if_else(is.na(row_old),
NA_integer_,
ID)) %>%
select(-matches("row_"))
# A tibble: 9 x 2
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA <NA>
4 2 O
5 NA <NA>
6 NA <NA>
7 3 U
8 3 L
9 3 R
n <- max(table(df$ID))
df %>%
group_by(ID) %>%
summarise(col1 =`length<-`(col1, n), .groups = 'drop') %>%
mutate(ID = `is.na<-`(ID, is.na(col1)))
# A tibble: 9 x 2
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
Another base R solution using sequence.
print(
df[
sequence(
abs(rep(i <- rle(df$ID)$lengths, each = 2) - c(0L, max(i))),
rep(cumsum(c(1L, i))[-length(i) - 1L], each = 2) + c(0L, nrow(df)),
),
],
row.names = FALSE
)
#> ID col1
#> 1 A
#> 1 B
#> NA <NA>
#> 2 O
#> NA <NA>
#> NA <NA>
#> 3 U
#> 3 L
#> 3 R

Function breaks when looped within dplyr::case_when()

I have a function that extracts the min or minimal of a range of values (within a character string) that appears to work fine on individual cases.
However, when I try to use it within case_when() it does not behave as expected.
Reproducible example
library(dplyr)
library(tibble)
library(stringr)
val_from_range <- function(.str, .fun = "min"){
str_extract_all(.str, "\\d*\\.?\\d+") |>
unlist() |>
as.numeric() |>
(\(x) if (.fun == "min") x |> min()
else if (.fun == "max") x |> max())()
}
tibble(x = c("5-6", "4", "6-9", "5", "NA")) |>
mutate(min = case_when(str_detect(x, "-") ~ val_from_range(x, "min"))) |>
mutate(max = case_when(str_detect(x, "-") ~ val_from_range(x, "max")))
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 4 9
2 4 NA NA
3 6-9 4 9
4 5 NA NA
5 NA NA NA
However, I want:
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 5 6
2 4 NA NA
3 6-9 6 9
4 5 NA NA
5 NA NA NA
The function performs as expected on individual cases
> val_from_range("5-6", "min")
[1] 5
> val_from_range("5-6", "max")
[1] 6
> val_from_range("5-6-8-10", "max")
[1] 10
Any help would be greatly appreciated. Thanks in advance.
Couple of changes required. The function works only for one value at a time . If you pass in more than one value it ignores the second value.
val_from_range("5-6", "min")
#[1] 5
val_from_range(c("5-6", "8-10"), "min")
#[1] 5
To pass them one by one you can take help of rowwise. Secondly, case_when still executes the function for values that do not satisfy the condition hence it returns a warning for "NA" value. We can use if/else here to avoid that.
library(dplyr)
library(stringr)
tibble(x = c("5-6", "4", "6-9", "5", "NA")) %>%
rowwise() %>%
mutate(min = if(str_detect(x, "-")) val_from_range(x, "min") else NA,
max = if(str_detect(x, "-")) val_from_range(x, "max") else NA) %>%
ungroup
# x min max
# <chr> <dbl> <dbl>
#1 5-6 5 6
#2 4 NA NA
#3 6-9 6 9
#4 5 NA NA
#5 NA NA NA

Delete duplicate rows based on condition in another column

Let's say I have this data frame:
df <- data.frame(
a = c(NA,6,6,8),
x= c(1,2,2,4),
y = c(NA,2,NA,NA),
z = c("apple", 2, "2", NA),
d = c(NA, 5, 5, 5),stringsAsFactors = FALSE)
Rows 2 and 3 are duplicates and row 3 has an NA value. I want to delete the duplicate row with the NA value so that it looks like this:
df <- data.frame(
a = c(NA,6,8),
x= c(1,2,4),
y = c(NA,2,NA),
z = c("apple", 2, NA),
d = c(NA, 5, 5),stringsAsFactors = FALSE)
I tried this but it doesn't work:
df2 <- df %>% group_by (a,x,z,d) %>% filter(y == max(y))
Any suggestions?
df %>%
arrange_all() %>%
filter(!duplicated(fill(., everything())))
a x y z d
1 NA 1 NA apple NA
2 6 2 2 2 5
3 8 4 NA <NA> 5
df %>% arrange(a,x,z,d) %>% distinct(a,x,z,d,.keep_all=TRUE)
a x y z d
1 6 2 2 2 5
2 8 4 NA <NA> 5
3 NA 1 NA apple NA
Fill NA values with previous non-NA and select unique rows with distinct.
library(dplyr)
library(tidyr)
df %>% fill(everything()) %>% distinct()
# a x y z d
#1 NA 1 NA apple NA
#2 6 2 2 2 5
#3 8 4 NA <NA> 5

A function to fill in a column with NA of the same type

I have a data frame with many columns of different types. I would like to replace each column with NA of the corresponding class.
for example:
df = data_frame(x = c(1,2,3), y = c("a", "b", "c"))
df[, 1:2] <- NA
yields a data frame with two logical columns, rather than numeric and character.
I know I can tell R:
df[,1] = as.numeric(NA)
df[,2] = as.character(NA)
But how do I do this collectively in a loop for all columns with all possible types of NA?
You can use this "trick" :
df[1:nrow(df),1] <- NA
df[1:nrow(df),2] <- NA
the [1:nrow(df),] basically tells R to replace all values in the column with NA and in this way the logical NA is coerced to the original type of the column before replacing the other values.
Also, if you have a lot of columns to replace and the data_frame has a lot of rows, I suggest to store the row indexes and reuse them :
rowIdxs <- 1:nrow(df)
df[rowIdxs ,1] <- NA
df[rowIdxs ,2] <- NA
df[rowIdxs ,3] <- NA
...
As cleverly suggested by #RonakShah, you can also use :
df[TRUE, 1] <- NA
df[TRUE, 2] <- NA
...
As pointed out by #Cath both the methods still work when you select more than one column e.g. :
df[TRUE, 1:3] <- NA
# or
df[1:nrow(df), 1:3] <- NA
Another solution that applies to all the columns can be to specify the non-NAs and replace with NA, i.e.
df[!is.na(df)] <- NA
which gives,
# A tibble: 3 x 2
x y
<dbl> <chr>
1 NA <NA>
2 NA <NA>
3 NA <NA>
Another way to change all columns at once while keeping the variables' classes:
df[] <- lapply(df, function(x) {type <- class(x); x <- NA; class(x) <- type; x})
df
# A tibble: 3 x 2
# x y
# <dbl> <chr>
#1 NA <NA>
#2 NA <NA>
#3 NA <NA>
As #digEmAll notified in comments, there is another similar but shorter way:
df[] <- lapply(df, function(x) as(NA,class(x)))
Using dplyr::na_if:
library(dplyr)
df %>%
mutate(x = na_if(x, x),
y = na_if(y, y))
# # A tibble: 3 x 2
# x y
# <dbl> <chr>
# 1 NA NA
# 2 NA NA
# 3 NA NA
If we want to mutate only subset of columns to NA, then:
# dataframe with extra column that stay unchanged
df = data_frame(x = c(1,2,3), y = c("a", "b", "c"), z = c(4:6))
df %>%
mutate_at(vars(x, y), funs(na_if(.,.)))
# # A tibble: 3 x 3
# x y z
# <dbl> <chr> <int>
# 1 NA NA 4
# 2 NA NA 5
# 3 NA NA 6
Using bind_cols() from dplyr you can also do:
df <- data_frame(x = c(1,2,3), y = c("a", "b", "c"))
classes <- sapply(df, class)
df[,1:2] <- NA
bind_cols(lapply(colnames(x), function(x){eval(parse(text=paste0("as.", classes[names(classes[x])], "(", df[,x],")")))}))
V1 V2
<dbl> <chr>
1 NA NA
2 NA NA
3 NA NA
Please note that this will change the colnames.
Another approach usingdplyr:
df <- tibble(x = c(1,2,3), y = c("a", "b", "c"))
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
df %>%
mutate(across(everything(), ~as(NA, class(.x))))
#> # A tibble: 3 x 2
#> x y
#> <dbl> <chr>
#> 1 NA <NA>
#> 2 NA <NA>
#> 3 NA <NA>

Using cummean with group_by and ignoring NAs

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried
df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))
but cummean just doesn't know what to do with NAs.
EDIT: I do not want to count NAs as 0.
You could use ifelse to treat NAs as 0 for the cummean call:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category) %>%
mutate(new_col = cummean(ifelse(is.na(value), 0, value)))
Output:
# A tibble: 8 x 3
# Groups: category [2]
category value new_col
<fct> <dbl> <dbl>
1 cat1 NA 0.
2 cat1 2. 1.00
3 cat2 3. 3.00
4 cat1 4. 2.00
5 cat2 5. 4.00
6 cat2 NA 2.67
7 cat1 7. 3.25
8 cat2 8. 4.00
EDIT: Now I see this isn't the same as ignoring NAs.
Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = ifelse(isna, NA, cummean(value)))
Output:
# A tibble: 8 x 4
# Groups: category, isna [4]
category value isna new_col
<fct> <dbl> <lgl> <dbl>
1 cat1 NA TRUE NA
2 cat1 2. FALSE 2.00
3 cat2 3. FALSE 3.00
4 cat1 4. FALSE 3.00
5 cat2 5. FALSE 4.00
6 cat2 NA TRUE NA
7 cat1 7. FALSE 4.33
8 cat2 8. FALSE 5.33
An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.
df %>% mutate(rn = row_number()) %>%
filter(!is.na(value)) %>%
group_by(category) %>%
mutate(new_col = cummean(value)) %>%
ungroup() %>%
right_join(mutate(df, rn = row_number()), by="rn") %>%
select(category = category.y, value = value.y, new_col) %>%
as.data.frame()
# category value new_col
# 1 cat1 NA NA
# 2 cat1 2 2.000000
# 3 cat2 3 3.000000
# 4 cat1 4 3.000000
# 5 cat2 5 4.000000
# 6 cat2 NA NA
# 7 cat1 7 4.333333
# 8 cat2 8 5.333333
I needed something similar, but cannot replace NAs with 0. So I created this simple function, which works with dplyr. Hope this helps.
cummean.na <- function(x, na.rm = T)
{
# x = c(NA, seq(1, 10, 1)); na.rm = T
n <- length(x)
op <- rep(NA, n)
for(i in 1:n) {op[i] <- ifelse(is.na(x[i]), NA, mean(x[1:i], na.rm = !!na.rm))}
rm(x, na.rm, n, i)
return(op)
}
Custom function to calculate "cummean", ignoring NA's and carrying forward the previous cumulative mean value to the next NA value:
cummean.na <-
function(x) {
tmp_ind <- cumsum(!is.na(x))
x_nona <- x[!is.na(x)]
out <- cummean(x_nona)[tmp_ind]
return(out)
}
Example output:
> cummean.na(1:5)
[1] 1.0 1.5 2.0 2.5 3.0
> cummean.na(c(1, 2, 3, NA, 4, 5))
[1] 1.0 1.5 2.0 2.0 2.5 3.0

Resources