R dplyr method to replace all empty factors with NA - r

Instead of writing and reading a dataframe to fill all empty factors in this method,
na.strings=c("","NA")
I wanted to just apply a function to all the columns and substitute the empties with NA. I've selected the factor columns so far but don't know what to do next.
df %>% select_if(is.factor) %>% ....
How would I be able to do this, preferably with dplyr and/or apply methods

We can use mutate_if
df <- df %>%
mutate_if(is.factor, funs(factor(replace(., .=="", NA))))
With dplyr 0.8.0, we can also do
df %>%
mutate_if(is.factor, na_if, y = "")
or change the funs (which is getting deprecated to list as #Frederick mentioned in the comments)
df %>%
mutate_if(is.factor, list(~ na_if(., "")))
Or using base R we can assign the specific levels to NA
j1 <- sapply(df, is.factor)
df[j1] <- lapply(df[j1], function(x) {is.na(x) <- levels(x)==""; x})
data
df <- data.frame(col1 = c("", "A", "B", ""), col2 = c("A", "", "", "C"),
col3 = 1:4)

Related

add labels to gt table with for loop over column names

I have the following data table .
library(dplyr)
library(gt)
df <- tibble(
`model 2000` = c("a", "b"),
`car 2022` = c("f", "d")
)
I would like to loop over a vector of column names and perform a string replace, then append this to the gt table
my_cols <- colnames(df)
for(i in my_cols){
df <- df %>%
gt() %>%
cols_label(
i = str_trim(str_remove(i, "2020|2021|2022"))
)
df
}
I want to be able to change the names after the GT table is created using this for loop but when the loop passes the values in my_cols, they aren't recognized... any help?
Here is the error:
Error: All column names provided must exist in the input `.data` table.
The best way to do this is to eschew looping and pass a named vector and splice it inside cols_labels():
my_cols <- setNames(str_trim(str_remove(names(df), "2020|2021|2022")), names(df))
df %>%
gt() %>%
cols_label(!!!my_cols)
If for some reason you must use a loop, you need to create the gt() object outside of the loop, otherwise after the first iteration you're passing an object that is already class gt_tbl to the gt() function which causes an error. You also need to use the Walrus operator := instead of = and the LHS needs to be a symbol or a glue string.
my_cols <- names(df)
df <- df %>%
gt()
for(i in my_cols) {
df <- df %>%
cols_label("{i}" := str_trim(str_remove(i, "2020|2021|2022"))) # or !!sym(i) := ...
}
df
You can use the .list option in cols_label().
my_cols <- colnames(df)
df %>%
gt() %>%
cols_label(
.list = setNames(as.list(str_trim(str_remove(my_cols, "2020|2021|2022"))), my_cols)
)
However, it seem much easier to just do this:
my_cols <- colnames(df)
df %>%
rename_with(~str_trim(str_remove(.x, "2020|2021|2022")), .cols =my_cols) %>%
gt()
Input:
df <- tibble(
`model 2021` = c("a", "b"),
`car 2022` = c("f", "d")
)

Combine row values into character vector by condition

I have a data.frame where values are repeated in col1.
col1 <- c("A", "A", "B", "B", "C")
col2 <- c(1995, 1997, 1999, 2000, 2005)
df <- data.frame(col1, col2)
I want to combine values in col2 that correspond to the same letter in col1 into one cell, so that col2 shows a range of values for a particular letter in col1. I do this by splitting the data.frame by col1, applying fun, and binding the split data.frames back together.
library(tidyverse)
split_df <- split(df, df$col1)
fun <- function(df) {
if (length(unique(df$col2)) > 1) {
df$col2 <- paste(min(df$col2),
max(df$col2),
sep = "-")
df <- distinct(df)
}
return(df)
}
split_df <- lapply(split_df, fun)
df <- do.call(rbind, split_df)
This works, but I am wondering if there is a more intuitive or more efficient solution?
Base R way using aggregate -
aggregate(col2~col1, df, function(x) paste0(unique(range(x)), collapse = '-'))
# col1 col2
#1 A 1995-1997
#2 B 1999-2000
#3 C 2005
Same can also be written with dplyr -
library(dplyr)
df %>%
group_by(col1) %>%
summarise(col2 = paste0(unique(range(col2)), collapse = '-'))
One option would be the tidyverse, where you can accomplish this a little more succinctly. The basic idea is the same:
library(tidyverse)
new.result <- df %>%
group_by(col1) %>%
summarize(
col2 = ifelse(n() == 1, as.character(col2), paste(min(col2), max(col2), sep = '-'))
)
col1 col2
<chr> <chr>
1 A 1995-1997
2 B 1999-2000
3 C 2005
A different (but possibly overcomplicated) approach assumes that you have at most two years per grouping. We can pivot the start and end years into their own columns, and then paste them together directly. This requires a little more data transformation but avoids having to check explicitly for groups with 1 year:
df %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = col2) %>%
rowwise() %>%
mutate(
vec = list(c(`1`, `2`)),
col2 = paste(vec[!is.na(vec)], collapse = '-')
) %>%
select(col1, col2)

Match text from one column with another column (vlookup + like)

I'm trying to perform a match of 2 columns but without success. I have one DF1 with 2 columns, Id and JSON. In the second DF2, I have one column with a pattern to be matched in each row for DF1$json (something like vlookup + like function).
As an output, I'd like to get DF1$Id but only where any of DF2 is matched with DF1$json.
I've tried some combinations with str_detect but it doesn't work on non-vector values. Maybe some tricks with grep or stringr functions?
For example:
str_detect(DF1$json, fixed(DF2[1,1], ignore_case = TRUE))
df1 <- data.frame(
Id = c("AA", "BB", "CC", "DD"),
json = c("{xxx:yyy:zzz};{mmm:zzz:vvv}", "{ccc:yyy:zzz};{ddd:zzz:vvv}", "{ttt:yyy:zzz};{mmm:zzz:vvv}", "{uuu:yyy:zzz};{mmm:zzz:vvv}")
)
matches <- c("mmm:zzz:vvv", "mmm:yyy:zzz")
library(stringr) # needed for str_extract_all()
Solution using data.table
library(data.table)
setDT(df1)
df1[, match := any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches), by = Id]
df1[match == T, .(Id)]
Solution using dplyr
library(dplyr)
df1 %>%
group_by(Id) %>%
mutate(match = any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
filter(match == T) %>%
select(Id)
Or just directly filter()
df1 %>%
group_by(Id) %>%
filter(any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
select(Id)
Output on both methods
Id
1: AA
2: CC
3: DD
Does this give you the expected result :
my_df <- data.frame("id" = c("AA", "BB", "CC", "DD"),
"json" = c("{x:y:z};{m:z:v}", "{c:y:z};{d:z:v}", "{t:y:z};{m:z:v}", "{u:y:z};{m:z:v}"),
"pattern" = c("m:z:v", "t:y:z", "m:z:v", "t"),
stringsAsFactors = FALSE)
my_f <- function(x) {
my_var <- paste(grep(pattern = my_df[x, "pattern"], x = my_df$json), collapse = " ")
return (my_var)
}
my_df$Value <- lapply(1:nrow(my_df), my_f)

How to use mutate rowwise with complex row operation?

How can I use mutate to achieve the below?
bd_diag_date <- df %>%
apply(1, function(dates) last(na.omit(dates))) %>%
as.data.frame() %>%
`colnames<-`("diag_date")
I tried this below but didn't work. I can't find out why and it says Error: Column 'diagnosis_date' is of unsupported type symbol. Should I assume mutate takes any function operation that can apply to a vector? If not, then what kind of operation does it accept?
bd_diag_date <- df %>%
rowwise() %>%
{mutate(., diag_date=last(na.omit(all_vars(.))))}
I also have a more general questions. That is how can I debug this? Every time I encounter this problem I have to google stack exchange but I feel like this isn't the right way to improve my dplyr skill.
We can use pmap
library(dplyr)
library(purrr)
df %>%
mutate(diag_date = pmap(., ~ last(na.omit(c(...)))))
If the columns are numeric, we can use pmap_dbl, simply using pmap returns a list column
df %>%
mutate(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
# col1 col2 col3 diag_date
#1 1 NA 2 2
#2 NA 2 NA 2
#3 3 4 NA 4
If we need to return only a single column, use transmute
df %>%
transmute(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
Or with group_split and map
df %>%
group_split(grp = row_number(), keep = FALSE) %>%
map_dfr(~ .x %>%
transmute(diag_date = last(na.omit(unlist(.)))))
Or using base R with max.col
df$diag_date <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df), 'last'))]
data
df <- data.frame(col1 = c(1, NA, 3), col2 = c(NA, 2, 4), col3 = c(2, NA, NA))

Add multiple columns with mutate using column-based conditions, without using explicit column name + POSIX

I have a dataframe of data: 1 column is POSIX, the rest is data.
I need to remove selectively some data from a group of columns and add these "new" columns to the original dataframe.
I can "easily" do it in base R (I am an old-style user). I'd like to do it more compactly with mutate_at or with other function... although I am having several issues.
A solution homemade with base R could be
df <- data.frame("date" = seq.POSIXt(as.POSIXct(format(Sys.time(),"%F %T"),tz="UTC"),length.out=20,by="min"), "a.1" = rnorm(20,0,3), "a.2" = rnorm(20,1,2), "b.1"= rnorm(20,1,4), "b.2"= rnorm(20,3,4))
df1 <- lapply(df[,grep("^a",names(df))], function(x) replace(x, which(x > 0 & x < 0.2), NA))
df1 <- data.frame(matrix(unlist(df1), nrow = nrow(df), byrow = F)) ## convert to data.frame
names(df1) <- grep("^a",names(df),value=T) ## rename columns
df1 <- cbind.data.frame("date"=df$date, df1) ## add date
Can anyone help me in setting up something working with dplyr + transmute?
So far I come up with something like:
df %>%
select(starts_with("a.")) %>%
transmute(
case_when(
.>0.2 ~ NA,
)
) %>%
cbind.data.frame(df)
But I am quite stuck, since I can't combine transmute with case_when: all examples that I found use explicitly the column names in case_when, but I can't, since I won't know the names of the column in advance. I will only know the initial of the columns that I need to transmute.
Thanks,
Alex
We can use transmute_at if the intention is to return only those columns specified in the vars
library(dplyr)
df %>%
transmute_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
bind_cols(df %>% select(date), .)
If we need all the columns to return, but only change the columns of interest in vars, then we need mutate_at instead of transmute_at
df %>%
mutate_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
select(date, starts_with('a')) # only need if we are selecting a subset of columns

Resources