Combine row values into character vector by condition - r

I have a data.frame where values are repeated in col1.
col1 <- c("A", "A", "B", "B", "C")
col2 <- c(1995, 1997, 1999, 2000, 2005)
df <- data.frame(col1, col2)
I want to combine values in col2 that correspond to the same letter in col1 into one cell, so that col2 shows a range of values for a particular letter in col1. I do this by splitting the data.frame by col1, applying fun, and binding the split data.frames back together.
library(tidyverse)
split_df <- split(df, df$col1)
fun <- function(df) {
if (length(unique(df$col2)) > 1) {
df$col2 <- paste(min(df$col2),
max(df$col2),
sep = "-")
df <- distinct(df)
}
return(df)
}
split_df <- lapply(split_df, fun)
df <- do.call(rbind, split_df)
This works, but I am wondering if there is a more intuitive or more efficient solution?

Base R way using aggregate -
aggregate(col2~col1, df, function(x) paste0(unique(range(x)), collapse = '-'))
# col1 col2
#1 A 1995-1997
#2 B 1999-2000
#3 C 2005
Same can also be written with dplyr -
library(dplyr)
df %>%
group_by(col1) %>%
summarise(col2 = paste0(unique(range(col2)), collapse = '-'))

One option would be the tidyverse, where you can accomplish this a little more succinctly. The basic idea is the same:
library(tidyverse)
new.result <- df %>%
group_by(col1) %>%
summarize(
col2 = ifelse(n() == 1, as.character(col2), paste(min(col2), max(col2), sep = '-'))
)
col1 col2
<chr> <chr>
1 A 1995-1997
2 B 1999-2000
3 C 2005
A different (but possibly overcomplicated) approach assumes that you have at most two years per grouping. We can pivot the start and end years into their own columns, and then paste them together directly. This requires a little more data transformation but avoids having to check explicitly for groups with 1 year:
df %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = col2) %>%
rowwise() %>%
mutate(
vec = list(c(`1`, `2`)),
col2 = paste(vec[!is.na(vec)], collapse = '-')
) %>%
select(col1, col2)

Related

How to get unique occurrences of these character strings separated by ";"?

So I have a column with values in this structure:
tribble(
~col,
"AA_BB;AA_AA;AA_BB",
"BB_BB;AA_AA",
"AA_BB",
"BB_AA;BB_AA;AA_AA;BB_AA")
)
So each row has items separated by a ";". The first for has items AA_BB, AA_AA and AA_BB. I want the first row to be transformed to "AA_BB;AA_AA" and the last row to be transformed to "BB_AA;AA_AA".
I thought about using separate but I the result didn't really help me (especially since I don't know how many columns there can be at most).
df %>%
separate(col, into = c("A", "B", "C", "D"), sep = ";")
Any tips on how to do this?
We can split the column, get the unique elements and paste
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(col = map_chr(strsplit(col, ";"), ~ str_c(unique(.x), collapse=";")))
-output
# A tibble: 4 x 1
# col
# <chr>
#1 AA_BB;AA_AA
#2 BB_BB;AA_AA
#3 AA_BB
#4 BB_AA;AA_AA
Or split with separate_rows, then do a group by paste after getting the distinct rows
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep=";") %>%
distinct %>%
group_by(rn) %>%
summarise(col = str_c(col, collapse=";"), .groups = 'drop') %>%
select(col)
In base R, you can split the string on semi-colon, keep only unique strings and paste them together.
df$col1 <- sapply(strsplit(df$col, ';'), function(x)
paste0(unique(x), collapse = ';'))
df
# A tibble: 4 x 2
# col col1
# <chr> <chr>
#1 AA_BB;AA_AA;AA_BB AA_BB;AA_AA
#2 BB_BB;AA_AA BB_BB;AA_AA
#3 AA_BB AA_BB
#4 BB_AA;BB_AA;AA_AA;BB_AA BB_AA;AA_AA

How to use mutate rowwise with complex row operation?

How can I use mutate to achieve the below?
bd_diag_date <- df %>%
apply(1, function(dates) last(na.omit(dates))) %>%
as.data.frame() %>%
`colnames<-`("diag_date")
I tried this below but didn't work. I can't find out why and it says Error: Column 'diagnosis_date' is of unsupported type symbol. Should I assume mutate takes any function operation that can apply to a vector? If not, then what kind of operation does it accept?
bd_diag_date <- df %>%
rowwise() %>%
{mutate(., diag_date=last(na.omit(all_vars(.))))}
I also have a more general questions. That is how can I debug this? Every time I encounter this problem I have to google stack exchange but I feel like this isn't the right way to improve my dplyr skill.
We can use pmap
library(dplyr)
library(purrr)
df %>%
mutate(diag_date = pmap(., ~ last(na.omit(c(...)))))
If the columns are numeric, we can use pmap_dbl, simply using pmap returns a list column
df %>%
mutate(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
# col1 col2 col3 diag_date
#1 1 NA 2 2
#2 NA 2 NA 2
#3 3 4 NA 4
If we need to return only a single column, use transmute
df %>%
transmute(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
Or with group_split and map
df %>%
group_split(grp = row_number(), keep = FALSE) %>%
map_dfr(~ .x %>%
transmute(diag_date = last(na.omit(unlist(.)))))
Or using base R with max.col
df$diag_date <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df), 'last'))]
data
df <- data.frame(col1 = c(1, NA, 3), col2 = c(NA, 2, 4), col3 = c(2, NA, NA))

How to change column name according to another dataframe in R?

df1 <- data.frame(
cola = c('1',NA,'c','1','1','e','1',NA,'c','d'),
colb = c("A",NA,"C","D",'a','b','c','d','c','d'),
colc = c('a',NA,'c','d','a',NA,'c',NA,'c','d'),stringsAsFactors = TRUE)
df2<-data.frame(name=c('cola','colc','colb'),
altname=c('a','c','b'))
df1 %>% table %>% data.frame(.)
Result of above codes as:
cola colb colc Freq
1 1 a a 1
2 c a a 0
I want to change columns name of result based on df2(for example,change colb to b ),the expected result as:
a b c Freq
1 1 a a 1
2 c a a 0
How to do it?
We can just remove the substring with rename_at
library(stringr)
libraryr(dplyr)
df1 %>%
table %>%
data.frame(.) %>%
rename_at(1:3, ~ str_remove(., "col"))
Or if it needs to be from 'df2'
df1 %>%
table %>%
data.frame(.) %>%
rename_at(1:3, ~ setNames(as.character(df2$altname), df2$name)[.])
Update
If all the column names in 'df1' are not in key/val columns of 'df2', an option is
df1 %>%
table %>%
data.frame(.) %>%
rename_at(1:3, ~ coalesce(setNames(as.character(df2$altname), df2$name)[.], .))
Or using base R
out <- df1 %>% table %>% data.frame(.)
names(out) <- sub("col", "", names(out))
if it needs to be based on a second dataset
name(out)[-4] <- df2$altname[match(names(out)[-4], df2$name)]
Or with substr
names(out) <- substring(names(out), 4)

Replace multiple `summarize`statements by function

I'm currently repeating a lot code, since I need to summarize always the same columns for different groups. How can I do this effectively by writing the summarize function (which is always the same) only once, but define the output name and group_by arguments case by case?
A minimum example:
col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8
df <- data.frame(col1, col2, col3, col4, col5)
result1 <- df %>%
group_by(col1, col2) %>%
summarize(sum1 = sum(col4, col5))
result2 <- df %>%
group_by(col2, col3) %>%
summarize(sum1 = sum(col4, col5))
result3 <- df %>%
group_by(col1, col3) %>%
summarize(sum1 = sum(col4, col5))
Using combn:
combn(colnames(df)[1:3], 2, FUN = function(x){
df %>%
group_by(.dots = x) %>%
summarize(sum1 = sum(col4, col5))
}, simplify = FALSE)
To use dplyr in own functions, you can use tidy evaluation. The reason for this is the way dplyr evaluates dplyr code, something called non standard evaluation, which wraps everything what does not behave like normal R Code. I recommend to read this:
https://tidyeval.tidyverse.org/modifying-inputs.html#modifying-quoted-expressions
summarizefunction <- function(data, ..., sumvar1, sumvar2) {
groups <- enquos(...)
sumvar1 <- enquo(sumvar1)
sumvar2 <- enquo(sumvar2)
result <- data %>%
group_by(!!!groups) %>%
summarise(sum1 = sum(!!sumvar1, !!sumvar2))
return(result)
}
summarizefunction(df, col1, col2, sumvar1 = col4, sumvar2 = col5)
You can use the enquo keyword to wrap quote parameters which prevents them from being evaluated immediately. This you can use the !! (called bang bang) operator to unquote the parameter. I think this is the most flexible and reuseable solution, even when you have to write some more initial code.
You can also use purrr::partial in these situations :
library(purrr)
summarize45 <- partial(summarize, sum1 = sum(col4, col5))
result1b <- df %>%
group_by(col1, col2) %>%
summarize45()
identical(result1, result1b)
# [1] TRUE
Or pushing it further :
gb_df <- partial(group_by, df)
result1c <- gb_df(col1, col2) %>% summarize45()
identical(result1, result1c)
# [1] TRUE
Firstly you'll need to evaluate the variables with a function as such:
library(tidyverse)
res_func <- function(x, y){
df %>%
group_by(!!as.symbol(x), !!as.symbol(y)) %>%
summarize(sum1 = sum(col4, col5))
}
works a charm:
res_func("col1", "col2")
# A tibble: 4 x 3
# Groups: col1 [2]
col1 col2 sum1
<fct> <fct> <int>
1 UK Social 10
2 UK Tech 6
3 US Social 8
4 US Tech 12
We can use assign to create a function that names your frame against the parameters you've passed in through the function:
res_func2 <- function(x, y){
assign(paste0("result_", x, y),
df %>%
group_by(!!as.symbol(x), !!as.symbol(y)) %>%
summarize(sum1 = sum(col4, col5)),
envir = parent.frame())
}
This creates a new df called result_col1col2 by simply running res_func2("col1", "col2")
> result_col1col2
# A tibble: 4 x 3
# Groups: col1 [2]
col1 col2 sum1
<fct> <fct> <int>
1 UK Social 10
2 UK Tech 6
3 US Social 8
4 US Tech 12

R dplyr method to replace all empty factors with NA

Instead of writing and reading a dataframe to fill all empty factors in this method,
na.strings=c("","NA")
I wanted to just apply a function to all the columns and substitute the empties with NA. I've selected the factor columns so far but don't know what to do next.
df %>% select_if(is.factor) %>% ....
How would I be able to do this, preferably with dplyr and/or apply methods
We can use mutate_if
df <- df %>%
mutate_if(is.factor, funs(factor(replace(., .=="", NA))))
With dplyr 0.8.0, we can also do
df %>%
mutate_if(is.factor, na_if, y = "")
or change the funs (which is getting deprecated to list as #Frederick mentioned in the comments)
df %>%
mutate_if(is.factor, list(~ na_if(., "")))
Or using base R we can assign the specific levels to NA
j1 <- sapply(df, is.factor)
df[j1] <- lapply(df[j1], function(x) {is.na(x) <- levels(x)==""; x})
data
df <- data.frame(col1 = c("", "A", "B", ""), col2 = c("A", "", "", "C"),
col3 = 1:4)

Resources