How to rename column names containing "(N)"? - r

I'd like to remove the "(N)" from the column names.
Example data:
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
I got so far, but don't know how to figure out the rest of regex
df %>%
rename_with(stringr::str_replace,
pattern = "[//(],N//)]", replacement = "")
But the n from the "number (N)" is gone.
name id N) umber (N)
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8

One liner: rename_with(df, ~str_remove_all(., ' \\(N\\)'))
or dplyr only: rename_with(df, ~sub(' \\(N\\)', '', .))
We could use the rename_with function from dplyr package and apply a function (in this case str_remove from stringr package).
And then use \\ to escape (:
library(dplyr)
library(stringr)
df %>%
rename_with(~str_remove_all(., ' \\(N\\)'))
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8

A possible solution:
library(tidyverse)
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
df %>% names %>% str_remove("\\s*\\(N\\)\\s*") %>% set_names(df,.)
#> # A tibble: 4 × 3
#> name id Number
#> <chr> <dbl> <dbl>
#> 1 A 1 3
#> 2 B 2 1
#> 3 C 3 2
#> 4 D 4 8

Perhaps you can try
setNames(df, gsub("\\s\\(.*\\)", "", names(df)))
which gives
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8

A simple solution is
colnames(df) <- gsub(" \\(N\\)", "", colnames(df))

Related

How to remove unique rows from dataframe using tidyverse

I want to remove unique rows based on a variable:
Letters Val
A 1
A 1
B 1
B 3
In this case, entries with A is removed as the Val values are unique resulting in:
Letters Val
B 1
B 3
I have tried to use count, then filter out n > 1 however in this process Val is lost.
In essence how do I filter(count(letters) > 1)?
md <- tibble::tribble(
~Letters, ~Val,
"A", 1,
"A", 1,
"B", 1,
"B", 3
)
library(dplyr)
md |>
group_by(Letters, Val) |>
filter(n() == 1)
#> # A tibble: 2 × 2
#> # Groups: Letters, Val [2]
#> Letters Val
#> <chr> <dbl>
#> 1 B 1
#> 2 B 3

Can't add rows to grouped data frames

This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75

Adding dataset identifier variable in full_join in R

I want to automatically add a new dataset identifier variable when using full_join() in R.
df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)
df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)
combined <- df1 %>% dplyr::full_join(df2)
I know from ?full_join that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.
Currently, I'm adding an extra variable in df1 first
df1 <- tribble(~ID, ~x, ~dataset,
"A", 1, 1,
"B", 2, 1,
"C", 3, 1)
and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))
Any suggestions to do it in a better way?
I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id, then you may try
df1 %>%
full_join(df2) %>%
mutate(dataset = as.numeric(is.na(x))+1)
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
But to be safe, it might be better just define it's index(?) thing beforehand.
df1 %>%
mutate(dataset = 1) %>%
full_join(df2 %>% mutate(dataset = 2))
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
New data
df1 <- tribble(~ID, ~x,~y,
"A", 1,1,
"B", 2,1,
"C", 3,1)
df2 <- tribble(~ID, ~x,~y,
"D", 4,1,
"E", 5,1,
"F", 6,1)
full_join(df1, df2)
ID x y
<chr> <dbl> <dbl>
1 A 1 1
2 B 2 1
3 C 3 1
4 D 4 1
5 E 5 1
6 F 6 1
Instead of a "join", maybe try bind_rows from dplyr:
library(dplyr)
bind_rows(df1, df2, .id = "dataset")
This will bind rows, and the missing columns are filled in with NA. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).
Output
dataset ID x y
<chr> <chr> <dbl> <dbl>
1 1 A 1 NA
2 1 B 2 NA
3 1 C 3 NA
4 2 D NA 4
5 2 E NA 5
6 2 F NA 6

How to replace columns with NA in a tibble with imputed columns from another tibble in R

I want to replace de columns with NA in df using the imputed values in df2 to get df3.
I can do it with left_join and coalesce, but I think this method doesn't generalize well. Is there a better way?
library(tidyverse)
df <- tibble(c = c("a", "a", "a", "b", "b", "b"),
d = c(1, 2, 3, 1, 2, 3),
x = c(1, NA, 3, 4, 5,6),
y = c(1, 2, NA, 4, 5, 6),
z = c(1, 2, 7, 4, 5, 6))
# I want to replace NA in df by df2
df2 <- tibble(c = c("a", "a", "a"),
d = c(1, 2, 3),
x = c(1, 2, 3),
y = c(1, 2, 2))
# to get
df3 <- tibble(c = c("a", "a", "a", "b", "b", "b"),
d = c(1, 2, 3, 1, 2, 3),
x = c(1, 2, 3, 4, 5, 6),
y = c(1, 2, 2, 4, 5, 6),
z = c(1, 2, 7, 4, 5, 6))
# is there a better solution than coalesce?
df3 <- df %>% left_join(df2, by = c("c", "d")) %>%
mutate(x = coalesce(x.x, x.y),
y = coalesce(y.x, y.y)) %>%
select(-x.x, -x.y, -y.x, -y.y)
Created on 2021-06-17 by the reprex package (v2.0.0)
Here's a custom function that coalesces all .x and .y columns, optionally renaming and removing columns.
#' Coalesce all columns duplicated in a previous join.
#'
#' Find all columns resulting from duplicate names after a join
#' operation (e.g., `dplyr::*_join` or `base::merge`), then coalesce
#' them pairwise.
#'
#' #param x data.frame
#' #param suffix character, length 2, the same string suffixes
#' appended to column names of duplicate columns; should be the same
#' as provided to `dplyr::*_join(., suffix=)` or `base::merge(.,
#' suffixes=)`
#' #param clean logical, whether to remove the suffixes from the LHS
#' columns and remove the columns on the RHS columns
#' #param strict logical, whether to enforce same-classes in the LHS
#' (".x") and RHS (".y") columns; while it is safer to set this to
#' true (default), sometimes the conversion of classes might be
#' acceptable, for instance, if one '.x' column is 'numeric' and its
#' corresponding '.y' column is 'integer', then relaxing the class
#' requirement might be acceptable
#' #return 'x', coalesced, optionally cleaned
#' #export
coalesce_all <- function(x, suffix = c(".x", ".y"),
clean = FALSE, strict = TRUE) {
nms <- colnames(x)
Xs <- endsWith(nms, suffix[1])
Ys <- endsWith(nms, suffix[2])
# x[Xs] <- Map(dplyr::coalesce, x[Xs], x[Ys])
# x[Xs] <- Map(data.table::fcoalesce, x[Xs], x[Ys])
x[Xs] <- Map(function(dotx, doty) {
if (strict) stopifnot(identical(class(dotx), class(doty)))
isna <- is.na(dotx)
replace(dotx, isna, doty[isna])
} , x[Xs], x[Ys])
if (clean) {
names(x)[Xs] <- gsub(glob2rx(paste0("*", suffix[1]), trim.head = TRUE), "", nms[Xs])
x[Ys] <- NULL
}
x
}
In action:
df %>%
left_join(df2, by = c("c", "d")) %>%
coalesce_all()
# # A tibble: 6 x 7
# c d x.x y.x z x.y y.y
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1 1 1
# 2 a 2 2 2 2 2 2
# 3 a 3 3 2 7 3 2
# 4 b 1 4 4 4 NA NA
# 5 b 2 5 5 5 NA NA
# 6 b 3 6 6 6 NA NA
df %>%
left_join(df2, by = c("c", "d")) %>%
coalesce_all(clean = TRUE)
# # A tibble: 6 x 5
# c d x y z
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 2 2 2 2
# 3 a 3 3 2 7
# 4 b 1 4 4 4
# 5 b 2 5 5 5
# 6 b 3 6 6 6
I included two coalesce functions as alternatives to the base-R within the Map. One advantage is the strict argument: dplyr::coalesce will silently allow integer and numeric to be coalesced, while data.table::fcoalesce does not. If that is desirable, use what you prefer. (Another advantage is that both of the non-base coalesce functions accept an arbitrary number of columns to coalesce, which is not required in this implementation.)
You may mutate all columns at once, by using across and using .names & .keep argument, like this
library(dplyr, warn.conflicts = F)
df %>% left_join(df2, by = c("c", "d")) %>%
mutate(across(ends_with('.x'), ~ coalesce(., get(gsub('.x', '.y', cur_column()))),
.names = '{gsub(".x$", "", .col)}'), .keep = 'unused')
#> # A tibble: 6 x 5
#> c d z x y
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 7 3 2
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6
Created on 2021-06-17 by the reprex package (v2.0.0)
I tried another method, filtering c, dropping all columns of df with NA, joining with df2 and bind rows of the unfiltered df with df3.
df3 <- df %>% filter(c == "a") %>% select_if(~ !any(is.na(.))) %>%
left_join(df2, by = c("c", "d"))
df3 <- bind_rows(df %>% filter(!c == "a"), df3) %>% arrange(c,d)
df3
#> # A tibble: 6 x 5
#> c d x y z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 3 2 7
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6
Created on 2021-06-17 by the reprex package (v2.0.0)
We can use {powerjoin}
library(powerjoin)
power_left_join(df, df2, by = c("c", "d"), conflict = coalesce_xy)
#> # A tibble: 6 × 5
#> c d z x y
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 7 3 2
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources