dplyr: create new column with values from other specified columns [duplicate]

dplyr: create new column with values from other specified columns [duplicate] - r

This question already has answers here:
How can I use one column to determine where I get the value for another column?
(3 answers)
Closed 4 years ago.
I have a tibble:
library(tibble)
library(dplyr)
(
data <- tibble(
a = 1:3,
b = 4:6,
mycol = c('a', 'b', 'a')
)
)
#> # A tibble: 3 x 3
#> a b mycol
#> <int> <int> <chr>
#> 1 1 4 a
#> 2 2 5 b
#> 3 3 6 a
Using dplyr::mutate I'd like to create a new column called value which uses a value from either column a or b, depending on which column name is specified in the mycol column.
(
desired <- tibble(
a = 1:3,
b = 4:6,
mycol = c('a', 'b', 'a'),
value = c(1, 5, 3)
)
)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <dbl>
#> 1 1 4 a 1
#> 2 2 5 b 5
#> 3 3 6 a 3
Here we're just using the values from column a all the time.
data %>%
mutate(value = a)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <int>
#> 1 1 4 a 1
#> 2 2 5 b 2
#> 3 3 6 a 3
Here we're just assigning the values of mycol to the new column rather than getting the values from the appropriate column.
data %>%
mutate(value = mycol)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <chr>
#> 1 1 4 a a
#> 2 2 5 b b
#> 3 3 6 a a
I've tried various combinations of !!, quo(), etc. but I don't fully understand what's going on under the hood in terms of NSE.
#Jaap has marked this as a duplicate but I'd still like to see a dplyr/tidyverse approach using NSE rather than using base R if possible.

Here is one approach:
df %>%
mutate(value = ifelse(mycol == "a", a, b))
#output
# A tibble: 3 x 4
a b mycol value
<int> <int> <chr> <int>
1 1 4 a 1
2 2 5 b 5
3 3 6 a 3
and here is a more general way in base R
df$value <- diag(as.matrix(df[,df$mycol]))
more complex example:
df <- tibble(
a = 1:4,
b = 4:7,
c = 5:8,
mycol = c('a', 'b', 'a', "c"))
df$value <- diag(as.matrix(df[,df$mycol]))
#output
# A tibble: 4 x 5
a b c mycol value
<int> <int> <int> <chr> <int>
1 1 4 5 a 1
2 2 5 6 b 5
3 3 6 7 a 3
4 4 7 8 c 8

Related

Create new column based on previous column by group; if missing, use NA

I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3

You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA

Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).

Filter groups using a lagged column

I'm working on creating some error reports and one of the times I'm trying to address is potential errors within the ID column id_1. I've made an alternative id column from various identifying features within the rows that I'm calling id_2. To help, I've also created a date_lag column on date to catch items that were entered within a specific period after the initial entry. The main problem that I'm having is returning the entire group that meets the criteria, including that first entry that would have an NA in the date_lag, or, if I allow the NA values through, I get more than just the items I'm looking for (id_1 1 and 2 below).
Example:
#id_1 where potential errors lie
#id_2 alternative id col I'm using to test
df <- data.table(id_1 = c(1:4, 1:4),
id_2 = c(rep(c("b", "a"), c(2, 2))),
date = c(rep(1,4),rep(20,2), rep(10,2)))
df %>%
group_by(id_2) %>%
mutate(date_lag = date - lag(date)) %>%
filter(between(date_lag, 0, 10) | is.na(date_lag))
# A tibble: 6 x 4
# Groups: id_1 [4]
id_1 id_2 date date_lag
<int> <chr> <dbl> <dbl>
1 b 1 NA
2 b 1 0
3 a 1 NA
4 a 1 0
2 b 20 0
3 a 10 9
4 a 10 0
Expected:
# A tibble: 6 x 4
# Groups: id_2 [4]
id_1 id_2 value val_lag
<int> <chr> <dbl> <dbl>
3 a 1 NA
4 a 1 NA
3 a 10 9
4 a 10 9

Perhaps, we can use diff
library(dplyr)
df %>%
group_by(id_1) %>%
filter(between(diff(date), 0, 10))
-output
# A tibble: 4 x 3
# Groups: id_1 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 3 a 1
#2 4 a 1
#3 3 a 10
#4 4 a 10
Concatenate with NA as the diff returns a length 1 less than the original data
df %>%
group_by(id_2) %>%
filter(between(c(NA, diff(date)), 0, 10))
# A tibble: 5 x 3
# Groups: id_2 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 2 b 1
#2 4 a 1
#3 2 b 20
#4 3 a 10
#5 4 a 10

How to summarize across multiple columns with condition on another (grouped) column with dplyr?

I need to summarize a data.frame across multiple columns in a generic way:
the first summarize operation is easy, e.g. a simple median, and is straightforward;
the second summarize then includes a condition on another column, e.g. taking the value where these is a minimum (by group) in another column:
set.seed(4)
myDF = data.frame(i = rep(1:3, each=3),
j = rnorm(9),
a = sample.int(9),
b = sample.int(9),
c = sample.int(9),
d = 'foo')
# i j a b c d
# 1 1 0.2167549 4 5 5 foo
# 2 1 -0.5424926 7 7 4 foo
# 3 1 0.8911446 3 9 1 foo
# 4 2 0.5959806 8 6 8 foo
# 5 2 1.6356180 6 8 3 foo
# 6 2 0.6892754 1 4 6 foo
# 7 3 -1.2812466 9 1 7 foo
# 8 3 -0.2131445 5 2 2 foo
# 9 3 1.8965399 2 3 9 foo
myDF %>% group_by(i) %>% summarize(across(where(is.numeric), median, .names="med_{col}"),
best_a = a[[which.min(j)]],
best_b = b[[which.min(j)]],
best_c = c[[which.min(j)]])
# # A tibble: 3 x 8
# i med_j med_a med_b med_c best_a best_b best_c
# * <int> <dbl> <int> <int> <int> <int> <int> <int>
# 1 1 0.217 4 7 4 7 7 4
# 2 2 0.689 6 6 6 8 6 8
# 3 3 -0.213 5 2 7 9 1 7
How can I define this second summarize operation in a generic way (i.e., not manually as done above)?
Hence I would need something like this (which obviously does not work as j is not recognized):
myfns = list(med = ~median(.),
best = ~.[[which.min(j)]])
myDF %>% group_by(i) %>% summarize(across(where(is.numeric), myfns, .names="{fn}_{col}"))
# Error: Problem with `summarise()` input `..1`.
# x object 'j' not found
# ℹ Input `..1` is `across(where(is.numeric), myfns, .names = "{fn}_{col}")`.
# ℹ The error occurred in group 1: i = 1.

Use another across to get corresponding values in column a:c where j is minimum.
library(dplyr)
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), median, .names="med_{col}"),
across(a:c, ~.[which.min(j)],.names = 'best_{col}'))
# i med_j med_a med_b med_c best_a best_b best_c
#* <int> <dbl> <int> <int> <int> <int> <int> <int>
#1 1 0.217 4 7 4 7 7 4
#2 2 0.689 6 6 6 8 6 8
#3 3 -0.213 5 2 7 9 1 7
To do it in the same across statement :
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), list(med = median,
best = ~.[which.min(j)]),
.names="{fn}_{col}"))

subtract first or second value from each row [duplicate]

This question already has answers here:
R subtract value for the same ID (from the first ID that shows)
(3 answers)
subtract first value from each subset of dataframe
(2 answers)
Closed 4 years ago.
I'm manipulating my data using dplyr, and after grouping my data, I would like to subtract all values by the first or second value in my group (i.e., subtracting a baseline). Is it possible to perform this in a single pipe step?
MWE:
test <- tibble(one=c("c","d","e","c","d","e"), two=c("a","a","a","b","b","b"), three=1:6)
test %>% group_by(`two`) %>% mutate(new=three-three[.$`one`=="d"])
My desired output is:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 0
3 e a 3 1
4 c b 4 -1
5 d b 5 0
6 e b 6 1
However I am getting this as the output:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 NA
3 e a 3 1
4 c b 4 -1
5 d b 5 NA
6 e b 6 1

We can use the first from dplyr
test %>%
group_by(two) %>%
mutate(new=three- first(three))
# A tibble: 6 x 4
# Groups: two [2]
# one two three new
# <chr> <chr> <int> <int>
#1 c a 1 0
#2 d a 2 1
#3 e a 3 2
#4 c b 4 0
#5 d b 5 1
#6 e b 6 2
If we are subsetting the 'three' values based on string "c" in 'one', then we don't need .$ as it will get the whole column 'c' instead of the values within the group by column
test %>%
group_by(`two`) %>%
mutate(new=three-three[one=="c"])

library(tidyverse)
tibble(
one = c("c", "d", "e", "c", "d", "e"),
two = c("a", "a", "a", "b", "b", "b"),
three = 1:6
) -> test_df
test_df %>%
group_by(two) %>%
mutate(new = three - three[1])
## # A tibble: 6 x 4
## # Groups: two [2]
## one two three new
## <chr> <chr> <int> <int>
## 1 c a 1 0
## 2 d a 2 1
## 3 e a 3 2
## 4 c b 4 0
## 5 d b 5 1
## 6 e b 6 2

How to append a sequential count of a column into a new column from a grouped column using dplyr

I have the following data frame:
library(tidyverse)
dat <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'a', 'b', 'b', 'c', 'd'))
dat
#> foo bar
#> 1 1 a
#> 2 1 a
#> 3 2 b
#> 4 3 b
#> 5 3 c
#> 6 3 d
What I want to do is to create a new column with bar column tagged with the sequential count of its member, resulting in:
foo bar new_column
1 a a.sample.1
1 a a.sample.2
2 b b.sample.1
3 b b.sample.2
3 c c.sample.1
3 d d.sample.1
I'm stuck with this code:
> dat %>% group_by(bar) %>% summarise(n=n())
# A tibble: 4 x 2
bar n
<fctr> <int>
1 a 2
2 b 2
3 c 1
4 d 1

You can use group_by %>% mutate:
dat %>% group_by(bar) %>% mutate(new_column = paste(bar, 'sample', 1:n(), sep = "."))
# A tibble: 6 x 3
# Groups: bar [4]
# foo bar new_column
# <dbl> <fctr> <chr>
#1 1 a a.sample.1
#2 1 a a.sample.2
#3 2 b b.sample.1
#4 3 b b.sample.2
#5 3 c c.sample.1
#6 3 d d.sample.1

dat%>%group_by(bar)%>%mutate(new_column=paste0(bar,'.','sample.',row_number()))
# A tibble: 6 x 3
# Groups: bar [4]
foo bar new_column
<dbl> <fctr> <chr>
1 1 a a.sample.1
2 1 a a.sample.2
3 2 b b.sample.1
4 3 b b.sample.2
5 3 c c.sample.1
6 3 d d.sample.1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: create new column with values from other specified columns [duplicate] - r

Related

Create new column based on previous column by group; if missing, use NA

Filter groups using a lagged column

How to summarize across multiple columns with condition on another (grouped) column with dplyr?

subtract first or second value from each row [duplicate]

How to append a sequential count of a column into a new column from a grouped column using dplyr

Categories

Resources