create a duplicate column using tidyeval [duplicate] - r

This question already has answers here:
Pass a string as variable name in dplyr::filter
(4 answers)
Closed 2 years ago.
I am trying to create a duplicate column using tidyeval. In each loop the name of the column to duplicate varies and is obtained using a regular expression. For example,
library(tidyverse)
a <- str_subset(string = names(mtcars), pattern = "^a")
a
# am
to get the column to be duplicated.
Then I have no idea how to use the string here to duplicate the column (to a new column a2). Tried various combinations from the code below, but struggling to get my head around tidy evaluations.
# a <- enquo(a)
mtcars %>%
as_tibble() %>%
mutate(a2 := {{a}})
# mutate(a2 := !!a)
# mutate(a2 := vars(!!!a))
# # A tibble: 32 x 12
# mpg cyl disp hp drat wt qsec vs am gear carb am2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 am
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 am
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 am
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 am
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 am
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 am
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 am
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 am
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 am
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 am
(I am looking for am2 to be a copy of am here, so 1 and 0 in each row, not "am")

If only one column is selected, e.g. am
a <- "am"
mtcars %>%
mutate("{a}2" := !!sym(a))
# mpg cyl disp hp drat wt qsec vs am gear carb am2
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 0
If there are more than one columns selected, e.g. mpg and cyl, you can use the .names argument in across().
a <- c("mpg", "cyl")
mtcars %>%
mutate(across(all_of(a), ~ ., .names = "{col}2"))
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2 cyl2
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 6
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0 6
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 4
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4 6
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.7 8
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1 6
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.3 8
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.4 4

Related

Calculate means including all factor levels but one

Using the dataframe mtcars I would like to add the column qsec_control which is calculated as the mean(qsec) of all rows that don't have the same cyl as the current row (e.g. if cyl == 6, it would take mean(qsec[cyl != 6])).
The question feels somewhat dumb, but I cant figure out how to do this.
This solution groups by cyl, then uses dplyr::cur_group_rows() to index into mtcars$qsec:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(qsec_control = mean(
mtcars$qsec[-cur_group_rows()]
)) %>%
ungroup()
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb qsec_cont…¹
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 17.8
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 17.8
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 17.2
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 17.8
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 18.7
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 17.8
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 18.7
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 17.2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 17.2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 17.8
# … with 22 more rows, and abbreviated variable name ¹​qsec_control
Replicating zephryl's answer in data.table:
library(data.table)
data(mtcars)
setDT(mtcars)
mtcars[, qsec_control := mtcars[-.I, mean(qsec)] , by = .(cyl)]
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb cyl2 qsec_control
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 6 17.81280
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 6 17.81280
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 17.17381
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6 17.81280
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 8 18.68611
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6 17.81280

Mutate across multiple columns using dplyr

I am trying to calculate rowwise averages for a number of columns. Could somebody please explain why the code below only calculates the mean for the two variables in the code (var_1 and var_13), rather than the mean for all 13 columns?
df %>%
rowwise() %>%
mutate(varmean = mean(var_1:var_13)) -> df
Two possibilities using dplyr:
library(dplyr)
mtcars %>%
rowwise() %>%
mutate(varmean = mean(c_across(mpg:vs)))
This returns
# A tibble: 32 x 12
# Rowwise:
mpg cyl disp hp drat wt qsec vs am gear carb varmean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 40.0
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 40.1
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 31.7
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 52.8
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 73.2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 47.7
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 81.2
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 36.7
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 42.8
# ... with 22 more rows
and without rowwise() and using base Rs rowMeans():
mtcars %>%
mutate(varmean = rowMeans(across(mpg:vs)))
returns
mpg cyl disp hp drat wt qsec vs am gear carb varmean
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 39.99750
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 40.09938
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 31.69750
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 52.76687
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 73.16375
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 47.69250
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 81.24000
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 33.12250
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 36.69625
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 42.80750

Best possible way to add the likely event and its probability from a cross table in R

Using the mtcars dataset, I have created a cross table as follows -
tab = with(mtcars, ftable(gear, cyl))
tab
Here is how it looks -
cyl 4 6 8
gear
3 1 2 12
4 8 4 0
5 2 1 2
For this crosstable, I have calculated the row-wise probability
tab_prob = tab %>% prop.table(1) %>% round(4) * 100
tab_prob
cyl 4 6 8
gear
3 6.67 13.33 80.00
4 66.67 33.33 0.00
5 40.00 20.00 40.00
I want to add two columns to the original mtcars dataset
Column 1 cyl_exp - Fill in the expected outcome based on cross-table. For example, in mtcars dataset, if the number of gears is 3, this new column (refer to the tab cross table) should have the value 8, since there is 80% probability that if the number of gears is 3, then cyl should be 8.
Column 2 cyl_prob - Write the probability from table tab_prob in this column based on the value in cyl_exp column.
Here is the expected outcome -
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb cyl_prob cyl_exp
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 66.67 4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 66.67 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 66.67 4
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 80.00 8
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 80.00 8
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 80.00 8
Is there an easy way to accomplish this?
Thanks!
With data.table, I would do it this way:
mtcars <- as.data.table(mtcars, keep.rownames = T)
tab <- mtcars[, .N, by = .(gear, cyl)]
tab[, prob := N/sum(N), by = .(gear)]
tab <- tab[order(-prob, cyl)][!duplicated(gear)]
mtcars[tab, `:=`(cyl_exp = i.cyl, cyl_prob = i.prob), on = .(gear)]
# > head(mtcars)
# rn mpg cyl disp hp drat wt qsec vs am gear carb cyl_exp cyl_prob
# 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4 0.6666667
# 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 4 0.6666667
# 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 0.6666667
# 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 8 0.8000000
# 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 8 0.8000000
# 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 8 0.8000000
Here's a way to do this in dplyr :
library(dplyr)
mtcars %>%
count(cyl_exp = cyl, gear, name = 'cyl_prob') %>%
group_by(gear) %>%
mutate(cyl_prob = prop.table(cyl_prob) * 100) %>%
slice(which.max(cyl_prob)) %>%
inner_join(mtcars, by = 'gear')
# cyl_exp gear cyl_prob mpg cyl disp hp drat wt qsec vs am carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 8 3 80 21.4 6 258 110 3.08 3.22 19.4 1 0 1
# 2 8 3 80 18.7 8 360 175 3.15 3.44 17.0 0 0 2
# 3 8 3 80 18.1 6 225 105 2.76 3.46 20.2 1 0 1
# 4 8 3 80 14.3 8 360 245 3.21 3.57 15.8 0 0 4
# 5 8 3 80 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
# 6 8 3 80 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
# 7 8 3 80 15.2 8 276. 180 3.07 3.78 18 0 0 3
# 8 8 3 80 10.4 8 472 205 2.93 5.25 18.0 0 0 4
# 9 8 3 80 10.4 8 460 215 3 5.42 17.8 0 0 4
#10 8 3 80 14.7 8 440 230 3.23 5.34 17.4 0 0 4
# … with 22 more rows
I have kept the data in long format so that it is easier to join. The first part of the answer is used to create cross table.
mtcars %>%
count(cyl_exp = cyl, gear, name = 'cyl_prob') %>%
group_by(gear) %>%
mutate(cyl_prob = prop.table(cyl_prob) * 100)
# cyl_exp gear cyl_prob
# <dbl> <dbl> <dbl>
#1 4 3 6.67
#2 4 4 66.7
#3 4 5 40
#4 6 3 13.3
#5 6 4 33.3
#6 6 5 20
#7 8 3 80
#8 8 5 40
From here we keep only the row with highest probability for each gear and join the data.
I used regular table and prop.table instead of ftable. I propose the following solution :
df <- mtcars
tab=table(mtcars$gear,mtcars$cyl)
tab_prob = round(prop.table(tab,margin=1)*100,2)
exp_cyl <- function(x){
return(as.numeric(names(which.max(tab[toString(x),]))))
}
prob_cyl <- function(x){
return(round(max(tab_prob[toString(x),]),2))
}
df <- mtcars
df %>% mutate(cyl_prob=sapply(gear,prob_cyl),cyl_exp=sapply(gear,exp_cyl))
Output :
mpg cyl disp hp drat wt qsec vs am gear carb cyl_prob cyl_exp
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 66.67 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 66.67 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 66.67 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 80.00 8
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 80.00 8
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 80.00 8
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 80.00 8
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 66.67 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 66.67 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 66.67 4

data.table question: How not to overwrite the original data frame

So with dplyr syntax when I e.g. use mutate I can wait until I like the result and start my data manipulations without an assignment and I can experiment. But it seems with data.table when I perform an operation I overwrite the original data frame and if I change my mind I have to reload the data and start over. When I use the pipe, I run the code often before the next pipe to see if everything is okay...
library(data.table, warn.conflicts = FALSE)
#> Warning: package 'data.table' was built under R version 3.6.1
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 3.6.1
df <- as.data.table(mtcars)
# dplyr version
mtcars %>%
as_tibble() %>%
mutate(am = 2*am)
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 2 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 2 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 2 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# here i will still have my original dataframe mtcars.
df[, am := 2*am]
head(df)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 2 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 2 4 4
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 2 4 1
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
df[cyl ==6, am := 2*am]
head(df)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 4 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 4 4 4
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 2 4 1
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Created on 2019-07-11 by the reprex package (v0.3.0)
So here if I just want to add a filter with data.table I am going to multiply am with 2 again... is this how data.table works? Is there a way not to overwrite the data frame? Or should I always make a copy when I am afraid of making a mistake?

use dplyr mutate to create new columns based on a vector of column names

I would like to take the log of some columns, and create new columns that are all named log[original column name].
The code below works, but how can I pass the vector called columnstolog into mutate? Thank you.
library(dplyr)
data(mtcars)
columnstolog <- c('mpg', 'cyl', 'disp', 'hp')
mtcars %>% mutate(logmpg = log(mpg))
mtcars %>% mutate(logcyl = log(cyl))
Use mutate_at, if you can bear with _log being appended to the original column names:
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_log cyl_log disp_log hp_log
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294 4.682131 4.532599
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759 5.552960 4.700480
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442 5.886104 5.164786
# ...
Ignoring the "use dplyr" part...
require(data.table)
mtcars <- as.data.table(mtcars)
mtcars[, paste0('log', columnstolog) := lapply(.SD, log), .SDcols = columnstolog]
You could also make use of rowwise from dplyr package:
mtcars %>%
rowwise %>%
mutate(logmpg = log(mpg),
logcyl = log(cyl))
# A tibble: 32 x 13
mpg cyl disp hp drat wt qsec vs am gear carb logmpg logcyl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.895912 1.791759
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 2.660260 2.079442
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.194583 1.386294
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.126761 1.386294
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2.954910 1.791759
# ... with 22 more rows

Resources