Setting names intuitively with dplyr across() function - r

I want to manipulate several columns to create new columns with names that are variants of the names of the columns being manipulating.
dplyr 1.0.0's across() function seems like the tool for the job, but the .names argument seems to have limited functionality. Here's what I want to do:
tmp <- iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length,
.names = gsub('Sepal', '', "{col}")))
but the gsub function doesn't work. I can work around this in the following way:
tmp <- iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length,
.names = "mod_{col}"))
names(tmp) <- gsub("mod_Sepal", "mod_", names(tmp))
but that requires more code and is harder to keep track of. Am I missing something here and is there a simpler way to set the new column names with across?

We can use rename_at after the mutate step
library(dplyr)
library(stringr)
iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length)) %>%
rename_at(vars(starts_with("Sepal")), ~ str_remove(., "Sepal"))
According to ?across
.names - The default (NULL) is equivalent to "{col}" for the single function case
And there is no option to remove the already existing column name, but, we can add a suffix or prefix

You can pass a function to .names as -
library(dplyr)
iris %>%
mutate(across(starts_with('Sepal'), ~ .x - Petal.Length,
.names = "{gsub('Sepal.', '', {col}, fixed = TRUE)}"))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length Width
#1 5.1 3.5 1.4 0.2 setosa 3.7 2.1
#2 4.9 3.0 1.4 0.2 setosa 3.5 1.6
#3 4.7 3.2 1.3 0.2 setosa 3.4 1.9
#4 4.6 3.1 1.5 0.2 setosa 3.1 1.6
#5 5.0 3.6 1.4 0.2 setosa 3.6 2.2
#6 5.4 3.9 1.7 0.4 setosa 3.7 2.2

Related

Rename several columns using start with in r

I want to rename multiple columns that starts with the same string.
However, all the codes I tried did not change the columns.
For example this:
df %>% rename_at(vars(matches('^oldname,\\d+$')), ~ str_replace(., 'oldname', 'newname'))
And also this:
df %>% rename_at(vars(starts_with(oldname)), funs(sub(oldname, newname, .))
Are you familiar with a suitable code for rename?
Thank you!
Take iris for example, you can use rename_with() to replace those column names started with "Petal" with a new string.
head(iris) %>%
rename_with(~ sub("^Petal", "New", .x), starts_with("Petal"))
Sepal.Length Sepal.Width New.Length New.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can also use rename_at() in this case, although rename_if(), rename_at(), and rename_all() have been superseded by rename_with().
head(iris) %>%
rename_at(vars(starts_with("Petal")), ~ sub("^Petal", "New", .x))

How to tidily create multiple columns from sets of columns?

I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.
I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.
Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615

Summing across selected columns (using select() methods) in dplyr [duplicate]

This question already has answers here:
dplyr mutate rowSums calculations or custom functions
(7 answers)
Closed 3 years ago.
Summing across columns by listing their names is fairly simple:
iris %>% rowwise() %>% mutate(sum = sum(Sepal.Length, Sepal.Width, Petal.Length))
However, say there are a lot more columns, and you are interested in extracting all columns containing "Sepal" without manually listing them out. Specifically, I'm looking for a method in the same way select() in dplyr allows you to subset columns with with contains(), starts_with(), etc.
There are ways to use mutate_all() + sum() + join() in order to fulfill the same result as this query, but I am more interested in seeing something as close to the solution as the code below:
iris %>% rowwise() %>% mutate(sum = sum(contains(colnames(.), "Sepal")))
If I understand correctly, basically you're trying to do:
library(dplyr)
iris %>% mutate(sum = rowSums(select(., contains("Sepal"))))
First few rows:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

dplyr filter by the first column

Is it possible to filter in dplyr by the position of a column?
I know how to do it without dplyr
iris[iris[,1]>6,]
But how can I do it in dplyr?
Thanks!
Besides the suggestion by #thelatemail, you can also use filter_at and pass the column number to vars parameter:
iris %>% filter_at(1, all_vars(. > 6))
all(iris %>% filter_at(1, all_vars(. > 6)) == iris[iris[,1] > 6, ])
# [1] TRUE
No magic, just use the item column number as per above, rather than the variable (column) name:
library("dplyr")
iris %>%
filter(iris[,1] > 6)
Which as #eipi10 commented is better as
iris %>%
filter(.[[1]] > 6)
dply >= 1.0.0
Scoped verbs (_if, _at, _all) and by extension all_vars() and any_vars() have been superseded by across(). In the case of filter the functions if_any and if_all have been created to combine logic across multiple columns to aid in subsetting (these verbs are available in dplyr >= 1.0.4):
if_any() and if_all() are used with to apply the same predicate function to a selection of columns and combine the results into a single logical vector.
The first argument to across, if_any, and if_any is still tidy-select syntax for column selection, which includes selection by column position.
Single Column
In your single column case you could do any with the same result:
iris %>%
filter(across(1, ~ . > 6))
iris %>%
filter(if_any(1, ~ . > 6))
iris %>%
filter(if_all(1, ~ . > 6))
Multiple Columns
If you're apply a predicate function or formula across multiple columns then across might give unexpected results and in this case you should use if_any and if_all:
iris %>%
filter(if_all(c(2, 4), ~ . > 2.3)) # by column position
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.3 3.3 6.0 2.5 virginica
2 7.2 3.6 6.1 2.5 virginica
3 5.8 2.8 5.1 2.4 virginica
4 6.3 3.4 5.6 2.4 virginica
5 6.7 3.1 5.6 2.4 virginica
6 6.7 3.3 5.7 2.5 virginica
Notice this returns rows where all selected columns have a value greater than 2.3, which is a subset of rows where any of the selected columns meet the logic:
iris %>%
filter(if_any(ends_with("Width"), ~ . > 2.3)) # same columns selection as above
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 6.7 3.3 5.7 2.5 virginica
7 6.7 3.0 5.2 2.3 virginica
8 6.3 2.5 5.0 1.9 virginica
9 6.5 3.0 5.2 2.0 virginica
10 6.2 3.4 5.4 2.3 virginica
11 5.9 3.0 5.1 1.8 virginica
The output above was shorted to be more compact for this example.

using 'ifelse' in R: variable taking static value

I am trying to create new variable in a dataset based on the value of an indicator. The following is the code for the same:
prac_data <- head(iris,10)
COPY_IND='Y' ##declaring the indicator to be 'Y'
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
I get the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New_Var
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 5.1
3 4.7 3.2 1.3 0.2 setosa 5.1
4 4.6 3.1 1.5 0.2 setosa 5.1
5 5.0 3.6 1.4 0.2 setosa 5.1
6 5.4 3.9 1.7 0.4 setosa 5.1
7 4.6 3.4 1.4 0.3 setosa 5.1
8 5.0 3.4 1.5 0.2 setosa 5.1
9 4.4 2.9 1.4 0.2 setosa 5.1
10 4.9 3.1 1.5 0.1 setosa 5.1
I actually want to copy the variable 'Sepal.Length' in the 'New_Var' for every observation if indicator(COPY_IND) is Yes('Y').
If I do the the following, I get the desired response:
if (COPY_IND=='Y')
{
prac_data$New_Var <- prac_data$Sepal.Length
} else {prac_data$New_Var <- 'N'}
I just want to understand why R treats both 'if-else' approaches differently?
Is there another better elegant way to the same?
Thanks in advance!!
Actually, this might be easier to read as an answer.
From ifelse() help: "ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE".
Your test is just a single value, so ifelse() returns a single value, either Sepal.Length[1] or N, which is then duplicated across the whole column.
You need rowwise() on your way: prac_data <- prac_data %>% rowwise() %>% mutate(New_Var = ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
COPY_IND is always "Y" in your case, then the code could be simplified to prac_data$New_Var = prac_data$Sepal.Length. Even if you want to use ifelse statement row-wisely, it is better to follow the instructions in the help document
Further note that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test is a simple true/false result, i.e., when length(test) == 1.
I guess the desired COPY_IND should be one column of the data frame/vector rather than a single fixed value. In this case, you code generate the right answer, e.g. keep the first five number:
library(dplyr)
prac_data <- head(iris,10)
prac_data$COPY_IND=c(rep('Y',5),rep('N',5))
#COPY_IND=c(rep('Y',5),rep('N',5)) works too
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
generates the right column.

Resources