Summing across selected columns (using select() methods) in dplyr [duplicate] - r

This question already has answers here:
dplyr mutate rowSums calculations or custom functions
(7 answers)
Closed 3 years ago.
Summing across columns by listing their names is fairly simple:
iris %>% rowwise() %>% mutate(sum = sum(Sepal.Length, Sepal.Width, Petal.Length))
However, say there are a lot more columns, and you are interested in extracting all columns containing "Sepal" without manually listing them out. Specifically, I'm looking for a method in the same way select() in dplyr allows you to subset columns with with contains(), starts_with(), etc.
There are ways to use mutate_all() + sum() + join() in order to fulfill the same result as this query, but I am more interested in seeing something as close to the solution as the code below:
iris %>% rowwise() %>% mutate(sum = sum(contains(colnames(.), "Sepal")))

If I understand correctly, basically you're trying to do:
library(dplyr)
iris %>% mutate(sum = rowSums(select(., contains("Sepal"))))
First few rows:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Related

Rename several columns using start with in r

I want to rename multiple columns that starts with the same string.
However, all the codes I tried did not change the columns.
For example this:
df %>% rename_at(vars(matches('^oldname,\\d+$')), ~ str_replace(., 'oldname', 'newname'))
And also this:
df %>% rename_at(vars(starts_with(oldname)), funs(sub(oldname, newname, .))
Are you familiar with a suitable code for rename?
Thank you!
Take iris for example, you can use rename_with() to replace those column names started with "Petal" with a new string.
head(iris) %>%
rename_with(~ sub("^Petal", "New", .x), starts_with("Petal"))
Sepal.Length Sepal.Width New.Length New.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can also use rename_at() in this case, although rename_if(), rename_at(), and rename_all() have been superseded by rename_with().
head(iris) %>%
rename_at(vars(starts_with("Petal")), ~ sub("^Petal", "New", .x))

Can we create/mutate several new columns at once in tidyverse [duplicate]

This question already has answers here:
Create loop with dynamic column names and repeating values based on defined i
(1 answer)
How to use mutate and ifelse in a loop?
(3 answers)
How can I dynamically create new variables/columns on databases in R using dplyr?
(2 answers)
How to use mutate from dplyr to create a series of columns defined and called by a vector specifying values for mutation?
(1 answer)
dplyr apply a single function with changing argument to the same column
(2 answers)
Closed 1 year ago.
Let me clarify I am not looking at mutate_at or mutate(across(..., ...)) type of syntax here. I just want to know how to create several new columns at once inside tidyverse pipe syntax.
Let us assume the case of iris dataset.
I want to create say 10 (or 100 or more) new columns having a criteria like this.
first new column(variable) say V1 is just Petal.Length * 1,
second new col say V2 is Petal.Length * 2
and so on upto say V10 Petal.Length * 10
without explicitly writing the names and formula for each of these columns, which may be cumbersome If I want to create say 100 new columns.
You can use map functions :
library(dplyr)
library(purrr)
df <- iris %>% head
value <- 1:5
bind_cols(df,
map_dfc(value, ~df %>% transmute(!!paste0('col', .x) := Petal.Length * .x)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species col1 col2 col3 col4 col5
#1 5.1 3.5 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#2 4.9 3.0 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#3 4.7 3.2 1.3 0.2 setosa 1.3 2.6 3.9 5.2 6.5
#4 4.6 3.1 1.5 0.2 setosa 1.5 3.0 4.5 6.0 7.5
#5 5.0 3.6 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#6 5.4 3.9 1.7 0.4 setosa 1.7 3.4 5.1 6.8 8.5
In base R, this can be done with lapply :
df[paste0('col', value)] <- lapply(value, `*`, df$Petal.Length)

Setting names intuitively with dplyr across() function

I want to manipulate several columns to create new columns with names that are variants of the names of the columns being manipulating.
dplyr 1.0.0's across() function seems like the tool for the job, but the .names argument seems to have limited functionality. Here's what I want to do:
tmp <- iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length,
.names = gsub('Sepal', '', "{col}")))
but the gsub function doesn't work. I can work around this in the following way:
tmp <- iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length,
.names = "mod_{col}"))
names(tmp) <- gsub("mod_Sepal", "mod_", names(tmp))
but that requires more code and is harder to keep track of. Am I missing something here and is there a simpler way to set the new column names with across?
We can use rename_at after the mutate step
library(dplyr)
library(stringr)
iris %>%
mutate(across(starts_with('Sepal'),
~ .x - Petal.Length)) %>%
rename_at(vars(starts_with("Sepal")), ~ str_remove(., "Sepal"))
According to ?across
.names - The default (NULL) is equivalent to "{col}" for the single function case
And there is no option to remove the already existing column name, but, we can add a suffix or prefix
You can pass a function to .names as -
library(dplyr)
iris %>%
mutate(across(starts_with('Sepal'), ~ .x - Petal.Length,
.names = "{gsub('Sepal.', '', {col}, fixed = TRUE)}"))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length Width
#1 5.1 3.5 1.4 0.2 setosa 3.7 2.1
#2 4.9 3.0 1.4 0.2 setosa 3.5 1.6
#3 4.7 3.2 1.3 0.2 setosa 3.4 1.9
#4 4.6 3.1 1.5 0.2 setosa 3.1 1.6
#5 5.0 3.6 1.4 0.2 setosa 3.6 2.2
#6 5.4 3.9 1.7 0.4 setosa 3.7 2.2

apply ntile function to list of data frames with different bucket sizes

I would like to use the ntile function from dplyr or a similar function on a list of data frames but using a different n for each data frame. My list contains 150 data frames so a manual solution like the one below will not work. How can I rewrite the code below to act on the list of data frames and return the list of data frames with the new column?
library(tidyverse)
iris_list=split(iris,iris$Species)
iris_setosa=iris_list[[1]]
iris_versicolor=iris_list[[2]]
iris_virginica=iris_list[[3]]
iris_setosa$n3=ntile(iris_setosa$Sepal.Length,3)
iris_versicolor$n5=ntile(iris_setosa$Sepal.Length,5)
iris_virginica$n7=ntile(iris_setosa$Sepal.Length,7)
The final result should be this
final_list=list(iris_setosa,iris_versicolor,iris_virginica)
head(final_list[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n3
1 5.1 3.5 1.4 0.2 setosa 2
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 3
There are several ways to achieve this, depending on what type of object you want in the end.
One way would be to use base::expand.grid and purrr::pmap like this:
percentiles = list(3,5,7)
iris_list %>%
map("Sepal.Length") %>%
expand.grid(percentiles) %>%
pmap(~ntile(..1,..2))
First, you want only the Sepal.Length variable of all your datasets, so you use purrr::map to get them.
Then, expand.grid creates a dataframe of all combinations of its parameters. Here, with 2 lists of 3 members, it would return a dataframe of 3x3=9 rows: setosa 3, versicolor 3, virginica 3, setosa 5, ...
Finally, pmap can iterate over the dataframe and apply the function ntile, with the first column (iris_list) as the first argument and the second column (percentiles) as the second argument. Unfortunately, purrr is very bad in dealing with names, but it seems that it is on purpose.
EDIT:
Your edit is somehow another question, so here is another answer:
iris_list %>%
map(~mutate(.x, n3=ntile(Sepal.Length,3)),
n5=ntile(Sepal.Length,5)), n7=ntile(Sepal.Length,7)))
I've found a way that works
n_size=data.frame(Species=c("setosa ","versicolor","virginica"),size=c(3,5,7))
iris_bin=iris %>% inner_join(n_size,by="Species") %>%
group_by(Species)%>%
mutate(bin=ntile(Sepal.Length,size[1])) %>%
arrange(Species,Sepal.Length,bin)

Use mixedsort on a select group of cells, R

I have a data.frame of cells containing a mix of numbers and characters.
For example
data(iris)
iris$comb<-paste(iris$Sepal.Length,'-',iris$Species)
iris$comb2<-paste(iris$Sepal.Width,'-',iris$Species)
head(iris[,6:7])
comb comb2
1 5.1 - setosa 3.5 - setosa
2 4.9 - setosa 3 - setosa
3 4.7 - setosa 3.2 - setosa
4 4.6 - setosa 3.1 - setosa
5 5 - setosa 3.6 - setosa
6 5.4 - setosa 3.9 - setosa
I want to sort groups of cells based on their numeric value, and I can do this with gtools::mixedsort(). However, I have several columns that need this, and I only want to sort every 3 rows in a column, independently of the rest of the column. The (extremely) long way to do this would be
library(gtools)
mixedsort(iris[1:3,6],decreasing=TRUE)
mixedsort(iris[4:6,6],decreasing=TRUE)
I'm just not sure how to loop through little bunches of cells like this. I would very much appreciate any help.
We create a grouping variable using gl and then using mutate_at specify the columns of interest to apply the function
library(gtools)
library(dplyr)
iris %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate_at(vars(matches("comb")), funs(mixedsort(., decreasing = TRUE))) %>%
ungroup() %>%
select(-grp)

Resources