dplyr mutating multiple columns by prefix and suffix

dplyr mutating multiple columns by prefix and suffix - r

I have a problem that I can replicate using the iris dataset, where many groups (same prefix in name) of variables with two different suffixes. I want to be take a ratio for all these groups but can't find a tidyverse solution.. I would have through mutate_at() might have been able to help.
In the iris dataset you could consider for Petal columns I want to generate a Petal proportion of Length / Width. Similarly I want to do this for Sepal. I don't want to manually do this in a mutate() because I have lots of variable groups, and this could change over time.
I do have a solution that works using base R (in the code below) but I wanted to know if there was a tidyverse solution that achieved the same.
# libs ----
library(tidyverse)
# data ----
df <- iris
glimpse(df)
# set up column vectors ----
length_cols <- names(df) %>% str_subset("Length") %>% sort()
width_cols <- names(df) %>% str_subset("Width") %>% sort()
new_col_names <- names(df) %>% str_subset("Length") %>% str_replace(".Length", ".Ratio") %>% sort()
length_cols
width_cols
new_col_names
# make new cols ----
df[, new_col_names] <- df[, length_cols] / df[, width_cols]
df %>% head()
Thanks,
Gareth

Here is one possibility using purrr::map:
library(tidyverse);
df <- map(c("Petal", "Sepal"), ~ iris %>%
mutate(
!!paste0(.x, ".Ratio") := !!as.name(paste0(.x, ".Length")) / !!as.name(paste0(.x, ".Width")) )) %>%
reduce(left_join);
head(df);
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
#1 5.1 3.5 1.4 0.2 setosa 7.00
#2 4.9 3.0 1.4 0.2 setosa 7.00
#3 4.7 3.2 1.3 0.2 setosa 6.50
#4 4.6 3.1 1.5 0.2 setosa 7.50
#5 5.0 3.6 1.4 0.2 setosa 7.00
#6 5.4 3.9 1.7 0.4 setosa 4.25
# Sepal.Ratio
#1 1.457143
#2 1.633333
#3 1.468750
#4 1.483871
#5 1.388889
#6 1.384615
Explanation: We map the prefixes "Petal" and "Sepal" to iris by extracting for each prefix the columns with suffixes "Length" and "Width", and calculate a new corresponding prefix + ".Ratio" column; reduce merges both data.frames.

Related

Set column names conditionally using dplyr

I want to set column names based on the number of columns.
For example,
#iris1 <- iris[,1:4]
if(ncol(iris)==4) colnames(iris) <- c("a","b","c","d")
if(ncol(iris)==5) colnames(iris) <- c("a","b","c","d","e")
I am looking for a way to do that using the dplyr pipeline. Something like this:
iris1 %>%
setNames(ifelse(ncol(.)==4,c("a","b","c","d"),c("a","b","c","d","e")))
UPDATE:
akrun's answer gave me this idea which works for me in this particular use-case.
cnames <- c("a","b","c","d","e")
iris1 %>% setNames(cnames[1:ncol(.)])
This solution cannot be generalised. Better solutions are welcome.

If this is based on a user input 'n', then we can use rename_at
library(dplyr)
n <- 4
iris %>%
rename_at(seq_len(n), ~ letters[seq_len(n)])
which can be wrapped into a function
rename_fn <- function(dat, n){
dat %>%
rename_at(seq_len(n), ~ letters[seq_len(n)])
}
rename_fn(iris, 4)
rename_fn(iris, 5)
If it is to change all the columns of the dataset, then an easier option is set_names
iris %>%
set_names(cnames[seq_len(ncol(.))])
Or in base R
setNames(iris, cnames[seq_len(ncol(iris))])

If you want to rename all the columns, you should probably use rename_all
library(dplyr)
iris1 %>% rename_all(~cnames[seq_along(.)]) %>% head
# a b c d
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4

Extracting columns from Data Frame based on a "formula"

I have some data which looks like:
data(iris)
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(-c(1)), ~str_c("X", seq_along(.)))
Data:
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
I add a random variable:
d$noise <- rnorm(length(d))
I am trying to extract just the Y, X1, X2... XN variables (dynamically). What I currently have is:
d %>%
select("Y", cat(paste0("X", seq_along(2:ncol(.)), collapse = ", ")))
This doesn't work since it takes into account the noise column and doesn't work even without the noise column.
So I am trying to create a new data frame which just extracts the Y, X1, X2...XN columns.

dplyr provides two select helper functions that you could use --- contains for literal strings or matches for regular expressions.
In this case you could do
d %>%
select("Y", contains("X"))
or
d %>%
select("Y", matches("X\\d+"))
The first one works in the example you provided but would fail if you have other variables that contain any "X" character. The second is more robust in that it will only capture variables whose names are "X" followed by one or more digits.

we can also use
d %>%
select(Y, starts_with('X'))

In a for loop, how do I insert the variable i inside the "starts_with" quotation?

I have this big dataframe, with species in rows and samples in columns. There are 30 samples, with 12 replicates each. The column names are written as such : sample.S1.01; sample.S1.02.....sample.S30.11; sample.S30.12.
I would like to create 30 new tables containing the 12 replicates for each samples.
I have this command line that works perfectly for one sample at a time :
dt<- tab_sp_sum %>%
select(starts_with("sample.S1."))
assign(paste("tab_sp_1"), dt)
But when I put this in a for loop, it doesn't work anymore.
I think it's due to the fact that the variable i is included in the starts_with quotation, and I don't know how to write it.
for (i in 1:30){
dt<- tab_sp_sum %>%
select(starts_with("sample.S",i,".", sep=""))
assign(paste("tab_sp",i,sep="_"), dt)
although the last line works well, 30 tables are created with the right names, but they are empty.
Any suggestion ?
Thank you

Instead of using assign and store it in different objects try to use list . Create the names that you want to select using paste0 and then use map to create list of dataframes.
library(dplyr)
library(purrr)
df_names <- paste0("sample.S", 1:30, ".")
df1 <- map(df_names, ~tab_sp_sum %>% select(starts_with(.x)))
You can then use df1[[1]], df1[[2]] to access individual dataframes.
In base R, we can use lapply by creating a regex to select columns that starts with df_names
df1 <- lapply(df_names, function(x)
tab_sp_sum[grep(paste0("^", x), names(tab_sp_sum))])
Using it with built-in iris dataset
df_names <- c("Sepal", "Petal")
df1 <- map(df_names, ~iris %>% select(starts_with(.x)))
head(df1[[1]])
# Sepal.Length Sepal.Width
#1 5.1 3.5
#2 4.9 3.0
#3 4.7 3.2
#4 4.6 3.1
#5 5.0 3.6
#6 5.4 3.9
head(df1[[2]])
# Petal.Length Petal.Width
#1 1.4 0.2
#2 1.4 0.2
#3 1.3 0.2
#4 1.5 0.2
#5 1.4 0.2
#6 1.7 0.4

We can use split in base R
nm1 <- paste(c("Sepal", "Petal"), collapse="|")
nm2 <- grep(nm1, names(iris), value = TRUE)
out <- split.default(iris[nm2], sub("\\..*", "", nm2))
head(out[[1]])
# Petal.Length Petal.Width
#1 1.4 0.2
#2 1.4 0.2
#3 1.3 0.2
#4 1.5 0.2
#5 1.4 0.2
#6 1.7 0.4
head(out[[2]])
# Sepal.Length Sepal.Width
#1 5.1 3.5
#2 4.9 3.0
#3 4.7 3.2
#4 4.6 3.1
#5 5.0 3.6
#6 5.4 3.9
Or in tidyverse
iris %>%
select(nm2) %>%
split.default(str_remove(nm2, "\\..*"))

Difference between pull and select in dplyr?

It seems like dplyr::pull() and dplyr::select() do the same thing. Is there a difference besides that dplyr::pull() only selects 1 variable?

First, it makes sense to see what class each function creates.
library(dplyr)
mtcars %>% pull(cyl) %>% class()
#> 'numeric'
mtcars %>% select(cyl) %>% class()
#> 'data.frame'
So pull() creates a vector -- which, in this case, is numeric -- whereas select() creates a data frame.
Basically, pull() is the equivalent to writing mtcars$cyl or mtcars[, "cyl"], whereas select() removes all of the columns except for cyl but maintains the data frame structure

You could see select as an analogue of [ or magrittr::extract and pull as an analogue of [[ (or $) or magrittr::extract2 for data frames (an analogue of [[ for lists would be purr::pluck).
df <- iris %>% head
All of these give the same output:
df %>% pull(Sepal.Length)
df %>% pull("Sepal.Length")
a <- "Sepal.Length"; df %>% pull(!!quo(a))
df %>% extract2("Sepal.Length")
df %>% `[[`("Sepal.Length")
df[["Sepal.Length"]]
# all of them:
# [1] 5.1 4.9 4.7 4.6 5.0 5.4
And all of these give the same output:
df %>% select(Sepal.Length)
a <- "Sepal.Length"; df %>% select(!!quo(a))
df %>% select("Sepal.Length")
df %>% extract("Sepal.Length")
df %>% `[`("Sepal.Length")
df["Sepal.Length"]
# all of them:
# Sepal.Length
# 1 5.1
# 2 4.9
# 3 4.7
# 4 4.6
# 5 5.0
# 6 5.4
pull and select can take literal, character, or numeric indices, while the others take character or numeric only
One important thing is they differ on how they handle negative indices.
For select negative indices mean columns to drop.
For pull they mean count from last column.
df %>% pull(-Sepal.Length)
df %>% pull(-1)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Strange result but Sepal.Length is converted to 1, and column -1 is Species (last column)
This feature is not supported by [[ and extract2 :
df %>% `[[`(-1)
df %>% extract2(-1)
df[[-1]]
# Error in .subset2(x, i, exact = exact) :
# attempt to select more than one element in get1index <real>
Negative indices to drop columns are supported by [ and extract though.
df %>% select(-Sepal.Length)
df %>% select(-1)
df %>% `[`(-1)
df[-1]
# Sepal.Width Petal.Length Petal.Width Species
# 1 3.5 1.4 0.2 setosa
# 2 3.0 1.4 0.2 setosa
# 3 3.2 1.3 0.2 setosa
# 4 3.1 1.5 0.2 setosa
# 5 3.6 1.4 0.2 setosa
# 6 3.9 1.7 0.4 setosa

dplyr rowwise mutate without hardcoding names

I want to do something like this
df <- iris %>%
rowwise %>%
mutate(new_var = sum(Sepal.Length, Sepal.Width))
Except I want to do it without typing the variable names, e.g.
names_to_add <- c("Sepal.Length", "Sepal.Width")
df <- iris %>%
rowwise %>%
[some function that uses names_to_add]
I attempted a few things e.g.
df <- iris %>%
rowwise %>%
mutate(new_var = sum(sapply(names_to_add, get, envir = as.environment(.))))
but still can't figure it out. I'll take an answer that plays around with lazyeval or something that's simpler. Note that the sum function here is just a placeholder and my actual function is much more complex, although it returns one value per row. I'd also rather not use data.table

You should check out all the functions that end with _ in dplyr. Example mutate_, summarise_ etc.
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df <- iris %>%
rowwise %>% mutate_(names_to_add)
Edit
The results of the code:
df <- iris %>%
rowwise %>% mutate(new_var = sum(Sepal.Length, Sepal.Width))
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df2 <- iris %>%
rowwise %>% mutate_(new_var = names_to_add)
identical(df, df2)
[1] TRUE

Edit
I edited the answer and it solves the problem. I wonder why it was donwvoted. We use SE (standard evaluation), passing a string as an input inside 'mutate_'. More info: vignette("nse","dplyr")
x <- "Sepal.Length + Sepal.Width"
df <- mutate_(iris, x)
head(df)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length + Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Categories

HOME

google-code

weboptimizer

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr mutating multiple columns by prefix and suffix - r

Related

Set column names conditionally using dplyr

Extracting columns from Data Frame based on a "formula"

In a for loop, how do I insert the variable i inside the "starts_with" quotation?

Difference between pull and select in dplyr?

dplyr rowwise mutate without hardcoding names

Categories

Resources