Extracting columns from Data Frame based on a "formula"

Extracting columns from Data Frame based on a "formula" - r

I have some data which looks like:
data(iris)
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(-c(1)), ~str_c("X", seq_along(.)))
Data:
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
I add a random variable:
d$noise <- rnorm(length(d))
I am trying to extract just the Y, X1, X2... XN variables (dynamically). What I currently have is:
d %>%
select("Y", cat(paste0("X", seq_along(2:ncol(.)), collapse = ", ")))
This doesn't work since it takes into account the noise column and doesn't work even without the noise column.
So I am trying to create a new data frame which just extracts the Y, X1, X2...XN columns.

dplyr provides two select helper functions that you could use --- contains for literal strings or matches for regular expressions.
In this case you could do
d %>%
select("Y", contains("X"))
or
d %>%
select("Y", matches("X\\d+"))
The first one works in the example you provided but would fail if you have other variables that contain any "X" character. The second is more robust in that it will only capture variables whose names are "X" followed by one or more digits.

we can also use
d %>%
select(Y, starts_with('X'))

Related

Operating on list of strings representing column names?

I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.

You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1

You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df

renaming columns to Y X1 X2 X3 X4 .. XN

I have some data such as:
data(iris)
I want to rename the columns such that Species is the Y variable and all other variables are the predictors.
What I have current doesn't give me the result I am looking for.
iris %>%
select(Species, everything()) %>% # move the Y variable to the "front"
rename(Y = 1) %>%
rename_at(vars(2:ncol(.)), ~ paste("X", seq(2:ncol(.)), sep = ""))
Expected output would be colnames:
Y, X1, X2, X3, X4, X5... XN

What went wrong
The mistake in your code is that it assumes the second . (in the anonymous function) is a tibble, when in fact it's really a character vector. Hence, ncol(.) is inappropriate, and instead should be length(.). Also, no need for seq() and given the output you requested, it should start from 1. In the end, you would have been fine with:
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(2:ncol(.)), ~ paste("X", 1:length(.), sep = ""))
The other answers provide alternative ways of expressing this operation. A possibly cleaner version would be
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_with(~ str_c("X", seq_along(.)), -1)

I'm rearranging your steps to avoid having to do any subsetting in creating the names. Instead, give the first column the name X0 knowing you're going to change it to Y.
library(dplyr)
iris %>%
select(Species, everything()) %>%
setNames(paste0("X", seq_along(.) - 1)) %>%
rename(Y = 1) %>%
head()
#> Y X1 X2 X3 X4
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3.0 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5.0 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4

You can set the colnames directly instead of using the sometimes finicky rename functions:
iris %>%
select(Species, everything()) %>% # move the Y variable to the "front"
`colnames<-`(c('Y', paste("X", seq(2:ncol(.)), sep = ""))) %>%
head
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
This question explains why `colnames<-` works as a function in the pipe:
use %>% with replacement functions like colnames()<-

Solution with base r functions:
colnames(iris) <- c("X1", "X2", "X3", "X4", "Y") # rename columns
iris[,c(5,1,2,3,4)] # reorder
Y X1 X2 X3 X4
# 1 setosa 5.1 3.5 1.4 0.2
# 2 setosa 4.9 3.0 1.4 0.2
# 3 setosa 4.7 3.2 1.3 0.2
# 4 setosa 4.6 3.1 1.5 0.2
# 5 setosa 5.0 3.6 1.4 0.2

In a for loop, how do I insert the variable i inside the "starts_with" quotation?

I have this big dataframe, with species in rows and samples in columns. There are 30 samples, with 12 replicates each. The column names are written as such : sample.S1.01; sample.S1.02.....sample.S30.11; sample.S30.12.
I would like to create 30 new tables containing the 12 replicates for each samples.
I have this command line that works perfectly for one sample at a time :
dt<- tab_sp_sum %>%
select(starts_with("sample.S1."))
assign(paste("tab_sp_1"), dt)
But when I put this in a for loop, it doesn't work anymore.
I think it's due to the fact that the variable i is included in the starts_with quotation, and I don't know how to write it.
for (i in 1:30){
dt<- tab_sp_sum %>%
select(starts_with("sample.S",i,".", sep=""))
assign(paste("tab_sp",i,sep="_"), dt)
although the last line works well, 30 tables are created with the right names, but they are empty.
Any suggestion ?
Thank you

Instead of using assign and store it in different objects try to use list . Create the names that you want to select using paste0 and then use map to create list of dataframes.
library(dplyr)
library(purrr)
df_names <- paste0("sample.S", 1:30, ".")
df1 <- map(df_names, ~tab_sp_sum %>% select(starts_with(.x)))
You can then use df1[[1]], df1[[2]] to access individual dataframes.
In base R, we can use lapply by creating a regex to select columns that starts with df_names
df1 <- lapply(df_names, function(x)
tab_sp_sum[grep(paste0("^", x), names(tab_sp_sum))])
Using it with built-in iris dataset
df_names <- c("Sepal", "Petal")
df1 <- map(df_names, ~iris %>% select(starts_with(.x)))
head(df1[[1]])
# Sepal.Length Sepal.Width
#1 5.1 3.5
#2 4.9 3.0
#3 4.7 3.2
#4 4.6 3.1
#5 5.0 3.6
#6 5.4 3.9
head(df1[[2]])
# Petal.Length Petal.Width
#1 1.4 0.2
#2 1.4 0.2
#3 1.3 0.2
#4 1.5 0.2
#5 1.4 0.2
#6 1.7 0.4

We can use split in base R
nm1 <- paste(c("Sepal", "Petal"), collapse="|")
nm2 <- grep(nm1, names(iris), value = TRUE)
out <- split.default(iris[nm2], sub("\\..*", "", nm2))
head(out[[1]])
# Petal.Length Petal.Width
#1 1.4 0.2
#2 1.4 0.2
#3 1.3 0.2
#4 1.5 0.2
#5 1.4 0.2
#6 1.7 0.4
head(out[[2]])
# Sepal.Length Sepal.Width
#1 5.1 3.5
#2 4.9 3.0
#3 4.7 3.2
#4 4.6 3.1
#5 5.0 3.6
#6 5.4 3.9
Or in tidyverse
iris %>%
select(nm2) %>%
split.default(str_remove(nm2, "\\..*"))

dplyr mutating multiple columns by prefix and suffix

I have a problem that I can replicate using the iris dataset, where many groups (same prefix in name) of variables with two different suffixes. I want to be take a ratio for all these groups but can't find a tidyverse solution.. I would have through mutate_at() might have been able to help.
In the iris dataset you could consider for Petal columns I want to generate a Petal proportion of Length / Width. Similarly I want to do this for Sepal. I don't want to manually do this in a mutate() because I have lots of variable groups, and this could change over time.
I do have a solution that works using base R (in the code below) but I wanted to know if there was a tidyverse solution that achieved the same.
# libs ----
library(tidyverse)
# data ----
df <- iris
glimpse(df)
# set up column vectors ----
length_cols <- names(df) %>% str_subset("Length") %>% sort()
width_cols <- names(df) %>% str_subset("Width") %>% sort()
new_col_names <- names(df) %>% str_subset("Length") %>% str_replace(".Length", ".Ratio") %>% sort()
length_cols
width_cols
new_col_names
# make new cols ----
df[, new_col_names] <- df[, length_cols] / df[, width_cols]
df %>% head()
Thanks,
Gareth

Here is one possibility using purrr::map:
library(tidyverse);
df <- map(c("Petal", "Sepal"), ~ iris %>%
mutate(
!!paste0(.x, ".Ratio") := !!as.name(paste0(.x, ".Length")) / !!as.name(paste0(.x, ".Width")) )) %>%
reduce(left_join);
head(df);
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
#1 5.1 3.5 1.4 0.2 setosa 7.00
#2 4.9 3.0 1.4 0.2 setosa 7.00
#3 4.7 3.2 1.3 0.2 setosa 6.50
#4 4.6 3.1 1.5 0.2 setosa 7.50
#5 5.0 3.6 1.4 0.2 setosa 7.00
#6 5.4 3.9 1.7 0.4 setosa 4.25
# Sepal.Ratio
#1 1.457143
#2 1.633333
#3 1.468750
#4 1.483871
#5 1.388889
#6 1.384615
Explanation: We map the prefixes "Petal" and "Sepal" to iris by extracting for each prefix the columns with suffixes "Length" and "Width", and calculate a new corresponding prefix + ".Ratio" column; reduce merges both data.frames.

Using variables for column functions in mutate()

How can I use variables in place of column names in dplyr strings? As an example say I want to add a column to the iris dataset called sum that is the sum of Sepal.Length and Sepal.Width. In short I want a working version of the below code.
x = "Sepal.Length"
y = "Sepal.Width"
head(iris%>% mutate(sum = x+y))
Currently, running the code outputs "Evaluation error: non-numeric argument to binary operator" as R evaluates x and y as character vectors. How do I instead get R to evaluate x and y as column names of the dataframe? I know that the answer is to use some form of lazy evaluation, but I'm having trouble figuring out exactly how to configure it.
Note that the proposed duplicate: dplyr - mutate: use dynamic variable names does not address this issue. The duplicate answers this question:
Not my question: How do I do:
var = "sum"
head(iris %>% mutate(var = Sepal.Length + Sepal.Width))

I think that recommended way is using sym:
iris %>% mutate(sum = !!sym(x) + !!sym(y)) %>% head

It also works with get():
> rm(list = ls())
> data("iris")
>
> library(dplyr)
>
> x <- "Sepal.Length"
> y <- "Sepal.Width"
>
> head(iris %>% mutate(sum = get(x) + get(y)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting columns from Data Frame based on a "formula" - r

we can also use d %>% select(Y, starts_with('X'))

Related

Operating on list of strings representing column names?

renaming columns to Y X1 X2 X3 X4 .. XN

In a for loop, how do I insert the variable i inside the "starts_with" quotation?

dplyr mutating multiple columns by prefix and suffix

Using variables for column functions in mutate()

Categories

Resources