Can we create/mutate several new columns at once in tidyverse [duplicate] - r

This question already has answers here:
Create loop with dynamic column names and repeating values based on defined i
(1 answer)
How to use mutate and ifelse in a loop?
(3 answers)
How can I dynamically create new variables/columns on databases in R using dplyr?
(2 answers)
How to use mutate from dplyr to create a series of columns defined and called by a vector specifying values for mutation?
(1 answer)
dplyr apply a single function with changing argument to the same column
(2 answers)
Closed 1 year ago.
Let me clarify I am not looking at mutate_at or mutate(across(..., ...)) type of syntax here. I just want to know how to create several new columns at once inside tidyverse pipe syntax.
Let us assume the case of iris dataset.
I want to create say 10 (or 100 or more) new columns having a criteria like this.
first new column(variable) say V1 is just Petal.Length * 1,
second new col say V2 is Petal.Length * 2
and so on upto say V10 Petal.Length * 10
without explicitly writing the names and formula for each of these columns, which may be cumbersome If I want to create say 100 new columns.

You can use map functions :
library(dplyr)
library(purrr)
df <- iris %>% head
value <- 1:5
bind_cols(df,
map_dfc(value, ~df %>% transmute(!!paste0('col', .x) := Petal.Length * .x)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species col1 col2 col3 col4 col5
#1 5.1 3.5 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#2 4.9 3.0 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#3 4.7 3.2 1.3 0.2 setosa 1.3 2.6 3.9 5.2 6.5
#4 4.6 3.1 1.5 0.2 setosa 1.5 3.0 4.5 6.0 7.5
#5 5.0 3.6 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#6 5.4 3.9 1.7 0.4 setosa 1.7 3.4 5.1 6.8 8.5
In base R, this can be done with lapply :
df[paste0('col', value)] <- lapply(value, `*`, df$Petal.Length)

Related

How to tidily create multiple columns from sets of columns?

I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.
I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.
Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615

Creating variables from list objects in R

I'm trying to create a binary set of variables that uses data across multiple columns.
I have a dataset where I'm trying to create a binary variable where any column with a specific name will be indexed for a certain value. I'll use iris as an example dataset.
Let's say I want to create a variable where any column with the string "Sepal" and any row in those columns with the values of 5.1, 3.0, and 4.7 will become "Class A" while values with 3.1, 5.0, and 5.4 will be "Class B". So let's look at the first few entries of iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The first 3 rows should then be under "Class A" While rows 4-6 will be under "Class B". I tried writing this code to do that
mutate(iris, Class = if_else(
vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7))), "Class A",
ifelse(vars(contains("Sepal")), any_vars(. %in% c(3.1, 5.0, 5.4))), "Class B",NA)
and received the error
Error: `condition` must be a logical vector, not a `quosures/list` object
So I've realized I need lapply here, but I'm not even sure where to begin to write this because I'm not sure how to represent the entire part of selecting columns with "Sepal" in the name and also include the specific values in those rows as one list object to provide to lapply
This is clearly the wrong syntax
lapply(vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7)))
Examples using case_when will also be accepted as answers.
If you want to do this using dplyr, you can use rowwise with new c_across :
library(dplyr)
iris %>%
rowwise() %>%
mutate(Class = case_when(
any(c_across(contains("Sepal")) %in% c(5.1,3.0, 4.7)) ~ 'Class A',
any(c_across(contains("Sepal")) %in% c(3.1,5.0,5.4)) ~ 'Class B')) %>%
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Class
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa Class A
#2 4.9 3 1.4 0.2 setosa Class A
#3 4.7 3.2 1.3 0.2 setosa Class A
#4 4.6 3.1 1.5 0.2 setosa Class B
#5 5 3.6 1.4 0.2 setosa Class B
#6 5.4 3.9 1.7 0.4 setosa Class B
However, note that using %in% on numerical values is not accurate. If interested you may read Why are these numbers not equal?

How do I identify duplicates except for one column, and replace that column with max [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 3 years ago.
I am trying to find data where three out of four columns are duplicated, and then to remove duplicates but keep the row with the largest number for the otherwise identical data.
I found this very helpful article on the StackOverflow which I think gets me about half way there.
I will base my question of the example in that question. (The example has more columns than what I am working on but I don' think that really matters.)
require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer = iris%>%semi_join(dups)
> answer
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
That article introduced me to code that will identify all rows where everything is equal except petal width:
iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]
This is great but I don't know how to progress from here. I would like to have rows 2 and 5 to collapse into a single row that is equal to row 5. Similarly 9 & 10, should become just 10, and 8 & 12 become just 12.
The data set I have has more than 2 rows in some sets of duplicates, so I haven't had any luck using arrange functions to order them and delete the smallest row.
This should do what you want
iris %>%
group_by(Sepal.Length,
Sepal.Width,
Petal.Length,
Species) %>%
filter(Petal.Width == max(Petal.Width)) %>%
filter(row_number() == 1) %>%
ungroup()
The second filtering is to get rid of duplicates if the Petal.Width is also identical for two entries. Does this work for you?

Summing across selected columns (using select() methods) in dplyr [duplicate]

This question already has answers here:
dplyr mutate rowSums calculations or custom functions
(7 answers)
Closed 3 years ago.
Summing across columns by listing their names is fairly simple:
iris %>% rowwise() %>% mutate(sum = sum(Sepal.Length, Sepal.Width, Petal.Length))
However, say there are a lot more columns, and you are interested in extracting all columns containing "Sepal" without manually listing them out. Specifically, I'm looking for a method in the same way select() in dplyr allows you to subset columns with with contains(), starts_with(), etc.
There are ways to use mutate_all() + sum() + join() in order to fulfill the same result as this query, but I am more interested in seeing something as close to the solution as the code below:
iris %>% rowwise() %>% mutate(sum = sum(contains(colnames(.), "Sepal")))
If I understand correctly, basically you're trying to do:
library(dplyr)
iris %>% mutate(sum = rowSums(select(., contains("Sepal"))))
First few rows:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Defining functions (in rollapply) using lines of a dataframe

First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))

Resources