I essentially want recode and rename a range of variables in a dataframe. I am looking for a way to do this in the single step.
Example in pseudo-code:
require(dplyr)
df <- iris %>% head()
df %>% mutate(
paste0("x", 1:3) = across( # In the example I want to rename
Sepal.Length:Petal.Length, # the variables I've selected
~ .x + 1 # and recoded to "x1" ... "x5"
)
)
df
Desired output:
x1 x2 x3 Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Maybe rename_with() is what you want. After that you can manipulate these renamed columns with mutate(across(...)).
library(dplyr)
df %>%
rename_with(~ paste0("x", seq_along(.x)), Sepal.Length:Petal.Length) %>%
mutate(across(x1:x3, ~ .x * 10))
x1 x2 x3 Petal.Width Species
1 51 35 14 0.2 setosa
2 49 30 14 0.2 setosa
3 47 32 13 0.2 setosa
4 46 31 15 0.2 setosa
5 50 36 14 0.2 setosa
6 54 39 17 0.4 setosa
If you want to manipulate and rename a range of columns in one step, try the argument .names in across().
df %>%
mutate(across(Sepal.Length:Petal.Length, ~ .x * 10,
.names = "x{seq_along(.col)}"),
.keep = "unused", .after = 1)
x1 x2 x3 Petal.Width Species
1 51 35 14 0.2 setosa
2 49 30 14 0.2 setosa
3 47 32 13 0.2 setosa
4 46 31 15 0.2 setosa
5 50 36 14 0.2 setosa
6 54 39 17 0.4 setosa
Hint: You can use seq_along() to create a sequence 1, 2, ... along with the selected columns, or match() to get the positions of the selected columns in the data, i.e. .names = "x{match(.col, names(df))}".
The below code allows you to just input the column numbers into a for loop, not sure if this is what you're going for.
require(dplyr)
df <- iris %>% head()
for(i in 1:3){
names(df)[i] <- paste0("x",i)
}
df
Outputs:
x1 x2 x3 Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You could add consecutive numbers to n columns with the same prefix this way:
df <- iris %>% head()
n <- 3
colnames(df)[1:n] <- sprintf("x%s",1:n)
output:
# x1 x2 x3 Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
Of any nonconsecutive number of columns by:
n <- c(1,3,5)
colnames(df)[n] <- sprintf("x%s",n)
# x1 Sepal.Width x3 Petal.Width x5
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
Related
Consider iris dataset. Let's say I want to create a column count if values "sepal" columns are between 1 to 5.
Here's what I have:
iris %>% rowwise() %>%
mutate(count = sum(if_any(contains("sepal", ignore.case = TRUE),
.fns = ~ between(.x, 1, 5)))) %>%
arrange(desc(count))
But the output is not what I want.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1 # Should be 1
2 4.9 3 1.4 0.2 setosa 1 # Should be 2
3 4.7 3.2 1.3 0.2 setosa 1 # Should be 2
4 4.6 3.1 1.5 0.2 setosa 1 # Should be 2
5 5 3.6 1.4 0.2 setosa 1 # Should be 2
6 5.4 3.9 1.7 0.4 setosa 1 # Should be 1
7 4.6 3.4 1.4 0.3 setosa 1 # Should be 2
8 5 3.4 1.5 0.2 setosa 1 # Should be 2
9 4.4 2.9 1.4 0.2 setosa 1 # Should be 2
10 4.9 3.1 1.5 0.1 setosa 1 # Should be 2
I can use case_when or if_else for the two columns but the actual dataset has a lot more columns. So I'm looking for a dplyr solution where I don't have to type out all the columns.
library(tidyverse)
iris %>%
mutate(
count = rowSums(across(contains("Sepal"), ~ between(.x, 1, 5)))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 2
4 4.6 3.1 1.5 0.2 setosa 2
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 2
8 5.0 3.4 1.5 0.2 setosa 2
9 4.4 2.9 1.4 0.2 setosa 2
10 4.9 3.1 1.5 0.1 setosa 2
EDIT:
With c_across. To my understanding, c_across has to be used with rowwise() to perform rowwise aggregation and calculation.
iris %>%
rowwise() %>%
mutate(count = sum(between(c_across(contains("Sepal")), 1, 5)))
I have a tibble data frame in R and I want to add a new column name but the name must come from the value of a variable, what is the easiest way to achieve this?
# let us generate a whole set of new features
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(hello=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species hello
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
# This is the example that does not work as desired
var_name='hello'
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(var_name=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species var_name
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
In the first example the column name created from mutate is actually a column called 'hello'. In the second example mutate names the column 'var_name', instead of 'hello' which is the desired outcome.
Any suggestions on how to make this as easy as possible?
Enter the command ?dplyr_data_masking
If you read through that, you can see there are at least 2 ways you can get your desired result.
iris_tbl <- iris_tbl %>% mutate("{var_name}" := 1)
Or
iris_tbl <- iris_tbl %>% mutate({{var_name}} := 1)
For this example, I'm going to use iris dataset built-in in R.
How can I avoid the copy and pasting of the syntax below to have the same output?
package
library(dplyr)
Input
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
Manual Solution
I have to subset my dataset based on the name of the column names.
I know how to do this "manually" but it would require a lot of copying and pasting on my current dataset.
Sepal <- iris %>% select(contains("Sepal"))
Petal <- iris %>% select(contains("Petal"))
Output
head(Sepal)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
head(Petal)
# Petal.Length Petal.Width
# 1 1.4 0.2
# 2 1.4 0.2
# 3 1.3 0.2
# 4 1.5 0.2
# 5 1.4 0.2
# 6 1.7 0.4
How can I automatize this process? I think I can use the purrr package here. But I couldn't find a way to do it.
You can use
library(tidyverse)
map(set_names(c("Sepal", "Petal")), ~ select(iris, starts_with(.x)))
output (head)
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
An option is also to use split.default on the substring of column names to return a named list of data.frames
library(dplyr)
library(stringr)
head(iris) %>%
select(-Species) %>%
split.default(str_remove(names(.), "\\..*"))
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows
My current dataframe in R has the following dimensions
nrows=605
ncol: 1514
The first column indicates the class/ label and my dataset has only two classes namely: setosa and iris.
test[1:5,]
class id1 id2...
1: setosa 2 4.....
2: setosa 2 5 .....
3: setosa 5 4 .....
4: iris 5 9......
5: iris 7 9 ....
However the dataframe is ordered as of now : ie. Rows 2- row 233 of my dataframe correspond to class setosa and class iris is from 234 until end. I want the dataset to be rearranged so that the samples are mixed up.
The expected output should be in following form:
If I do df[1:10,] ie. 10 lines of dataframe ,I should be able to see samples of both iris and setosa. Any ideas or suggestion on how to do this?
library( tidyverse )
iris[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
df <- iris %>%
group_by( Species ) %>%
mutate( id = row_number() ) %>%
arrange( id ) %>%
select ( -id )
df[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 7 3.2 4.7 1.4 versicolor
# 3 6.3 3.3 6 2.5 virginica
# 4 4.9 3 1.4 0.2 setosa
# 5 6.4 3.2 4.5 1.5 versicolor
# 6 5.8 2.7 5.1 1.9 virginica
# 7 4.7 3.2 1.3 0.2 setosa
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 7.1 3 5.9 2.1 virginica
# 10 4.6 3.1 1.5 0.2 setosa