I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows
Related
Consider iris dataset. Let's say I want to create a column count if values "sepal" columns are between 1 to 5.
Here's what I have:
iris %>% rowwise() %>%
mutate(count = sum(if_any(contains("sepal", ignore.case = TRUE),
.fns = ~ between(.x, 1, 5)))) %>%
arrange(desc(count))
But the output is not what I want.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1 # Should be 1
2 4.9 3 1.4 0.2 setosa 1 # Should be 2
3 4.7 3.2 1.3 0.2 setosa 1 # Should be 2
4 4.6 3.1 1.5 0.2 setosa 1 # Should be 2
5 5 3.6 1.4 0.2 setosa 1 # Should be 2
6 5.4 3.9 1.7 0.4 setosa 1 # Should be 1
7 4.6 3.4 1.4 0.3 setosa 1 # Should be 2
8 5 3.4 1.5 0.2 setosa 1 # Should be 2
9 4.4 2.9 1.4 0.2 setosa 1 # Should be 2
10 4.9 3.1 1.5 0.1 setosa 1 # Should be 2
I can use case_when or if_else for the two columns but the actual dataset has a lot more columns. So I'm looking for a dplyr solution where I don't have to type out all the columns.
library(tidyverse)
iris %>%
mutate(
count = rowSums(across(contains("Sepal"), ~ between(.x, 1, 5)))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 2
4 4.6 3.1 1.5 0.2 setosa 2
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 2
8 5.0 3.4 1.5 0.2 setosa 2
9 4.4 2.9 1.4 0.2 setosa 2
10 4.9 3.1 1.5 0.1 setosa 2
EDIT:
With c_across. To my understanding, c_across has to be used with rowwise() to perform rowwise aggregation and calculation.
iris %>%
rowwise() %>%
mutate(count = sum(between(c_across(contains("Sepal")), 1, 5)))
I have a tibble data frame in R and I want to add a new column name but the name must come from the value of a variable, what is the easiest way to achieve this?
# let us generate a whole set of new features
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(hello=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species hello
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
# This is the example that does not work as desired
var_name='hello'
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(var_name=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species var_name
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
In the first example the column name created from mutate is actually a column called 'hello'. In the second example mutate names the column 'var_name', instead of 'hello' which is the desired outcome.
Any suggestions on how to make this as easy as possible?
Enter the command ?dplyr_data_masking
If you read through that, you can see there are at least 2 ways you can get your desired result.
iris_tbl <- iris_tbl %>% mutate("{var_name}" := 1)
Or
iris_tbl <- iris_tbl %>% mutate({{var_name}} := 1)
I'm new to R and am trying to learn how to create my own functions.
While the following function works fine:
#---------------------
# this works fine
#---------------------
func <- function(df) {
new_df <- unite(df, key, c("Sepal.Length","Sepal.Width"), sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
new_iris <- func(iris)
, this function where the unite function's third argument is now parameterized:
#---------------------
# this does not work
#---------------------
func <- function(df, keycols) {
new_df <- unite(df, key, keycols, sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
keycols <- quote(c("Sepal.Length","Sepal.Width"))
new_iris <- func(iris, keycols)
generates the following error message:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type language.
i It must be numeric or character.
Is there a way to pass c("Sepal.Length","Sepal.Width") as a parameter? Or some way to make the keycols a parameter for the above user defined function?
Thanks for any guidance.
One way you could achieve this simply by using curly-curly {{}} from rlang package which is a safe option,
library(tidyr)
library(rlang)
iris <- tibble::as_tibble(iris)
# using curly curly from {rlang} ------------------------------------------
func <- function(df, keycols) {
new_df <- unite(df, "key", {{keycols}}, sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
func(iris, c(Sepal.Length, Sepal.Width)) # passing directly the columns
#> # A tibble: 150 × 6
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
func(iris, c("Sepal.Length", "Sepal.Width")) # passing columns as character vector
#> # A tibble: 150 × 6
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
Created on 2022-07-08 by the reprex package (v2.0.1)
To understand why and how this solution works, look here programming with dplyr
You may adopt either of these methods
library(tidyr)
func <- function(df, ...){
unite(df, key, ..., sep=" ", remove = FALSE, na.rm = FALSE)
}
func(iris, Sepal.Length, Sepal.Width)
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
keycols <- c("Sepal.Length", "Sepal.Width")
func <- function(df, cols){
unite(df, key, !!cols, sep=" ", remove = FALSE, na.rm = FALSE)
}
func(iris, keycols)
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
Created on 2022-07-08 by the reprex package (v2.0.1)
My current dataframe in R has the following dimensions
nrows=605
ncol: 1514
The first column indicates the class/ label and my dataset has only two classes namely: setosa and iris.
test[1:5,]
class id1 id2...
1: setosa 2 4.....
2: setosa 2 5 .....
3: setosa 5 4 .....
4: iris 5 9......
5: iris 7 9 ....
However the dataframe is ordered as of now : ie. Rows 2- row 233 of my dataframe correspond to class setosa and class iris is from 234 until end. I want the dataset to be rearranged so that the samples are mixed up.
The expected output should be in following form:
If I do df[1:10,] ie. 10 lines of dataframe ,I should be able to see samples of both iris and setosa. Any ideas or suggestion on how to do this?
library( tidyverse )
iris[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
df <- iris %>%
group_by( Species ) %>%
mutate( id = row_number() ) %>%
arrange( id ) %>%
select ( -id )
df[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 7 3.2 4.7 1.4 versicolor
# 3 6.3 3.3 6 2.5 virginica
# 4 4.9 3 1.4 0.2 setosa
# 5 6.4 3.2 4.5 1.5 versicolor
# 6 5.8 2.7 5.1 1.9 virginica
# 7 4.7 3.2 1.3 0.2 setosa
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 7.1 3 5.9 2.1 virginica
# 10 4.6 3.1 1.5 0.2 setosa
Using dplyr, you can do something like this:
iris %>% head %>% mutate(sum=Sepal.Length + Sepal.Width)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
But above, I referenced the columns by their column names. How can I use 1 and 2 , which are the column indices to achieve the same result?
Here I have the following, but I feel it's not as elegant.
iris %>% head %>% mutate(sum=apply(select(.,1,2),1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
You can try:
iris %>% head %>% mutate(sum = .[[1]] + .[[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
I'm a bit late to the game, but my personal strategy in cases like this is to write my own tidyverse-compliant function that will do exactly what I want. By tidyverse-compliant, I mean that the first argument of the function is a data frame and that the output is a vector that can be added to the data frame.
sum_cols <- function(x, col1, col2){
x[[col1]] + x[[col2]]
}
iris %>%
head %>%
mutate(sum = sum_cols(x = ., col1 = 1, col2 = 2))
An alternative to reusing . in mutate that will respect grouping is to use dplyr::cur_data_all(). From help(cur_data_all)
cur_data_all() gives the current data for the current group (including grouping variables)
Consider the following:
iris %>% group_by(Species) %>% mutate(sum = .[[1]] + .[[2]]) %>% head
#Error: Problem with `mutate()` column `sum`.
#ℹ `sum = .[[1]] + .[[2]]`.
#ℹ `sum` must be size 50 or 1, not 150.
#ℹ The error occurred in group 1: Species = setosa.
If instead you use cur_data_all(), it works without issue:
iris %>% mutate(sum = select(cur_data_all(),1) + select(cur_data_all(),2)) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
The same approach works with the extract operator ([[).
iris %>% mutate(sum = cur_data()[[1]] + cur_data()[[2]]) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
What do you think about this version?
Inspired by #SavedByJesus's answer.
applySum <- function(df, ...) {
assertthat::assert_that(...length() > 0, msg = "one or more column indexes are required")
mutate(df, Sum = apply(as.data.frame(df[, c(...)]), 1, sum))
}
iris %>%
head(2) %>%
applySum(1, 2)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
#
### you can select and sum more then two columns by the same function
#
iris %>%
head(2) %>%
applySum(1, 2, 3, 4)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 9.5
To address the issue that #pluke is asking about in the comments, dplyr doesn't really support column index.
Not a perfect solution, but you can use base R to get around this
iris[1] <- iris[1] + iris[2]
This can now (packageVersion("dplyr") >= 1.0.0) be done very nicely with the combination of dplyr::rowwise() and dplyr::c_across().
library(dplyr)
packageVersion("dplyr")
#> [1] '1.0.10'
iris %>%
head %>%
rowwise() %>%
mutate(sum = sum(c_across(c(1, 2))))
#> # A tibble: 6 × 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 8.6
#> 2 4.9 3 1.4 0.2 setosa 7.9
#> 3 4.7 3.2 1.3 0.2 setosa 7.9
#> 4 4.6 3.1 1.5 0.2 setosa 7.7
#> 5 5 3.6 1.4 0.2 setosa 8.6
#> 6 5.4 3.9 1.7 0.4 setosa 9.3
Created on 2022-11-01 with reprex v2.0.2