I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.
You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df
Related
I want to use the str_detectfunction passing a variable as the first argument. Meaning this could theoretically look something like this.
# create the variable
var = names(mtcars)[1]
mtcars %>%
mutate(
new_var = case_when(str_detect(var, "^2"), "two", "other")
)
Now I'm not sure how to insert the variable var correctly into the str_detect function. I guess some tidy-eval is necessary, but I'm not sure....
using mtcars as an exmaple for string manipulation is not very helpful, so switching over to iris. Also, your case_when specification was wrong, so I'm using if_else for this example.
You can use !!(sym(var))
library(tidyverse)
var <- "Species"
iris %>%
mutate(
new_var = if_else(str_detect(!!sym(var), "set"), "two", "other")
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa two
2 4.9 3.0 1.4 0.2 setosa two
3 4.7 3.2 1.3 0.2 setosa two
4 4.6 3.1 1.5 0.2 setosa two
5 5.0 3.6 1.4 0.2 setosa two
6 5.4 3.9 1.7 0.4 setosa two
I'd like to select the first N values of each variables (columns) in a data set, where N varies by column and row and are given in an other table. An example below with the iris data:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Create a fake external table
ext.tab <- data.table(species=c("setosa","versicolor", "virginica" ),N1=c(1:3),N2=c(3:5),N3=c(5:7),N4=c(7:9))
head(ext.tab)
species N1 N2 N3 N4
1: setosa 1 3 5 7
2: versicolor 2 4 6 8
3: virginica 3 5 7 9
Now for Iris setosa, I'd like to get the first maximum value (N1 in ext.tab) of column 1 ('sepal.length' in iris data), then the three max values (N2 in ext.tab) for column 2 (sepal.width), then the five max values (N3) for column 3 (petal.length) and so forth. Then moving to the Iris versicolor and do the same.
The result can be either a table or a list for each species with the values themselves or row indices for each variable (column). Any idea of a fast way to implement that?
Here is a tidyverse approach using a custom function. The function takes the variable and group names as character scalar and number of maximum values as numeric. Inside the function is a dplyr pipeline using .data pronoun. Then, I reshaped ext.tab to long form and applied get_maximum() row-wise.
library(tidyverse)
get_maximum <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
dat <- as_tibble(ext.tab) %>%
pivot_longer(-species) %>%
mutate(name = recode(
name,
N1 = "Sepal.Length",
N2 = "Sepal.Width",
N3 = "Petal.Length",
N4 = "Petal.Width"
)) %>%
rowwise() %>%
mutate(max_num = list(
get_maximum(name, species, value, iris)
)) %>%
ungroup()
If you need the unique maximum values, you can add distinct() inside the custom function.
get_maximum_unique <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
distinct(.data[[.x]]) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
Here is an option using data.table. I have taken the liberty of renaming the column names.
cols <- setdiff(names(ext.tab), "Species")
iris[ext.tab, on=.(Species), by=.EACHI,
.(.(mapply(function(x, n) -head(sort(-x, partial=n), n),
x=mget(cols), n=mget(paste0("i.", cols)), SIMPLIFY=FALSE)))]$V1
data:
library(data.table)
iris <- as.data.table(iris)
ext.tab <- data.table(Species=c("setosa", "versicolor", "virginica"),
Sepal.Length=c(1:3),
Sepal.Width=c(3:5),
Petal.Length=c(5:7),
Petal.Width=c(7:9))
output:
[[1]]
[[1]]$Sepal.Length
[1] 5.8
[[1]]$Sepal.Width
[1] 4.4 4.2 4.1
[[1]]$Petal.Length
[1] 1.9 1.9 1.7 1.7 1.7
[[1]]$Petal.Width
[1] 0.4 0.4 0.6 0.4 0.5 0.4 0.4
[[2]]
[[2]]$Sepal.Length
[1] 7.0 6.9
[[2]]$Sepal.Width
[1] 3.4 3.3 3.2 3.2
[[2]]$Petal.Length
[1] 5.1 4.8 4.9 5.0 4.9 4.8
[[2]]$Petal.Width
[1] 1.7 1.6 1.6 1.8 1.5 1.5 1.6 1.5
[[3]]
[[3]]$Sepal.Length
[1] 7.7 7.9 7.7
[[3]]$Sepal.Width
[1] 3.8 3.8 3.6 3.4 3.4
[[3]]$Petal.Length
[1] 6.4 6.3 6.7 6.9 6.7 6.6 6.1
[[3]]$Petal.Width
[1] 2.5 2.5 2.4 2.5 2.4 2.4 2.3 2.3 2.3
Short explanation:
Perform a left join iris[ext.tab, on=.(Species),
by=.EACHI means for each row of ext.tab
x=mget(cols) gets the columns in iris
mget(paste0("i.", cols)) gets the number of values required for each column
-head(sort(-x, partial=n), n) performs a partial sort and extract the first n values
SIMPLIFY=FALSE and .(.( )) are simply required to return the results as a list
have a dataset like an iris, any help will be appreciated,
iris %>% head %>% mutate(sum = .[[1]] + .[[2]]) #works
iris %>% head %>% mutate(max = max(.[1], .[2])) #doesnt work
Expected answer, find the max(1st column, 2nd column)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5.0 3.6 1.4 0.2 setosa 5.0
6 5.4 3.9 1.7 0.4 setosa 5.4
many thanks in advance
We need elementwise max and this can be achieved with pmax
iris %>%
head %>%
mutate(max= pmax(.[[1]] , .[[2]]) )
The issue with max is that its usage is
max(..., na.rm = FALSE)
Here, the ... signifies
numeric or character arguments
So, it is taking the max value of all the columns passed into the function, rather than the elementwise max of the columns
The + is a different function and it is always elementwise, but if we do sum (which would be a corresponding candidate to check with max), it also does the same behavior as max
How can I use variables in place of column names in dplyr strings? As an example say I want to add a column to the iris dataset called sum that is the sum of Sepal.Length and Sepal.Width. In short I want a working version of the below code.
x = "Sepal.Length"
y = "Sepal.Width"
head(iris%>% mutate(sum = x+y))
Currently, running the code outputs "Evaluation error: non-numeric argument to binary operator" as R evaluates x and y as character vectors. How do I instead get R to evaluate x and y as column names of the dataframe? I know that the answer is to use some form of lazy evaluation, but I'm having trouble figuring out exactly how to configure it.
Note that the proposed duplicate: dplyr - mutate: use dynamic variable names does not address this issue. The duplicate answers this question:
Not my question: How do I do:
var = "sum"
head(iris %>% mutate(var = Sepal.Length + Sepal.Width))
I think that recommended way is using sym:
iris %>% mutate(sum = !!sym(x) + !!sym(y)) %>% head
It also works with get():
> rm(list = ls())
> data("iris")
>
> library(dplyr)
>
> x <- "Sepal.Length"
> y <- "Sepal.Width"
>
> head(iris %>% mutate(sum = get(x) + get(y)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
In R I'm trying to profile the columns of a data frame. This is the data frame:
> library(MASS)
> data<-iris[1:5,1:4]
> data
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
I want the result of the profiling to look something like this:
min max mean
Sepal.Length 4.6 5.1 5
Sepal.Width 3.0 3.6 5
Petal.Length 1.3 1.5 3
Petal.Width 0.2 0.2 1
There could be many more functions I want to apply to the columns.
I'm able to get the data I want with this command:
library(dplyr)
data %>% summarise_all(funs(min, max, mean))
However, neither the shape nor the row/column names are as desired. Is there an elegant way of achieving what I want?
Oneliner with base R:
t(sapply(data, summary))[, c('Min.', 'Max.', 'Mean')]
library(plyr)
t(sapply(data, each(min,max,mean)))
Using dplyr to allow use of any functions
library(dplyr)
library(tidyr)
data %>%
gather() %>%
group_by(key) %>%
summarise_all(funs(min, max, mean))