How to do rowSums over many columns in ``dplyr`` or ``tidyr``?

How to do rowSums over many columns in ``dplyr`` or ``tidyr``? - r

For example, is it possible to do this in dplyr:
new_name <- "Sepal.Sum"
col_grep <- "Sepal"
iris <- cbind(iris, tmp_name = rowSums(iris[,grep(col_grep, names(iris))]))
names(iris)[names(iris) == "tmp_name"] <- new_name
This adds up all the columns that contain "Sepal" in the name and creates a new variable named "Sepal.Sum".
Importantly, the solution needs to rely on a grep (or dplyr:::matches, dplyr:::one_of, etc.) when selecting the columns for the rowSums function, and have the name of the new column be dynamic.
My application has many new columns being created in a loop, so an even better solution would use mutate_each_ to generate many of these new columns.

Here a dplyr solution that uses the contains special functions to be used inside select.
iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>% rowSums()) -> iris2
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
and here the benchmarks:
Unit: milliseconds
expr
iris2 <- iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>% rowSums())
min lq mean median uq max neval
1.816496 1.86304 2.132217 1.928748 2.509996 5.252626 100

Didn't want to comment this as it's too long.
Not much in it in terms of timing for the solutions (expect the data.table solution which appearsslower) that have been proposed and none stand out as clearly more elegant.
library(dplyr)
library(data.table)
new_name <- "Sepal.Sum"
col_grep <- "Sepal"
# Make iris bigger
data(iris)
for(i in 1:18){
iris <- bind_rows(iris, iris)
}
iris1 <- iris
system.time({
# Base solution
iris1 <- cbind(iris1, tmp_name = rowSums(iris1[,grep(col_grep, names(iris1))]))
names(iris1)[names(iris1) == "tmp_name"] <- new_name
})
# 1.26
system.time({
# less elegant dplyr solution
iris %>% select(matches(col_grep)) %>% rowSums() %>%
data.frame(.) %>% bind_cols(iris, .) %>% setNames(., c(names(iris), new_name))
})
# 1.14
system.time({
# bit more elegant dplyr solution
iris %>% mutate(tmp_name = rowSums(.[] %>% select(matches(col_grep)))) %>%
rename_(.dots = setNames("tmp_name", new_name))
})
# 1.12
data(iris)
# Make iris bigger
for(i in 1:18){
iris <- rbindlist(list(iris, iris))
}
system.time({
setDT(iris)[, tmp_name := rowSums(.SD[,grep(col_grep, names(iris)), with = FALSE])]
setnames(iris, "tmp_name", new_name)
})
# 2.39

Related

How to dynamically create variables and combine it to the dataframe in r?

I am running kmeans for multiple number of clusters and then trying to combine cluster results to the original dataframe.
from post https://stats.stackexchange.com/questions/10838/produce-a-list-of-variable-name-in-a-for-loop-then-assign-values-to-the I am using their below mentioned code to create variables dynamically and modifying as per my need.
original code in the above post:
x <- as.list(rnorm(10000))
names(x) <- paste("a", 1:length(x), sep = "")
list2env(x , envir = .GlobalEnv)
Now applying this on iris data:
library(tidyverse)
library(ggthemes)
library(factoextra)
this works fine in creating 3 list of clusters:
# running for 1 to 3 clusters
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# iris_df <- cbind(iris, cluster_res_list)
} )
Issue: When I try to combine them with the original dataset I am getting an error: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"kmeans"’ to a data.frame
lapply(1:3,
function(cluster_num){
cluster_res_list <- as.list(kmeans(iris %>% select(-Species), cluster_num, nstart = 25))
names(cluster_res_list) <- paste("iris_clus", 1:length(cluster_res_list), sep="_")
list2env(cluster_res_list, envir = .GlobalEnv)
# to combine each cluster result to original df
iris_df <- cbind(iris, cluster_res_list)
} )

The output from kmeans can be viewed as a matrix using the fitted function. The row names of the matrix identify the clusters. If you want to add a column to the original date frame that identifies the cluster assignment, then something like this would work.
Using 3 clusters as an example:
cluster_num <- 3
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris) %>%
tail()
iris_clus Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 2 6.7 3.3 5.7 2.5 virginica
146 2 6.7 3.0 5.2 2.3 virginica
147 1 6.3 2.5 5.0 1.9 virginica
148 2 6.5 3.0 5.2 2.0 virginica
149 2 6.2 3.4 5.4 2.3 virginica
150 1 5.9 3.0 5.1 1.8 virginica
Inserting this into the lapply from your example
lapply(1:3, function(cluster_num) {
iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .) %>%
cbind(iris)
})
Here's one way to combine it all into one data set. With one column per model
clusters <- Reduce(cbind, lapply(1:3, function(cluster_num) {
result <- iris %>%
select(-Species) %>%
kmeans(centers = cluster_num, nstart = 25) %>%
fitted() %>%
row.names() %>%
tibble(iris_clus = .)
names(result) <- paste("iris_clus", cluster_num, sep = "_")
return(result)
}))
cbind(iris, clusters)

How to write a function to rename multiples columns at once?

df1 <- df %>%
rename(newcol1 = oldcol1) %>%
rename(newcol2 = oldcol2) %>%
rename(newcol3 = oldcol3) %>%
rename(newcol4 = oldcol4) %>%
rename(newcol5 = oldcol5)
I am trying to write a function, which I just learned, that will do the same thing as above.
renaming = function(df, oldcol, newcol) {
rename(df, newcol = oldcol)
but then I am not sure how to do with the multiple columns..
any help would be much appreciated!

Using base R
names(df) <- c("newname1", "newname2", "newname3") # for all varnames
names(df)[c(1,3,4)] <- c("newname1", "newname3", "newname4") # for varnames 1,3,4
names(df)[names(df) == "oldname"] <- "newname" # for one varname
Using data.table
setnames(dt, old=c("oldname1", "oldname2"), new=c("newname1", "newname2"))
Using dplyr/tidyverse
df %>% rename(newname1 = oldname1, newname2 = oldname2)

You can use set_names from the tidyverse package purrr.
Reproducible example:
> df <- iris
> df1 <- df %>%
purrr::set_names(c("d","x","y","z","a"))
> df1
d x y z a
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you

Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).

If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.

If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)

I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.

Rename multiple variables within a pipeline

The pipeline metaphor enabled by packages like dplyr and magrittr is incredibly useful and does great things for making your code readable in R (a daunting task!)
How can one make a pipeline that ended with renaming all the variables in a data frame to a pre-determined list?
Here is what I tried. First, simple sample data to test on:
> library(dplyr)
> iris %>% head(n=3) %>% select(-Species) %>% t %>% as.data.frame -> test.data
> test.data
1 2 3
Sepal.Length 5.1 4.9 4.7
Sepal.Width 3.5 3.0 3.2
Petal.Length 1.4 1.4 1.3
Petal.Width 0.2 0.2 0.2
This doesn't work:
> test.data %>% rename(a=1,b=2,c=3)
Error: Arguments to rename must be unquoted variable names. Arguments a, b, c are not.
I wasn't able to figure out the precise meaning of this error from reading the documentation on rename. My other attempt avoids an error by using curly braces to define a code block, but the renaming doesn't actually happen:
> test.data %>% { names(.) <- c('a','b','c')}

'1','2','3'You were correct except use setNames {stats} instead of rename (zx8754 answered in your comment before me)
setNames: This is a convenience function that sets the names on an
object and returns the object. It is most useful at the end of a
function definition where one is creating the object to be returned
and would prefer not to store it under a name just so the names can be
assigned.
Your example (Close just change rename with setNames)
iris %>%
head(n=3) %>%
select(-Species) %>%
t %>%
as.data.frame %>%
rename(a=1,b=2,c=3)
Answer
iris %>%
head(n=3) %>%
select(-Species) %>%
t %>%
as.data.frame %>%
setNames(c('1','2','3'))
Another Example
name_list <- c('1','2','3')
iris %>%
head(n=3) %>%
select(-Species) %>%
t %>%
as.data.frame %>%
setNames(name_list)

The way I got this to work, I needed the tee operator from the magrittr package:
> library(magrittr)
> test.data %T>% { names(.) <- c('a','b','c')} -> renamed.test.data
> renamed.test.data
a b c
Sepal.Length 5.1 4.9 4.7
Sepal.Width 3.5 3.0 3.2
Petal.Length 1.4 1.4 1.3
Petal.Width 0.2 0.2 0.2
Note that for a data frame with normal (i.e. not numbers) variable names, you can do this:
> # Rename it with rename in a normal pipe
> renamed.test.data %>% rename(x=a,y=b,z=c) -> renamed.again.test.data
> renamed.again.test.data
x y z
Sepal.Length 5.1 4.9 4.7
Sepal.Width 3.5 3.0 3.2
Petal.Length 1.4 1.4 1.3
Petal.Width 0.2 0.2 0.2
The above trick (edit: or, even better, using setNames) is still useful, though, because sometimes you already have the list of names in a character vector and you just want to set them all at once without worrying about writing out each replacement pair.

We can rename the numerical variable names with dplyr::rename by enclosing in Backquote(`).
library(dplyr)
iris %>%
head(n=3) %>% select(-Species) %>% t %>% as.data.frame %>%
dplyr::rename(a=`1`, b=`2`, c=`3`)
# a b c
# Sepal.Length 5.1 4.9 4.7
# Sepal.Width 3.5 3.0 3.2
# Petal.Length 1.4 1.4 1.3
# Petal.Width 0.2 0.2 0.2
As another way, we can set column name by using stats::setNames, magrittr::set_names and purrr::set_names.
library(dplyr)
library(magrittr)
library(purrr)
iris %>%
head(n=3) %>% select(-Species) %>% t %>% as.data.frame %>%
stats::setNames(c("a", "b", "c"))
iris %>%
head(n=3) %>% select(-Species) %>% t %>% as.data.frame %>%
magrittr::set_names(c("a", "b", "c"))
iris %>%
head(n=3) %>% select(-Species) %>% t %>% as.data.frame %>%
purrr::set_names(c("a", "b", "c"))
# The results of above all codes is as follows:
# a b c
# Sepal.Length 5.1 4.9 4.7
# Sepal.Width 3.5 3.0 3.2
# Petal.Length 1.4 1.4 1.3
# Petal.Width 0.2 0.2 0.2

How do I summarise only part of a table?

I have two related use-cases in which I need to summarise just parts of a table, specified in a way similar to filter.
In a nutshell, I want something like this:
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
To yield this:
Source: local data frame [101 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
…
So instead of grouping by the value of a column, I use a filtering criterion to operate on a view of the table, without actually losing the rest of the table (unlike filter).
How do I smartly implement use_only/use_all? Even better, is this functionality already contained in dplyr and how do I use it?
It’s of course quite easy to generate the result above, but I need to do something similar for many different cases, with complex and variable criteria for filtering.

I implemented this with the approach of having use_only save the rest of the table into a global option dplyr_use_only_rest, and having use_all bind it back together.
use_only <- function(.data, ...) {
if (!is.null(.data$.index)) {
stop("data cannot already have .index column, would be overwritten")
}
filt <- .data %>%
mutate(.index = row_number()) %>%
filter(...)
rest <- .data %>% slice(-filt$.index)
options(dplyr_use_only_rest = rest)
select(filt, -.index)
}
use_all <- function(.data, ...) {
rest <- getOption("dplyr_use_only_rest")
if (is.null(rest)) {
stop("called use_all() without earlier use_only()")
}
options(dplyr_use_only_rest = NULL)
bind_rows(.data, rest)
}
I recognize setting global options is less than ideal design for functional programming, but I don't think there's another way to ensure that the remainder of the data frame passes through any intermediate functions untouched. Adding an extra attribute to the object wouldn't survive functions such as do or summarize.
At this point,
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
returns, as desired:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
...
Any intermediate steps could be used in place of summarize_each and mutate (do, filter, etc) and they would happen only to the specified rows. You could even add or remove columns (the remainder would be filled in with NAs).

I think your approach of searching for a function to satisfy that particular syntax is too restrictive. This is what I would do using data.table (I'm not sure if dplyr allows for variable rows like this yet, I know it's been an FR for a while):
library(data.table)
dt = as.data.table(iris)
dt[, if (Species == 'setosa') lapply(.SD, sum) else .SD, by = Species]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 250.3 171.4 73.1 12.3
# 2: versicolor 7.0 3.2 4.7 1.4
# 3: versicolor 6.4 3.2 4.5 1.5
# 4: versicolor 6.9 3.1 4.9 1.5
# 5: versicolor 5.5 2.3 4.0 1.3
# ---
You can also add [Species == 'setosa', Species := 'setosa_sum'] at the end to modify the name in place. It should be straightforward to extend to multiple criteria/whatever function.

You can create a new column to group by:
iris %>%
mutate( group1 = ifelse(Species == "setosa", "", row_number())) %>%
group_by( group1, Species ) %>%
summarise_each(funs(sum), -Species, -group1) %>%
ungroup() %>%
select(-group1)
Update - as more general solution
library(lazyeval)
use_only_ <- function(x, condition, ...) {
condition <- as.lazy(condition, parent.frame())
mutate_(x, .group = condition) %>%
group_by_(".group", ...)
}
use_only <- function(x, condition, ...) {
use_only_(x, lazy(condition), ...)
}
use_all <- function(x) {
ungroup(x) %>%
select(- .group)
}
Use use_only with any condition in the context of data frame and calling environment. In this case:
iris %>%
use_only( ifelse(Species == "setosa", "", row_number()), "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
The use_only_ can be used with formula or string. For example:
condition <- ~ifelse(Species == "setosa", "", row_number())
or
condition <- "ifelse(Species == 'setosa' , "", row_number())"
And call:
iris %>%
use_only_(condition, "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
When mutate-ing between the use_only and use_all calls you must take care to change only values inside marked group.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to do rowSums over many columns in ``dplyr`` or ``tidyr``? - r

Related

How to dynamically create variables and combine it to the dataframe in r?

How to write a function to rename multiples columns at once?

R dplyr summarise multiple functions to selected variables

Rename multiple variables within a pipeline

How do I summarise only part of a table?

Categories

Resources