i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)
I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?
The syntax I have written so far is:
scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>%
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))
As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.
You can use do.call() to take all columns except id as input of coalesce().
library(dplyr)
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = do.call(coalesce, select(df, -id)))
In addition, you can call coalesce() iteratively by Reduce().
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = Reduce(coalesce, select(df, -id)))
I have a df with 30 columns and 2000 rows.
from the df, I selected several variables by their name and calculated mean of Value by 3 by3 rows of group and type variables.
But there are only 3 variables (group, type, res) in output data.
How should I tell to save selected variables into output df? Is there anything wrong with this code?
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value))
Thanks in advance!
As others have pointed out, summarize only returns grouping variables and those variables specified in summarize. This is by design – summarize returns a single row for each group, so there must be a single value for each variable.
The function used in summarize must return a single value (so that's covered), while using group_by with variables ensures that these variables are the same within the group. But for the other variables, there could be several different values within the group: which would summarize choose? Instead of making a guess, it drops those variables.
There are several options to get around this, which one is best depends on your data and what you want to do with it:
Add these variables as grouping variables. This is the preferred method, but obviously it only works if the structure of the data allows it. For example, in a hypothetical dataset, if you want to group by city but want to preserve the state variable, using group_by(city, state) will divide into groups the same way as group_by(city) since city and state are linked (for example, "Boston" will always be with "MA").
Define them in summarize and choose only the first value to be the value for that group, as in #thc's answer. Note that you will lose any other values of those variables and it's not always clear which value will be kept and which will be lost.
Use mutate instead - this will keep the original number of rows rather than collapsing to 1 per group, but will ensure that you don't lose any data.
Join them as a comma (or other) separated string by adding: A = paste(A, sep = ', ') to the summarize for each variable you want to keep. This will preserve the information, at the expense of making it dificult to work with in any future steps.
You can include them in summarise instead, e.g.:
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value), A=A[1], B=B[1], C=C[1] )
I believe this is the fastest approach under dplyr if you have a very large data.frame.
I noticed that the order in which the dplyr functions when used in pipeline impacts the result. for example:
iris %>%
group_by(Species) %>%
mutate(Sum = sum(Sepal.Length))
produces different results than this:
iris %>%
mutate(Sum = sum(Sepal.Length)) %>%
group_by(Species)
Can anyone explain the reason for this, and if there are any specific order in which they have to be defined, please mention the same.
Thank you
FYI: iris is an inbuilt dataset in R,use data(iris) to load it. I was trying to add a new column, sum of sepal lengths for each species.
Yes, the order matters.
The pipe is equivalent to:
iris<-group_by(iris, Species)
iris<-mutate(iris, Sum = sum(Sepal.Length))
If you change the order, you change the result. If you group by species first, you'll have the result of the sum by species (I guess that's what you want).
However if you group by species after the sum, this sum will correspond to summing the Sepal length for all species.
Yes, the order matters because each part of the pipe is evaluated on its own, starting from the first through to the last pipe-part and the result of the previous pipe (or original dataset) is piped forward to the next following pipe-part. That means, if you use group_by after the mutate as in your example, the mutate will be done without grouping.
One side effect is that you can create complex and long pipes where you control the order of operations (by positioning them at the right part of the pipe) and you don't need to start a new pipe after an operation is finished.
Given a grouped tbl, can I extract one/few groups?
Such function can be useful when prototyping code, e.g.:
mtcars %>%
group_by(cyl) %>%
select_first_n_groups(2) %>%
do({'complicated expression'})
Surely, one can do an explicit filter before grouping, but that can be cumbersome.
With a bit of dplyr along with some nesting/unnesting (supported by tidyr package), you could establish a small helper to get the first (or any) group
first = function(x) x %>% nest %>% ungroup %>% slice(1) %>% unnest(data)
mtcars %>% group_by(cyl) %>% first()
By adjusting the slicing you could also extract the nth or any range of groups by index, but typically the first or the last is what most users want.
The name is inspired by functional APIs which all call it first (see stdlibs of i.e. kotlin, python, scala, java, spark).
Edit: Faster Version
A more scalable version (>50x faster on large datasets) that avoids nesting would be
first_group = function(x) x %>%
select(group_cols()) %>%
distinct %>%
ungroup %>%
slice(1) %>%
{ semi_join(x, .)}
A another positive side-effect of this improved version is that it fails if not grouping is present in x.
Try this where groups is a vector of group numbers. Here 1:2 means the first two groups:
select_groups <- function(data, groups, ...)
data[sort(unlist(attr(data, "indices")[ groups ])) + 1, ]
mtcars %>% group_by(cyl) %>% select_groups(1:2)
The selected rows appear in the original order. If you prefer that the rows appear in the order that the groups are specified (e.g. in the above eaxmple the rows of the first group followed by the rows of the second group) then remove the sort.