How to use dplyr `rowwise()` column numbers instead of column names - r

library(tidyverse)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9))
df %>% rowwise() %>% mutate(col4 = sd(c(col1, col3)))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 2.83
# 2 2 4 9 4.95
After asking a series of questions I can finally calculate standard deviation across rows. See my code above.
But I can't use column names in my production code, because the database I pull from likes to change the column names periodically. Lucky for me the relative column positions is always the same.
So I'll just use column numbers instead. And let's check to make sure I can just swap things in and out:
identical(df$col1, df[[1]])
# [1] TRUE
Yes, I can just swap df[[1]] in place of df$col1. I think I do it like this.
df %>% rowwise() %>% mutate(col4 = sd(c(.[[1]], .[[3]])))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 3.40
# 2 2 4 9 3.40
df %>% rowwise() %>% {mutate(col4 = sd(c(.[[1]], .[[3]])))}
# Error in mutate_(.data, .dots = compat_as_lazy_dots(...)) :
# argument ".data" is missing, with no default
Nope, it looks like these don't work because the results are different from my original. And I can't use apply, if you really need to know why I made a separate question.
df %>% mutate(col4 = apply(.[, c(1, 3)], 1, sd))
How do I apply dplyr rowwise() with column numbers instead of names?

The issue in using .[[1]] or .[[3]] after doing the rowwise (grouping by row - have only single row per group) is that it breaks the grouping structure and extracts the whole column. Inorder to avoid that, we can create a row_number() column before doing the rowwise and then subset the columns based on that index
library(dplyr)
df %>%
mutate(rn = row_number()) %>% # create a sequence of row index
rowwise %>%
mutate(col4 = sd(c(.[[1]][rn[1]], .[[3]][rn[1]]))) %>% #extract with index
select(-rn)
#Source: local data frame [2 x 4]
#Groups: <by row>
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is map from purrr where we loop over the row_number() and do the subsetting of rows of dataset
library(purrr)
df %>%
mutate(col4 = map_dbl(row_number(), ~ sd(c(df[[1]][.x], df[[3]][.x]))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is pmap (if we don't want to use row_number())
df %>%
mutate(col4 = pmap_dbl(.[c(1, 3)], ~ sd(c(...))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Of course, the easiest way would be to use rowSds from matrixStats as described in the dupe tagged post here
NOTE: All of the above methods doesn't require any reshaping

Since you don't necessarily know the column names, but know the positions of the columns for which you need standard deviation, etc., I'd reshape into long data and add an ID column. You can gather by position instead of column name, either by giving the numbers of the column that should become the key, or the numbers of the columns to omit from the key. That way, you don't need to specify those values by column because you'll have them all in one column already. Then you can join those summary values back to your original wide-shaped data.
library(dplyr)
library(tidyr)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9)) %>%
mutate(id = row_number())
df %>%
mutate(id = row_number()) %>%
gather(key, value, 1, 3) %>%
group_by(id) %>%
summarise(sd = sd(value)) %>%
inner_join(df, by = "id")
#> # A tibble: 2 x 5
#> id sd col1 col2 col3
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.83 5 6 9
#> 2 2 4.95 2 4 9
Rearrange columns by position as you need.

An approach transposing data, converting it to matrix, computing the standard deviation, transposing again and transforming into tibble.
df %>%
t %>%
rbind(col4 = c(sd(.[c(1, 3),1]), sd(.[c(1, 3),2]))) %>%
t %>%
as_tibble()

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

how to pass column names including space in R

assume my column names are: User ID and name
how should I pass this column name to functions like what I have below?
df %>%
group_by(User ID) %>%
count(name)
apparently, group_by() or similar functions do not accept column names with space in their names.
You need to use tibble instead of data.frame:
library(tidyverse)
df <- tibble(`User ID` = 1:2, x = 5:6)
df %>%
group_by(`User ID`) %>%
summarise(total = sum(x))
#> # A tibble: 2 × 2
#> `User ID` total
#> <int> <int>
#> 1 1 5
#> 2 2 6

`unnest_longer` two columns as once?

I have a dataframe with two sets of columns v and t amongst others.
library(tidyverse)
(df <-
tibble(id = 1,
v1 = 1, v2 = 2, v3 = 3,
t1 = "a", t2 = "b", t3 = "c"
)
)
#> # A tibble: 1 × 7
#> id v1 v2 v3 t1 t2 t3
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 1 1 2 3 a b c
I want my output to be three rows long. I think one way I can do this is by nesting the similar columns, and unnest_longer. But this is not allowed.
## unnest_longer can't handle multiple cols
df %>%
nest(v = c(v1, v2, v3),
t = c(t1, t2, t3)) %>%
unnest_longer(c("v", "t"))
#> Error: Must extract column with a single valid subscript.
#> x Subscript `var` has size 2 but must be size 1.
Is it possible to unnest_longer multiple columns at once?
According to the documentation of ?unnest_longer takes only a single column
.col, col -
List-column to extract components from.
whereas the argument in unnest is cols (which can unnest more than one column)
Perhaps, the OP wanted to use pivot_longer instead of nest/unnest i.e. reshape to 'long' format by specifying the cols without the 'id' column, return the .value and capture the non digits (\\D+) in the column name as names_pattern
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -id, names_to = ".value",
names_pattern = "^(\\D+).*")
# A tibble: 3 × 3
# id v t
# <dbl> <dbl> <chr>
#1 1 1 a
#2 1 2 b
#3 1 3 c

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

Resources