Dplyr recursively grow a dataframe

Dplyr recursively grow a dataframe - r

I have the following values
values <- seq(1,3)
I would like to have the resulting dataframe
# A tibble: 6 x 2
facet values
<dbl> <dbl>
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
The facet is a column which is stating the iteration of the recursive append.
My current solution is
facet1 <- values %>% head(1) %>% tibble(facet = 1, values = .)
facet2 <- values %>% head(2) %>% tibble(facet = 2, values = .)
facet3 <- values %>% head(3) %>% tibble(facet = 3, values = .)
bind_rows(facet1, facet2, facet3)
A more general solution needed [Edit]
The current solutions will not work for my use-case because they are exploiting the fact that in my previous example sequences of the facets equal the values.
Here is a more general reproducible example where the values are much different from the facets.
set.seed(42)
values <- rnorm(3,0,.2)
df_recursive <- tribble(~facet, ~values,
1, 0.27,
2, 0.27,
2, -0.11,
3, 0.27,
3, -0.11,
3, 0.07)
# A tibble: 6 x 2
facet values
<dbl> <dbl>
1 1 0.27
2 2 0.27
3 2 -0.11
4 3 0.27
5 3 -0.11
6 3 0.07

Here is an option:
library(tidyverse)
values <- seq(1,3)
map_dfr(values, ~tibble(facet = .x, values = 1:.x))
#> # A tibble: 6 x 2
#> facet values
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 2 2
#> 4 3 1
#> 5 3 2
#> 6 3 3
EDIT:
These approaches can all be adapted to your use case. For example:
set.seed(42)
values <- rnorm(3,0,.2)
map_dfr(1:length(values), ~tibble(facet = .x, values = values[1:.x]))
#> # A tibble: 6 x 2
#> facet values
#> <int> <dbl>
#> 1 1 0.274
#> 2 2 0.274
#> 3 2 -0.113
#> 4 3 0.274
#> 5 3 -0.113
#> 6 3 0.0726

One possible solution:
library(dplyr)
tibble(facet=1:3) %>%
group_by(facet) %>%
summarise(values = seq_len(facet)) %>%
ungroup()
# A tibble: 6 x 2
facet values
<int> <int>
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
library(data.table)
data.table(facet=1:3)[, .(values = seq_len(facet)), by=facet]
facet values
<int> <int>
1: 1 1
2: 2 1
3: 2 2
4: 3 1
5: 3 2
6: 3 3

A base R solution, toying around with rep and sequence:
v <- seq(1, 3)
data.frame(facet = rep(v, v), values = sequence(v))
facet values
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
This can be adapted to any vector:
set.seed(42)
values <- rnorm(3, 0, .2)
v <- seq_along(values)
data.frame(facet = rep(v, v), values = values[sequence(v)])
facet values
1 1 0.27419169
2 2 0.27419169
3 2 -0.11293963
4 3 0.27419169
5 3 -0.11293963
6 3 0.07262568
And a base R version of AndS.'s solution:
do.call(rbind.data.frame, lapply(v, \(x) data.frame(facet = x, values = seq(x))))

Related

Is there a way to get subdataframes with purrr in magrittr pipes workflow without using data.frame name?

That is, I was interested in doing the same as in the example, but with purrr functions.
tibble(a, b = a * 2, c = 1) %>%
{lapply(X = names(.), FUN = function(.x) select(., 1:.x))}
[[1]]
# A tibble: 5 x 1
a
<int>
1 1
2 2
3 3
4 4
5 5
[[2]]
# A tibble: 5 x 2
a b
<int> <dbl>
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
[[3]]
# A tibble: 5 x 3
a b c
<int> <dbl> <dbl>
1 1 2 1
2 2 4 1
3 3 6 1
4 4 8 1
5 5 10 1
I only could do it if I named foo <- tibble(a, b = a * 2, c = 1) and inside map I did select(foo, ...), but I wanted to avoid that, since I wanted to mutate the named dataframe in pipe workflow.
Thank you!

You can use map in the following way :
library(dplyr)
library(purrr)
tibble(a = 1:5, b = a * 2, c = 1) %>%
{map(names(.), function(.x) select(., 1:.x))}
Based on your actual use case you can also use imap which will pass column value (.x) along with it's name (.y).
tibble(a = 1:5, b = a * 2, c = 1) %>%
imap(function(.x, .y) select(., 1:.y))
#$a
# A tibble: 5 x 1
# a
# <int>
#1 1
#2 2
#3 3
#4 4
#5 5
#$b
# A tibble: 5 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 4
#3 3 6
#4 4 8
#5 5 10
#$c
# A tibble: 5 x 3
# a b c
# <int> <dbl> <dbl>
#1 1 2 1
#2 2 4 1
#3 3 6 1
#4 4 8 1
#5 5 10 1

How to create combinations of values of one variable by group using tidyverse in R

I am using the combn function in R to get all the combinations of the values of variable y taking each time 2 values, grouping by the values of x. My expected final result is the tibble c.
But when I try to do it in tidyverse something is (very) wrong.
library(tidyverse)
df <- tibble(x = c(1, 1, 1, 2, 2, 2, 2),
y = c(8, 9, 7, 3, 5, 2, 1))
# This is what I want
a <- combn(df$y[df$x == 1], 2)
a <- rbind(a, rep(1, ncol(a)))
b <- combn(df$y[df$x == 2], 2)
b <- rbind(b, rep(2, ncol(b)))
c <- cbind(a, b)
c <- tibble(c)
c <- t(c)
# but using tidyverse it does not work
df %>% group_by(x) %>% mutate(z = combn(y, 2))
#> Error: Problem with `mutate()` input `z`.
#> x Input `z` can't be recycled to size 3.
#> i Input `z` is `combn(y, 2)`.
#> i Input `z` must be size 3 or 1, not 2.
#> i The error occurred in group 1: x = 1.
Created on 2020-11-18 by the reprex package (v0.3.0)

Try with combn
out = df %>% group_by(x) %>% do(data.frame(t(combn(.$y, 2))))
# A tibble: 9 x 3
# Groups: x [2]
x X1 X2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1

If you have dplyr v1.0.2, you can do this
df %>% group_by(x) %>% group_modify(~as_tibble(t(combn(.$y, 2L))))
Output
# A tibble: 9 x 3
# Groups: x [2]
x V1 V2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1

An option with summarise and unnest
library(dplyr)
library(tidyr)
df %>%
group_by(x) %>%
summarise(y = list(as.data.frame(t(combn(y, 2)))), .groups = 'drop') %>%
unnest(c(y))
# A tibble: 9 x 3
# x V1 V2
# <dbl> <dbl> <dbl>
#1 1 8 9
#2 1 8 7
#3 1 9 7
#4 2 3 5
#5 2 3 2
#6 2 3 1
#7 2 5 2
#8 2 5 1
#9 2 2 1

counting the number of observations row wise using dplyr

I have a dataset look like this -
sample <- tibble(x = c (1,2,3,NA), y = c (5, NA,2, NA))
sample
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 1 5
2 2 NA
3 3 2
4 NA NA
Now I want create a new variable Z, which will count how many observations are in each row. For example for the sample dataset above the first value of new variable Z should be 2 because both x and y have values. Similarly, for 2nd row the value of Z is 1 as there is one missing value and for 4th row, the value is 0 as there is no observations in the row.
The expected dataset looks like this -
x y z
<dbl> <dbl> <dbl>
1 1 5 2
2 2 NA 1
3 3 2 2
4 NA NA 0
I want to do this on few number of variables, not the whole dataset.

Using base R. First line checks all columns, second one checks columns by name, third might not work as good if the number of columns is substantial.
sample$z1 <- rowSums(!is.na(sample))
sample$z2 <- rowSums(!is.na(sample[c("x", "y")]))
sample$z3 <- is.finite(sample$x) + is.finite(sample$y)
> sample
# A tibble: 4 x 5
x y z1 z2 z3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 5 2 2 2
2 2 NA 1 1 1
3 3 2 2 2 2
4 NA NA 0 0 0

We can use
library(dplyr)
sample %>%
rowwise %>%
mutate(z = sum(!is.na(cur_data()))) %>%
ungroup
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <int>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
If it is select columns
sample %>%
rowwise %>%
mutate(z = sum(!is.na(select(cur_data(), x:y))))
Or with rowSums on a logical matrix
sample %>%
mutate(z = rowSums(!is.na(cur_data())))
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0

apply function with selected columns example:
set.seed(7)
vals <- sample(c(1:20, NA, NA), 20)
sample <- matrix(vals, ncol = 5)
# Select columns 1, 3, 4
cols <- c(1, 3, 4)
rowcnts <- apply(sample[ , cols], 1, function(x) length(x[!is.na(x)]))
sample <- cbind(sample, rowcnts)
> sample
rowcnts
[1,] 10 15 16 NA 12 2
[2,] 19 8 14 18 9 3
[3,] 7 17 6 4 1 3
[4,] 2 3 13 NA 5 2

R max function returns pseudo values when used within 'dplyr'

I used R's max function in combination with the summarise function from the dplyr package and had a typo in the max function's argument na.rm.
Mistakenly I wrote ns.rm = T and the script worked without any warning message and returned wrong values.
When replacing the na.rm with ns.rm on a simple vector (outside dplyr environment), the function returns the right values, and if the input vector holds NA then it returns an NA value without any warning about wrong argument used.
Here is an example:
if(!require('magrittr')) install.packges('magrittr')
if(!require('dplyr')) install.packges('dplyr')
tab <- data.frame("grp1" = sort(rep(1:4, 5)),
"grp2" = rep(c(1:2), 10),
"val" = rnorm(20))
tab
grp1 grp2 val
1 1 1 0.03536351
2 1 2 1.04237251
3 1 1 0.82735937
4 1 2 0.29040424
5 1 1 0.30194926
6 2 2 -0.96649026
7 2 1 -0.97388257
8 2 2 -0.13111541
9 2 1 -0.48337864
10 2 2 -0.73471857
11 3 1 -0.88536656
12 3 2 -1.30442575
13 3 1 1.18816751
14 3 2 -0.90334058
15 3 1 -0.53102641
16 4 2 -0.69266762
17 4 1 -0.64776312
18 4 2 0.01354644
19 4 1 0.78058285
20 4 2 -0.06647959
>
### Using max function within dplyr
## Right way
tab %>%
group_by(grp1, grp2) %>%
summarise("max_val" = max(val, na.rm = T))
# A tibble: 8 x 3
# Groups: grp1 [4]
grp1 grp2 max_val
<int> <int> <dbl>
1 1 1 0.827
2 1 2 1.04
3 2 1 -0.483
4 2 2 -0.131
5 3 1 1.19
6 3 2 -0.903
7 4 1 0.781
8 4 2 0.0135
## with a typo in na.rm argument
tab %>%
group_by(grp1, grp2) %>%
summarise("max_val" = max(val, ns.rm = T))
# A tibble: 8 x 3
# Groups: grp1 [4]
grp1 grp2 max_val
<int> <int> <dbl>
1 1 1 1
2 1 2 1.04
3 2 1 1
4 2 2 1
5 3 1 1.19
6 3 2 1
7 4 1 1
8 4 2 1
### Using max function on a vector
max(c(1, 2, 3), ns.rm = T)
[1] 3
max(c(1, 2, 3), ns.rm = T)
[1] 3
max(c(1, 2, 3), na.rm = T)
[1] 3
max(c(1, 2, 3, NA), ns.rm = T)
[1] NA
max(c(1, 2, 3, NA), na.rm = T)
[1] 3
Does anybody know if ns.rm is a legitimate input argument of any R function?
If not, why there is no warning that the argument used is not used appropriately?

No, ns.rm is not a legitimate input argument but what is happening here is ns.rm = T is considered as new input in the vector which is passed in max where T is considered as 1.
max(c(1, 2, 3), ns.rm = T)
#[1] 3
is actually interpreted as
max(c(1, 2, 3), 1)
#[1] 3
and
max(c(0.1, 0.2, 0.33), ns.rm = T)
#[1] 1
is interpreted as
max(c(0.1, 0.2, 0.33), 1)
and
max(c(1, 2, 3, NA), ns.rm = T)
#[1] NA
is actually
max(c(1, 2, 3, NA), 1)
#[1] NA
Similarly, for the dataframe
set.seed(123)
tab <- data.frame(grp1 = sort(rep(1:4, 5)),
grp2 = rep(c(1:2), 10),
val = rnorm(20))
By using the right way, we get numbers as
library(dplyr)
tab %>% group_by(grp1, grp2) %>% summarise(max_val = max(val, na.rm = T))
# grp1 grp2 max_val
# <int> <int> <dbl>
#1 1 1 1.56
#2 1 2 0.0705
#3 2 1 0.461
#4 2 2 1.72
#5 3 1 1.22
#6 3 2 0.360
#7 4 1 0.701
#8 4 2 1.79
Now if we use ns.rm = T
tab %>% group_by(grp1, grp2) %>% summarise(max_val = max(val, ns.rm = T))
# grp1 grp2 max_val
# <int> <int> <dbl>
#1 1 1 1.56
#2 1 2 1
#3 2 1 1
#4 2 2 1.72
#5 3 1 1.22
#6 3 2 1
#7 4 1 1
#8 4 2 1.79
Notice where max_val was less than 1 in the above groups is now replaced with 1 while using ns.rm since T is interpreted as 1.
Also, note that this is not limited to ns.rm only, you can use any character here.
max(c(0.1, 0.2, 0.33), a = T)
#[1] 1

adding grouping indicator for repeating sequences

I thought this is simple thing but failed and can't find answer from anywhere.
Example data looks like this. I have nro running from 1:x and restarts at random points. I would like to create ind variable which would be 1 for first run and 2 for second...
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
End result should look like this:
tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)),
ind = c(rep(1, 3), rep(2, 5), rep(3, 4)))
# A tibble: 12 x 2
nro ind
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 4 2
8 5 2
9 1 3
10 2 3
11 3 3
12 4 3
I thought I could do something with ifelse but failed miserably.
tbl %>%
mutate(ind = ifelse(nro < lag(nro), 1 + lag(ind), 1))
I assume this needs some kind of loop.

for sequences of the same length
You could use group_by on your nro variable and then just take the row_number():
tbl %>%
group_by(nro) %>%
mutate(ind = row_number())
# A tibble: 12 x 2
# Groups: nro [4]
# nro ind
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3
for varying length of the sequences
inspired by docendo discimus's comment
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
tbl %>%
mutate(ind = cumsum(nro == 1))
However, this is limited to sequences which begin with 1, since only the TRUE values of nro == 1 are cumulated.
thus, you should consider to use this:
tbl %>% mutate(dif = nro - lag(nro)) %>%
mutate(dif = ifelse(is.na(dif), nro, dif)) %>%
mutate(ind = cumsum(dif < 0) + 1) %>%
select(-dif)
# A tibble: 12 x 2
# nro ind
# <int> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 4 2
# 8 5 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dplyr recursively grow a dataframe - r

Related

Is there a way to get subdataframes with purrr in magrittr pipes workflow without using data.frame name?

How to create combinations of values of one variable by group using tidyverse in R

counting the number of observations row wise using dplyr

R max function returns pseudo values when used within 'dplyr'

adding grouping indicator for repeating sequences

Categories

Resources