I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.
I want to group rows based on consecutive observations of the same value.
For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.
The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.
library(tidyverse)
# Crete some test data
df <- tibble(
id = 1:20,
type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
val = runif(20)
)
df
#> # A tibble: 20 x 3
#> id type val
#> <int> <chr> <dbl>
#> 1 1 a 0.0606
#> 2 2 a 0.501
#> 3 3 a 0.974
#> 4 4 a 0.0833
#> 5 5 a 0.752
#> 6 6 b 0.0450
#> 7 7 b 0.367
#> 8 8 b 0.649
#> 9 9 b 0.846
#> 10 10 b 0.896
#> 11 11 a 0.178
#> 12 12 a 0.295
#> 13 13 a 0.206
#> 14 14 a 0.233
#> 15 15 a 0.851
#> 16 16 b 0.179
#> 17 17 b 0.801
#> 18 18 b 0.326
#> 19 19 b 0.269
#> 20 20 b 0.584
# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
current <- df$type[i]
lag <- ifelse(i == 1, NA, df$type[i - 1])
lead <- ifelse(i == nrow(df), NA, df$type[i + 1])
if (lead %>% is.na) {
df$consec_group[i] <- ifelse(current == lag, count, count + 1)
} else {
df$consec_group[i] <- count
if (current != lead) count <- count + 1
}
}
df
#> # A tibble: 20 x 4
#> id type val consec_group
#> <int> <chr> <dbl> <dbl>
#> 1 1 a 0.0606 1
#> 2 2 a 0.501 1
#> 3 3 a 0.974 1
#> 4 4 a 0.0833 1
#> 5 5 a 0.752 1
#> 6 6 b 0.0450 2
#> 7 7 b 0.367 2
#> 8 8 b 0.649 2
#> 9 9 b 0.846 2
#> 10 10 b 0.896 2
#> 11 11 a 0.178 3
#> 12 12 a 0.295 3
#> 13 13 a 0.206 3
#> 14 14 a 0.233 3
#> 15 15 a 0.851 3
#> 16 16 b 0.179 4
#> 17 17 b 0.801 4
#> 18 18 b 0.326 4
#> 19 19 b 0.269 4
#> 20 20 b 0.584 4
Created on 2019-03-14 by the reprex package (v0.2.1)
This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.
You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...
library(data.table)
setDT(df)[, groupid := rleid(type)][]
# id type val groupid
# 1: 1 a 0.624078793 1
# 2: 2 a 0.687361541 1
# 3: 3 a 0.817702740 1
# 4: 4 a 0.669857208 1
# 5: 5 a 0.100977936 1
# 6: 6 b 0.418275823 2
# 7: 7 b 0.660119857 2
# 8: 8 b 0.876015209 2
# 9: 9 b 0.473562143 2
# 10: 10 b 0.284474633 2
# 11: 11 a 0.034154862 3
# 12: 12 a 0.391760387 3
# 13: 13 a 0.383107868 3
# 14: 14 a 0.729583433 3
# 15: 15 a 0.006288375 3
# 16: 16 b 0.530179235 4
# 17: 17 b 0.802643704 4
# 18: 18 b 0.409618633 4
# 19: 19 b 0.309363642 4
# 20: 20 b 0.021918512 4
If you insist on using the tidyverse/dplyr, you can (of course) still use the
rleid-function as follows:
df %>% mutate( groupid = data.table::rleid(type) )
benchmarks
on a larger sample
library(tidyverse)
library(data.table)
# Crete some large test data
df <- tibble(
id = 1:200000,
type = sample(letters[1:26], 200000, replace = TRUE),
val = runif(200000)
)
dt <- as.data.table(df)
microbenchmark::microbenchmark(
dplyr.rleid = df %>% mutate( groupid = data.table::rleid(type) ),
data.table.rleid = dt[, groupid := rleid(type)][],
rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr.rleid 3.153626 3.278049 3.410363 3.444949 3.502792 3.582626 10
# data.table.rleid 2.965639 3.065959 3.173992 3.145643 3.259672 3.507009 10
# rle 13.059774 14.042797 24.364176 26.126176 29.460561 36.874054 10
# rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719 10
# transform 12.330717 22.419128 22.725242 25.532084 26.187634 26.702794 10
You can use a rleid()-like possibility like this:
df %>%
mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})
id type val ID_rleid
<int> <chr> <dbl> <int>
1 1 a 0.0430 1
2 2 a 0.858 1
3 3 a 0.504 1
4 4 a 0.318 1
5 5 a 0.469 1
6 6 b 0.144 2
7 7 b 0.173 2
8 8 b 0.0706 2
9 9 b 0.958 2
10 10 b 0.557 2
11 11 a 0.358 3
12 12 a 0.973 3
13 13 a 0.982 3
14 14 a 0.177 3
15 15 a 0.599 3
16 16 b 0.627 4
17 17 b 0.454 4
18 18 b 0.682 4
19 19 b 0.690 4
20 20 b 0.713 4
Or a modification (originally proposed by #d.b) that makes it more handy:
df %>%
mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))
Related
I have the following dataframe called df (dput below):
group value
1 A 4
2 A 2
3 A 4
4 A 3
5 A 1
6 A 5
7 B 3
8 B 2
9 B 1
10 B 2
11 B 2
12 B 2
I would like to calculate the percentage of values on the mode value per group. Here is the code to calculate the mode per group:
# Mode function
mode <- function(codes){
which.max(tabulate(codes))
}
library(dplyr)
# Calculate mode per group
df %>%
group_by(group) %>%
mutate(mode_value = mode(value))
#> # A tibble: 12 × 3
#> # Groups: group [2]
#> group value mode_value
#> <chr> <dbl> <int>
#> 1 A 4 4
#> 2 A 2 4
#> 3 A 4 4
#> 4 A 3 4
#> 5 A 1 4
#> 6 A 5 4
#> 7 B 3 2
#> 8 B 2 2
#> 9 B 1 2
#> 10 B 2 2
#> 11 B 2 2
#> 12 B 2 2
Created on 2022-11-28 with reprex v2.0.2
But I am not sure how to calculate the percentage of values on the mode per group which should look like this:
group value mode_value perc_on_mode
1 A 4 4 0.33
2 A 2 4 0.33
3 A 4 4 0.33
4 A 3 4 0.33
5 A 1 4 0.33
6 A 5 4 0.33
7 B 3 2 0.67
8 B 2 2 0.67
9 B 1 2 0.67
10 B 2 2 0.67
11 B 2 2 0.67
12 B 2 2 0.67
So I was wondering if anyone knows how to calculate the percentage of values on the mode value per group?
dput of df:
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), value = c(4, 2, 4, 3, 1, 5, 3, 2, 1, 2,
2, 2)), class = "data.frame", row.names = c(NA, -12L))
You could try:
df %>%
group_by(group) %>%
mutate(mode_value = mode(value),
perc_on_mode = mean(value == mode_value))
Output:
# A tibble: 12 x 4
# Groups: group [2]
group value mode_value perc_on_mode
<chr> <dbl> <int> <dbl>
1 A 4 4 0.333
2 A 2 4 0.333
3 A 4 4 0.333
4 A 3 4 0.333
5 A 1 4 0.333
6 A 5 4 0.333
7 B 3 2 0.667
8 B 2 2 0.667
9 B 1 2 0.667
10 B 2 2 0.667
11 B 2 2 0.667
12 B 2 2 0.667
By modifying the mode function:
mode <- function(codes){
tab <- tabulate(codes)
mode_value <- which.max(tab)
data.frame(value = codes, mode_value, perc_on_mode = tab[mode_value]/length(codes))
}
# Calculate mode per group
df %>%
group_by(group) %>%
do(mode(.$value))
#> # A tibble: 12 x 4
#> # Groups: group [2]
#> group value mode_value perc_on_mode
#> <chr> <dbl> <int> <dbl>
#> 1 A 4 4 0.333
#> 2 A 2 4 0.333
#> 3 A 4 4 0.333
#> 4 A 3 4 0.333
#> 5 A 1 4 0.333
#> 6 A 5 4 0.333
#> 7 B 3 2 0.667
#> 8 B 2 2 0.667
#> 9 B 1 2 0.667
#> 10 B 2 2 0.667
#> 11 B 2 2 0.667
#> 12 B 2 2 0.667
Or with data.table:
library(data.table)
mode <- function(codes){
tab <- tabulate(codes)
mode_value <- which.max(tab)
list(mode_value, tab[mode_value]/length(codes))
}
setDT(df)[, c("mode_value", "perc_on_mode") := mode(value), group][]
#> group value mode_value perc_on_mode
#> 1: A 4 4 0.3333333
#> 2: A 2 4 0.3333333
#> 3: A 4 4 0.3333333
#> 4: A 3 4 0.3333333
#> 5: A 1 4 0.3333333
#> 6: A 5 4 0.3333333
#> 7: B 3 2 0.6666667
#> 8: B 2 2 0.6666667
#> 9: B 1 2 0.6666667
#> 10: B 2 2 0.6666667
#> 11: B 2 2 0.6666667
#> 12: B 2 2 0.6666667
I have a dataframe grouped by grp:
df <- data.frame(
v = rnorm(25),
grp = c(rep("A",10), rep("B",15)),
size = 2)
I want to flag the run-length of intervals determined by size. For example, for grp == "A", size is 2, and the number of rows is 10. So the interval should have length 10/2 = 5. This code, however, creates intervals with length 2:
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% size)
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 1
4 -0.913 A 2 1
5 0.486 A 2 2
6 -1.80 A 2 2
7 -0.370 A 2 3
8 -0.209 A 2 3
9 -0.661 A 2 4
10 -0.177 A 2 4
# … with 15 more rows
How can I flag the correct run-length of the size-determined intervals? The desired output is this:
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 0
4 -0.913 A 2 0
5 0.486 A 2 0
6 -1.80 A 2 1
7 -0.370 A 2 1
8 -0.209 A 2 1
9 -0.661 A 2 1
10 -0.177 A 2 1
# … with 15 more rows
If I interpreted your question correctly, this small change should do the trick?
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% (n()/size))
You can use gl:
df %>%
group_by(grp) %>%
mutate(interval = gl(first(size), ceiling(n() / first(size)))[1:n()])
output
# A tibble: 26 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <fct>
1 -1.12 A 2 1
2 3.04 A 2 1
3 0.235 A 2 1
4 -0.0333 A 2 1
5 -2.73 A 2 1
6 -0.0998 A 2 1
7 0.976 A 2 2
8 0.414 A 2 2
9 0.912 A 2 2
10 1.98 A 2 2
11 1.17 A 2 2
12 -0.509 B 2 1
13 0.704 B 2 1
14 -0.198 B 2 1
15 -0.538 B 2 1
16 -2.86 B 2 1
17 -0.790 B 2 1
18 0.488 B 2 1
19 2.17 B 2 1
20 0.501 B 2 2
21 0.620 B 2 2
22 -0.966 B 2 2
23 0.163 B 2 2
24 -2.08 B 2 2
25 0.485 B 2 2
26 0.697 B 2 2
I am trying to assign quantile groups for a stacked data such that for each category (r1 and r2 in my example) of data, I can classify the values into 5 groups. I can manage to do this using ntile() as follows.
r1<-rnorm(10,0,1)
r2<-rnorm(10,2,4)
df<-cbind(r1,r2)
df<-melt(df)
df<-df%>%group_by(Var2) %>% mutate(group=ntile(value,5))
However, what should I do if I hope to exclude the top and bottom 10% when sorting the groups. Ideally, I hope to keep those top and bottom values in the output table with their group code showing as "NA".
Thanks to anyone who can help!
Your question is a little ambiguous. It is not clear whether you wish to exclude the top and bottom 10% from the quintile calculation (so that you are getting equal quintiles of the 10-90th centiles of the original data), or whether you want to do the quintiles first on all the data, then exclude the first and last 10%. Doing it the second way will give you smaller 1st and 5th quintiles, so I assume you mean the first method:
df %>%
group_by(Var2) %>%
mutate(group = ntile(value, 10)) %>%
mutate(group = ntile(ifelse(group %% 9 == 1, NA, value), 5))
#> # A tibble: 20 x 4
#> # Groups: Var2 [2]
#> Var1 Var2 value group
#> <int> <fct> <dbl> <int>
#> 1 1 r1 -0.626 1
#> 2 2 r1 0.184 2
#> 3 3 r1 -0.836 NA
#> 4 4 r1 1.60 NA
#> 5 5 r1 0.330 3
#> 6 6 r1 -0.820 1
#> 7 7 r1 0.487 3
#> 8 8 r1 0.738 5
#> 9 9 r1 0.576 4
#> 10 10 r1 -0.305 2
#> 11 1 r2 8.05 NA
#> 12 2 r2 3.56 2
#> 13 3 r2 -0.485 1
#> 14 4 r2 -6.86 NA
#> 15 5 r2 6.50 5
#> 16 6 r2 1.82 1
#> 17 7 r2 1.94 2
#> 18 8 r2 5.78 4
#> 19 9 r2 5.28 3
#> 20 10 r2 4.38 3
Just in case, the second method you would achieve like this:
df %>%
group_by(Var2) %>%
mutate(group = ntile(value, 5)) %>%
mutate(group = ifelse(ntile(value, 10) %% 9 == 1, NA, group))
#> # A tibble: 20 x 4
#> # Groups: Var2 [2]
#> Var1 Var2 value group
#> <int> <fct> <dbl> <int>
#> 1 1 r1 -0.626 2
#> 2 2 r1 0.184 3
#> 3 3 r1 -0.836 NA
#> 4 4 r1 1.60 NA
#> 5 5 r1 0.330 3
#> 6 6 r1 -0.820 1
#> 7 7 r1 0.487 4
#> 8 8 r1 0.738 5
#> 9 9 r1 0.576 4
#> 10 10 r1 -0.305 2
#> 11 1 r2 8.05 NA
#> 12 2 r2 3.56 3
#> 13 3 r2 -0.485 1
#> 14 4 r2 -6.86 NA
#> 15 5 r2 6.50 5
#> 16 6 r2 1.82 2
#> 17 7 r2 1.94 2
#> 18 8 r2 5.78 4
#> 19 9 r2 5.28 4
#> 20 10 r2 4.38 3
Created on 2022-02-19 by the reprex package (v2.0.1)
Setup and data used
library(dplyr)
library(reshape2)
set.seed(1)
r1 <- rnorm(10,0,1)
r2 <- rnorm(10,2,4)
df <- cbind(r1,r2)
df <- melt(df)
I have a tibble in R that has 11 observations of every month.Apart from June that has 0.
My data frame (tibble) looks like this :
library(tidyverse)
A = c(1,2,3,4,5,7,8,9,10,11,12)
B = rnorm(11,0,1)
Data = tibble(A,B);Data
But i want to add the 0 observation of June of this timeseries.
Something like :
d = c(6,0);d
newdata = rbind(Data,d)
order(newdata$A)
but the 12 (december) appears.Any help?
Two approaches:
(1) We can use add_row for this. However, d must be named and we need to splice it into add_row with the tribble bang !!! operator. Then we can arrange the data so that the month are sorted from 1 to 12. Of course you can specify add_row directly like in #Chris answer without the need of an external vector.
library(dplyr)
A = c(1,2,3,4,5,7,8,9,10,11,12)
B = rnorm(11,0,1)
Data = tibble(A,B)
d = c(A = 6, B = 0)
newdata <- Data %>%
add_row(!!! d) %>%
arrange(A)
# check
newdata
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 1 1.22
#> 2 2 0.0729
#> 3 3 0.597
#> 4 4 -1.26
#> 5 5 0.928
#> 6 6 0
#> 7 7 -1.08
#> 8 8 0.704
#> 9 9 -0.119
#> 10 10 -0.462
#> 11 11 -0.00388
#> 12 12 1.56
order(newdata$A)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12
(2) We can use tidyr::complete, as suggested by #Ronak in the comments, although we use a slightly different specification with full_seq:
library(tidyr)
Data %>%
complete(A = full_seq(A, 1), fill = list(B = 0))
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 1 -0.258
#> 2 2 -1.18
#> 3 3 -0.165
#> 4 4 0.775
#> 5 5 0.926
#> 6 6 0
#> 7 7 0.343
#> 8 8 1.10
#> 9 9 0.359
#> 10 10 0.934
#> 11 11 -0.444
#> 12 12 0.184
Created on 2021-09-21 by the reprex package (v2.0.1)
You can define the additional row in add_row:
library(dplyr)
Data %>%
add_row(A = 6, B = 0) %>%
arrange(A)
# A tibble: 12 x 2
A B
<dbl> <dbl>
1 1 -0.547
2 2 -0.564
3 3 -0.890
4 4 -0.477
5 5 -0.998
6 6 0
7 7 -0.776
8 8 0.0645
9 9 0.959
10 10 -0.110
11 11 -0.511
12 12 -0.911
how can I create a new column which starting value is 1 and the following values are a multiplication of the previous value of a column (b) and the previous value of itself (d)?
these data are only made up, but have the structure of my data:
> a <- rep(1:10, 3)
> b <- runif(30)
> c <- tibble(a,b)
> c
# A tibble: 30 x 2
a b
<int> <dbl>
1 1 0.945
2 2 0.280
3 3 0.464
4 4 0.245
5 5 0.917
6 6 0.913
7 7 0.144
8 8 0.481
9 9 0.873
10 10 0.754
# ... with 20 more rows
Then I try to calculate column d:
> c <- c %>%
+ group_by(a) %>%
+ mutate(d = accumulate(lag(b, k = 1), `*`, .init = 1))
and it should look like this
# A tibble: 30 x 3
# Groups: a [10]
a b d
<int> <dbl> <dbl>
1 1 0.945 1 <--- b[1] * d[1] = d[2]
2 2 0.280 0.945
3 3 0.464 0.265
4 4 0.245 0.123
5 5 0.917 0.03
#...
But instead I am getting this error message.
Fehler: Column `d` must be length 3 (the group size) or one, not 4
The problem is that when you initialize accumulate with .init = that adds an extra first element of the vector.
You could try this:
library(dplyr)
library(purrr)
c %>%
group_by(a) %>%
mutate(d = accumulate(b[(2:length(b))-1], `*`,.init=1)) %>%
arrange(a)
# a b d
# <int> <dbl> <dbl>
# 1 1 0.266 1
# 2 1 0.206 0.266
# 3 1 0.935 0.0547
# 4 2 0.372 1
# 5 2 0.177 0.372
# … with 25 more rows
Data
library(tibble)
set.seed(1)
a <- rep(1:10, 3)
b <- runif(30)
c <- tibble(a,b)
Using dplyr, I would do this:
c %>%
mutate(d = 1*accumulate(.x = b[-length(b)],
.init = 1,
.f = `*`))
# # A tibble: 30 x 3
# a b d
# <int> <dbl> <dbl>
# 1 1 0.562 1
# 2 2 0.668 0.562
# 3 3 0.100 0.375
# 4 4 0.242 0.0376
# 5 5 0.0646 0.00907
# 6 6 0.373 0.000586
# 7 7 0.664 0.000219
# 8 8 0.915 0.000145
# 9 9 0.848 0.000133
# 10 10 0.952 0.000113
# # ... with 20 more rows