Building sum of dynamic number of rows in dplyr - r

My df looks something like the first three columns of the following:
ID VAL LENGTH SUM
1 1 1 1
1 1 1 1
1 1 2 2
1 1 2 2
2 0 1 0
2 3 1 0
2 4 2 3
I want to add a fourth column, which is defined as the sum of the group's first to LENGTH-st values in VAL.
How do I do that?

You could do:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(SUM = sapply(LENGTH, function(x) sum(VAL[1:x])))
Output:
# A tibble: 7 x 4
# Groups: ID [2]
ID VAL LENGTH SUM
<int> <int> <int> <dbl>
1 1 1 1 1
2 1 1 1 1
3 1 1 2 2
4 1 1 2 2
5 2 0 1 0
6 2 3 1 0
7 2 4 2 3

Related

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

R update values within a grouped df with information from updated previous value

I would like conditionally mutate variables (var1, var2) within groups (id) at different timepoints (timepoint) using previously updated/muated values according to this function:
change_function <- function(value,pastvalue,timepoint){
if(timepoint==1){valuenew=value} else
if(value==0){valuenew=pastvalue-1}
if(value==1){valuenew=pastvalue}
if(value==2){valuenew=pastvalue+1}
return(valuenew)
}
pastvalue is the MUTATED/UPDATED value at timepoint -1 for timepoint 2:4
Here is an example and output file:
``` r
#example data
df <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,1,2,2,2,1,0),var2=c(2,0,1,2,3,2,1,0))
df
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 0
#> 3 1 3 1 1
#> 4 1 4 2 2
#> 5 2 1 2 3
#> 6 2 2 2 2
#> 7 2 3 1 1
#> 8 2 4 0 0
#desired output
output <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,0,1,2,3,3,2),var2=c(2,1,1,2,3,4,4,3))
output
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 1
#> 3 1 3 0 1
#> 4 1 4 1 2
#> 5 2 1 2 3
#> 6 2 2 3 4
#> 7 2 3 3 4
#> 8 2 4 2 3
```
<sup>Created on 2020-11-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
My Approach: use my function using dplyr::mutate_at
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
However, this does not work because if/else is not vectorized
Update 1:
Using a nested ifelse function does not give the desired output, because it does not use updated pastvalue's:
change_function <- function(value,pastvalue,timepoint){
ifelse((timepoint==1),value,
ifelse((value==0),pastvalue-1,
ifelse((value==1),pastvalue,
ifelse((value==2),pastvalue+1,NA))))
}
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
id TimePoint var1 var2 var1_. var2_.
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 1 2
2 1 2 0 0 0 1
3 1 3 1 1 0 0
4 1 4 2 2 2 2
5 2 1 2 3 2 3
6 2 2 2 2 3 4
7 2 3 1 1 2 2
8 2 4 0 0 0 0
Update 2:
According to the comments, purrr:accumulate could be used
Thanks to akrun I could get the correct function:
# write a vectorized function
change_function <- function(prev, new) {
change=if_else(new==0,-1,
if_else(new==1,0,1))
if_else(is.na(new), new, prev + change)
}
# use purrr:accumulate
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(accumulate(.,change_function)))
# A tibble: 8 x 4
# Groups: id [2]
id timepoint var1 var2
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 2
2 1 2 0 1
3 1 3 0 1
4 1 4 1 2
5 2 1 2 3
6 2 2 3 4
7 2 3 3 4
8 2 4 2 3

Rows sequence by group using two columns

Suppose I have the following df
data <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
Value = c(1,1,0,1,0,1,1,1,0,0,1,0,0,0),
Result = c(1,1,2,3,4,5,5,1,2,2,3,1,1,1))
How can I obtain column Result from the first two columns?
I have tried different approaches using rle, seq, cumsum and cur_group_id but can't get the Result column easily
library(data.table)
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Result2 = rleid(Value))
This gives us:
ID Value Result Result2
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
Does this work:
library(dplyr)
data %>% group_by(ID) %>% mutate(r = rep(seq_along(rle(ID*Value)$values), rle(ID*Value)$lengths))
# A tibble: 14 x 4
# Groups: ID [3]
ID Value Result r
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
We could use rle with ave in base R
data$Result2 <- with(data, ave(Value, ID, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values)))))
data$Result2
#[1] 1 1 2 3 4 5 5 1 2 2 3 1 1 1

counting indicator respect of 2 groups

I have a group and persons in each group. and an indicator. How to count indicator per each group for each person element?
group person ind
1 1 1
1 1 1
1 2 1
2 1 0
2 2 1
2 2 1
output
so in the first group 2 persons have 1 in ind, and second group one person so
group person ind. count
1 1 1 2
1 1 1 2
1 2 1 2
2 1 0 1
2 2 1 1
2 2 1 1
Could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
count = n_distinct(person[ind == 1])
)
Output:
# A tibble: 6 x 4
# Groups: group [2]
group person ind count
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 1 2
3 1 2 1 2
4 2 1 0 1
5 2 2 1 1
6 2 2 1 1
Or in data.table:
library(data.table)
setDT(df)[, count := uniqueN(person[ind == 1]), by = group]
An option using base R
df1$count <- with(df1, ave(ind* person, group, FUN =
function(x) length(unique(x[x!=0]))))
df1$count
#[1] 2 2 2 1 1 1

Create a combination ID number from a set of factors in R

can anyone help me out in computing a new variable that will number a distinct combination from some factors?
Assuming there are 4 within subject factors (A, B, C, D) with 8 repetitions of each combination for any of 10 subjects, this is how my data could look like to represent it's actual structure:
library(AlgDesign) #for generating a factorial design)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
> head(df)
A B C D replication Subject
1 1 1 1 1 1 1
2 2 1 1 1 1 1
3 1 2 1 1 1 1
4 2 2 1 1 1 1
5 1 1 2 1 1 1
6 2 1 2 1 1 1
> tail(df)
A B C D replication Subject
1275 1 2 1 2 8 10
1276 2 2 1 2 8 10
1277 1 1 2 2 8 10
1278 2 1 2 2 8 10
1279 1 2 2 2 8 10
1280 2 2 2 2 8 10
In this example replication was simply generated in order to force 8 reps but it doesnt "code" the combintation itself.
My original data has only variables A, B, C, D and Subject and I'd like to compute replication in a way that it has distinct values
but for each combination of A, B, C, D
library(AlgDesign)
library(dplyr)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
df %>%
rowwise() %>% # for each row
mutate(factors = paste0(c(A,B,C,D), collapse = "_")) %>% # create a combination of your factors
ungroup() %>% # forget the row grouping
mutate(replication_upd = as.numeric(factor(factors))) # create a number based on the combination you have
# # A tibble: 1,280 x 8
# A B C D replication Subject factors replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <chr> <dbl>
# 1 1 1 1 1 1 1 1_1_1_1 1
# 2 2 1 1 1 1 1 2_1_1_1 9
# 3 1 2 1 1 1 1 1_2_1_1 5
# 4 2 2 1 1 1 1 2_2_1_1 13
# 5 1 1 2 1 1 1 1_1_2_1 3
# 6 2 1 2 1 1 1 2_1_2_1 11
# 7 1 2 2 1 1 1 1_2_2_1 7
# 8 2 2 2 1 1 1 2_2_2_1 15
# 9 1 1 1 2 1 1 1_1_1_2 2
#10 2 1 1 2 1 1 2_1_1_2 10
# # ... with 1,270 more rows
You can remove any unnecessary variables. I left them there so you can see how the process works.
Another option is this
# create a look up table based on unique combinations and assign them a number
df %>% distinct(A,B,C,D) %>% mutate(replication_upd = row_number()) -> look_up
# join back to original dataset
df %>% inner_join(look_up, by=c("A","B","C","D")) %>% tbl_df()
# # A tibble: 1,280 x 7
# A B C D replication Subject replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <int>
# 1 1 1 1 1 1 1 1
# 2 2 1 1 1 1 1 2
# 3 1 2 1 1 1 1 3
# 4 2 2 1 1 1 1 4
# 5 1 1 2 1 1 1 5
# 6 2 1 2 1 1 1 6
# 7 1 2 2 1 1 1 7
# 8 2 2 2 1 1 1 8
# 9 1 1 1 2 1 1 9
# 10 2 1 1 2 1 1 10
# # ... with 1,270 more rows
Note that the first approach picks the numbers based on the new variable we create (i.e. orders A,B,C,D), and the second approach uses the initial order of you dataset to pick the number for each unique combination.

Resources