Count number of value increases/decreases per group - r

I have a df with entries in 10 columns grouped by unit and year. I want to calculate a) how often the values per column increased and b) how often the values per column decreased from one year to the other (e.g. from 2010 to 2011, 2011 to 2012 and so on) per group.
This is my df
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
So a solution should show information how often unit 1 in X1 had increases and decreases from one year to the other, how often unit 1 had increases/decreases in X2 and so forth
A tidyverse solution would be preferable ;)

One solution that produces a wide format. Each one of the Xs will get 2 new columns of counts: X_incr and X_decr:
# example data
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
library(dplyr)
# function to count increases and decreases
f_incr = function(x) sum(lead(x) > x, na.rm = T)
f_decr = function(x) sum(lead(x) < x, na.rm = T)
df %>%
group_by(unit) %>% # for each unit
summarise_at(vars(matches("X")), funs(incr = f_incr, # apply functions
decr = f_decr))
# # A tibble: 250 x 21
# unit X1_incr X2_incr X3_incr X4_incr X5_incr X6_incr X7_incr X8_incr X9_incr X10_incr X1_decr X2_decr
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1 0 2 1 1 1 1 1 2 2 2 3
# 2 2 1 2 1 2 0 1 1 3 2 2 2 1
# 3 3 3 1 1 1 2 1 1 2 2 2 0 2
# 4 4 1 1 2 1 1 1 1 1 2 1 2 2
# 5 5 3 2 2 1 2 2 1 2 2 2 0 1
# 6 6 1 2 1 2 2 2 1 2 2 1 2 1
# 7 7 1 2 1 1 2 0 2 3 1 1 2 1
# 8 8 2 1 1 2 2 1 1 2 1 1 1 2
# 9 9 1 2 3 1 2 2 1 1 2 2 2 1
#10 10 2 1 2 2 2 2 0 1 2 1 1 2
# # ... with 240 more rows, and 8 more variables: X3_decr <int>, X4_decr <int>, X5_decr <int>, X6_decr <int>,
# # X7_decr <int>, X8_decr <int>, X9_decr <int>, X10_decr <int>
Or if you prefer a format where each X has 2 rows of counts (X_incr and X_decr):
library(tidyr)
df %>%
group_by(unit) %>%
summarise_at(vars(matches("X")), funs(incr = f_incr,
decr = f_decr)) %>%
gather(type, counts, -unit)
# # A tibble: 5,000 x 3
# unit type counts
# <int> <chr> <int>
# 1 1 X1_incr 1
# 2 2 X1_incr 1
# 3 3 X1_incr 3
# 4 4 X1_incr 1
# 5 5 X1_incr 3
# 6 6 X1_incr 1
# 7 7 X1_incr 1
# 8 8 X1_incr 2
# 9 9 X1_incr 1
#10 10 X1_incr 2
# # ... with 4,990 more rows
Or this:
df %>%
gather(type,value,-unit,-year) %>% # reshape data
group_by(unit, type) %>% # for each combination
summarise(incr = f_incr(value), # get increasing counts
decr = f_decr(value)) %>% # get decreasing counts
arrange(type, unit) %>% # order (just for visualisation purposes)
ungroup() # forget the grouping
# # A tibble: 2,500 x 4
# unit type incr decr
# <int> <chr> <int> <int>
# 1 1 X1 1 2
# 2 2 X1 1 2
# 3 3 X1 3 0
# 4 4 X1 1 2
# 5 5 X1 3 0
# 6 6 X1 1 2
# 7 7 X1 1 2
# 8 8 X1 2 1
# 9 9 X1 1 2
#10 10 X1 2 1
# # ... with 2,490 more rows

I hope I understand the question (a) correctly. You are trying to see for each row how many times the value increases (first from x1 to x2, then from x2 to x3, and so on)
I am using apply to iterate over each row. Then overlay the second through last value over the first through second-to-last value and see if they if the first is bigger or smaller than the second. And add sum the boolean values to see how many times an increase or decrease occurs for that row. Note the switch from '>' to '<'
increases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] > x[1:(length(x)-1)])})
decreases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] < x[1:(length(x)-1)])})
For question (b) you can subtract the subset where year equals 2012 from the subset where year equals 2013, and test whether values are bigger than 0 for increases and smaller than 0 for decreases. Then use colSums to see for how many 'units' the increase or decrease is true.
Increase:
colSums((subset(df, year==2013) - subset(df, year==2012))>0)[3:12]
Decrease:
colSums((subset(df, year==2013) - subset(df, year==2012))<0)[3:12]

Related

issues using first and mutate with group_by

I am using mutate to create a column depending on the first value of a group
library(tidyverse)
test = data.frame(grp = c(1,1,1,2,2,2), x = c(1,2,3,1,2,3), y = c(1,2,3,1,2,3))
test
grp x y
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 1
5 2 2 2
6 2 3 3
test %>% group_by(grp) %>%
mutate(y = ifelse(grp[[1]] == x[[1]], y-1, y))
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 1
However output is not as I expected.
Expected output is
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 1
3 1 3 2
4 2 1 1
5 2 2 2
6 2 3 3
Can you please explain what is happening and how best to get my expected solution?
You need to remove the index [[1]] from grp since it will only change the first value of that group and use that to replace y. Since grp is the group you should avoid indexing it. Just use it as is, i.e.
library(dplyr)
test %>%
group_by(grp) %>%
mutate(new_y = ifelse(grp == first(x), y-1, y))
# A tibble: 6 × 4
# Groups: grp [2]
grp x y new_y
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 2 1
3 1 3 3 2
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
Because of the x[[1]], you are always comparing the group values of each row with the the x value of the first row. I think you want grp==x within ifelse()

Use dynamically generated column names in dplyr

I have a data frame with multiple columns, the user provides a vector with the column names, and I want to count maximum amount of times an element appears
set.seed(42)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var1", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(c(var1,var3)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
This does exactly what I want, but when I try to use a vector of variables i cant get it to work
df %>%
rowwise() %>%
mutate(consensus=max(unlist(table(select_vars)) )))
You can wrap it in c(!!! syms()) to get it working, and you don't need the unlist apparently. But honestly, I'm not sure what you are trying to do, and why table is needed here. Do you just want to check if var2 and var3 are the same value and if then 2 and if not then 1?
library(dplyr)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var2", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(c(!!!syms(select_vars)))))
#> # A tibble: 10 x 4
#> # Rowwise:
#> var1 var2 var3 consensus
#> <int> <int> <int> <int>
#> 1 2 3 2 1
#> 2 3 1 3 1
#> 3 3 1 1 2
#> 4 3 3 3 2
#> 5 1 1 2 1
#> 6 2 1 3 1
#> 7 3 2 3 1
#> 8 1 2 3 1
#> 9 2 1 2 1
#> 10 2 1 1 2
Created on 2021-07-22 by the reprex package (v0.3.0)
In the OP's code, we need select
library(dplyr)
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
-output
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or just subset from cur_data() which would only return the data keeping the group attributes
df %>%
rowwise %>%
mutate(consensus = max(table(unlist(cur_data()[select_vars]))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or using pmap
library(purrr)
df %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
# A tibble: 10 x 4
var1 var2 var3 consensus
<int> <int> <int> <dbl>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
As these are rowwise operations, can get some efficiency if we use collapse functions
library(collapse)
tfm(df, consensus = dapply(slt(df, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
# A tibble: 10 x 4
var1 var2 var3 consensus
* <int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Benchmarks
As noted above, collapse is faster (run on a slightly bigger dataset)
df1 <- df[rep(seq_len(nrow(df)), 1e5), ]
system.time({
tfm(df1, consensus = dapply(slt(df1, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
})
#user system elapsed
# 5.257 0.123 5.323
system.time({
df1 %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
})
#user system elapsed
# 54.813 0.517 55.246
The rowwise operation is taking too much time, so stopped the execution
df1 %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
})
Timing stopped at: 575.5 3.342 581.3
What you need is to use the verb all_of
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(all_of(select_vars)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 2 3 3 1
2 2 2 2 1
3 1 2 2 1
4 2 3 3 1
5 1 2 1 1
6 2 1 2 1
7 2 2 2 1
8 3 1 2 1
9 2 1 3 1
10 3 2 1 1

Create a combination ID number from a set of factors in R

can anyone help me out in computing a new variable that will number a distinct combination from some factors?
Assuming there are 4 within subject factors (A, B, C, D) with 8 repetitions of each combination for any of 10 subjects, this is how my data could look like to represent it's actual structure:
library(AlgDesign) #for generating a factorial design)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
> head(df)
A B C D replication Subject
1 1 1 1 1 1 1
2 2 1 1 1 1 1
3 1 2 1 1 1 1
4 2 2 1 1 1 1
5 1 1 2 1 1 1
6 2 1 2 1 1 1
> tail(df)
A B C D replication Subject
1275 1 2 1 2 8 10
1276 2 2 1 2 8 10
1277 1 1 2 2 8 10
1278 2 1 2 2 8 10
1279 1 2 2 2 8 10
1280 2 2 2 2 8 10
In this example replication was simply generated in order to force 8 reps but it doesnt "code" the combintation itself.
My original data has only variables A, B, C, D and Subject and I'd like to compute replication in a way that it has distinct values
but for each combination of A, B, C, D
library(AlgDesign)
library(dplyr)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
df %>%
rowwise() %>% # for each row
mutate(factors = paste0(c(A,B,C,D), collapse = "_")) %>% # create a combination of your factors
ungroup() %>% # forget the row grouping
mutate(replication_upd = as.numeric(factor(factors))) # create a number based on the combination you have
# # A tibble: 1,280 x 8
# A B C D replication Subject factors replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <chr> <dbl>
# 1 1 1 1 1 1 1 1_1_1_1 1
# 2 2 1 1 1 1 1 2_1_1_1 9
# 3 1 2 1 1 1 1 1_2_1_1 5
# 4 2 2 1 1 1 1 2_2_1_1 13
# 5 1 1 2 1 1 1 1_1_2_1 3
# 6 2 1 2 1 1 1 2_1_2_1 11
# 7 1 2 2 1 1 1 1_2_2_1 7
# 8 2 2 2 1 1 1 2_2_2_1 15
# 9 1 1 1 2 1 1 1_1_1_2 2
#10 2 1 1 2 1 1 2_1_1_2 10
# # ... with 1,270 more rows
You can remove any unnecessary variables. I left them there so you can see how the process works.
Another option is this
# create a look up table based on unique combinations and assign them a number
df %>% distinct(A,B,C,D) %>% mutate(replication_upd = row_number()) -> look_up
# join back to original dataset
df %>% inner_join(look_up, by=c("A","B","C","D")) %>% tbl_df()
# # A tibble: 1,280 x 7
# A B C D replication Subject replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <int>
# 1 1 1 1 1 1 1 1
# 2 2 1 1 1 1 1 2
# 3 1 2 1 1 1 1 3
# 4 2 2 1 1 1 1 4
# 5 1 1 2 1 1 1 5
# 6 2 1 2 1 1 1 6
# 7 1 2 2 1 1 1 7
# 8 2 2 2 1 1 1 8
# 9 1 1 1 2 1 1 9
# 10 2 1 1 2 1 1 10
# # ... with 1,270 more rows
Note that the first approach picks the numbers based on the new variable we create (i.e. orders A,B,C,D), and the second approach uses the initial order of you dataset to pick the number for each unique combination.

The dplyr way to get grouped differences

I am trying to figure out the dplyr way to do grouped differences.
Here is some fake data:
>crossing(year=seq(1,4),week=seq(1,3)) %>%
mutate(value = c(rep(4,3),rep(3,3),rep(2,3),rep(1,3)))
year week value
<int> <int> <dbl>
1 1 1 4
2 1 2 4
3 1 3 4
4 2 1 3
5 2 2 3
6 2 3 3
7 3 1 2
8 3 2 2
9 3 3 2
10 4 1 1
11 4 2 1
12 4 3 1
What I would like is year 1- year2, year2-year3, and year3-year4. The result would like like the following.
year week diffs
<int> <int> <dbl>
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
Edit:
I apologize. I was trying to make a simple reprex, but I messed up a lot.
Please let me know what the proper etiquette is. I don't want to ruffle any feathers.
I did not know that -diff() was a function. What I am actually looking for is percent difference ((new-old)/old)*100 and I am not able to find a straight forward way to use diff to get that value.
I am starting from the largest year. Adding a arrange(desc(year)) to the above code is what I have. I would be trimming the smallest year not the largest.
If this edit with worth a separate question let me know.
If you don't have missing years for each week:
df %>%
arrange(year) %>%
group_by(week) %>%
mutate(diffs = value - lead(value)) %>%
na.omit() %>% select(-value)
# A tibble: 9 x 3
# Groups: week [3]
# year week diffs
# <int> <int> <dbl>
#1 1 1 1
#2 1 2 1
#3 1 3 1
#4 2 1 1
#5 2 2 1
#6 2 3 1
#7 3 1 1
#8 3 2 1
#9 3 3 1
You can use diff, but it needs adjusting, as it subtracts the other way and returns a vector that's one shorter than what it's passed:
library(tidyverse)
diffed <- crossing(year = seq(1,4),
week = seq(1,3)) %>%
mutate(value = rep(4:1, each = 3)) %>%
group_by(week) %>%
mutate(value = c(-diff(value), NA)) %>%
drop_na(value)
diffed
#> # A tibble: 9 x 3
#> # Groups: week [3]
#> year week value
#> <int> <int> <int>
#> 1 1 1 1
#> 2 1 2 1
#> 3 1 3 1
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 1
#> 7 3 1 1
#> 8 3 2 1
#> 9 3 3 1
using dplyr and do:
library(dplyr)
df %>% group_by(week) %>% do(cbind(.[-nrow(.),1:2],diffs=-diff(.$value)))
# # A tibble: 9 x 3
# # Groups: week [3]
# year week diffs
# <int> <int> <dbl>
# 1 1 1 1
# 2 2 1 1
# 3 3 1 1
# 4 1 2 1
# 5 2 2 1
# 6 3 2 1
# 7 1 3 1
# 8 2 3 1
# 9 3 3 1

dplyr how to count cycles in the records

For example, if I have records like:
A B
1 2
2 3
3 1
1 2
2 1
Let's say one cycle is from 1 (to 2 to 3) back to 1,so I need my data frame to be like
No. A B
cycle1 1 2
cycle1 2 3
cycle1 3 1
cycle2 1 2
cycle2 2 1
Or a better way for me, I just need to record the time the same record appears, like
Time A B
Time1 1 2
Time1 2 3
Time1 3 1
Time2 1 2
Time1 2 1
I need to do this because I have to use summarize function in dplyr to do calculation but I cannot group data by A and B directly. The order of the data is also important.
Is this what you want ?
library(zoo)
T1=which(df$A==1)
T2=1:length(T1)
T2=paste('cycle',T2 )
df$No=NA
df$No[T1]=T2
df$No=na.locf(df$No)
df
A B No
1 1 2 cycle 1
2 2 3 cycle 1
3 3 1 cycle 1
4 1 2 cycle 2
5 2 1 cycle 2
#the reason: keep the row Id with the calculation
library(dplyr)
df%>%group_by(A,B)%>%mutate(Time=paste('Time',row_number()))
A B Time
<int> <int> <chr>
1 1 2 Time 1
2 2 3 Time 1
3 3 1 Time 1
4 1 2 Time 2
5 2 1 Time 1
Create an augmented 'diff' variable. c(NA , diff (your_var)). Within a sequence group this will be 1. Set your group to change at the logical falsity of that proposition. (My first iteration on the algorithm wasn't quite correct so modified it slightly.)
dat %>% as_tibble() %>% mutate(G = cumsum( c(-1, diff(A)) < 0 ) )
# A tibble: 5 x 3
A B G
<int> <int> <int>
1 1 2 1
2 2 3 1
3 3 1 1
4 1 2 2
5 2 1 2
dat %>% as_tibble() %>% mutate(G = paste0( "time", cumsum( c(-1, diff(A)) < 0 ) ))
# A tibble: 5 x 3
A B G
<int> <int> <chr>
1 1 2 time1
2 2 3 time1
3 3 1 time1
4 1 2 time2
5 2 1 time2
One could also test for A=1, but then sequences like 1,2,3,2,3,4 would not get properly split.

Resources