issues using first and mutate with group_by - r

I am using mutate to create a column depending on the first value of a group
library(tidyverse)
test = data.frame(grp = c(1,1,1,2,2,2), x = c(1,2,3,1,2,3), y = c(1,2,3,1,2,3))
test
grp x y
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 1
5 2 2 2
6 2 3 3
test %>% group_by(grp) %>%
mutate(y = ifelse(grp[[1]] == x[[1]], y-1, y))
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 1
However output is not as I expected.
Expected output is
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 1
3 1 3 2
4 2 1 1
5 2 2 2
6 2 3 3
Can you please explain what is happening and how best to get my expected solution?

You need to remove the index [[1]] from grp since it will only change the first value of that group and use that to replace y. Since grp is the group you should avoid indexing it. Just use it as is, i.e.
library(dplyr)
test %>%
group_by(grp) %>%
mutate(new_y = ifelse(grp == first(x), y-1, y))
# A tibble: 6 × 4
# Groups: grp [2]
grp x y new_y
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 2 1
3 1 3 3 2
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3

Because of the x[[1]], you are always comparing the group values of each row with the the x value of the first row. I think you want grp==x within ifelse()

Related

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

Use dynamically generated column names in dplyr

I have a data frame with multiple columns, the user provides a vector with the column names, and I want to count maximum amount of times an element appears
set.seed(42)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var1", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(c(var1,var3)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
This does exactly what I want, but when I try to use a vector of variables i cant get it to work
df %>%
rowwise() %>%
mutate(consensus=max(unlist(table(select_vars)) )))
You can wrap it in c(!!! syms()) to get it working, and you don't need the unlist apparently. But honestly, I'm not sure what you are trying to do, and why table is needed here. Do you just want to check if var2 and var3 are the same value and if then 2 and if not then 1?
library(dplyr)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var2", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(c(!!!syms(select_vars)))))
#> # A tibble: 10 x 4
#> # Rowwise:
#> var1 var2 var3 consensus
#> <int> <int> <int> <int>
#> 1 2 3 2 1
#> 2 3 1 3 1
#> 3 3 1 1 2
#> 4 3 3 3 2
#> 5 1 1 2 1
#> 6 2 1 3 1
#> 7 3 2 3 1
#> 8 1 2 3 1
#> 9 2 1 2 1
#> 10 2 1 1 2
Created on 2021-07-22 by the reprex package (v0.3.0)
In the OP's code, we need select
library(dplyr)
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
-output
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or just subset from cur_data() which would only return the data keeping the group attributes
df %>%
rowwise %>%
mutate(consensus = max(table(unlist(cur_data()[select_vars]))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or using pmap
library(purrr)
df %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
# A tibble: 10 x 4
var1 var2 var3 consensus
<int> <int> <int> <dbl>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
As these are rowwise operations, can get some efficiency if we use collapse functions
library(collapse)
tfm(df, consensus = dapply(slt(df, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
# A tibble: 10 x 4
var1 var2 var3 consensus
* <int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Benchmarks
As noted above, collapse is faster (run on a slightly bigger dataset)
df1 <- df[rep(seq_len(nrow(df)), 1e5), ]
system.time({
tfm(df1, consensus = dapply(slt(df1, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
})
#user system elapsed
# 5.257 0.123 5.323
system.time({
df1 %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
})
#user system elapsed
# 54.813 0.517 55.246
The rowwise operation is taking too much time, so stopped the execution
df1 %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
})
Timing stopped at: 575.5 3.342 581.3
What you need is to use the verb all_of
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(all_of(select_vars)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 2 3 3 1
2 2 2 2 1
3 1 2 2 1
4 2 3 3 1
5 1 2 1 1
6 2 1 2 1
7 2 2 2 1
8 3 1 2 1
9 2 1 3 1
10 3 2 1 1

expand_grid is creating a data frame column, how can I convert them to simple columns

library(tidyr)
dat <- expand_grid(df = data.frame(x = 1:2, y = c(2, 1)), z = 1:3)
dat
yeilds
# A tibble: 6 x 2
z df$x $y
<int> <int> <dbl>
1 1 1 2
2 1 2 1
3 2 1 2
4 2 2 1
5 3 1 2
6 3 2 1
I would like to remove df from the data frame, having 3 simple columns, z, x, and y
I have tried unnest but got no where. any advice?
Just don't name the data frame:
library(tidyr)
expand_grid(data.frame(x = 1:2, y = c(2, 1)), z = 1:3)
# A tibble: 6 x 3
x y z
<int> <dbl> <int>
1 1 2 1
2 1 2 2
3 1 2 3
4 2 1 1
5 2 1 2
6 2 1 3
If the dataset is already created, then convert to a regular data.frame with do.call, and rename the columns that starts with df
library(dplyr)
library(stringr)
do.call(data.frame, dat) %>%
rename_at(vars(starts_with('df')), ~ str_remove(., 'df\\.'))
# x y z
#1 1 2 1
#2 1 2 2
#3 1 2 3
#4 2 1 1
#5 2 1 2
#6 2 1 3
Or another option is to pull the column 'df' and then bind with the rest
dat %>%
pull(df) %>%
bind_cols(z = dat %>%
pull(z))
# x y z
#1 1 2 1
#2 1 2 2
#3 1 2 3
#4 2 1 1
#5 2 1 2
#6 2 1 3
Or using crossing in this case
crossing(data.frame(x = 1:2, y = c(2, 1)), z = 1:3)
# A tibble: 6 x 3
# x y z
# <int> <dbl> <int>
#1 1 2 1
#2 1 2 2
#3 1 2 3
#4 2 1 1
#5 2 1 2
#6 2 1 3

Count number of value increases/decreases per group

I have a df with entries in 10 columns grouped by unit and year. I want to calculate a) how often the values per column increased and b) how often the values per column decreased from one year to the other (e.g. from 2010 to 2011, 2011 to 2012 and so on) per group.
This is my df
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
So a solution should show information how often unit 1 in X1 had increases and decreases from one year to the other, how often unit 1 had increases/decreases in X2 and so forth
A tidyverse solution would be preferable ;)
One solution that produces a wide format. Each one of the Xs will get 2 new columns of counts: X_incr and X_decr:
# example data
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
library(dplyr)
# function to count increases and decreases
f_incr = function(x) sum(lead(x) > x, na.rm = T)
f_decr = function(x) sum(lead(x) < x, na.rm = T)
df %>%
group_by(unit) %>% # for each unit
summarise_at(vars(matches("X")), funs(incr = f_incr, # apply functions
decr = f_decr))
# # A tibble: 250 x 21
# unit X1_incr X2_incr X3_incr X4_incr X5_incr X6_incr X7_incr X8_incr X9_incr X10_incr X1_decr X2_decr
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1 0 2 1 1 1 1 1 2 2 2 3
# 2 2 1 2 1 2 0 1 1 3 2 2 2 1
# 3 3 3 1 1 1 2 1 1 2 2 2 0 2
# 4 4 1 1 2 1 1 1 1 1 2 1 2 2
# 5 5 3 2 2 1 2 2 1 2 2 2 0 1
# 6 6 1 2 1 2 2 2 1 2 2 1 2 1
# 7 7 1 2 1 1 2 0 2 3 1 1 2 1
# 8 8 2 1 1 2 2 1 1 2 1 1 1 2
# 9 9 1 2 3 1 2 2 1 1 2 2 2 1
#10 10 2 1 2 2 2 2 0 1 2 1 1 2
# # ... with 240 more rows, and 8 more variables: X3_decr <int>, X4_decr <int>, X5_decr <int>, X6_decr <int>,
# # X7_decr <int>, X8_decr <int>, X9_decr <int>, X10_decr <int>
Or if you prefer a format where each X has 2 rows of counts (X_incr and X_decr):
library(tidyr)
df %>%
group_by(unit) %>%
summarise_at(vars(matches("X")), funs(incr = f_incr,
decr = f_decr)) %>%
gather(type, counts, -unit)
# # A tibble: 5,000 x 3
# unit type counts
# <int> <chr> <int>
# 1 1 X1_incr 1
# 2 2 X1_incr 1
# 3 3 X1_incr 3
# 4 4 X1_incr 1
# 5 5 X1_incr 3
# 6 6 X1_incr 1
# 7 7 X1_incr 1
# 8 8 X1_incr 2
# 9 9 X1_incr 1
#10 10 X1_incr 2
# # ... with 4,990 more rows
Or this:
df %>%
gather(type,value,-unit,-year) %>% # reshape data
group_by(unit, type) %>% # for each combination
summarise(incr = f_incr(value), # get increasing counts
decr = f_decr(value)) %>% # get decreasing counts
arrange(type, unit) %>% # order (just for visualisation purposes)
ungroup() # forget the grouping
# # A tibble: 2,500 x 4
# unit type incr decr
# <int> <chr> <int> <int>
# 1 1 X1 1 2
# 2 2 X1 1 2
# 3 3 X1 3 0
# 4 4 X1 1 2
# 5 5 X1 3 0
# 6 6 X1 1 2
# 7 7 X1 1 2
# 8 8 X1 2 1
# 9 9 X1 1 2
#10 10 X1 2 1
# # ... with 2,490 more rows
I hope I understand the question (a) correctly. You are trying to see for each row how many times the value increases (first from x1 to x2, then from x2 to x3, and so on)
I am using apply to iterate over each row. Then overlay the second through last value over the first through second-to-last value and see if they if the first is bigger or smaller than the second. And add sum the boolean values to see how many times an increase or decrease occurs for that row. Note the switch from '>' to '<'
increases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] > x[1:(length(x)-1)])})
decreases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] < x[1:(length(x)-1)])})
For question (b) you can subtract the subset where year equals 2012 from the subset where year equals 2013, and test whether values are bigger than 0 for increases and smaller than 0 for decreases. Then use colSums to see for how many 'units' the increase or decrease is true.
Increase:
colSums((subset(df, year==2013) - subset(df, year==2012))>0)[3:12]
Decrease:
colSums((subset(df, year==2013) - subset(df, year==2012))<0)[3:12]

The dplyr way to get grouped differences

I am trying to figure out the dplyr way to do grouped differences.
Here is some fake data:
>crossing(year=seq(1,4),week=seq(1,3)) %>%
mutate(value = c(rep(4,3),rep(3,3),rep(2,3),rep(1,3)))
year week value
<int> <int> <dbl>
1 1 1 4
2 1 2 4
3 1 3 4
4 2 1 3
5 2 2 3
6 2 3 3
7 3 1 2
8 3 2 2
9 3 3 2
10 4 1 1
11 4 2 1
12 4 3 1
What I would like is year 1- year2, year2-year3, and year3-year4. The result would like like the following.
year week diffs
<int> <int> <dbl>
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
Edit:
I apologize. I was trying to make a simple reprex, but I messed up a lot.
Please let me know what the proper etiquette is. I don't want to ruffle any feathers.
I did not know that -diff() was a function. What I am actually looking for is percent difference ((new-old)/old)*100 and I am not able to find a straight forward way to use diff to get that value.
I am starting from the largest year. Adding a arrange(desc(year)) to the above code is what I have. I would be trimming the smallest year not the largest.
If this edit with worth a separate question let me know.
If you don't have missing years for each week:
df %>%
arrange(year) %>%
group_by(week) %>%
mutate(diffs = value - lead(value)) %>%
na.omit() %>% select(-value)
# A tibble: 9 x 3
# Groups: week [3]
# year week diffs
# <int> <int> <dbl>
#1 1 1 1
#2 1 2 1
#3 1 3 1
#4 2 1 1
#5 2 2 1
#6 2 3 1
#7 3 1 1
#8 3 2 1
#9 3 3 1
You can use diff, but it needs adjusting, as it subtracts the other way and returns a vector that's one shorter than what it's passed:
library(tidyverse)
diffed <- crossing(year = seq(1,4),
week = seq(1,3)) %>%
mutate(value = rep(4:1, each = 3)) %>%
group_by(week) %>%
mutate(value = c(-diff(value), NA)) %>%
drop_na(value)
diffed
#> # A tibble: 9 x 3
#> # Groups: week [3]
#> year week value
#> <int> <int> <int>
#> 1 1 1 1
#> 2 1 2 1
#> 3 1 3 1
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 1
#> 7 3 1 1
#> 8 3 2 1
#> 9 3 3 1
using dplyr and do:
library(dplyr)
df %>% group_by(week) %>% do(cbind(.[-nrow(.),1:2],diffs=-diff(.$value)))
# # A tibble: 9 x 3
# # Groups: week [3]
# year week diffs
# <int> <int> <dbl>
# 1 1 1 1
# 2 2 1 1
# 3 3 1 1
# 4 1 2 1
# 5 2 2 1
# 6 3 2 1
# 7 1 3 1
# 8 2 3 1
# 9 3 3 1

Resources