Rank by multiple columns in R

Rank by multiple columns in R - r

Trying to create a rank indicator over 2 columns, in this case both account and DATE.
For example:
df <- data.frame(
Account = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
DATE = c(201901, 201902, 201903, 201904, 201902, 201903, 201904, 201905, 201906, 201907, 201904, 201905))
> df
Account DATE
1 201901
1 201902
1 201903
1 201904
2 201902
2 201903
2 201904
2 201905
2 201906
2 201907
3 201904
3 201905
I've tried to use rank and order, and rank(rank()) and order(order()) but with no luck
df <- df %>%
mutate("rank" = rank(Account, DATE))
Account DATE rank
1 201901 2.5
1 201902 2.5
1 201903 2.5
1 201904 2.5
2 201902 7.5
2 201903 7.5
2 201904 7.5
2 201905 7.5
2 201906 7.5
2 201907 7.5
3 201904 11.5
3 201905 11.5
But what I want is for it to rank the dates descending, but by each account, it should look like this:
Account DATE RANK
1 201901 4
1 201902 3
1 201903 2
1 201904 1
2 201902 6
2 201903 5
2 201904 4
2 201905 3
2 201906 2
2 201907 1
3 201904 2
3 201905 1

library("dplyr")
df %>%
group_by(Account) %>%
mutate("rank" = rank(DATE))
#> # A tibble: 12 x 3
#> # Groups: Account [3]
#> Account DATE rank
#> <dbl> <dbl> <dbl>
#> 1 1 201901 1
#> 2 1 201902 2
#> 3 1 201903 3
#> 4 1 201904 4
#> 5 2 201902 1
#> 6 2 201903 2
#> 7 2 201904 3
#> 8 2 201905 4
#> 9 2 201906 5
#> 10 2 201907 6
#> 11 3 201904 1
#> 12 3 201905 2
Created on 2020-03-09 by the reprex package (v0.3.0.9001)

We can use a descending order to create the ranks:
library(dplyr)
df %>%
group_by(Account) %>%
mutate("rank" = order(DATE, decreasing = TRUE))
Output:
# A tibble: 12 x 3
# Groups: Account [3]
Account DATE rank
<dbl> <dbl> <int>
1 1 201901 4
2 1 201902 3
3 1 201903 2
4 1 201904 1
5 2 201902 6
6 2 201903 5
7 2 201904 4
8 2 201905 3
9 2 201906 2
10 2 201907 1
11 3 201904 2
12 3 201905 1

Here you go:
df <- df %>% group_by(Account) %>% mutate(ranking = rank(DATE))

in base R
sortdata <- lapply(1:3,grep,df[,1])
for(i in sortdata){
df[i,3] <- order(df[i,2],decreasing=T)
}

Related

How to get the cartesian product?

I have a data frame like this:
plazo
monto
20
2
50
3
I need to add a rows for values between 1 to the value of plazo and expand my dataframe like below;
plazo
monto
Semana
20
2
1
20
2
2
20
2
3
20
2
…
20
2
20
50
3
1
50
3
2
50
3
3
50
3
…
50
3
50

We can create a nested column with values from 1:plazo for each row and then unnest that column.
df1 <- data.frame(plazo = c(2, 5), monto = c(2,3))
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(Semana = list(1:plazo)) %>%
unnest(Semana)
#> # A tibble: 7 x 3
#> plazo monto Semana
#> <dbl> <dbl> <int>
#> 1 2 2 1
#> 2 2 2 2
#> 3 5 3 1
#> 4 5 3 2
#> 5 5 3 3
#> 6 5 3 4
#> 7 5 3 5

We may use uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(plazo, .id = 'Semana', .remove = FALSE)
-output
plazo monto Semana
1 2 2 1
2 2 2 2
3 5 3 1
4 5 3 2
5 5 3 3
6 5 3 4
7 5 3 5

dplyr lag function multiple nested data

I want to create a lag variable for a value that is nested in three groups:
For example:
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
Data:
wave party inc value
1 1 A 1 1
2 1 A 2 10
3 1 A 3 100
4 1 B 1 3
5 1 B 2 30
6 1 B 3 300
7 2 A 1 6
8 2 A 2 60
9 2 A 3 600
10 2 B 1 7
11 2 B 2 70
12 2 B 3 700
What I need is the following:
wave party inc value lag
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 1
8 2 A 2 60 10
9 2 A 3 600 100
10 2 B 1 7 3
11 2 B 2 70 30
12 2 B 3 700 300
Where a respondent of income group (inc) 1, of party A in wave 2 has the lagged value of inc 1, party A in wave 1, etc.
I tried:
df %>% group_by(wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 100
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 600
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 NA
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 NA
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave, inc) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 NA
8 2 A 2 60 NA
9 2 A 3 600 NA
10 2 B 1 7 NA
11 2 B 2 70 NA
12 2 B 3 700 NA
I can continue like this. I tried different versions using df %>% arrange() and the order_by() function within lag. But for some reason I cannot figure out how to get the right lagged variable.

You could achieve your desired result by grouping only by party and inc:
library(dplyr)
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
df %>%
group_by(party, inc) %>%
mutate(lag = lag(value)) %>%
ungroup()
#> # A tibble: 12 x 5
#> wave party inc value lag
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 A 1 1 NA
#> 2 1 A 2 10 NA
#> 3 1 A 3 100 NA
#> 4 1 B 1 3 NA
#> 5 1 B 2 30 NA
#> 6 1 B 3 300 NA
#> 7 2 A 1 6 1
#> 8 2 A 2 60 10
#> 9 2 A 3 600 100
#> 10 2 B 1 7 3
#> 11 2 B 2 70 30
#> 12 2 B 3 700 300

Max of sum of combinations across columns in a dataframe

I have the dataframe
df <- data.frame(e_1=c(1,2,3,4,5), e_2=c(1,3,5,7,9), e_3=c(2,4,6,8,1),
e_4=c(1,2,4,5,7), e_5=c(1,8,9,6,4), Lanes=c(3,4,3,5,4))
I try to use:
max(combn(df[,(1:5)],df$Lanes,FUN = function(i) rowSums(df[,(1:5)][i])))
I get the error
Error in combn(df[, (1:5)], df$Lanes, FUN = function(i) rowSums(df[, (1:5)][i])) : length(m) == 1L is not TRUE

I guess you can try using combn row-wise, e.g.,
df$comb <- apply(df,1,function(v) max(combn(v[1:5],v["Lanes"],sum)))
such that
> df
e_1 e_2 e_3 e_4 e_5 Lanes comb
1 1 1 2 1 1 3 4
2 2 3 4 2 8 4 17
3 3 5 6 4 9 3 20
4 4 7 8 5 6 5 30
5 5 9 1 7 4 4 25

Using dplyr and purrr for this one could look as follows.
library(dplyr)
library(purrr)
library(rlang)
df <- data.frame(e_1=c(1,2,3,4,5), e_2=c(1,3,5,7,9), e_3=c(2,4,6,8,1),
e_4=c(1,2,4,5,7), e_5=c(1,8,9,6,4), Lanes=c(3,4,3,5,4))
df %>%
mutate(eSum = pmap(list(!!!parse_exprs(colnames(.))),
~ max(colSums(combn(c(..1, ..2, ..3, ..4, ..5), ..6)))))
# e_1 e_2 e_3 e_4 e_5 Lanes eSum
# 1 1 1 2 1 1 3 4
# 2 2 3 4 2 8 4 17
# 3 3 5 6 4 9 3 20
# 4 4 7 8 5 6 5 30
# 5 5 9 1 7 4 4 25

An option with c_across from dplyr
library(dplyr)
df %>%
rowwise %>%
mutate(Comb = max(combn(c_across(starts_with('e')), Lanes, FUN = sum)))
# A tibble: 5 x 7
# Rowwise:
# e_1 e_2 e_3 e_4 e_5 Lanes Comb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 2 1 1 3 4
#2 2 3 4 2 8 4 17
#3 3 5 6 4 9 3 20
#4 4 7 8 5 6 5 30
#5 5 9 1 7 4 4 25

How to properly merge a variable by group from another data frame?

I have two datasets- one is a baseline and the other is a follow up dataset.
DF1 is the baseline (cross-sectional) data with id, date, score1, score2, level, and grade.
DF2 has id, date, score1, and score2, in a long format with multiple rows per id.
df1 <- as.data.frame(cbind(id = c(1,2,3),
date = c("2020-06-03","2020-07-02","2020-06-11"),
score1 =c(6,8,5),
score2=c(1,1,6),
baselevel=c(2,2,2),
basegrade=c("B","B","A")))
df2 <- as.data.frame(cbind(id =c(1,1,1,1,2,2,2,3,3,3),
date = c("2020-06-10","2020-06-17","2020-06-24",
"2020-07-01", "2020-07-03", "2020-07-10","2020-07-17", "2020-06-14",
"2020-06-22", "2020-06-29"),
score1 = c(3,1,7,8,8,6,5,5,3,5),
score2 = c(1,4,5,4,1,1,2,6,7,1)) )
This is what I want as a result of merging the two dfs.
id date score1 score 2 baselevel basegrade
1 2020-06-03 6 1 2 "B"
1 2020-06-10 3 1 2 "B"
1 2020-06-17 1 4 2 "B"
1 2020-06-24 7 5 2 "B"
1 2020-07-01 8 4 2 "B"
2 2020-07-02 8 1 2 "B"
2 2020-07-03 8 1 2 "B"
2 2020-07-10 6 1 2 "B"
2 2020-07-17 5 2 2 "B"
3 2020-06-11 5 6 1 "A"
3 2020-06-14 5 6 1 "A"
3 2020-06-22 3 7 1 "A"
3 2020-06-29 5 1 1 "A"
I tried two different code below using merge, but I still get NAs.. what am I missing here?
Any help would be appreciated!!
dfcombined1 <- merge(df1, df2, by=c("id","date"), all= TRUE)
dfcombined2 <- merge(df1, df2, by=intersect(names(df1), names(df2)), all= TRUE)

You can use bind_rows() in dplyr.
library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>%
group_by(id) %>%
fill(starts_with("base"), .direction = "updown") %>%
arrange(date, .by_group = T)
# # A tibble: 13 x 6
# # Groups: id [3]
# id date score1 score2 baselevel basegrade
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2020-06-03 6 1 2 B
# 2 1 2020-06-10 3 1 2 B
# 3 1 2020-06-17 1 4 2 B
# 4 1 2020-06-24 7 5 2 B
# 5 1 2020-07-01 8 4 2 B
# 6 2 2020-07-02 8 1 2 B
# 7 2 2020-07-03 8 1 2 B
# 8 2 2020-07-10 6 1 2 B
# 9 2 2020-07-17 5 2 2 B
# 10 3 2020-06-11 5 6 2 A
# 11 3 2020-06-14 5 6 2 A
# 12 3 2020-06-22 3 7 2 A
# 13 3 2020-06-29 5 1 2 A

I think you are looking for this:
library(tidyverse)
#Code
df1 %>% bind_rows(df2) %>% arrange(id) %>% group_by(id) %>% fill(baselevel) %>% fill(basegrade)
Output:
# A tibble: 13 x 6
# Groups: id [3]
id date score1 score2 baselevel basegrade
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 2020-06-03 6 1 2 B
2 1 2020-06-10 3 1 2 B
3 1 2020-06-17 1 4 2 B
4 1 2020-06-24 7 5 2 B
5 1 2020-07-01 8 4 2 B
6 2 2020-07-02 8 1 2 B
7 2 2020-07-03 8 1 2 B
8 2 2020-07-10 6 1 2 B
9 2 2020-07-17 5 2 2 B
10 3 2020-06-11 5 6 2 A
11 3 2020-06-14 5 6 2 A
12 3 2020-06-22 3 7 2 A
13 3 2020-06-29 5 1 2 A

A double merge maybe.
merge(merge(df1[-(3:4)], df2, all.y=TRUE)[-(3:4)], df1[-(2:4)], all.x=TRUE)
# id date score1 score2 baselevel basegrade
# 1 1 2020-06-10 3 1 2 B
# 2 1 2020-06-17 1 4 2 B
# 3 1 2020-06-24 7 5 2 B
# 4 1 2020-07-01 8 4 2 B
# 5 2 2020-07-03 8 1 2 B
# 6 2 2020-07-10 6 1 2 B
# 7 2 2020-07-17 5 2 2 B
# 8 3 2020-06-14 5 6 2 A
# 9 3 2020-06-22 3 7 2 A
# 10 3 2020-06-29 5 1 2 A

Count interactions with unique accounts in financial transaction dataset

I have a question about a dataset with financial transactions:
Account_from Account_to Value
1 1 2 25.0
2 1 3 30.0
3 2 1 28.0
4 2 3 10.0
5 2 3 12.0
6 3 1 40.0
7 3 1 30.0
8 3 1 20.0
Each row represents a transaction. I would like to create an extra column with a variable containing the information of the number of interactions with each unique account.
That it would look like the following:
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25.0 2 2
2 1 3 30.0 2 2
3 2 1 28.0 2 1
4 2 3 10.0 2 1
5 2 3 12.0 2 1
6 3 1 40.0 1 2
7 3 1 30.0 1 2
8 3 1 20.0 1 2
Account 3 only interacts with account 1, therefore Count_interactions_out is 1. However, it receives interactions from account 1 and 2, therefore the count_interactions_in is 2.
How can I apply this to the whole dataset?
Thanks

Here's an approach using dplyr
library(dplyr)
financial.data %>%
group_by(Account_from) %>%
mutate(Count_interactions_out = nlevels(factor(Account_to))) %>%
ungroup() %>%
group_by(Account_to) %>%
mutate(Count_interactions_in = nlevels(factor(Account_from))) %>%
ungroup()

Here is a solution with base R, where ave() is used
df <- cbind(df,
with(df, list(
Count_interactions_out = ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in = ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)])))
such that
> df
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 2 1
4 2 3 10 2 1
5 2 3 12 2 1
6 3 1 40 1 2
7 3 1 30 1 2
8 3 1 20 1 2
or
df <- within(df, list(
Count_interactions_out <- ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in <- ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)]))
such that
> df
Account_from Account_to Value Count_interactions_in Count_interactions_out
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 1 2
4 2 3 10 1 2
5 2 3 12 1 2
6 3 1 40 2 1
7 3 1 30 2 1
8 3 1 20 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rank by multiple columns in R - r

Here you go: df <- df %>% group_by(Account) %>% mutate(ranking = rank(DATE))

in base R sortdata <- lapply(1:3,grep,df[,1]) for(i in sortdata){ df[i,3] <- order(df[i,2],decreasing=T) }

Related

How to get the cartesian product?

dplyr lag function multiple nested data

Max of sum of combinations across columns in a dataframe

How to properly merge a variable by group from another data frame?

Count interactions with unique accounts in financial transaction dataset

Categories

Resources