Calculate direct dependencies among values of a dataframe in R - r

A data frame is given and the objective is to calculate the direct dependency value between two columns of the data frame.
c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10
what we are looking for is that to get the direct dependency relations, for example, for aand b this value is ab - ba = 20.
The final result should be like this:
c1 c2 N DepValue
a b 30 ab - ba = 20
a c 5 ac - ca = 0
a d 10 ad- 0 = 10
c a 5 ca - ac= 0
b a 10 ba - ab = 20
Thank you for your help.

D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10")
N12 <- D$N
names(N12) <- paste0(D$c1, D$c2)
N21 <- N12[paste0(D$c2, D$c1)]
D$depValue <- D$N - ifelse(is.na(N21), 0, N21)
result:
> D
c1 c2 N depValue
1 a b 30 20
2 a c 5 0
3 a d 10 10
4 c a 5 0
5 b a 10 -20

One option is to create groups with pmin and pmax values of c1 and c2 and take difference between the two values. This will return NA for groups with only one value, we can replace those NAs to the first value in the group.
library(dplyr)
df %>%
group_by(group1 = pmin(c1, c2), group2 = pmax(c1, c2)) %>%
mutate(dep = N[1] - N[2],
dep = replace(dep, is.na(dep), N[1])) %>%
ungroup() %>%
select(-group1, -group2)
# c1 c2 N dep
# <chr> <chr> <int> <int>
#1 a b 30 20
#2 a c 5 0
#3 a d 10 10
#4 c a 5 0
#5 b a 10 20

An idea via base R is to sort columns c1 and c2, split based on those values and subtract N, i.e.
i1 <- paste(pmin(df$c1, df$c2), pmax(df$c1, df$c2))
i1
#[1] "a b" "a c" "a d" "a c" "a b"
do.call(rbind, lapply(split(df, i1), function(i) {i['DepValue'] <- Reduce(`-`, i$N); i}))
# c1 c2 N DepValue
#a b.1 a b 30 20
#a b.5 b a 10 20
#a c.2 a c 5 0
#a c.4 c a 5 0
#a d a d 10 10

Related

R: creating combinations of elements within a group and adding up numbers associated with combinations in a new data frame

I have the following dataset:
Letter ID Number
A A1 1
A A2 2
A A3 3
B B1 1
B B2 2
B B3 3
B B4 4
My aim is first to create all possible combinations of IDs within the same "Letter" group. For example, for the letter A, it would be only three combinations: A1-A2,A2-A3,and A1-A3. The same IDs ordered differently don't count as a new combination, so for example A1-A2 is the same as A2-A1.
Then, within those combinations, I want to add up the numbers from the "Number" column associated with those IDs. So for the combination A1-A2, which are associated with 1 and 2 in the "Number" column, this would result in the number 1+2=3.
Finally, I want to place the ID combinations, added numbers and original Letter in a new data frame. Something like this:
Letter Combination Add.Number
A A1-A2 3
A A2-A3 5
A A1-A3 4
B B1-B2 3
B B2-B3 5
B B3-B4 7
B B1-B3 4
B B2-B4 6
B B1-B4 5
How can I do this in R, ideally using the package dplyr?
library(dplyr)
letter <- c("A","A","A","B","B","B","B")
df <-
data.frame(letter) %>%
group_by(letter) %>%
mutate(
number = row_number(),
id = paste0(letter,number)
)
df %>%
full_join(df,by = "letter") %>%
filter(number.x < number.y) %>%
mutate(
combination = paste0(id.x,"-",id.y),
add_number = number.x + number.y) %>%
select(letter,combination,add_number)
# A tibble: 9 x 3
# Groups: letter [2]
letter combination add_number
<chr> <chr> <int>
1 A A1-A2 3
2 A A1-A3 4
3 A A2-A3 5
4 B B1-B2 3
5 B B1-B3 4
6 B B1-B4 5
7 B B2-B3 5
8 B B2-B4 6
9 B B3-B4 7
In base R, using combn:
df <- data.frame(
Letter = c("A","A","A","B","B","B","B"),
Id = c("A1","A2","A3","B1","B2","B3","B4"),
Number = c(1,2,3,1,2,3,4))
# combinations
l<-lapply(split(df$Id, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("L1","L2")))
n<-lapply(split(df$Number, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("N1","N2")))
# rbind all
result <- do.call(rbind, mapply(cbind, Letter=names(l), l, n, SIMPLIFY = F))
result$combination <- paste(result$L1, result$L2, sep="-")
result$sum = result$N1 + result$N2
result
#> Letter L1 L2 N1 N2 combination sum
#> A.1 A A1 A2 1 2 A1-A2 3
#> A.2 A A1 A3 1 3 A1-A3 4
#> A.3 A A2 A3 2 3 A2-A3 5
#> B.1 B B1 B2 1 2 B1-B2 3
#> B.2 B B1 B3 1 3 B1-B3 4
#> B.3 B B1 B4 1 4 B1-B4 5
#> B.4 B B2 B3 2 3 B2-B3 5
#> B.5 B B2 B4 2 4 B2-B4 6
#> B.6 B B3 B4 3 4 B3-B4 7

R: expand grid of all possible combinations within groups and apply functions across all the pairs

data <- tibble(time = c(1,1,2,2), a = c(1,2,3,4), b =c(4,3,2,1), c = c(1,1,1,1))
The result will look like this
result <- tibble(
t = c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
firm1 = c("a","a","a","b","b","b","c","c","c","a","a","a","b","b","b","c","c","c"),
firm2 = c("a","b","c","a","b","c","a","b","c","a","b","c","a","b","c","a","b","c"),
value = c(6,10,5,10,14,9,5,9,4,14,10,9,10,6,5,9,5,4))
result
The function could be
function(x, y){sum(x, y)}
Basically I am looking for a tidy solution to expand.grid data at each point of time and apply functions across columns. Can anyone help?
I tried this, but I could not have time in front of the pairs.
expected_result<-expand.grid(names(data[-1]), names(data[-1])) %>%
mutate(value = map2(Var1, Var2, ~ fun1(data[.x], data[.y])))
expected_result
Use exand.grid you get all possible combination of columns, split the data by time and apply fun for each row of tmp.
library(dplyr)
library(purrr)
tmp <- expand.grid(firm1 = names(data[-1]), firm2 = names(data[-1]))
fun <- function(x, y) sum(x, y)
result <- data %>%
group_split(time) %>%
map_df(~cbind(time = .x$time[1], tmp,
value = apply(tmp, 1, function(x) fun(.x[[x[1]]], .x[[x[2]]]))))
result
# time firm1 firm2 value
#1 1 a a 6
#2 1 b a 10
#3 1 c a 5
#4 1 a b 10
#5 1 b b 14
#6 1 c b 9
#7 1 a c 5
#8 1 b c 9
#9 1 c c 4
#10 2 a a 14
#11 2 b a 10
#12 2 c a 9
#13 2 a b 10
#14 2 b b 6
#15 2 c b 5
#16 2 a c 9
#17 2 b c 5
#18 2 c c 4
You may also do this in base R -
result <- do.call(rbind, by(data, data$time, function(x) {
cbind(time = x$time[1], tmp,
value = apply(tmp, 1, function(y) fun(x[[y[1]]], x[[y[2]]])))
}))
We may use
library(dplyr)
library(tidyr)
library(purrr)
data1 <- data %>%
group_by(time) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop') %>%
pivot_longer(cols = -time) %>%
group_split(time)
map_dfr(data1, ~ {dat <- .x
crossing(firm1 = dat$name, firm2 = dat$name) %>%
mutate(value = c(outer(dat$value, dat$value, FUN = `+`))) %>%
mutate(time = first(dat$time), .before = 1)})
-output
# A tibble: 18 × 4
time firm1 firm2 value
<dbl> <chr> <chr> <dbl>
1 1 a a 6
2 1 a b 10
3 1 a c 5
4 1 b a 10
5 1 b b 14
6 1 b c 9
7 1 c a 5
8 1 c b 9
9 1 c c 4
10 2 a a 14
11 2 a b 10
12 2 a c 9
13 2 b a 10
14 2 b b 6
15 2 b c 5
16 2 c a 9
17 2 c b 5
18 2 c c 4

Extract rows where value appears in any of multiple columns

Let' say I have two data.frames
name_df = read.table(text = "player_name
a
b
c
d
e
f
g", header = T)
game_df = read.table(text = "game_id winner_name loser_name
1 a b
2 b a
3 a c
4 a d
5 b c
6 c d
7 d e
8 e f
9 f a
10 g f
11 g a
12 f e
13 a d", header = T)
name_df contains a unique list of all the winner_name or loser_name values in game_df. I want to create a new data.frame that has, for each person in the name_df a row if a given name (e.g. a) appears in either the winner_name or loser_name column
So I essentially want to merge game_df with name_df, but the key column (name) can appear in either winner_name or loser_name.
So, for just a and b the final output would look something like:
final_df = read.table(text = "player_name game_id winner_name loser_name
a 1 a b
a 2 b a
a 3 a c
a 4 a d
a 9 f a
a 11 g a
a 13 a d
b 1 a b
b 2 b a
b 5 b c", header = T)
We can loop over the elements in 'name_df' for 'player_name', filter the rows from 'game_df' for either the 'winner_name' or 'loser_name'
library(dplyr)
library(purrr)
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ game_df %>%
filter(winner_name %in% .x|loser_name %in% .x), .id = 'player_name')
Or if there are many columns, use if_any
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ {
nm1 <- .x
game_df %>%
filter(if_any(c(winner_name, loser_name), ~ . %in% nm1))
}, .id = 'player_name')
Dedicated to our teacher and mentor dear #akrun
I think we can also make use of the add_row() function you first taught me the other day. Unbelievable!!!
library(dplyr)
library(purrr)
library(tibble)
game_df %>%
rowwise() %>%
mutate(player_name = winner_name) %>%
group_split(game_id) %>%
map_dfr(~ add_row(.x, game_id = .x$game_id, winner_name = .x$winner_name,
loser_name = .x$loser_name, player_name = .x$loser_name)) %>%
arrange(player_name) %>%
relocate(player_name)
# A tibble: 26 x 4
player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
This can be directly expressed in SQL:
library(sqldf)
sqldf("select *
from name_df
left join game_df on winner_name = player_name or loser_name = player_name")
Without using purrr. I think this is appropriate use case of tidyr::unite with argument remove = F where we can first unite the winners' and losers' names and then use tidyr::separate_rows to split new column into rows.
library(tidyr)
library(dplyr)
game_df %>% unite(Player_name, winner_name, loser_name, remove = F, sep = ', ') %>%
separate_rows(Player_name) %>%
relocate(Player_name) %>%
arrange(Player_name)
# A tibble: 26 x 4
Player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
A Base R approach :
result <- do.call(rbind, lapply(name_df$player_name, function(x)
cbind(plaername = x,
subset(game_df, winner_name == x | loser_name == x))))
rownames(result) <- NULL
result
# playername game_id winner_name loser_name
#1 a 1 a b
#2 a 2 b a
#3 a 3 a c
#4 a 4 a d
#5 a 9 f a
#6 a 11 g a
#7 a 13 a d
#8 b 1 a b
#...
#...

Count of unique elements of each row in a data frame in R

I have a data frame like below:
Group1 Group2 Group3 Group4
A B A B
A C B A
B B B B
A C B D
A D C A
I want to add a new column to the data frame which will have the count of unique elements in each row. Desired output:
Group1 Group2 Group3 Group4 Count
A B A B 2
A C B A 3
B B B B 1
A C B D 4
A D C A 3
I am able to find such a count for each row using
length(unique(c(df[,c(1,2,3,4)][1,])))
I want to do the same thing for all rows in the data frame. I tried apply() with var=1 but without success. Also, it would be great if you could provide a more elegant solution to this.
We can use apply with MARGIN =1 to loop over the rows
df1$Count <- apply(df1, 1, function(x) length(unique(x)))
df1$Count
#[1] 2 3 1 4 3
Or using tidyverse
library(dplyr)
df1 %>%
rowwise() %>%
do(data.frame(., Count = n_distinct(unlist(.))))
# A tibble: 5 × 5
# Group1 Group2 Group3 Group4 Count
#* <chr> <chr> <chr> <chr> <int>
#1 A B A B 2
#2 A C B A 3
#3 B B B B 1
#4 A C B D 4
#5 A D C A 3
We can also use regex to do this in a faster way. It is based on the assumption that there is only a single character per each cell
nchar(gsub("(.)(?=.*?\\1)", "", do.call(paste0, df1), perl = TRUE))
#[1] 2 3 1 4 3
More detailed explanation is given here
duplicated in base R:
df$Count <- apply(df,1,function(x) sum(!duplicated(x)))
# Group1 Group2 Group3 Group4 Count
#1 A B A B 2
#2 A C B A 3
#3 B B B B 1
#4 A C B D 4
#5 A D C A 3
Athough there are some pretty great solutions mentioned over here, You can also use, data.table :
DATA:
df <- data.frame(g1 = c("A","A","B","A","A"),g2 = c("B", "C", "B","C","D"),g3 = c("A","B","B","B","C"),g4 = c("B","A","B","D","A"),stringsAsFactors = F)
Code:
EDIT: After the David Arenberg's comment,added (.I) instead of 1:nrow(df). Thanks for valuable comments
library(data.table)
setDT(df)[, id := .I ]
df[, count := uniqueN(c(g1, g2, g3, g4)), by=id ]
df
Output:
> df
g1 g2 g3 g4 id count
1: A B A B 1 2
2: A C B A 2 3
3: B B B B 3 1
4: A C B D 4 4
5: A D C A 5 3

R: Shift some rows by one column across table

I have data frame X that looks like this. It has 4 columns and 5 rows.
name age gender class
A 12 M C1
B 10 F C2
C M C1 N/A
D F C2 N/A
E F C1 N/A
I would like to shift all data from col 2 (age) and row 3 onward by one column to right so that gender and classes align leaving the wrongly filled age data as blank . My resulting set should look like:
name age gender class
A 12 M C1
B 10 F C2
C N/A M C1
D N/A F C2
E N/A F C1
Please note: this is a situation from a very large dataset with 4 mil records and 52 cols.
Any help will be much appreciated. Thanks in advance!
Like this:
nc <- ncol(dfr)
dfr[-(1:2), 3:nc] <- dfr[-(1:2), 2:(nc-1)]
dfr[-(1:2), 2] <- NA
The negative indices in the rows mean 'everything but rows 1 and 2'.
> df <- data.frame("name" = LETTERS[1:5],
+ "age" = c(12, 10, "M","F","F"),
+ "gender" = c("M", "F", "C1", "C2", "C1"),
+ "class" = c("C1", "C2", NA,NA,NA))
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C M C1 <NA>
4 D F C2 <NA>
5 E F C1 <NA>
> df[3:nrow(df),3:ncol(df)] <- df[3:nrow(df),2:ncol(df)]
Warning message:
In `[<-.data.frame`(`*tmp*`, 3:nrow(df), 3:ncol(df), value = list( :
provided 3 variables to replace 2 variables
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C M M C1
4 D F F C2
5 E F F C1
> df[3:nrow(df),2] <- NA
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C <NA> M C1
4 D <NA> F C2
5 E <NA> F C1

Resources