parent child structure in R dataframe

parent child structure in R dataframe - r

I have a csv that contains an org structure as follows plus some additional columns. I use R to create charts and it works great !.
The challenge is when trying to create the charts for a subset manager and its children/grandchildren.
Is there any filtering that is possible in dplr or any alternative package?
Sample format:
emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3
The subset I need is for manager "C"
Scenario 1:emp_id==C should contain all nodes of 'D','D1','D2','E','E1','F'
expected structure:
manager,all_children
C D
C D1
C D2
C E
C E1
C F
Scenario 2:emp_id==C should contain all above nodes but retain mgr_id structure for 'D','E'
expected structure:
manager,all_children
C D
C E
C F
D D1
D D2
E E1

Consider the base package with by which creates a df list for every level of mgr_id (not just C):
SCENARIO 1
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("mgr_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub("*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 C D1
# 5 C D2
# 6 C E1
SCENARIO 2 (where the selected columns change in grandchild and then first column rename)
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("emp_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub(".*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
names(grandchild)[1] <- "mgr_id"
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 D D1
# 5 D D2
# 6 E E1

Here is one solution using functions from dplyr and data.table. dt3 is the output for scenario 1, while dt4 is the output for scenario 2.
# Load packages
library(dplyr)
library(data.table)
# Create example data frame
dt <- read.table(text = "emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
# Filter levels lower than 1
filter(nest_id > 1) %>%
mutate(group_id = ifelse(nest_id > 2, 0, 1)) %>%
# Create "run_id", which will be used to fill manager label
mutate(run_id = rleid(group_id)) %>%
mutate(run_id = ifelse(run_id %% 2 == 0, run_id - 1, run_id)) %>%
group_by(run_id) %>%
mutate(manager = first(emp_id)) %>%
# Select for manager C
filter(manager %in% "C") %>%
ungroup() %>%
# Remove rows if manager == emp_id
filter(manager != emp_id) %>%
rename(all_children = emp_id)
# Scenario 1
dt3 <- dt2 %>% select(manager, all_children)
# Scenario 2
dt4 <- dt2 %>%
select(manager = mgr_id, all_children) %>%
arrange(manager, all_children)

Related

Defining indices for row sequences more succintly

I have a dataframe like this:
set.seed(123)
df <- data.frame(A = sample(LETTERS[1:5], 50, replace = TRUE),
B = sample(LETTERS[1:5], 50, replace = TRUE))
I want to filter the dataframe on two parameters: (i) the target rows that match a certain criterion and (ii) a certain number of rows that precede the target rows. Specifically, I want to filter rows where A == "A" & B == "A" as well as the five rows preceding the target row. I can do this with a two-step operation: first by defining a function, and second by using the function as input for slice:
Sequ <- function(col1, col2) {
# get row indices of target row with function `which`
inds <- which(col1 == "A" & col2 == "A")
# sort row indices of the rows before target row AND target row itself
sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}
library(dplyr)
df %>%
slice(Sequ(col1 = A, col2 = B))
A B
1 D C
2 D B
3 C B
4 C D
5 B B
6 A A
7 E B
8 E D
9 D C
10 D D
11 A A
12 C C
13 D E
14 B E
15 B E
16 B A
17 A A
18 C D
19 C B
20 B D
21 A B
22 A A
But surely there must be a more efficient replacement for this part: sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds))). In case I want to filter not just the preceding 5 but, say, 10 or 100 rows this way of defining each index individually becomes quickly impractical. How can this part be coded more economically?

1) Define bothA which takes a matrix and returns TRUE if any row is all A's. Then use rollapply to apply it as a moving window.
library(zoo)
bothA <- function(x) any(rowSums(rbind(x) == "A") == 2)
ok <- rollapply(df, 6, bothA, align = "left", partial = TRUE, by.column = FALSE)
df[ok, ]
2) or in a pipe
df %>%
filter(rollapply(., 6, bothA, align = "left", partial = TRUE, by.column = FALSE))
3) This also works:
ok <- rollapply(rowSums(df == "A") == 2, 6, any, align = "left", partial = TRUE)
df[ok, ]

Here is a dplyr solution that can be directly used in a pipe, with no need for filter.
Sequ <- function(x, col1, col2, value = "A"){
x %>%
mutate(grp = lag(cumsum({{col1}} == value & {{col2}} == value), default = 0)) %>%
group_by(grp) %>%
slice_tail(n = 5) %>%
ungroup() %>%
select(-grp)
}
df %>% Sequ(A, B)
## A tibble: 23 x 2
# A B
# <chr> <chr>
# 1 B D
# 2 C C
# 3 E A
# 4 D B
# 5 A A
# 6 C D
# 7 E E
# 8 C E
# 9 C C
#10 A A
## … with 13 more rows

One dplyr and purrr solution could be:
df %>%
filter(!row_number() %in% unlist(map(which(A == "A" & B == "A"), ~ (.x-5):.x)))

Creating Nodes and Edges Dataframes from Tidy Dataframes

I have a data frame that's of this structure:
df <- data.frame(var1 = c(1,1,1,2,2,3,3,3,3),
cat1 = c("A","B","D","B","C","D","E","B","A"))`
> df
var1 cat1
1 1 A
2 1 B
3 1 D
4 2 B
5 2 C
6 3 D
7 3 E
8 3 B
9 3 A
And I am looking to create both nodes and edges data frames from it, so that I can draw a network graph, using VisNetwork. This network will show the number/strength of connections between the different cat1 values, as grouped by the var1 value.
I have the nodes data frame sorted:
nodes <- data.frame(id = unique(df$cat1))
> nodes
id
1 A
2 B
3 D
4 C
5 E
What I'd like help with is how to process df in the following manner:
for each distinct value of var1 in df, tally up the group of nodes that are common to that value of var1 to give an edges dataframe that ultimately looks like the one below. Note that I'm not bothered about the direction of flow along the edges. Just that they are connected is all I need.
> edges
from to value
1 A B 2
2 A D 2
3 A E 1
4 B C 1
5 B D 2
6 B E 1
7 D E 1
With thanks in anticipation,
Nevil
Update: I found here a similar problem, and have adapted that code to give, which is getting close to what I want, but not quite there...
> df %>% group_by(var1) %>%
filter(n()>=2) %>% group_by(var1) %>%
do(data.frame(t(combn(.$cat1, 2,function(x) sort(x))),
stringsAsFactors=FALSE))
# A tibble: 10 x 3
# Groups: var1 [3]
var1 X1 X2
<dbl> <chr> <chr>
1 1. A B
2 1. A D
3 1. B D
4 2. B C
5 3. D E
6 3. B D
7 3. A D
8 3. B E
9 3. A E
10 3. A B

I don't know if there is already a suitable function to achieve this task. Here is a detailed procedure to do it. Whith this, you should be able to define you own function. Hope it helps!
# create an adjacency matrix
mat <- table(df)
mat <- t(mat) %*% mat
as.table(mat) # look at your adjacency matrix
# since the network is not directed, we can consider only the (strictly) upper triangular matrix
mat[lower.tri(mat, diag = TRUE)] <- 0
as.table(mat) # look at the new adjacency matrix
library(dplyr)
edges <- as.data.frame(as.table(mat))
edges <- filter(edges, Freq != 0)
colnames(edges) <- c("from", "to", "value")
edges <- arrange(edges, from)
edges # output
# from to value
#1 A B 2
#2 A D 2
#3 A E 1
#4 B C 1
#5 B D 2
#6 B E 1
#7 D E 1

here's a couple other ways...
in base R...
values <- unique(df$var1[duplicated(df$var1)])
do.call(rbind,
lapply(values, function(i) {
nodes <- as.character(df$cat1[df$var1 == i])
edges <- combn(nodes, 2)
data.frame(from = edges[1, ],
to = edges[2, ],
value = i,
stringsAsFactors = F)
})
)
in tidyverse...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
filter(n() >= 2) %>%
mutate(cat1 = as.character(cat1)) %>%
summarise(edges = list(data.frame(t(combn(cat1, 2)), stringsAsFactors = F))) %>%
unnest(edges) %>%
select(from = X1, to = X2, value = var1)
in tidyverse using tidyr::complete...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
mutate(i.cat1 = cat1) %>%
complete(cat1, i.cat1) %>%
filter(cat1 < i.cat1) %>%
select(from = cat1, to = i.cat1, value = var1)
in tidyverse using tidyr::expand...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
expand(cat1, to = cat1) %>%
filter(cat1 < to) %>%
select(from = cat1, to, value = var1)

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).

One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4

An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

Replace values of multiple columns from one dataframe using another dataframe with conditions

Hi I have two data frames as followed:
df1:
ID x y z
1 a b c
2 a b c
3 a b c
4 a b c
and df2:
ID x y
2 d NA
3 NA e
and I am after a result like this:
df1:
ID x y z
1 a b c
2 d b c
3 a e c
4 a b c
I have been trying to use the match function as suggested by some other posts but I keep getting the issue where my df1 dataframe being replaced with NA values from df2.
This is the code I have been using without luck
for (i in names(df2)[2:length(names(df2))]) {
df1[i] <- df2[match(df1$ID, df2$ID)]
}
Thanks

Your code didn't work for me so I change it a little but it works. If you are reading data from an external file use the stringAsFactor = FALSE when you read it so you don't run into problems.
df1 = data.frame("ID" = 1:4,"x" = rep("a",4), "y" =rep("b",4),"z" = rep("c",4),
stringsAsFactors=FALSE)
df2 = data.frame("ID" = 2:3,"x" = c("d",NA), "y" = c(NA,"e"),stringsAsFactors=FALSE)
for(i in 1:nrow(df2)){
new_data = df2[i,-which(apply(df2[i,],2,is.na))]
pos = as.numeric(new_data[1])
col_replace = intersect(colnames(new_data),colnames(df1))
df1[pos,col_replace] = new_data
}

A solution using dplyr. The idea is to convert both data frames to long format, conduct join and replace the values, and convert the format back to wide format. df5 is the final output.
library(dplyr)
library(tidyr)
df3 <- df1 %>% gather(Col, Value, -ID)
df4 <- df2 %>% gather(Col, Value, -ID, na.rm = TRUE)
df5 <- df3 %>%
left_join(df4, by = c("ID", "Col")) %>%
mutate(Value.x = ifelse(!is.na(Value.y), Value.y, Value.x)) %>%
select(ID, Col, Value.x) %>%
spread(Col, Value.x)
df5
# ID x y z
# 1 1 a b c
# 2 2 d b c
# 3 3 a e c
# 4 4 a b c
DATA
df1 <- read.table(text = "ID x y z
1 a b c
2 a b c
3 a b c
4 a b c",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "ID x y
2 d NA
3 NA e",
header = TRUE, stringsAsFactors = FALSE)

As mentioned by alistaire this is an update join. It is available with the data.table package:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on = "ID", x := ifelse(is.na(i.x), x, i.x)]
df1[df2, on = "ID", y := ifelse(is.na(i.y), y, i.y)]
df1
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
If there are many columns with replacement values, it might be worthwhile to follow www's suggestion to do the replacement after reshaping to long format where column names are treated as data:
library(data.table)
melt(setDT(df1), "ID")[
melt(setDT(df2), "ID", na.rm = TRUE), on = .(ID, variable), value := i.value][
, dcast(.SD, ID ~ variable)]
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
Data
df1 <- fread(
"ID x y z
1 a b c
2 a b c
3 a b c
4 a b c")
df2 <- fread(
"ID x y
2 d NA
3 NA e")

Function that ignores missing columns

Say I have the following two data frames:
col1 <- c("a","b","c","d","e")
col2 <- c("A","B","C","D","E")
col1a <- c("a","b","c","d","e")
col2a <- c("A","B","C","D","E")
df1 <- data.frame(col1, col2)
df2 <- data.frame(col1a, col2a)
colnames(df1) <- c("c1","c2")
colnames(df2) <- c("c1","c3")
And I have the following function to rename column headers:
library(dplyr)
col_rename <- function(x) x %>% rename(new_c1 = c1, new_c2 = c2, new_c3 = c3)
When I run this function, I get an error because the columns in the function does not match the columns in the data frame.
df1 <- col_rename(df1)
Error: `c3` contains unknown variables
How can I make the function run only on the present columns, and ignore the ones not present, without removing or changing the column names specified in the function?
EDIT:
I can see how the example was a bit confusing. I have many dataframes with many columns. These columns are shared by some dataframes but not all. However, I want to rename all columns specified by the function, regardless of what is present in the dataframe. It looks something like this:
col1 <- c(1:5)
col2 <- c(1:5)
col3 <- c(1:5)
col4 <- c(1:5)
df1 <- data.frame(col1,col2,col3,col4)
df2 <- data.frame(col1,col2,col3,col4)
colnames(df1) <- c("c1","c2","c6","c8")
colnames(df2) <- c("c1","c3","c2","c8")
AB_rename <- function(x) x %>% rename(aa=col1,bb=col2,
cc=col3,dd=col4,
ee=col5,ff=col6,
gg=col7,hh=col8)
Therefore I cannot follow the example of #Ycw, as they do not all follow the same rename rule. How do I make this ignore columns that are not present?

Here is a workaround to use setNames for the col_rename function.
col_rename <- function(x) setNames(x, paste0("new_", names(x)))
col_rename(df1)
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
col_rename(df2)
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
Or use the select_all function from the dplyr.
library(dplyr)
df1 %>% select_all(function(x) paste0("new_", x))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
This (~) also works for select_all
df2 %>% select_all(~paste0("new_", .))
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
rename_all also works well
library(dplyr)
df1 %>% rename_all(~paste0("new_", .))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
Update
This is an update to address OP's updated question.
We can create a named vector showing the relationship between old column names and new column names. And defined a function to change the name based on the setNames function.
# Create name vector
vec <- paste0("c", 1:8)
names(vec) <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh")
# Create the function
AB_rename <- function(x, name_vec){
old_colname <- names(x)
new_colname <- name_vec[name_vec %in% old_colname]
x2 <- setNames(x, names(new_colname))
return(x2)
}
AB_rename(df1, vec)
aa bb ff hh
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

parent child structure in R dataframe - r

Related

Defining indices for row sequences more succintly

Creating Nodes and Edges Dataframes from Tidy Dataframes

selecting values of one dataframe based on partial string in another dataframe

Replace values of multiple columns from one dataframe using another dataframe with conditions

Function that ignores missing columns

Categories

Resources