I have a dataset with origin ("from"), destination ("to") and price as below:
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
I want to sum the price considering the return route as well. for example, A - B consists of the following data:
from to price
A B 28109
B A 445789
Then, take the sum of the price (28109+445789). The output will be like this:
route total_price
A - B 473898
A - D 48030
A - E 4216
B - D 2102
I was thinking to run a for loop but my data size is very large (800k rows). Any help will be highly appreciated. Thanks a lot in advance.
You can do this by sorting the from-to pairs, then grouping on that sorted pair and summing.
Edit: See #JasonAizkalns' answer for tidyverse equivalent
library(data.table)
setDT(df)
df[, .(total_price = sum(price))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price
# 1: A - B 473898
# 2: A - D 48030
# 3: A - E 4216
# 4: B - D 2102
#Frank notes that this result hides the fact that route "A - E" is not complete, in the sense that there is no row of the original data with from == 'E' and to == 'A'. He's offered a good way of capturing that info (and more), and I've added some others below.
df[, .(total_price = sum(price), complete = .N > 1)
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price complete
# 1: A - B 473898 TRUE
# 2: A - D 48030 TRUE
# 3: A - E 4216 FALSE
# 4: B - D 2102 TRUE
df[, .(total_price = sum(price), paths_counted = .(paste(from, '-', to)))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price paths_counted
# 1: A - B 473898 A - B,B - A
# 2: A - D 48030 A - D,D - A
# 3: A - E 4216 A - E
# 4: B - D 2102 B - D,D - B
Data used
df <- fread('
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979')
You could do a self-join and then things are pretty straightforward:
library(tidyverse)
df <- readr::read_table("
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
")
df %>%
inner_join(df, by = c("from" = "to")) %>%
filter(to == from.y) %>%
mutate(
route = paste(from, "-", to),
total_price = price.x + price.y
)
#> # A tibble: 6 x 7
#> from to price.x from.y price.y route total_price
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 A B 28109 B 445789 A - B 473898
#> 2 A D 2356 D 45674 A - D 48030
#> 3 B A 445789 A 28109 B - A 473898
#> 4 B D 123 D 1979 B - D 2102
#> 5 D A 45674 A 2356 D - A 48030
#> 6 D B 1979 B 123 D - B 2102
Created on 2019-03-20 by the reprex package (v0.2.1)
Because I like #IceCreamToucan's answer better, here's the tidyverse equivalent:
df %>%
group_by(route = paste(pmin(from, to), "-", pmax(from, to))) %>%
summarise(total_price = sum(price))
Also one tidyverse possibility:
df %>%
nest(from, to) %>%
mutate(route = unlist(map(data, function(x) paste(sort(x), collapse = "_")))) %>%
group_by(route) %>%
summarise(total_price = sum(price))
route total_price
<chr> <int>
1 A_B 473898
2 A_D 48030
3 A_E 4216
4 B_D 2102
In this case, it, first, creates a list composed of values "from" and "to" variables. Second, it sorts the elements in the list and combines them together, separated by _. Finally, it groups by the combined elements and gets the sum.
Or involving a wide-to-long transformation:
df %>%
rowid_to_column() %>%
gather(var, val, -c(rowid, price)) %>%
arrange(rowid, val) %>%
group_by(rowid) %>%
summarise(route = paste(val, collapse = "_"),
price = first(price)) %>%
group_by(route) %>%
summarise(total_price = sum(price))
For this, it, first, performs a wide-to-long data transformation, excluding the row ID and "price". Second, it arranges the data according row ID and values contained in "from" and "to". Third, it groups by row ID, combines the elements together, separated by _. Finally, it groups by this variable and gets the sum.
I'd do...
library(data.table)
setDT(df)
pts = df[, unique(c(from, to))]
rDT = CJ(P1 = pts, P2 = pts)[P1 < P2]
rDT[df, on=.(P1 = from, P2 = to), r12 := i.price]
rDT[df, on=.(P2 = from, P1 = to), r21 := i.price]
rDT[, r := r12 + r21]
P1 P2 r12 r21 r
1: A B 28109 445789 473898
2: A D 2356 45674 48030
3: A E 4216 NA NA
4: B D 123 1979 2102
5: B E NA NA NA
6: D E NA NA NA
This will make it clear where data is incomplete.** You could filter to rDT[!is.na(r)] for only the complete records.
** This is also addressed in #JasonAizkalns's
and #IceCreamToucan's answers, but contrasts with OP's requested output.
Related
I have the following dataset:
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
I would like to create the following dataset in R:
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C none B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
Is there a quick way to do this? I have many individuals, but these are all the possible scenarios. A special_drug is identified by the sequence number; those with 'NA' are a traditional_drug.
prior_special_drug will contain any special_drug previously identified, so for the first special_drug C there is no previous special_drug, for the second special_drug D, there is one previous special_drug that is C, and for the third special_drug there are two previous special_drugs C and D.
prior_traditional_drug is the same but will contain anything that has been identified in sequence_special_drug as NA. So for the first special_drug (C), the two prior_traditional_drugs are A and B. For the third special_drug, the prior_traditional_drugs are A, B, B, Z, Z, A.
during_special_drug will contain every traditional_drug that have been referenced during the administration of special_drug. This can be identified in the dataset through the repetition of sequence_special_drug (e.g. 2 -> NA NA -> 2 -> NA -> 2) therefore B, Z, Z.
EDIT - For 2 individuals:
dat <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
B 1 D
B NA B
B NA Z
B 1 D
B NA Z
B 1 D
B NA A
B 2 E",
header = TRUE)
I would expect:
- WRONG "none" line 3 under prior_traditional_drug -
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C none B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
B 1 D none none B, Z, Z
B 2 E D B, Z, Z, A none
- RIGHT "A, B" line 3 under prior_traditional_drug -
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C A, B B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
B 1 D none none B, Z, Z
B 2 E D B, Z, Z, A none
But I obtained:
Error message with my own dataset
> special_drug <- example_data %>%
+ nest_by(individual) %>%
+ mutate(
+ spec_drug = list(get_all_drugs(data))
+ ) %>%
+ unnest(spec_drug) %>%
+ select(-data) %>%
+ ungroup()
`summarise()` has grouped output by 'sequence_special_drug'. You can override using the `.groups` argument.
Error: Problem with `mutate()` input `spec_drug`.
x Problem with `mutate()` input `flag3`.
x `false` must be a list, not a character vector.
ℹ Input `flag3` is `if_else(flag1 == 1, list(character(0)), flag3)`.
ℹ Input `spec_drug` is `list(get_all_drugs(data))`.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
Error in is_rlang_error(parent) :
argument "parent" is missing, with no default
My own dataset is more like this:
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
77779 NA Name1
77779 1 Name2
77779 1 Name2
77779 1 Name2
77779 2 Name3
4444 NA Name1
4444 1 Name4
4444 2 Name3
4444 3 Name7",
header = TRUE)
But the dataset below also generates the same error message:
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A 1 C
A 2 D
A 2 D
A 2 D
A 3 E
B NA B
B 1 D
B 2 E
B 3 F",
header = TRUE)
Here is my suggestion using {tidyverse}. I wrote a function to get each column and then put them together in get_all_drugs(). Then, I ran the function through the nested data by individual, as in the example below.
library(tidyverse)
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
B 1 D
B NA B
B NA Z
B 1 D
B NA Z
B 1 D
B NA A
B 2 E",
header = TRUE)
get_special_drugs <- function(.data) {
.data %>%
filter(sequence_special_drug != 0) %>%
distinct() %>%
select(sequence_special_drug, special_drug = all_drugs) %>%
mutate(prior_special_drug = as.list(accumulate(special_drug, c))) %>%
rowwise() %>%
mutate(prior_special_drug = list(
prior_special_drug[prior_special_drug != special_drug]
)) %>%
ungroup()
}
fix_drug_sequence <- function(.data) {
.data %>%
mutate(
seq_drug = replace_na(sequence_special_drug, 0),
flag = if_else(seq_drug == 0 & seq_drug != lead(seq_drug),
lead(seq_drug),
seq_drug),
flag = if_else(flag == 0 & flag != lead(flag),
lead(flag),
flag)
) %>%
select(-sequence_special_drug) %>%
rename(sequence_special_drug = flag)
}
get_prior_traditional_drug <- function(...) {
fix_drug_sequence(...) %>%
group_by(sequence_special_drug) %>%
mutate(
flag1 = max(seq_drug == sequence_special_drug & row_number() == 1),
) %>%
group_by(sequence_special_drug, flag1) %>%
summarise(
flag2 = list(all_drugs[seq_drug == 0])
) %>%
ungroup() %>%
mutate(
flag3 = as.list(accumulate(flag2, append)),
flag3 = if_else(flag1 == 1, lag(flag3), flag3)
) %>%
select(sequence_special_drug, prior_traditional_drug = flag3)
}
get_during_special_drugs <- function(...) {
fix_drug_sequence(...) %>%
group_by(sequence_special_drug) %>%
mutate(
flag = cumsum(seq_drug == sequence_special_drug)
) %>%
filter(flag > 0) %>%
summarise(
during_special_drug = list(all_drugs[seq_drug == 0])
)
}
get_all_drugs <- function(.data) {
spec_drug <- get_special_drugs(.data)
prior_traditional <- get_prior_traditional_drug(.data)
during_spec <- get_during_special_drugs(.data)
list(spec_drug, prior_traditional, during_spec) %>%
reduce(left_join, by = "sequence_special_drug")
}
special_drug <- example_data %>%
nest_by(individual) %>%
mutate(
spec_drug = list(get_all_drugs(data))
) %>%
unnest(spec_drug) %>%
select(-data) %>%
ungroup()
special_drug
Here is my inelegant solution only for this specific problem, but it maybe useful to give you a hint.
library(data.table)
dt <- fread(
"
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
"
)
df <- unique(na.omit(dt))
setnames(df,"all_drugs","special_drug")
df
#> individual sequence_special_drug special_drug
#> 1: A 1 C
#> 2: A 2 D
#> 3: A 3 E
## add row ideantifier in dt
dt[,rd:=rowid(individual)]
## create prior_special_drug
df[,prior_special_drug:=shift(special_drug)]
df[3,4] <- df[special_drug < "E", paste(special_drug,collapse = ", ")]
df
#> individual sequence_special_drug special_drug prior_special_drug
#> 1: A 1 C <NA>
#> 2: A 2 D C
#> 3: A 3 E C, D
special.drug = df$special_drug
special.drug
#> [1] "C" "D" "E"
posi <- c(
dt[,first(.I[all_drugs==special.drug[1]])], #first position of C
dt[,first(.I[all_drugs==special.drug[2]])], #first position of D
dt[,last(.I[all_drugs==special.drug[2]])], #last position of D
dt[,last(.I[all_drugs==special.drug[3]])] #last position of E
)
posi
#> [1] 3 4 9 11
# dt[is.na(sequence_special_drug) & rd < posi[1], all_drugs]
# dt[is.na(sequence_special_drug) & rd %between% posi[2:3], all_drugs]
# dt[is.na(sequence_special_drug) & rd < posi[4], all_drugs]
drug <- c(
paste(dt[is.na(sequence_special_drug) & rd < posi[1], all_drugs],collapse = ", "),
paste(dt[is.na(sequence_special_drug) & rd %between% posi[2:3], all_drugs],collapse = ", "),
paste(dt[is.na(sequence_special_drug) & rd < posi[4], all_drugs],collapse = ", ")
)
drug
#> [1] "A, B" "B, Z, Z" "A, B, B, Z, Z, A"
## create prior_traditional_drug and during_special_drug
df[,prior_traditional_drug := drug]
df[,prior_traditional_drug := ifelse(special_drug == "D",NA,prior_traditional_drug)]
df[,during_special_drug := drug]
df[,during_special_drug := ifelse(special_drug %in% c("C","E"),NA,during_special_drug)]
## replace NA with "none" in df
for (jj in 1:ncol(df))
set(df,
i = which(is.na(df[[jj]])),
j = jj,
v = "none"
)
df
#> individual sequence_special_drug special_drug prior_special_drug
#> 1: A 1 C none
#> 2: A 2 D C
#> 3: A 3 E C, D
#> prior_traditional_drug during_special_drug
#> 1: A, B none
#> 2: none B, Z, Z
#> 3: A, B, B, Z, Z, A none
Created on 2021-06-06 by the reprex package (v2.0.0)
I have one data frame like:
a b c d e f g
1 Car 10/02 01/02 30/02 14 1 NA
2 Car 10/02 07/02 20/02 0 NA 7
I want to get :
a b c d e f g
1 Car 10/02 01/02 20/02 14 1 7
Like a group by (a,b) , select the min Date for c and d, select the max for e and select the non-null for f and g
How can I solve it in R ?
Using dplyr we can group_by a and b, convert c and d to actual dates using dmy from lubridate and select the minimum date, select maximum value of e and non-NA value of f and g.
library(dplyr)
library(lubridate)
df %>%
group_by(a, b) %>%
summarise(c = c[which.min(dmy(paste0(c, "/19")))],
d = d[which.min(dmy(paste0(d, "/19")))],
e = max(e),
f = f[!is.na(f)],
g = g[!is.na(g)])
# a b c d e f g
# <fct> <fct> <fct> <fct> <dbl> <int> <int>
#1 Car 10/02 01/02 20/02 14 1 7
As shown in the example, I am assuming you would have only one non-NA value for f and g if you have more than one then use which.max to select the first non-NA value from those columns.
Use library dplyr, let's say your dataframe is df. First thing is replace NA by 0
df[is.na(df)] <- 0 # you can now directly take sum of f and g column in group by
library(dplyr)
df_1 <- df %>% group_by(a,b) %>% summarise(c = min(c),d = min(d),e = max(e),f = sum(f),g = sum(g))
using data.table. I had to change one of your dates because there is no feb 30th
x <- data.frame(
a= c( "Car","Car"), b=c("10/20","10/20"), c=c("01/02","07/02"),d=c("28/02","20/02"), e=c(14,0), f=c(1, NA), g=c(NA,7))
library( data.table) # you may need to install
x <- data.table( x )
#convert to dates
x$c <- as.Date(x$c, "%d/%m")
x$d <- as.Date(x$d, "%d/%m")
# group as you specfied
x[ ,
.(
c = min( c ),
d = min( d ),
e = max( e ),
f= max( f , na.rm=T),
g= max( g , na.rm=T)
) ,
by= c( "a","b")
]
So my data set looks like this and the alphabets are factor variables. Pred columns are for predicted observations for each ID and Real columns are the real observations. I want to calculate the overall accuracy of the predicted values for each ID.
ID Pred1 Pred2 Pred3 Real1 Real2 Real3
1 A C E A D B
2 A B D E C C
3 E C A A B D
4 D A B B B D
5 B A C C A B
So I want to mutate a column called 'score' which gives you a percentage of the number of matched observations between Pred1,2,3 and Real1,2,3 columns.
I only care about finding any 'Pred' values in any of 'Real' columns.
If Pred1 is found in one of Real1,Real2 and Real3, then I give a score of 1/3.
If Pred1 AND Pred2 (not Pred3) are both found in any of Real1,Real2 and Real3 columns, (*the order does NOT matter. Pred1 can be found in Real2 or Real3- just anywhere in 'Real' columns), then I give a score of 2/3.
I hope it makes sense. The order does not matter and I only care about finding any of 'Pred' values in any of 'Real' columns.
So I want something like below.
ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
1 A C E A D B 1/3
2 A B D E C C 0
3 E C A A B D 1/3
4 D A B B E D 2/3
5 B A C C A B 1
I am trying to write a function and tried something like
ifelse("Pred1" %in% c("Real1","Real2","Real3") , 1/3 ,0 )) but it didn't work well.. (had error messages with coercing to logical etc which I didn't know how to solve)
So I am trying different things too but keep getting stuck with errors...
Can anyone help please? Thank you in advance!
It makes it difficult to compare values with different factor levels. We can first convert the columns from factors to characters.
df[-1] <- lapply(df[-1], as.character)
Find out index of Predicted and Real columns and then for every row check how many of Predicted observations are present in Real ones.
pred_cols <- grep("^Pred", names(df))
real_cols <- grep("^Real", names(df))
df$Score <- sapply(1:nrow(df), function(x)
sum(df[x, pred_cols] %in% df[x, real_cols]))/length(pred_cols)
df
# ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
#1 1 A C E A D B 0.33
#2 2 A B D E C C 0.00
#3 3 E C A A B D 0.33
#4 4 D A B B B D 0.67
#5 5 B A C C A B 1.00
As it is row-wise comparison we can also use apply with MARGIN = 1 using the same logic. With this approach we don't need to explicitly convert the columns into characters.
apply(df, 1, function(x) sum(x[pred_cols] %in% x[real_cols]))/length(pred_cols)
One tidyverse possibility could be:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
group_by(ID) %>%
summarise(res = paste0(sum(val %in% val2), "/3")) %>%
left_join(df, by = c("ID" = "ID"))
ID res Pred1 Pred2 Pred3 Real1 Real2 Real3
<int> <chr> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 1/3 A C E A D B
2 2 0/3 A B D E C C
3 3 1/3 E C A A B D
4 4 2/3 D A B B B D
5 5 3/3 B A C C A B
It first, separately, transforms from wide to long format the columns that contains Pred and Real. Second, it combines the two by columns. Finally, it groups by "ID", sums the number of matching cases and joins it with the original df.
Or if the number of pairs is not fixed to 3:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
add_count(ID) %>%
group_by(ID) %>%
summarise(res = paste(sum(val %in% val2), first(n), sep = "/")) %>%
left_join(df, by = c("ID" = "ID"))
Or if you want a numeric variable as the result:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
add_count(ID) %>%
group_by(ID) %>%
summarise(res = sum(val %in% val2)/first(n)) %>%
left_join(df, by = c("ID" = "ID"))
ID res Pred1 Pred2 Pred3 Real1 Real2 Real3
<int> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 0.333 A C E A D B
2 2 0 A B D E C C
3 3 0.333 E C A A B D
4 4 0.667 D A B B B D
5 5 1 B A C C A B
Hi I have two data frames as followed:
df1:
ID x y z
1 a b c
2 a b c
3 a b c
4 a b c
and df2:
ID x y
2 d NA
3 NA e
and I am after a result like this:
df1:
ID x y z
1 a b c
2 d b c
3 a e c
4 a b c
I have been trying to use the match function as suggested by some other posts but I keep getting the issue where my df1 dataframe being replaced with NA values from df2.
This is the code I have been using without luck
for (i in names(df2)[2:length(names(df2))]) {
df1[i] <- df2[match(df1$ID, df2$ID)]
}
Thanks
Your code didn't work for me so I change it a little but it works. If you are reading data from an external file use the stringAsFactor = FALSE when you read it so you don't run into problems.
df1 = data.frame("ID" = 1:4,"x" = rep("a",4), "y" =rep("b",4),"z" = rep("c",4),
stringsAsFactors=FALSE)
df2 = data.frame("ID" = 2:3,"x" = c("d",NA), "y" = c(NA,"e"),stringsAsFactors=FALSE)
for(i in 1:nrow(df2)){
new_data = df2[i,-which(apply(df2[i,],2,is.na))]
pos = as.numeric(new_data[1])
col_replace = intersect(colnames(new_data),colnames(df1))
df1[pos,col_replace] = new_data
}
A solution using dplyr. The idea is to convert both data frames to long format, conduct join and replace the values, and convert the format back to wide format. df5 is the final output.
library(dplyr)
library(tidyr)
df3 <- df1 %>% gather(Col, Value, -ID)
df4 <- df2 %>% gather(Col, Value, -ID, na.rm = TRUE)
df5 <- df3 %>%
left_join(df4, by = c("ID", "Col")) %>%
mutate(Value.x = ifelse(!is.na(Value.y), Value.y, Value.x)) %>%
select(ID, Col, Value.x) %>%
spread(Col, Value.x)
df5
# ID x y z
# 1 1 a b c
# 2 2 d b c
# 3 3 a e c
# 4 4 a b c
DATA
df1 <- read.table(text = "ID x y z
1 a b c
2 a b c
3 a b c
4 a b c",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "ID x y
2 d NA
3 NA e",
header = TRUE, stringsAsFactors = FALSE)
As mentioned by alistaire this is an update join. It is available with the data.table package:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on = "ID", x := ifelse(is.na(i.x), x, i.x)]
df1[df2, on = "ID", y := ifelse(is.na(i.y), y, i.y)]
df1
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
If there are many columns with replacement values, it might be worthwhile to follow www's suggestion to do the replacement after reshaping to long format where column names are treated as data:
library(data.table)
melt(setDT(df1), "ID")[
melt(setDT(df2), "ID", na.rm = TRUE), on = .(ID, variable), value := i.value][
, dcast(.SD, ID ~ variable)]
ID x y z
1: 1 a b c
2: 2 d b c
3: 3 a e c
4: 4 a b c
Data
df1 <- fread(
"ID x y z
1 a b c
2 a b c
3 a b c
4 a b c")
df2 <- fread(
"ID x y
2 d NA
3 NA e")
I have a csv that contains an org structure as follows plus some additional columns. I use R to create charts and it works great !.
The challenge is when trying to create the charts for a subset manager and its children/grandchildren.
Is there any filtering that is possible in dplr or any alternative package?
Sample format:
emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3
The subset I need is for manager "C"
Scenario 1:emp_id==C should contain all nodes of 'D','D1','D2','E','E1','F'
expected structure:
manager,all_children
C D
C D1
C D2
C E
C E1
C F
Scenario 2:emp_id==C should contain all above nodes but retain mgr_id structure for 'D','E'
expected structure:
manager,all_children
C D
C E
C F
D D1
D D2
E E1
Consider the base package with by which creates a df list for every level of mgr_id (not just C):
SCENARIO 1
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("mgr_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub("*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 C D1
# 5 C D2
# 6 C E1
SCENARIO 2 (where the selected columns change in grandchild and then first column rename)
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("emp_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub(".*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
names(grandchild)[1] <- "mgr_id"
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 D D1
# 5 D D2
# 6 E E1
Here is one solution using functions from dplyr and data.table. dt3 is the output for scenario 1, while dt4 is the output for scenario 2.
# Load packages
library(dplyr)
library(data.table)
# Create example data frame
dt <- read.table(text = "emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
# Filter levels lower than 1
filter(nest_id > 1) %>%
mutate(group_id = ifelse(nest_id > 2, 0, 1)) %>%
# Create "run_id", which will be used to fill manager label
mutate(run_id = rleid(group_id)) %>%
mutate(run_id = ifelse(run_id %% 2 == 0, run_id - 1, run_id)) %>%
group_by(run_id) %>%
mutate(manager = first(emp_id)) %>%
# Select for manager C
filter(manager %in% "C") %>%
ungroup() %>%
# Remove rows if manager == emp_id
filter(manager != emp_id) %>%
rename(all_children = emp_id)
# Scenario 1
dt3 <- dt2 %>% select(manager, all_children)
# Scenario 2
dt4 <- dt2 %>%
select(manager = mgr_id, all_children) %>%
arrange(manager, all_children)