R: reconstruct manager-employee data - r

I have a list of employee information with employee id and direct line manager id. I want to rearrange the data so it will list all level of managers for each employee.
I want to create a loop to find line managers repeatedly.
Here is the code to create a sample dataset.
employee_id = seq(1:10)
manager_id =c(1,1,2,3,4,2,3,1,4,5)
hr=data.frame(employee_id,manager_id)
Here is what I expect:
Using employee_id 4 as an example
employee_id managerL1 managerL2 managerL3
4 3 2 1
I should also mention that this is a simplified example. In the real data that I'm working with, manager and employee ids are not sequential. They are some random numbers with prefixs. The id itself don't have any information on managerial levels. The level is purely driven by data.

Seems like this requires an iterative solution.
Start with the level 1 managers of our employees. The row index of the employee who is the manager of each employee is
i <- 1
idx = match(hr$manager_id, hr$employee_id)
The manager's manager is hr$manager_id[idx], and we can use the same match() approach iteratively. Record and repeat until there is just a single employee as manager
repeat {
idx = match(hr$manager_id[idx], hr$employee_id)
hr[[paste0("manager_", i)]] = hr$employee_id[idx]
if (length(unique(idx)) == 1)
break
i <- i + 1
}
A variant might allow for one or more top-level managers by using NA as their manager, and stopping appropriately
hr$employee_id[1] = NA # the boss; there could be several top-level managers...
i <- 1
idx = match(hr$manager_id, hr$employee_id)
repeat {
idx = match(hr$manager_id[idx], hr$employee_id)
hr[[paste0("manager_", i)]] = hr$employee_id[idx]
if (all(is.na(idx)))
break
i <- i + 1
}

Here is an option with tidyverse
library(tidyverse)
hr %>%
uncount(manager_id, .remove = FALSE) %>%
group_by(employee_id) %>%
mutate(new_id = row_number(), nm1 = str_c('manager_', new_id)) %>%
spread(nm1,new_id)
# A tibble: 10 x 7
# Groups: employee_id [10]
# employee_id manager_id manager_1 manager_2 manager_3 manager_4 manager_5
# <int> <dbl> <int> <int> <int> <int> <int>
# 1 1 1 1 NA NA NA NA
# 2 2 1 1 NA NA NA NA
# 3 3 2 1 2 NA NA NA
# 4 4 3 1 2 3 NA NA
# 5 5 4 1 2 3 4 NA
# 6 6 2 1 2 NA NA NA
# 7 7 3 1 2 3 NA NA
# 8 8 1 1 NA NA NA NA
# 9 9 4 1 2 3 4 NA
#10 10 5 1 2 3 4 5
Or with map and spread
hr %>%
mutate(new_id = map(manager_id, seq)) %>%
unnest %>%
mutate(nm1 = str_c('manager_', new_id)) %>%
spread(nm1, new_id)

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

R - dataframe - every x rows new number in other column

my question is:
I have a matrix of 200.000 rows and 3 different columns (productID, week, order).
I want to put the productID (starting with 1) in the product column and create 26 rows for each ID. Than I want to put 1-26 in the week column for every ID.
I know it's not that hard, but I keep making mistakes.
Thank you so much for your help!
Do you look for something like this:
tibble(productID = 1:4, week = 5:8, order = "Test") %>%
tidyr::complete(week = 1:26, productID = 1:4, fill = list(order = NA_character_))
# A tibble: 104 x 3
week productID order
<int> <int> <chr>
1 1 1 NA
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 2 1 NA
6 2 2 NA
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 NA
# ... with 94 more rows

create id variable from table of duplicates

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.
fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
id = 1:6)
fdf
#> name id
#> 1 fred 1
#> 2 ferd 2
#> 3 frad 3
#> 4 eric 4
#> 5 eirc 5
#> 6 george 6
I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.
key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#> id1 id2
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 4 5
I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:
fdf$realid <- c(1,1,1,2,2,3)
fdf
#> name id realid
#> 1 fred 1 1
#> 2 ferd 2 1
#> 3 frad 3 1
#> 4 eric 4 2
#> 5 eirc 5 2
#> 6 george 6 3
Edit for clarity
Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:
# id1 id2
# 1 2
# 1 3
# 1 4
# ...
# 6 4
# 6 5
determining which are true connections (based on the other variables in each observation).
# id1 id2 match
# 1 2 match
# 1 3 no match
# 1 4 match
# ...
# 6 4 no match
# 6 5 no match
and sub-setting to the cases that are matches.
The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)
key <- data.frame(id = c(1, 2, 3, 4, 5, 6),
realid = c(1, 1, 1, 2, 2, 3))
Then it is just a matter of merging fdf and key together with merge
fdf <- merge(fdf, key_table, by.x = "id")
fdf
id name realid
1 1 fred 1
2 2 ferd 1
3 3 frad 1
4 4 eric 2
5 5 eirc 2
6 6 george 3
I didn't find a 'straight forward way', but it seems to work well.
First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:
check_overlap <- function(pair1, pair2){
newset <- intersect(pair1, pair2)
length(newset) != 0
}
Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:
check_overlaps <- function(key){
cont <- data.frame()
i <- 1
while(nrow(key) > 0){
ids <- apply(key, 1, check_overlap, key[1, ])
vals <- unique(unlist(key[ids, ]))
key <- key[!ids, ]
cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
i <- i+1
}
return(cont)
}
new_ids <- check_overlaps(key)
# vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:
for(val in unique(fdf$id)){
if(!(val %in% unlist(key))){
key <- rbind(key, c(val, val))
}
}
new_ids2 <- check_overlaps(key)
vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
Which is easy to merge with fdf like:
merge(fdf, new_ids2, by.x = "id", by.y = "vals")
id name V2
# 1 1 fred 1
# 2 2 ferd 1
# 3 3 frad 1
# 4 4 eric 2
# 5 5 eirc 2
# 6 6 george 3
If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:
# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
if (!(id %in% unlist(id_groups))) {
id_groups[[i]] <- c(id, key$id2[key$id1 == id])
i = i + 1
}
}
# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))
# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
which(sapply(id_groups, function(group) id %in% group))
})

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Resources