Say I have two dataframes df1 and df2 as follow:
df1:
EmployeeID Skill
1 A
1 B
1 C
2 B
2 D
2 C
2 F
3 A
3 J
df2:
Opportunity.ID Skill
12345 A
12345 B
56788 C
56788 B
56788 F
09988 H
What I'm looking to do is to have a new data frame with all the EmployeeID that have all the skills required for a specific Opportunity.ID, and not only one of them. This is why a simple merge or left/right join will not be enought.
In our case, what I would like to have is:
Opportunity.ID Employee.ID
12345 1
56788 2
09988 NA
Note that employee 3 should not be assigned to opportunity 12345 because he only has one skill among the two required.
Any help would be greatly appreciated.
Here's one way using dplyr -
df2 %>%
left_join(df1, by = "Skill") %>%
group_by(Opportunity.ID) %>%
mutate(test = ave(Skill, EmployeeID, FUN = function(x) all(Skill %in% x))) %>%
ungroup() %>%
filter(test != "FALSE") %>%
distinct(Opportunity.ID, EmployeeID)
# A tibble: 3 x 2
Opportunity.ID EmployeeID
<int> <int>
1 12345 1
2 56788 2
3 9988 NA
There is probably a better solution, but with the data.table-package I came to the following approach:
library(data.table) # load the package
setDT(df1) # convert 'df1' to a 'data.table'
setDT(df2) # convert 'df2' to a 'data.table'
df2[, .(EmployeeID = df1[.SD[, .(Skill, n = .N)], on = .(Skill)
][, .(ne = .N), by = .(EmployeeID, n)
][n == ne, EmployeeID])
, by = Opportunity.ID]
which gives:
Opportunity.ID EmployeeID
1: 12345 1
2: 56788 2
3: 9988 NA
Related
Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')
I have a dataframe like this
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97
that I would like to merge with this other dataframe
id date
1 26/08/88
2 10/05/96
The resulting merged dataframe should look like this
id start end date
1 20/06/88 24/07/89 26/06/88
1 27/07/89 13/04/93 NA
1 14/04/93 6/09/95 NA
2 3/01/92 11/02/94 NA
2 30/03/94 16/04/96 NA
2 17/04/96 18/08/97 10/05/96
In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.
Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..
Many thanks to everyone
Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin
library(tidyverse)
library(fuzzyjoin)
df1 %>%
mutate_at(2:3, as.Date, "%d/%m/%y") %>%
fuzzy_left_join(
df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
by = c("id" = "id", "start" = "date", "end" = "date"),
match_fun = list(`==`, `<`, `>`))
# id.x start end id.y date
#1 1 1988-06-20 1989-07-24 1 1988-08-26
#2 1 1989-07-27 1993-04-13 NA <NA>
#3 1 1993-04-14 1995-09-06 NA <NA>
#4 2 1992-01-03 1994-02-11 NA <NA>
#5 2 1994-03-30 1996-04-16 NA <NA>
#6 2 1996-04-17 1997-08-18 2 1996-05-10
All that remains is tidying up the id columns.
Sample data
df1 <- read.table(text = "
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97", header = T)
df2 <- read.table(text = "
id date
1 26/08/88
2 10/05/96 ", header = T)
You can use sqldf for complex joins:
require(sqldf)
sqldf("SELECT df1.*,df2.date,df2.id as id2
FROM df1
LEFT JOIN df2
ON df1.id = df2.id AND
df1.start < df2.date AND
df1.end > df2.date")
I do have a statistical routine that does not like row exact duplicates (without ID) as resulting into null distances.
So I first detect duplicates which I remove, apply my routines and merge back records left aside.
For simplicity, consider I use rownames as ID/key.
I have found following way to achieve my result in base R:
data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3))
# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
dup1 <- duplicated(data)
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]]
# keep only those records that do have duplicates to preveng running folowing steps on all rows
datadup <- data[dupID,]
# "hash" row
rowhash <- apply(datadup, 1, paste, collapse="_")
idmaps <- split(rownames(datadup),rowhash)
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))
Which gives me what I want, ie deduplicated data (easy) and mapping table.
> (data <- data[!dup1,])
x y
1 1 1
4 2 4
6 3 3
> idmaptable
mappedid otherids
1_1.1 1 2
1_1.2 1 3
2_4 4 5
I wonder whether there is a simpler or more effective method (data.table / dplyr accepted). Any alternative to propose?
With data.table...
library(data.table)
setDT(data)
# tag groups of dupes
data[, g := .GRP, by=x:y]
# do whatever analysis
f = function(DT) Reduce(`+`, DT)
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][]
# "update join" the results back to the main table if needed
data[resDT, on=.(g), res := i.res ]
The OP skipped a central part of the example (usage of the deduped data), so I just made up f.
A solution using tidyverse. I usually don't store information in the row names, so I created ID and ID2 to store information. But of course, you can change that based on your needs.
library(tidyverse)
idmaptable <- data %>%
rowid_to_column() %>%
group_by(x, y) %>%
filter(n() > 1) %>%
unite(ID, x, y) %>%
mutate(ID2 = 1:n()) %>%
group_by(ID) %>%
mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>%
spread(ID_type, rowid) %>%
fill(mappedid) %>%
drop_na(otherids) %>%
mutate(ID2 = 1:n())
idmaptable
# A tibble: 3 x 4
# Groups: ID [2]
ID ID2 mappedid otherids
<chr> <int> <int> <int>
1 1_1 1 1 2
2 1_1 2 1 3
3 2_4 1 4 5
Some improvements to your base R solution,
df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),]
do.call(rbind, lapply(split(rownames(df),
do.call(paste, c(df, sep = '_'))), function(i)
data.frame(mapped = i[1],
others = i[-1],
stringsAsFactors = FALSE)))
Which gives,
mapped others
1_1.1 1 2
1_1.2 1 3
2_4 4 5
And of course,
unique(data)
x y
1 1 1
4 2 4
6 3 3
I have the following table:
id origin destination price
1 A B 2
1 C D 2
2 A B 3
3 B E 6
3 E C 6
3 C F 6
Basically what I want to do is to group it by id, select the first element from origin, and keep the last element from destination resulting in this table.
id origin destination price
1 A D 2
2 A B 3
3 B F 6
I know how to select the first and last row, but not to do what I want.
df %>%
group_by(id) %>%
slice(c(1, n())) %>%
ungroup()
Is it possible to do this with dplyr or even with data.table?
A solution with library(data.table):
unique(setDT(df)[, "origin" := origin[1] , by = id][, "destination" := destination[.N], by = id][, "price" := price[1] , by = id][])
A shortcut suggested by Imo:
setDT(df)[, .(origin=origin[1], destination=destination[.N], price=price[1]), by=id]
A base R approach using split:
do.call(rbind, lapply(split(df, df$id),
function(a) with(a, data.frame(origin=head(origin,1), destination=tail(destination,1), price=head(price,1)))))
# origin destination price
#1 A D 2
#2 A B 3
#3 B F 6
I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2