Create a mapping table of duplicated id / keys

Create a mapping table of duplicated id / keys - r

I do have a statistical routine that does not like row exact duplicates (without ID) as resulting into null distances.
So I first detect duplicates which I remove, apply my routines and merge back records left aside.
For simplicity, consider I use rownames as ID/key.
I have found following way to achieve my result in base R:
data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3))
# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
dup1 <- duplicated(data)
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]]
# keep only those records that do have duplicates to preveng running folowing steps on all rows
datadup <- data[dupID,]
# "hash" row
rowhash <- apply(datadup, 1, paste, collapse="_")
idmaps <- split(rownames(datadup),rowhash)
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))
Which gives me what I want, ie deduplicated data (easy) and mapping table.
> (data <- data[!dup1,])
x y
1 1 1
4 2 4
6 3 3
> idmaptable
mappedid otherids
1_1.1 1 2
1_1.2 1 3
2_4 4 5
I wonder whether there is a simpler or more effective method (data.table / dplyr accepted). Any alternative to propose?

With data.table...
library(data.table)
setDT(data)
# tag groups of dupes
data[, g := .GRP, by=x:y]
# do whatever analysis
f = function(DT) Reduce(`+`, DT)
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][]
# "update join" the results back to the main table if needed
data[resDT, on=.(g), res := i.res ]
The OP skipped a central part of the example (usage of the deduped data), so I just made up f.

A solution using tidyverse. I usually don't store information in the row names, so I created ID and ID2 to store information. But of course, you can change that based on your needs.
library(tidyverse)
idmaptable <- data %>%
rowid_to_column() %>%
group_by(x, y) %>%
filter(n() > 1) %>%
unite(ID, x, y) %>%
mutate(ID2 = 1:n()) %>%
group_by(ID) %>%
mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>%
spread(ID_type, rowid) %>%
fill(mappedid) %>%
drop_na(otherids) %>%
mutate(ID2 = 1:n())
idmaptable
# A tibble: 3 x 4
# Groups: ID [2]
ID ID2 mappedid otherids
<chr> <int> <int> <int>
1 1_1 1 1 2
2 1_1 2 1 3
3 2_4 1 4 5

Some improvements to your base R solution,
df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),]
do.call(rbind, lapply(split(rownames(df),
do.call(paste, c(df, sep = '_'))), function(i)
data.frame(mapped = i[1],
others = i[-1],
stringsAsFactors = FALSE)))
Which gives,
mapped others
1_1.1 1 2
1_1.2 1 3
2_4 4 5
And of course,
unique(data)
x y
1 1 1
4 2 4
6 3 3

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)

I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL

Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

Use apply functions within %>%

Below I create a function that deletes a specific column if there is only one unique value in it. Can I somehow use lapply within %>% to avoid calling the function three times? Or even call the function for all columns?
df <- tibble(col1 = sample(1:6), col2 = sample(1:6), col3 = 3, col4 = 4)
condDelCol <- function(mycolumn, mydataframe) {
if(length(unique(mydataframe[[mycolumn]])) == 1) { mydataframe[[mycolumn]] = NULL }
mydataframe
}
df %>%
condDelCol("col2", .) %>%
condDelCol("col3", .) %>%
condDelCol("col4", .)

With dplyr, an option is select_if
library(dplyr)
df %>%
select_if(~ n_distinct(.) > 1)
# A tibble: 6 x 2
# col1 col2
# <int> <int>
#1 1 6
#2 6 1
#3 5 5
#4 3 4
#5 4 2
#6 2 3
Or another way is base R by looping over the columns with sapply, create a logical vector, extract the column names that have only single unique value and assign (<-) it to NULL
i1 <- sapply(df, function(x) length(unique(x)))
df[names(which(i1 == 1))] <- NULL
Or with Filter
Filter(var, df)

You could use this one as well. It ignores the columns for which the standard deviation is 0.
df[, sapply(df, sd) != 0]
# A tibble: 6 x 2
col1 col2
<int> <int>
1 1 3
2 5 6
3 6 1
4 2 2
5 3 4
6 4 5
or if you want to use the pipe operator
df %>%
select(which(sapply(df, sd) != 0))

Filter Dataframe with Tidy Evaluation

I am handling a large dataset. First, for certain columns (X1, X2, ...), I am trying to identify a range of value (a, b) consists of repeated value (a > n, b > n). Next, I wish to filter row based on the condition which matches respective columns to result given in the previous step.
Here is a reproducible example simulating the scenario I am facing,
library(tidyverse)
set.seed(1122)
vecs <- lapply(X = 1:2, function(x) rep(c(1, 2, 3), times = 10) %>% sample() %>% head(10))
names(vecs) <- paste0("col_", 1:2)
dat <- vecs %>% as.data.frame()
dat
col_1 col_2
1 3 2
2 1 1
3 1 1
4 1 2
5 1 2
6 3 3
7 3 3
8 2 1
9 1 3
10 2 2
I am able to identify the range by the following method,
# Which col has repeated value more than 3 appearances?
more_than_3 <- function(df, var){
var <- rlang::sym(var)
df %>%
group_by(!!var) %>%
summarise(n = n()) %>%
filter(n > 3) %>%
pull(!!var) %>%
range()
}
cols_name <- c("col_1", "col_2")
some_range <- purrr::map(cols_name, more_than_3, df = dat)
names(some_range) <- cols_name
some_range
$col_1
[1] 1 1
$col_2
[1] 2 2
However, to filter out values that fall outside the upper limit, this is what I do.
dat %>%
filter(col_1 <= some_range[["col_1"]][2],
col_2 <= some_range[["col_2"]][2])
col_1 col_2
1 1 1
2 1 1
3 1 2
4 1 2
I believe there must be a more efficient and elegant way of filtering the result based on tidy evaluation. Can someone point me to the right direction?
Many thanks in advance.

First let's try to create a small function that creates a single filter expression for one column. This function will take a symbol and then transform to string but it could be the other way around:
new_my_filter_call_upper <- function(sym, range) {
col_name <- as.character(sym)
col_range <- range[[col_name]]
if (is.null(col_range)) {
stop(sprintf("Can't find column `%s` to compute range", col_name), call. = FALSE)
}
expr(!!sym < !!col_range[[2]])
}
Let's try it:
new_my_filter_call_upper(quote(foobar), some_range)
#> Error: Can't find column `foobar` to compute range
# It works!
new_my_filter_call_upper(quote(col_1), some_range)
#> col_1 < 3
Now we're ready to create a pipeline verbs that will take a data frame and bare column names.
# Probably cleaner to pass range as argument. Prefix with dot to allow
# columns named `range`.
my_filter <- function(.data, ..., .range) {
# ensyms() guarantees there won't be complex expressions
syms <- rlang::ensyms(...)
# Now let's map the function to create many calls:
calls <- purrr::map(syms, new_my_filter_call_upper, range = .range)
# And we're ready to filter with those expressions:
dplyr::filter(.data, !!!calls)
}
Let's try it:
dat %>% my_filter(col_1, col_2, .range = some_range)
#> col_1 col_2 NA.
#> 1 2 1 1
#> 2 2 2 1

We could use map2
library(purrr)
map2(dat, some_range, ~ .x < .y[2]) %>%
reduce(`&`) %>%
dat[.,]
# col_1 col_2
#1 2 2
#2 1 1
#3 1 2
#6 1 1
Or with pmap
pmap(list(dat, some_range %>%
map(2)), `<`) %>%
reduce(`&`) %>%
dat[.,]

Multiple values in one cell

I have data looking somewhat similar to this:
number type results
1 5 x, y, z
2 6 a
3 8 x
1 5 x, y
Basically, I have data in Excel that has commas in a couple of individual cells and I need to count each value that is separated by a comma, after a certain requirement is met by subsetting.
Question: How do I go about receiving the sum of 5 when subsetting the data with number == 1 and type == 5, in R?

If we need the total count, then another option is str_count after subsetting
library(stringr)
with(df, sum(str_count(results[number==1 & type==5], "[a-z]"), na.rm = TRUE))
#[1] 5
Or with gregexpr from base R
with(df, sum(lengths(gregexpr("[a-z]", results[number==1 & type==5])), na.rm = TRUE))
#[1] 5
If there are no matching pattern for an element, use
with(df, sum(unlist(lapply(gregexpr("[a-z]",
results[number==1 & type==5]), `>`, 0)), na.rm = TRUE))

Here is an option using dplyr and tidyr. filter function can filter the rows based on conditions. separate_rows can separate the comma. group_by is to group the data. tally can count the numbers.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
group_by(results) %>%
tally()
# # A tibble: 3 x 2
# results n
# <chr> <int>
# 1 x 2
# 2 y 2
# 3 z 1
Or you can use count(results) only as the following code shows.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
count(results)
DATA
dt <- read.table(text = "number type results
1 5 'x, y, z'
2 6 a
3 8 x
1 5 'x, y'",
header = TRUE, stringsAsFactors = FALSE)

Here is a method using base R. You split results on the commas and get the length of each list, then add these up grouping by number.
aggregate(sapply(strsplit(df$results, ","), length), list(df$number), sum)
Group.1 x
1 1 5
2 2 1
3 3 1
Your data:
df = read.table(text="number type results
1 5 'x, y, z'
2 6 'a'
3 8 'x'
1 5 'x, y'",
header=TRUE, stringsAsFactors=FALSE)

Proper idiom for adding zero count rows in tidyr/dplyr

Suppose I have some count data that looks like this:
library(tidyr)
library(dplyr)
X.raw <- data.frame(
x = as.factor(c("A", "A", "A", "B", "B", "B")),
y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
z = 1:6
)
X.raw
# x y z
# 1 A i 1
# 2 A ii 2
# 3 A ii 3
# 4 B i 4
# 5 B i 5
# 6 B i 6
I'd like to tidy and summarise like this:
X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:
X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 B i 15
# 3 A ii 5
# 4 B ii 0
But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?
Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.

Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:
X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0
This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).

The complete function from tidyr is made for just this situation.
From the docs:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data.
You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).
X.raw %>%
complete(x, y, fill = list(z = 0)) %>%
group_by(x,y) %>%
summarise(count = sum(z))
Source: local data frame [4 x 3]
Groups: x [?]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.
# Complete after ungrouping
X.tidy %>%
ungroup %>%
complete(x, y, fill = list(count = 0))
# Complete within grouping
X.tidy %>%
complete(y, fill = list(count = 0))
The result is the same for each option:
Source: local data frame [4 x 3]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0

You can use tidyr's expand to make all combinations of levels of factors, and then left_join:
X.tidy %>% expand(x, y) %>% left_join(X.tidy)
# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii NA
Then you may keep values as NAs or replace them with 0 or any other value.
That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.

plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by #momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:
library(plyr)
X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)
X.tidy
x y count
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0

You could explicitly make all possible combinations and then joining it with the tidy summary:
x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
left_join(x.tidy, by=("x", "y")) %>%
mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's

You can also use the data.table package and its Cross Join CJ() function for that.
require(data.table)
X = data.table(X.raw)[
CJ(y = y,
x = x,
unique = TRUE),
on = .(x, y)
][ , .(z = sum(z)), .(x, y) ][ order(x, y) ]
X
# filling the NAs with 0s
setnafill(X, fill = 0, cols = 'z')
X
# x y z
# 1: A i 1
# 2: A ii 5
# 3: B i 15
# 4: B ii 0
Though it's not initially asked for, I'm adding a data.table solution here for the sake of completeness and to also link to the related data.table question.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a mapping table of duplicated id / keys - r

Related

R: conditionally mutate a variable when columns match in different dataframes

Use apply functions within %>%

Filter Dataframe with Tidy Evaluation

Multiple values in one cell

Proper idiom for adding zero count rows in tidyr/dplyr

Categories

Resources