How to anonymize data without losing duplicates

How to anonymize data without losing duplicates - r

I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once.
How can I anonymize in R so that duplicates are transformed into the same value?
Thanks in advance!

Suppose your data looks like this:
df <- data.frame(id = c("A", "B", "C", "A", "B", "C"), value = rnorm(6),
stringsAsFactors = FALSE)
df
#> id value
#> 1 A -0.8238857
#> 2 B -0.1553338
#> 3 C -0.6297834
#> 4 A -0.4616377
#> 5 B 0.1643057
#> 6 C -0.6719061
And your list of new ID strings (which can be created randomly - see footnote) looks like this:
newIds <- c("newId1", "newId2", "newId3")
Then you should first ensure that your id column is a factor:
df$id <- as.factor(df$id)
Then you should probably store the client IDs for safe lookup later
lookup <- data.frame(key = newIds, value = levels(df$id))
lookup
#> key value
#> 1 newId1 A
#> 2 newId2 B
#> 3 newId3 C
Now all you need to do is overwrite the factor levels:
levels(df$id) <- newIds
df
#> id value
#> 1 newId1 0.7241847
#> 2 newId2 0.4313706
#> 3 newId3 -0.8687062
#> 4 newId1 1.3464852
#> 5 newId2 0.6973432
#> 6 newId3 1.9872338
Note: If you want to create random strings for the ids you can do this:
sapply(seq_along(levels(df$id)), function(x) paste0(sample(LETTERS, 5), collapse = ""))
#> [1] "TWABF" "YSBUF" "WVQEY"
Created on 2020-03-02 by the reprex package (v0.3.0)

Related

Split character vector values by fixed delimiter - return data frame with one row for each vector value

The delimiter is present in all vector values, and only once, thus each vector value should result in exactly one pair and the result should be a two column data frame.
I am not unhappy about my solution, but wondered if there might be cool functions around that make this easier. Open for any package, but base R preferred.
test <- rep("a,b", 5)
# expected result
data.frame(t(do.call(cbind, strsplit(test, ","))))
#> X1 X2
#> 1 a b
#> 2 a b
#> 3 a b
#> 4 a b
#> 5 a b

You can use tidyr::separate().
test <- data.frame(x = rep("a,b", 5))
separate(test,x, c("X1","X2"))
#> X1 X2
#> 1 a b
#> 2 a b
#> 3 a b
#> 4 a b
#> 5 a b

You can use extract:
library(tidyr)
data.frame(test) %>%
extract(col = test,
into = c("X1", "X2"),
regex = "(.),(.)")
X1 X2
1 a b
2 a b
3 a b
4 a b
5 a b
Data:
test <- rep("a,b", 5)

Function to remove data frames in a list of data frames based on name

I have a list of list of data frames named Data_list and I want to remove all the data frames containing "TAB" or "tab" in their full name. How can I deal with lapply, an anonymous function and grepl from the dplyr package ?
New_Data_list <- lapply(Data_list, function(x)x[names(x) != "Tab"])

list with named data frames and list of named data frames
Using purrr::map and lmap to remove data frames conditionally by names when the list of data frames is a list object in a list, as suggested by comment to earlier answer:
data
df <- data.frame(a = 1:2,
b = c("a", "b"))
df_ls2 <- list(df2 = df,
df2tab = df,
df2TABdf = df,
TABdf2 = df,
df2_5 = df)
df_ls <- list(df1 = df,
df1tab = df,
df1TABdf = df,
TABdf1 = df,
df1_3 = df,
df1_4 = df_ls2)
# df_ls
code
library(stringr)
library(purrr)
lmap(df_ls, function(x) x[!str_detect(names(x), "tab|TAB")]) %>%
map(function(x) x[!str_detect(names(x), "tab|TAB")])
#> $df1
#> a b
#> 1 1 a
#> 2 2 b
#>
#> $df1_3
#> a b
#> 1 1 a
#> 2 2 b
#>
#> $df1_4
#> $df1_4$df2
#> a b
#> 1 1 a
#> 2 2 b
#>
#> $df1_4$df2_5
#> a b
#> 1 1 a
#> 2 2 b
comment
lmap removes named data frames in the list
map removes named data frames from the child list of data frames
I've not checked whether this works for deeper nesting of lists of data frames.
List of named data frames
First answer for removing data frames conditionally where the list contains only named data frames:
data
df <- data.frame(a = 1:2,
b = c("a", "b"))
df_ls <- list(df1 = df,
dftab = df,
dfTABdf = df,
TABdf = df,
df3 = df)
# df_ls
code
library(stringr)
df_ls[!str_detect(names(df_ls), "tab|TAB")]
#> $df1
#> a b
#> 1 1 a
#> 2 2 b
#>
#> $df3
#> a b
#> 1 1 a
#> 2 2 b
Created on 2020-07-08 by the reprex package (v0.3.0)

You can do this in base R with grep()
Data_list[grep("tab", names(Data_list), ignore.case = TRUE, invert = TRUE)]
or grepl() (also base R)
Data_list[!grepl("tab", names(Data_list), ignore.case = TRUE)]
Example data:
Data_list <- list(
df_tab = iris[1:2, 1:2],
df_TAB = iris[1:2, 3:5],
df_good = iris[1:2, 4:5]
)

Separate data frame into ID columns (with only one unique value), and variable columns (with more than one value)

I have a data frame with several ID cols containing only one unique value and columns that actually contain variables. How to separate those?
I have come up with the following approach using a conditional statement in sapply, but I wondered if there may be a more elegant way to do that?
I am happy with any package, and any output where the data frames are separated, this can also be in a list. Each frame does not need to be assigned to a new object.
mydf <- data.frame(a = 'a', b = 'b', val1 = 1:10, val2 = 10:1)
head(mydf,3)
#> a b val1 val2
#> 1 a b 1 10
#> 2 a b 2 9
#> 3 a b 3 8
id_cols <- mydf[sapply(names(mydf), function(x) {length(unique(mydf[[x]])) == 1})]
variable_cols <- mydf[sapply(names(mydf), function(x) {length(unique(mydf[[x]])) != 1})]
head(id_cols, 3)
#> a b
#> 1 a b
#> 2 a b
#> 3 a b
head(variable_cols, 3)
#> val1 val2
#> 1 1 10
#> 2 2 9
#> 3 3 8
Created on 2020-04-02 by the reprex package (v0.3.0)

A very, very slightly shorter way would be
Var = lengths(lapply(mydf, unique)) > 1
id_cols = mydf[, Var]
variable_cols = mydf[, !Var]

For each observation, find a corresponding centile on a subset determined by factor

Assume I have a data frame like so:
df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))
I want to create a new column, which will contain a centile that an observation belongs to, calculated separately on each factor level.
What would be a reasonably simple and efficient way to do that? The closest I came to a solution was
df$newColumn<-findInterval(df$value, tapply(df$value, df$f, quantile, probs=seq(0, 0.99, 0.01))$df[, "f"])
However, this just gives zeros to all observations. The tapply returns a four-element list of quantile vectors and I'm not sure how to access a relevant element for each observation to pass as an argument for the findInterval function.
The number of rows in the data frame could reach a few millions, so speed is an issue too. The factor column will always have four levels.

With dplyr:
library(dplyr)
df %>%
group_by(f) %>%
mutate(quant = findInterval(value, quantile(value)))
#> Source: local data frame [400 x 3]
#> Groups: f [4]
#>
#> f value quant
#> <fctr> <dbl> <int>
#> 1 a 0.51184061 3
#> 2 b 0.44362348 3
#> 3 c -1.04869448 1
#> 4 d -2.41772425 1
#> 5 a 0.10738332 3
#> 6 b -0.58630348 1
#> 7 c 0.34376820 3
#> 8 d 0.68322738 4
#> 9 a 1.00232314 4
#> 10 b 0.05499391 3
#> # ... with 390 more rows
With data.table:
library(data.table)
dt <- setDT(df)
dt[, quant := findInterval(value, quantile(value)), by = f]
dt
#> f value quant
#> 1: a 0.3608395 3
#> 2: b -0.1028948 2
#> 3: c -2.1903336 1
#> 4: d 0.7470262 4
#> 5: a 0.5292031 3
#> ---
#> 396: d -1.3475332 1
#> 397: a 0.1598605 3
#> 398: b -0.4261003 2
#> 399: c 0.3951650 3
#> 400: d -1.4409000 1
Data:
df <- data.frame(f = rep(c("a", "b", "c", "d"), 100), value = rnorm(400))

I think that data.table is faster, however, a solution without using packages is:
Define a function based on cut or findInterval together with quantile
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T , labels=1:100)
}
then, apply it by a factor using ave
df$newColumn <- ave(df$values, df$f, FUN=cut2)

left_join two data frames and overwrite

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4

I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.

This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to anonymize data without losing duplicates - r

I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once. How can I anonymize in R so that duplicates are transformed into the same value? Thanks in advance!

Related

Split character vector values by fixed delimiter - return data frame with one row for each vector value

Function to remove data frames in a list of data frames based on name

Separate data frame into ID columns (with only one unique value), and variable columns (with more than one value)

For each observation, find a corresponding centile on a subset determined by factor

left_join two data frames and overwrite

Categories

Resources