I am trying to deidentify data using the duawranglr package in R presented in this example: https://cran.r-project.org/web/packages/duawranglr/vignettes/securing_data.html.
As an example, I created a data frame:
data <- data.frame(
Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2)
)
I am trying to create unique, hexadecimal strings without a crosswalk that correspond to the Name column, using the deid_dua function.
data <- deid_dua(data, id_col = "Name", new_id_name = "DID", write_crosswalk = TRUE, id_length = 12)
The error that I keep getting is:
Error in data.frame(old = old_ids, new = new_ids, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 5, 0
At first I thought the issue was with the name column being a factor. However, I receive the same error after converting it to character using the stringsAsFactors = FALSE statement in data.frame. I'm also not sure based on the CRAN example if I need these statements:
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')
df <- read_dua_file(admin_file)
df
Do they apply if you're not importing the data? The example doesn't explain very well what they are for.
Here's a much simpler solution:
# create a custom 8-digit random identifier string called ID:
library(stringi)
data$ID <- stri_rand_strings(nrow(data), 8)
# remove the name column to create a de-identified dataset
data_deidentified <- data[,-1]
Your data_deidentified dataframe will look something like this:
V1 V2 ID
1 16 3 V2Hziep8
2 20 7 vFeQW1OQ
3 34 5 E5vcWYfm
4 25 3 VLbHzU3H
5 26 2 acCbXiO1
And obviously retain the original data dataframe as your crosswalk. You can make the ID variable longer by changing the '8' value in that call.
Now if you have duplicate names in your data, you will need to do a few extra steps:
# note that I've modified the original dataframe to include two "Martin" values:
data <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin", "Martin"),
V1 = c(16, 20, 34, 25, 26, 28),
V2 = c(3, 7, 5, 3, 2, 5))
# get list of unique names and convert to dataframe
names <- data.frame('Name' = unique(data$Name))
# assign ID string to each unique name
names$ID <- stri_rand_strings(nrow(names), 8)
# now merge back into original df
data <- merge(data, names)
Your result is:
Name V1 V2 ID
1 Jan 25 3 e8da7lO4
2 Jane 20 7 pGeeklL1
3 Kate 16 3 5yYAtO9B
4 Martin 26 2 BwC6jPBh
5 Martin 28 5 BwC6jPBh
6 Rod 34 5 f3xvGbu2
I get an error if I don't set a crosswalk first, but this is fairly trivial:
library(duawranglr)
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
# You only have a single column to obscure, so you only need a one-cell data frame to set up
set_dua_cw(data.frame(secure = "Name"))
#> -- duawranglr note -------------------------------------------------------------------
#> DUA crosswalk has been set!
# Simultaneously secure the data and write the crosswalk
df <- deid_dua(df,
id_col = "Name",
new_id_name = "ID",
write_crosswalk = T,
id_length = 12,
crosswalk_filename = "cw.csv")
print(df)
#> ID V1 V2
#> 1 950dce035280 16 3
#> 2 6b95d061b59f 20 7
#> 3 00a5d8ab2a4c 34 5
#> 4 ea03e704d806 25 3
#> 5 3eba984ebcba 26 2
And you can see the contents of the crosswalk by reading the csv file's contents
read.csv("cw.csv")
#> Name ID
#> 1 Kate 950dce035280
#> 2 Jane 6b95d061b59f
#> 3 Rod 00a5d8ab2a4c
#> 4 Jan ea03e704d806
#> 5 Martin 3eba984ebcba
And if you want to get the names back in the future, you can do:
cw <- read.csv("cw.csv")
df$Name <- cw$Name[match(cw$ID, df$ID)]
I'm a little late, but as the package author, I'll try to clear up some confusion.
tl;dr
The answer #Allan Cameron gave worked for me, but if all you want to do is hash your IDs, then #mh765's solution is probably the best.
Longer explanation of duawranglr purpose
duawranglr assumes you have a restricted data frame and that you want to do two things so that you can share it:
Drop columns which contain restricted data elements (like DOB or
other identifying information)
Convert unique identifiers into another unique ID that can't be used to back into the original IDs (in case the original IDs are also restricted, like SSNs)
Since you aren't trying to do #1, then it makes sense to have a DUA crosswalk that only has one column with one element: the name of your ID column (per #Allan Cameron).
But let's say you have two potential levels of security and in the second, you can't include V1. Then your DUA crosswalk might look like this:
library(duawranglr)
## your data frame
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
## create dua crosswalk
dua_cw <- data.frame(secure_level_i = c("Name",""),
secure_level_ii = c("Name", "V1"))
## show cw (level_i won't allow name; level_ii won't allow name or V1)
dua_cw
secure_level_i secure_level_ii
1 Name Name
2 V1
## set the dua cw
set_dua_cw(dua_cw)
-- duawranglr note -------------------------------------------------------------
DUA crosswalk has been set!
Now you can set the level of security. Let's say you set it at secure_level_i, meaning it's okay to keep V1 in the final data frame you share:
## set DUA level
set_dua_level("secure_level_i", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
Now you can use deid_dua() as you wanted to hash your IDs, in this case, names.
## deidentify data (don't need to set id_col since we set it in set_dua_level)
df <- deid_dua(df,
new_id_name = "DID",
write_crosswalk = TRUE,
id_length = 12,
crosswalk_filename = "cw.csv")
## show result
df
DID V1 V2
1 d164bb624da2 16 3
2 a8b33e3b0230 20 7
3 a1d287cbdde7 34 5
4 1c00ba576e1a 25 3
5 a870564b3365 26 2
## show crosswalk
read.csv("cw.csv")
Name DID
1 Kate d164bb624da2
2 Jane a8b33e3b0230
3 Rod a1d287cbdde7
4 Jan 1c00ba576e1a
5 Martin a870564b3365
## check restrictions to see if you can save data
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
If, however, you set_dua_level() to "secure_level_ii", then you won't pass the last check since you'll still have V1 in your data.
## set new more secure level
set_dua_level("secure_level_ii", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
The following variables are not allowed at the current data usage level
restriction [ secure_level_ii ] and MUST BE REMOVED before saving:
- V1
To pass under the new level, you'll need to drop V1 from your data frame.
## drop
df$V1 <- NULL
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
As a final note, your id_col must contain unique IDs. The names work in the toy example because they are unique, but as others have noted, repeated names for different observations won't work with duawranglr.
Related
I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.
I would like to add a new column to a data.frame that converts from the numeric value in the first column to the corresponding string (if any) from a subsequent matching column i.e. the column name partially matches this value in the first column.
In this example, I wish to add a value for 'Highest_Earner', which depends on the value in the Earner_Number column:
> df1 <- data.frame("Earner_Number" = c(1, 2, 1, 5),
"Earner5" = c("Max", "Alex", "Ben", "Mark"),
"Earner1" = c("John", "Dora", "Micelle", "Josh"))
> df1
Earner_Number Earner5 Earner1
1 1 Max John
2 2 Alex Dora
3 1 Ben Micelle
4 5 Mark Josh
The result should be:
> df1
Earner_Number Earner5 Earner1 Highest_Earner
1 1 Max John John
2 2 Alex Dora Neither
3 1 Ben Micelle Michelle
4 5 Mark Josh Mark
I have tried cutting the data.frame into various smaller pieces, but was wondering if someone had a somewhat cleaner method?
#Have to convert them to character for nested if else to work.
df$Earner5 <- as.character(df$Earner5)
df$Earner1 <- as.character(df$Earner1)
#Using nested if to get your column.
df$Higher_Earner <- ifelse(df$Earner_Number == 5, df$Earner5,
ifelse(df$Earner_Number==1df$Earner1,"Neither"))
dplyr approach
library(tidyverse)
df <- tibble("Earner_Number" = c(1,2,1,5), "Earner5" = c('Max', 'Alex','Ben','Mark'), "Earner1" = c("John","Dora","Micelle",'Josh'))
df %>%
mutate(Highest_Earner = case_when(Earner_Number == 1 ~ Earner1,
Earner_Number == 5 ~ Earner5,
TRUE ~ 'Neither'))
I have two datasets that I want to merge. One of the columns that I want to use as a key to merge has the values in a list. If any of those values appear in the second dataset’s column, I want the value in the other column to be merged into the first dataset – which might mean there are multiple values, which should be presented as a list.
That is quite hard to explain but hopefully this example data makes it clearer.
Example data
library(data.table)
mother_dt <- data.table(mother = c("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"),
child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"))
mother_dt [, children := .(list(unique(child))), by = mother]
mother_dt [, child := NULL]
mother_dt <- unique(mother_dt , by = "mother")
child_dt <- data.table(child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"),
age = c(10, 8, 9, 6, 5, 2))
So for example, the first row in my new dataset would have “Penny” in themother column, a list containing “Violet” and “Prudence” in the children column, and a list containing 10 and 8 in the age column.
I've tried the following:
combined_dt <- mother_dt[, child_age := ifelse(child_dt$child %in% children,
.(list(unique(child_dt$age))), NA)
But that just contains a list of all the ages in the final row.
I appreciate this is probably quite unusual behaviour but is there a way to achieve it?
Edit: The final datatable would look like this:
final_dt <- data.table(mother = c("Penny", "Anya", "Sam"),
children = c(list(c("Violet", "Prudence")), list(c("Erika")), list(c("Jake", "Wolf", "Red"))),
age = c(list(c(10, 8)), list(c(9)), list(c(6, 5, 2))))
The easiest way I can think of is, first unlist the children, then merge, then list again:
mother1 <- mother_dt[,.(children=unlist(children)),by=mother]
mother1[child_dt,on=c(children='child')][,.(children=list(children),age=list(age)),by=mother]
You can do something like this-
library(splitstackshape)
newm <- mother_dt[,.(children=unlist(children)),by=mother]
final_dt <- merge(newm,child_dt,by.x = "children",by.y = "child")
> aggregate(. ~ mother, data = cv, toString)
mother children age
1 Anya Erika 9
2 Penny Prudence, Violet 8, 10
3 Sam Jake, Red, Wolf 6, 2, 5
You could do it the following way, which has the advantage of preserving duplicates in mother column when they exist.
mother_dt$age <- lapply(
mother_dt$children,
function(x,y) y[x],
y = setNames(child_dt$age, child_dt$child))
mother_dt
# mother children age
# 1: Penny Violet,Prudence 10, 8
# 2: Anya Erika 9
# 3: Sam Jake,Wolf,Red 6,5,2
I translates nicely into tidyverse syntax :
library(tidyverse)
mutate(mother_dt, age = map(children,~.y[.], deframe(child_dt)))
# mother children age
# 1 Penny Violet, Prudence 10, 8
# 2 Anya Erika 9
# 3 Sam Jake, Wolf, Red 6, 5, 2
I'm fairly new to R and I'm running into the following problem.
Let's say I have the following data frames:
sale_df <- data.frame("Cheese" = c("cheese-01", "cheese-02", "cheese-03"), "Number_of_sales" = c(4, 8, 23))
id_df <- data.frame("ID" = c(1, 2, 3), "Name" = c("Leerdammer", "Gouda", "Mozerella")
What I want to do is match the numbers of the first column of id_df to the numbers in the string of the first column of sale_df.
Then I want to replace the value in sale_df by the value in the second column of id_df, i.e. I want cheese-01 to become "Leerdammer".
Does anyone have any idea how I could solve this?
With tidyverse :
sale_df %>% mutate(ID=as.numeric(str_extract(Cheese,"(?<=cheese-).*"))) %>% inner_join(id_df,by="ID")
# Cheese Number_of_sales ID Name
#1 cheese-01 4 1 Leerdammer
#2 cheese-02 8 2 Gouda
#3 cheese-03 23 3 Mozerella
Assuming that all entries for Cheese in sale_df will start with cheese-, here is a simple solution.
sale_df$CheeseID <- as.numeric(substring(sale_df$Cheese, 8))
merge(sale_df, id_df, by.x = "CheeseID", by.y = "ID", all.x = TRUE)
sale_df$Number_of_sales=id_df$Name[match(id_df$ID,as.numeric(gsub("\\D","",sale_df$Cheese)))]
> sale_df
Cheese Number_of_sales
1 cheese-01 Leerdammer
2 cheese-02 Gouda
3 cheese-03 Mozerella
I have a dataset like this.
df = data.frame( name= c("Tommy", "John", "Dan"), age = c(20, NA, NA) )
I tried to set 15 y.o. to John and Dan.
df[ ( df$age != 20) , ]$age = 15
But I got an error as follows,
Error in [<-.data.frame(tmp, (df$age != 20), , value = list(name = c(NA_integer_, : missing values are not allowed in subscripted assignments of data frames
What is a nice way to set new values to these missing cells?
If you want to modify all cells that are not 20, including other valid values for age, I would do the following:
# Creating a data frame with another valid age
df = data.frame( name= c("Tommy", "John", "Dan","Bob"), age = c(20, NA, NA,12) )
# Substitute values different than 20 for 15
df[df$age!=20 | is.na(df$age),"age"] <- 15
name age
1 Tommy 20
2 John 15
3 Dan 15
4 Bob 15
We can use is.na
library(data.table)
setDT(df)[is.na(age), age:= 15]
Try this:
df$age[is.na(df$age)] <- 15
or using your style of syntax:
df[is.na(df$age), ]$age = 15
The error you get arises because df$age != 20 produces the following:
[1] FALSE NA NA
The NA values in the age column are not interpreted correctly as not being equal to twenty in the logical statement.