Data Masking in Dataframe [duplicate] - r

This question already has answers here:
How to create example data set from private data (replacing variable names and levels with uninformative place holders)?
(3 answers)
Closed 3 years ago.
I have a dataframe with 8 unique values
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
>data
id
1 ab
2 cc
3 cc
4 dd
5 ee
6 ff
7 ee
8 ff
9 ab
10 dd
11 gg
12 1
13 air
I create another dataframe holding 8 unique values that are to be used as replacements
library(random)
replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
replacements
V1
1 SJ
2 fH
3 TZ
4 Mr
5 oZ
6 kZ
7 fe
8 ql
I want to replace all unique values from data dataframe with values in replacement dataframe in below way
All ab values replaced by SJ
All cc values replaced by fH
All dd values replaced by TZ
All ee values replaced by Mr
All ff values replaced by oZ
All gg values replaced by kZ
All 1 values replaced by fe
All air values replaced by ql
Currently, I am achieving this by:
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
data$id<-as.character(data$id)
replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
replacements$V1<-as.character(replacements$V1)
for(i in 1:length(unique(data$id))){
data$id[data$id %in% data$id[i]] <- replacements$V1[i]
}
>data
id
1 SJ
2 fH
3 fH
4 TZ
5 Mr
6 oZ
7 Mr
8 oZ
9 SJ
10 TZ
11 kZ
12 fe
13 ql
Is there any base function in R to achieve? Is there better approach than this for masking data?

I would suggest using merge(), but to do that you would first need to add a column of unique data$id to replacements, as both data.frames need to have a column in common.
Here's data:
> data
id
1 ab
2 cc
3 cc
4 dd
5 ee
6 ff
7 ee
8 ff
9 ab
10 dd
11 gg
12 1
13 air
Here's replacements:
> replacements
V1
1 VS
2 Of
3 bH
4 iJ
5 jm
6 kH
7 cm
8 rQ
So add unique data$id to replacements:
replacements$id <- unique(data$id)
Giving:
V1 id
1 VS ab
2 Of cc
3 bH dd
4 iJ ee
5 jm ff
6 kH gg
7 cm 1
8 rQ air
Then merge data with replacements using id:
data <- merge(data, replacements, by = "id", all.x = TRUE, sort = FALSE)
Giving:
id V1
1 ab VS
2 ab VS
3 cc Of
4 cc Of
5 dd bH
6 dd bH
7 ee iJ
8 ee iJ
9 ff jm
10 ff jm
11 gg kH
12 1 cm
13 air rQ
If you really wanted to keep only the new id column, you could drop the original id and rename the new column:
data <- data[, 2, drop = FALSE]
colnames(data) <- "id"
Giving:
id
1 VS
2 VS
3 Of
4 Of
5 bH
6 bH
7 iJ
8 iJ
9 jm
10 jm
11 kH
12 cm
13 rQ

Masking data using algorithm CRC32
library(data.table)
library(digest)
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
setDT(data)
anonymize <- function(x, algo="crc32"){
unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
unname(unq_hashes[x])
}
cols_to_mask <- c("id")
data[,cols_to_mask := lapply(.SD, anonymize),.SDcols=cols_to_mask,with=FALSE]
References:Data anonymization in R

Related

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

R: Add two data frames with same dimensions

I have df1:
Type CA AR Total
alpha 2 3 5
beta 1 5 6
gamma 6 2 8
delta 8 1 9
I have df2:
Type CA AR Total
alpha 3 4 7
beta 2 6 8
gamma 9 1 10
delta 4 1 5
I want to add the values in both the data frames to get 1 data frame with this result:
Type CA AR Total
alpha 5 7 12
beta 3 11 14
gamma 15 3 18
delta 12 2 14
Example --> (alpha, CA) = 2 (from df1) + 3 (from df2) = 5 (resulting df)
Does anyone know how to do this? It's not exactly merge I think because merge will override the value, where as, I want to add the value.
Thanks in advance!!
+ is vectorised, this is just a simple operation in R
cbind(df1[1], df1[-1] + df2[-1])
# Type CA AR Total
# 1 alpha 5 7 12
# 2 beta 3 11 14
# 3 gamma 15 3 18
# 4 delta 12 2 14
If your data set are not order properly, you could use match (as mentioned in comments)
cbind(df1[1], df1[, -1] + df2[match(df1$Type, df2$Type), -1])
You can just sum them and re-add the factor column.
df_tot <- df1 + df2
df_tot$Type = df1$Type
You can do with dplyr + magrittr, if you want to go that route:
library("dplyr")
library("magrittr")
df1 %>% select(-type) %>%
add(df2 %>% select(-type)) %>%
mutate(type = df1$type)
Note: this assumes df1 and df2 are ordered in the same manner.

Adding extra column name and row name to a table in R?

I have a table which I generated from table() function and I further use xtable to print it as follows:
FF NF NN Sum
FF 8 0 0 8
NF 7 8 0 15
NN 3 1 4 8
I want to add an additional column name and a rowname in the following format.
Time2
Time1 FF NF NN Sum
FF 8 0 0 8
NF 7 8 0 15
NN 3 1 4 8
I looked into xtable but couldn't find anything. colnames() changes the names of the existing columns, rownames() does the same to the rownames.
You've got a couple of options.
The first is to add those names to the table object "by hand".
## An example of a table object with unnamed dimnames
x <- with(warpbreaks, table(unname(wool), unname(tension)))
x
# L M H
# A 9 9 9
# B 9 9 9
names(dimnames(x)) <- c("Time1", "Time2")
x
# Time2
# Time1 L M H
# A 9 9 9
# B 9 9 9
The second (and typically preferable) option is to supply the names in your initial call to table(), like this:
table(Time1 = warpbreaks[[2]], Time2 = warpbreaks[[3]])
# Time2
# Time1 L M H
# A 9 9 9
# B 9 9 9

How to merge two data frames using (parts of) text values? [duplicate]

This question already has answers here:
R: I have to do Softmatch in String
(2 answers)
Closed 9 years ago.
I have two data frames both with columns containing texts. Now I want to merge those data frames by using (imperfect) matches between the text columns. If e.g. cell 1 of the text column of data frame 1 has a text value that contains a (part of a) word that resembles a (part of a) word in the text value of cel 2 of the text column of data frame 2, then I want the data frames be me merged using these cells. What is the best way to do this in R?
I am not sure if my question is clear enough, but if so, does anyone know of an R package or a function that can help me do this kind of merging?
Many thanks in advance!
Try the RecordLinkage package.
Here is a possible solution where the merge works based on generally how "close" the two "words" match:
library(reshape2)
library(RecordLinkage)
set.seed(16)
l <- LETTERS[1:10]
ex1 <- data.frame(lets = paste(l, l, l, sep = ""), nums = 1:10)
ex2 <- data.frame(lets = paste(sample(l), sample(l), sample(l), sep = ""),
nums = 11:20)
ex1
# lets nums
# 1 AAA 1
# 2 BBB 2
# 3 CCC 3
# 4 DDD 4
# 5 EEE 5
# 6 FFF 6
# 7 GGG 7
# 8 HHH 8
# 9 III 9
# 10 JJJ 10
ex2
# lets nums
# 1 GDJ 11
# 2 CFH 12
# 3 DBE 13
# 4 BED 14
# 5 FJB 15
# 6 JHG 16
# 7 AII 17
# 8 ICC 18
# 9 EGF 19
# 10 HAA 20
lets <- melt(outer(ex1$lets, ex2$lets, FUN = "levenshteinDist"))
lets <- lets[lets$value < 2, ] # adjust the "< 2" as necessary
cbind(ex1[lets$Var1, ], ex2[lets$Var2, ])
# lets nums lets nums
# 9 III 9 AII 17
# 3 CCC 3 ICC 18
# 1 AAA 1 HAA 20

Creating a function to split data frames multiple times then recombine

I'm working on a large dataset in R with 3 factors: FY (6 levels), Region (10 levels), and Service (24 levels). I need to sum my numeric vector, SumOfUnits, at all three levels, and the only way I can think to do this is to split the data frames up into first: 6 data frames, split by FY, then split those 6 into 10 data frames, split on region, then those 10 into the 24 Services, then I can finally take the sum of the numeric vector and recombine all of the data frames into one. This data frame would then have 6*10*24 (1440) rows and 4 columns. The way I'm currently doing it involves a lot of splitting, so I thought there might be a function I could write that I could use at each level of the split, but I haven't used "function" very much in R so I'm not sure what to write (if there even is something). I also imagine there is probably a more efficient way to get the formatted data set, so I welcome all suggestions.
Here are a few lines from my data frame:
FY Region Service SumOfUnits
1 2006 1 Medication 13
2 2006 1 Medication 1
3 2006 1 Screening & Assessment 38
4 2006 1 Screening & Assessment 13
5 2006 1 Screening & Assessment 41
6 2006 1 Screening & Assessment 67
7 2006 1 Screening & Assessment 222
8 2006 1 Residential Treatment 38
9 2006 1 Residential Treatment 1558
This is the code I've been using for my splits:
# Creating a data frame by year
X <- split(MIC, MIC$FY)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]
Q <- Y[[6]]
#Creating 10 dataframes from 2006 split by region
X <- split(A, A$Region)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Reg1 <- Y[[1]]
Reg2 <- Y[[2]]
Reg3<- Y[[3]]
Reg4 <- Y[[4]]
Reg5<- Y[[5]]
Reg6 <- Y[[6]]
Reg7 <- Y[[7]]
Reg8 <- Y[[8]]
Reg9 <- Y[[9]]
Reg10<- Y[[10]]
#Creating 24 dataframes: for 2006, region 1
X <- split(Reg1, Reg1$Service)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Serv1 <- Y[[1]]
Serv2 <- Y[[2]]
Serv3<- Y[[3]]
Serv4 <- Y[[4]]
Serv5<- Y[[5]]
#etc...
I would want a sample of my data to look something like this:
FY Region Service SumOfUnits
2006 1 Medication 4300
2006 2 Medication 3299
2006 3 Medication 2198
2007 1 Medication 5467
2007 2 Medication 3214
2007 3 Medication 9807
this is quite nice function to do this:
library(plyr)
ddply(MIC, .(FY, Region, Service), summarize, sumOfUnits=sum(SumOfUnits))
it gives back exactly what you need.
For MIC =
FY Region Service SumOfUnits
1 2006 1 A 1
2 2006 2 B 4
3 2007 1 C 3
4 2007 2 D 2
5 2007 2 E 7
6 2006 1 A 3
7 2007 1 D 3
8 2007 2 B 4
9 2007 2 B 6
returns:
FY Region Service sumOfUnits
1 2006 1 A 4
2 2006 2 B 4
3 2007 1 C 3
4 2007 1 D 3
5 2007 2 B 10
6 2007 2 D 2
7 2007 2 E 7

Resources