How can I de- and re-classify data? - r

Some of the data I work with contain sensitive information (names of persons, dates, locations, etc). But I sometimes need to share "the numbers" with other persons to get help with statistical analysis, or process it on more powerful machines where I can't control who looks at the data.
Ideally I would like to work like this:
Read the data into R (look at it, clean it, etc.)
Select a data frame that I want to de-classify, run it through a package and receive two "files": the de-classified data and a translation-file. The latter I will keep myself.
The de-classified data can be shared, manipulated and processed without worries.
I re-classify the processed data together with the translation-file.
I suppose that this can also be useful when uploading data for processing "in the cloud" (Amazon, etc.).
Have you been in this situation? I first thought about writing a "randomize" function myself, but then I realized there is no end on how sophisticated this can be done (for example, offsetting time-stamps without losing order). Maybe there is already a defined method or tool?
Thanks to everyone who contributes to [r]-tag here at Stack Overflow!

One way to do this is with match. First I make a small dataframe:
foo <- data.frame( person=c("Mickey","Donald","Daisy","Scrooge"), score=rnorm(4))
foo
person score
1 Mickey -0.07891709
2 Donald 0.88678481
3 Daisy 0.11697127
4 Scrooge 0.31863009
Then I make a key:
set.seed(100)
key <- as.character(foo$person[sample(1:nrow(foo))])
You must save this key obviously somewhere. Now I can encode the persons:
foo$person <- match(foo$person, key)
foo
person score
1 2 0.3186301
2 1 -0.5817907
3 4 0.7145327
4 3 -0.8252594
If I want the person names again I can index the key:
key[foo$person]
[1] "Mickey" "Donald" "Daisy" "Scrooge"
Or use tranform, this also works if the data is changed as long as the person ID remains the same:
foo <-rbind(foo,foo[sample(1:4),],foo[sample(1:4,2),],foo)
foo
person score
1 2 0.3186301
2 1 -0.5817907
3 4 0.7145327
4 3 -0.8252594
21 1 -0.5817907
41 3 -0.8252594
31 4 0.7145327
15 2 0.3186301
32 4 0.7145327
16 2 0.3186301
11 2 0.3186301
12 1 -0.5817907
13 4 0.7145327
14 3 -0.8252594
transform(foo, person=key[person])
person score
1 Mickey 0.3186301
2 Donald -0.5817907
3 Daisy 0.7145327
4 Scrooge -0.8252594
21 Donald -0.5817907
41 Scrooge -0.8252594
31 Daisy 0.7145327
15 Mickey 0.3186301
32 Daisy 0.7145327
16 Mickey 0.3186301
11 Mickey 0.3186301
12 Donald -0.5817907
13 Daisy 0.7145327
14 Scrooge -0.8252594

Can you simply assign a GUID to the row from which you have removed all of the sensitive information? As long as your colleagues lacking the security clearance don't mess with the GUID, you'd be able to incorporate any changes and additions they may make simply by joining on the GUID. Then it becomes simply a matter of generating bogus ersatz values for the columns whose data you have purged. LastName1, LastName2, City1, City2, etc etc. EDIT: You'd have a table for each purged column, e.g. City, State, Zip, FirstName, LastName each of which contains the distinct set of the real classified values in that column and an integer value. So that "Jones" could be represented in the sanitized dataset as, say, LastName22, "Schenectady" as City343, "90210" as Zipcode716. This would give your colleagues valid values to work with (e.g. they'd have the same number of distinct cities as your real data, just with anonymized names) and the interrelationships of the anonymized data are preserved.. EDIT2: if the goal is to give your colleagues sanitized data that is still statistically meaningful, then date columns would require special processing. E.g. if your colleagues need to do statistical computations on the age of the person, you have to give them something close to the original date, not so close that it could be revealing, yet not so far that it could skew the analysis.

Sounds like Statistical Disclosure Control problem. Take a look at sdcMicro package.
EDIT: Just realized that you have slightly different problem. The point of Statistical Disclosure Control is to "damage" data so that the risk of disclosure is reduced. By "damaging" data you are loosing some information - this is the price you are paying for reduced risk of disclosure. Your data will contain less information - so your analysis can give different or less results as analysis done on original data.
Depends on what you are going to do with your data.

Related

Cluster analysis in R on large data set

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

Resources