Data Masking in SAS: Scrambling Sensitive observations at character level - encryption

I'm working with client data in SAS with sensitive customer identification information. The challenge is to mask the field in such a way that it remains numeric/alphabetic/alphanumeric. I found a way of using Bitwise function in SAS (BXOR, BOR, BAND) but the output is full of special characters which SAS cant handle/sort/merge etc.
I also thought of scrambling the field itself, based on a key, but haven't been able to see it through. Following are the challenges:
1) It HAS to be key based
2) HAS to be reversible.
3) Masked/scrambled field has to be numeric/alphabetic/alphanumeric only so it can be used in SAS.
4) The field to be masked has both alphabets and numbers but has varying lengths and with millions of observartions.
Any tips on how to achieve this masking/scrambling would be greatly appreicated :(

Here is a simple key-based solution. I present the data step solution here, and then will present a FCMP version in a bit. I keep everything in the range of 48 to 127 (Numbers, letters, and common characters such as # > < etc.); that's not quite alphanumeric but I can't imagine why it would matter in this case. You could reduce it further to only truly alphanumeric using this same method, but it would make the key much worse (only 62 values) and be clunky to work with (as you have 3 noncontiguous ranges).
data construct_key;
length keystr $1500;
do _t = 1 to 1500;
_rannum = ceil(ranuni(7)*80);
*if _rannum=12 then _rannum=-15;
substr(keystr,_t,1)=byte(47+_rannum);
end;
call symput('keystr',keystr);
run;
%put %bquote(&keystr);
data encrypted;
set sashelp.class;
retain key "&keystr";
length name_encrypt $30;
do _t = 1 to length(name);
substr(name_encrypt,_t,1) = byte(mod(rank(substr(name,_t,1)) + rank(substr(key,1,1))-94,80)+47);
key = substr(key,2);
end;
keep name:;
run;
data unencrypted;
set encrypted;
retain key "&keystr";
length name_unenc $30;
do _t = 1 to length(name_encrypt);
substr(name_unenc,_t,1) = byte(
mod(80+rank(substr(name_encrypt,_t,1)) - rank(substr(key,1,1)),80)
+47);
key = substr(key,2);
end;
run;
In this solution, there is a medium level of encryption - a key with 80 possible values is not strong enough to deter a truly sophisticated hacker, but is strong enough for most purposes. You need to pass either the key itself or the seed to the key algorithm in order to unencrypt; if you use this multiple times, make sure to pick a new seed each time (and not something related to the data). If you seed with zero (or a nonpostive integer) you will effectively guarantee a new key each time, but you will have to pass the key itself rather than the seed, which may present some data security issues (obviously, the key itself can be obtained by a malicious user, and would have to be stored in a different location than the data). Passing the key by way of the seed is probably better, as you could pass that verbally over the telephone or through some sort of prearranged list of seeds.
I'm not sure I recommend this sort of approach in general; a superior approach may well be to simply encrypt the entire SAS dataset using a superior encryption method (PGP, for example). Your exact solution may vary, but if you have for example some customer information that isn't actually necessary for most steps of your process, you may be better off separating that information from the rest of the (non-sensitive) data and only incorporating that when it's needed.
For example, I have a process whereby I pull sample for a client for a healthcare survey. I select valid records from a dataset that has no information for the customer except a numeric unique identifier; once I have narrowed the sample down to the valid records, then I attach the customer information from a separate dataset and create the mailing files (which are stored in an encrypted directory). That keeps the data nonsensitive for as long as possible. It's not perfect - the unique numeric identifier still means there is a tie back, even if it's not to anything someone would know outside of the project - but it keeps things safe as long as possible on our end.
Here is the FCMP version:
%let keylength=5;
%let seed=15;
proc fcmp outlib=work.funcs.test;
subroutine encrypt(value $,key $);
length key $&keylength.;
outargs value,key;
do _t = 1 to lengthc(value);
substr(value,_t,1) = byte(mod(rank(substr(value,_t,1)) + rank(substr(key,1,1))-62,96)+31);
key = substr(key,2)||substr(key,1,1);
end;
endsub;
subroutine unencrypt(value $,key $);
length key $&keylength.;
outargs value,key;
do _t = 1 to lengthc(value);
substr(value,_t,1) = byte(mod(96+rank(substr(value,_t,1)) - rank(substr(key,1,1)),96)+31);
key = substr(key,2)||substr(key,1,1);
end;
endsub;
subroutine gen_key(seed,keystr $);
outargs keystr;
length keystr $&keylength.;
do _t = 1 to &keylength.;
_rannum = ceil(ranuni(seed)*80);
substr(keystr,_t,1)=byte(47+_rannum);
end;
endsub;
quit;
options cmplib=work.funcs;
data encrypted;
set sashelp.class;
length key $&keylength.;
retain key ' '; *the missing is to avoid the uninitialized variable warning;
if _n_ = 1 then call gen_key(&seed,key);
call encrypt(name,key);
drop key;
run;
data unencrypted;
set encrypted;
length key $&keylength.;
retain key ' ';
if _n_ = 1 then call gen_key(&seed,key);
call unencrypt(name,key);
run;
This is somewhat more robust; it allows characters from 32 to 127 rather than from 48, meaning it deals with space successfully. (Tab will still not decode properly - it would beocme a 'k'.) You pass the seed to call gen_key and then it uses that key for the remainder of the process.
It goes without saying that this is not guaranteed to function for your purposes and/or to be a secure solution and you should consult with a security professional if you have substantial security needs. This post is not warranted for any purpose and any and all liability arising from its use is disclaimed by the poster.

SAS have an article on their website on how to encrypt specific variables. Hopefully this will help you.
link

Related

Finding similar hashes

I'm trying to find 2 different plain text words that create very similar hashes.
I'm using the hashing method 'whirlpool', but I don't really need my question to be answered in the case or whirlpool, if you can using md5 or something easier that's ok.
The similarities i'm looking for is that they contain the same number of letters (doesnt matter how much they're jangled up)
i.e
plaintext 'test'
hash 1: abbb5 has 1 a , 3 b's , one 5
plaintext 'blahblah'
hash 2: b5bab must have the same, but doesnt matter what order.
I'm sure I can read up on how they're created and break it down and reverse it, but I am just wondering if what I'm talking about occurs.
I'm wondering because I haven't found a match of what I'm explaining (I created a PoC to run threw random words / letters till it recreated a similar match), but then again It would take forever doing it the way i was dong it. and was wondering if anyone with real knowledge of hashes / encryption would help me out.
So you can do it like this:
create an empty sorted map \
create a 64 bit counter (you don't need more than 2^63 inputs, in all probability, since you would be dead before they would be calculated - unless quantum crypto really takes off)
use the counter as input, probably easiest to encode it in 8 bytes;
use this as input for your hash function;
encode output of hash in hex (use ASCII bytes, for speed);
sort hex on number / alphabetically (same thing really)
check if sorted hex result is a key in the map
if it is, show hex result, the old counter from the map & the current counter (and stop)
if it isn't, put the sorted hex result in the map, with the counter as value
increase counter, goto 3
That's all folks. Results for SHA-1:
011122344667788899999aaaabbbcccddeeeefff for both 320324 and 429678
I don't know why you want to do this for hex, the hashes will be so large that they won't look too much alike. If your alphabet is smaller, your code will run (even) quicker. If you use whole output bytes (i.e. 00 to FF instead of 0 to F) instead of hex, it will take much more time - a quick (non-optimized) test on my machine shows it doesn't finish in minutes and then runs out of memory.

Generete unique random number on large number range

what i ask about is if exist a way to generate unique random numbers without helper structures.
I mean if already exist some mathematics functions (or algorithms) that natively generate random numbers only at once on a field (i would not try to write some kind of hash function specific for this problem).
This because i would generate a lot of unique numbers (integer) choosen between 0 and 10.000.000.000 (about 60% of the field), so a random repetition is not so improbable and store previously generated number in a structure for a subsequent lookup (even if well optimized, like bit arrays) could be too expensive (spatially and temporally).
P.S.
(Note that when i write random i really mean pseudo random)
If you want to ensure uniqueness then do not use a hash function, but instead use an encryption function to encrypt the numbers 0, 1, 2, 3 ... Since encryption is reversible then every number (up to the block size) is uniquely encrypted and will produce a unique result.
You can either write a simple Feistel cypher with a convenient block size or else use the Hasty Pudding cypher, which allows a large range of block sizes. Whenever an input number generates too large an output, then just go to the next input number.
Changing the key of the cypher will generate a different series of output numbers. the same series of numbers can be regenerated whenever needed by remembering the key and starting again with 0, 1, 2 ... There is no need to store the entire sequence. As you say, the sequence is pseudo-random and so can be regenerated easily if you know the key.
Instead of pseudo-random numbers, you could try so-called quasi-random numbers, which are more accurately called low-discrepancy sequences. [1]
[1] https://en.wikipedia.org/wiki/Low-discrepancy_sequence

What is the name for encoding/encrypting with noise padding?

I want code to render n bits with n + x bits, non-sequentially. I'd Google it but my Google-fu isn't working because I don't know the term for it.
For example, the input value in the first column (2 bits) might be encoded as any of the output values in the comma-delimited second column (4 bits) below:
0 1,2,7,9
1 3,8,12,13
2 0,4,6,11
3 5,10,14,15
My goal is to take a list of integer IDs, and transform them in a way they can still be used for persistent URLs, but that can't be iterated/enumerated sequentially, and where a client cannot determine programmatically if a URL in a search result set has been visited previously without visiting it again.
I would term this process "encoding". You'll see something similar done to permit the use of communications channels that have special symbols that are not permitted in data. Examples: uuencoding and base64 encoding.
That said, you still need to (and appear at first blush to have) ensure that there is only one correct de-code; and accept the increase in size of the output (in the case above, the output will be double the size, bit-for-bit as the input).
I think you'd be better off encrypting the number with a cheap cypher + a constant secret key stored on your server(s), adding a random character or four at the end, and a cheap checksum, and simply reject any responses that don't have a valid checksum.
<encrypt(secret)>
<integer>+<random nonsense>
</encrypt>
+
<checksum()>
<integer>+<random nonsense>
</checksum>
Then decrypt the first part (remember, cheap == fast), validate the ciphertext using the checksum, throw off the random nonsense, and use the integer you stored.
There are probably some cryptographic no-no's here, but let's face it, the cost of this algorithm being broken is a touch on the low side.

How to decide if the chosen password is correct?

If an encrypted file exists and someone wants to decrypt it, there are several methods do try.
For example, if you would chose a brute force attack, that's easy: just try all possible keys and you will find the correct one. For this question, it doesn't matter that this might take too long.
But trying keys means the following steps:
Chose key
Decrypt data with key
Check if decryption was successful
Besides the problem that you would need to know the algorithm that was used for the encryption, I cannot imagine how one would do #3.
Here is why: After decrypting the data, I get some "other" data. In case of an encrypted plain text file in a language that I can understand, I can now check if the result is a text in that langauge.
If it would be a known file type, I could check for specific file headers.
But since one tries to decrypt something secret, it is most likely unknown what kind of information there will be if correctly decrypted.
How would one check if a decryption result is correct if it is unknown what to look for?
Like you suggest, one would expect the plaintext to be of some know format, e.g., a JPEG image, a PDF file, etc. The idea would be that it is very unlikely that a given ciphertext can be decrypted into both a valid JPEG image and a valid PDF file (but see below).
But it is actually not that important. When one talks about a cryptosystem being secure, one (roughly) talks about the odds of you being able to guess the plaintext corresponding to a given ciphertext. So I pick a random message m and encrypts it c = E(m). I give you c and if you cannot guess m, then we say the cryptosystem is secure, otherwise it's broken.
This is just a simple security definition. There are other definitions that require the system to be able to hide known plaintexts (semantic security): you give me two messages, I encrypt one of them, and you will not be able to tell which message I chose.
The point is, that in these definitions, we are not concerned with the format of the plaintexts, all we require is that you cannot guess the plaintext that was encrypted. So there is no step 3 :-)
By not considering your step 3, we make the question of security as clear as possible: instead of arguing about how hard it is to guess which format you used (zip, gzip, bzip2, ...) we only talk about the odds of breaking the system compared to the odds of guessing the key. It is an old principle that you should concentrate all your security in the key -- it simplifies things dramatically when your only assumption is the secrecy of the key.
Finally, note that some encryption schemes makes it impossible for you to verify if you have the correct key since all keys are legal. The one-time pad is an extreme example such a scheme: you take your plaintext m, choose a perfectly random key k and compute the ciphertext as c = m XOR k. This gives you a completely random ciphertext, it is perfectly secure (the only perfectly secure cryptosystem, btw).
When searching for an encryption key, you cannot know when you've found the right one. This is because c could be an encryption of any file with the same length as m: if you encrypt the message m' with the key *k' = c XOR m' you'll see that you get the same ciphertext again, thus you cannot know if m or m' was the original message.
Instead of thinking of exclusive-or, you can think of the one-time pad like this: I give you the number 42 and tell you that is is the sum of two integers (negative, positive, you don't know). One integer is the message, the other is the key and 42 is the ciphertext. Like above, it makes no sense for you to guess the key -- if you want the message to be 100, you claim the key is -58, if you want the message to be 0, you claim the key is 42, etc. One time pad works exactly like this, but on bit values instead.
About reusing the key in one-time pad: let's say my key is 7 and you see the ciphertexts 10 and 20, corresponding to plaintexts 3 and 13. From the ciphertexts alone, you now know that the difference in plaintexts is 10. If you somehow gain knowledge of one of the plaintext, you can now derive the other! If the numbers correspond to individual letters, you can begin looking at several such differences and try to solve the resulting crossword puzzle (or let a program do it based on frequency analysis of the language in question).
You could use heuristics like the unix
file
command does to check for a known file type. If you have decrypted data that has no recognizable type, decrypting it won't help you anyway, since you can't interpret it, so it's still as good as encrypted.
I wrote a tool a little while ago that checked if a file was possibly encrypted by simply checking the distribution of byte values, since encrypted files should be indistinguishable from random noise. The assumption here then is that an improperly decrypted file retains the random nature, while a properly decrypted file will exhibit structure.
#!/usr/bin/env python
import math
import sys
import os
MAGIC_COEFF=3
def get_random_bytes(filename):
BLOCK_SIZE=1024*1024
BLOCKS=10
f=open(filename)
bytes=list(f.read(BLOCK_SIZE))
if len(bytes) < BLOCK_SIZE:
return bytes
f.seek(0, 2)
file_len = f.tell()
index = BLOCK_SIZE
cnt=0
while index < file_len and cnt < BLOCKS:
f.seek(index)
more_bytes = f.read(BLOCK_SIZE)
bytes.extend(more_bytes)
index+=ord(os.urandom(1))*BLOCK_SIZE
cnt+=1
return bytes
def failed_n_gram(n,bytes):
print "\t%d-gram analysis"%(n)
N = len(bytes)/n
states = 2**(8*n)
print "\tN: %d states: %d"%(N, states)
if N < states:
print "\tinsufficient data"
return False
histo = [0]*states
P = 1.0/states
expected = N/states * 1.0
# I forgot how this was derived, or what it is suppose to be
magic = math.sqrt(N*P*(1-P))*MAGIC_COEFF
print "\texpected: %f magic: %f" %(expected, magic)
idx=0
while idx<len(bytes)-n:
val=0
for x in xrange(n):
val = val << 8
val = val | ord(bytes[idx+x])
histo[val]+=1
idx+=1
count=histo[val]
if count - expected > magic:
print "\tfailed: %s occured %d times" %( hex(val), count)
return True
# need this check because the absence of certain bytes is also
# a sign something is up
for i in xrange(len(histo)):
count = histo[i]
if expected-count > magic:
print "\tfailed: %s occured %d times" %( hex(i), count)
return True
print ""
return False
def main():
for f in sys.argv[1:]:
print f
rand_bytes = get_random_bytes(f)
if failed_n_gram(3,rand_bytes):
continue
if failed_n_gram(2,rand_bytes):
continue
if failed_n_gram(1,rand_bytes):
continue
if __name__ == "__main__":
main()
I find this works reasonable well:
$ entropy.py ~/bin/entropy.py entropy.py.enc entropy.py.zip
/Users/steve/bin/entropy.py
1-gram analysis
N: 1680 states: 256
expected: 6.000000 magic: 10.226918
failed: 0xa occured 17 times
entropy.py.enc
1-gram analysis
N: 1744 states: 256
expected: 6.000000 magic: 10.419895
entropy.py.zip
1-gram analysis
N: 821 states: 256
expected: 3.000000 magic: 7.149270
failed: 0x0 occured 11 times
Here .enc is the source ran through:
openssl enc -aes-256-cbc -in entropy.py -out entropy.py.enc
And .zip is self-explanatory.
A few caveats:
It doesn't check the entire file, just the first KB, then random blocks from the file. So if a file was random data appended with say a jpeg, it will fool the program. The only way to be sure if to check the entire file.
In my experience, the code reliably detects when a file is unencrypted (since nearly all useful data has structure), but due to its statistical nature may sometimes misdiagnose an encrypted/random file.
As it has been pointed out, this kind of analysis will fail for OTP, since you can make it say anything you want.
Use at your own risk, and most certainly not as the only means of checking your results.
One of the ways is compressing the source data with some standard algorithm like zip. If after decryption you can unzip the result - it's decrypted right. Compression is almost usually done by encryption programs prior to encryption - because it's another step the bruteforcer will need to repeat for each trial and lose time on it and because encrypted data is almost surely uncompressible (size doesn't decrease after compression with a chained algorithm).
Without a more clearly defined scenario, I can only point to cryptanalysis methods. I would say it's generally accepted that validating the result is an easy part of cryptanalysis. In comparison to decrypting even a known cypher, a thorough validation check costs little cpu.
are you seriously asking questions like this?
well if it was known whats inside then you would not need to decrypt it anywayz right?
somehow this has nothing to do with programming question its more mathematical. I took some encryption math classes at my university.
And you can not confirm without a lot of data points.
Sure if your result makes sense and its clear it is meaningful in plain english (or whatever language is used) but to answer your question.
If you were able to decrypt you should be able to encrypt as well.
So encrypt the result using reverse process of decryption and if you get same results you might be golden...if not something is possibly wrong.

Obscure / encrypt an order number as another number: symmetrical, "random" appearance?

Client has an simple increasing order number (1, 2, 3...). He wants end-users to receive an 8- or 9- digit (digits only -- no characters) "random" number. Obviously, this "random" number actually has to be unique and reversible (it's really an encryption of the actualOrderNumber).
My first thought was to just shuffle some bits. When I showed the client a sample sequence, he complained that subsequent obfuscOrderNumbers were increasing until they hit a "shuffle" point (point where the lower-order bits came into play). He wants the obfuscOrderNumbers to be as random-seeming as possible.
My next thought was to deterministically seed a linear congruential pseudo-random-number generator and then take the actualOrderNumber th value. But in that case, I need to worry about collisions -- the client wants an algorithm that is guaranteed not to collide in at least 10^7 cycles.
My third thought was "eh, just encrypt the darn thing," but if I use a stock encryption library, I'd have to post-process it to get the 8-or-9 digits only requirement.
My fourth thought was to interpret the bits of actualOrderNumber as a Gray-coded integer and return that.
My fifth though was: "I am probably overthinking this. I bet someone on StackOverflow can do this in a couple lines of code."
Pick a 8 or 9 digit number at random, say 839712541. Then, take your order number's binary representation (for this example, I'm not using 2's complement), pad it out to the same number of bits (30), reverse it, and xor the flipped order number and the magic number. For example:
1 = 000000000000000000000000000001
Flip = 100000000000000000000000000000
839712541 = 110010000011001111111100011101
XOR = 010010000011001111111100011101 = 302841629
2 = 000000000000000000000000000010
Flip = 010000000000000000000000000000
839712541 = 110010000011001111111100011101
XOR = 100010000011001111111100011101 = 571277085
To get the order numbers back, xor the output number with your magic number, convert to a bit string, and reverse.
Hash function? http://www.partow.net/programming/hashfunctions/index.html
Will the client require the distribution of obfuscated consecutive order numbers to look like anything in particular?
If you do not want to complicate yourself with encryption, use a combination of bit shuffling with a bit of random salting (if you have bits/digits to spare) XOR-superimposed over some fixed constant (or some function of something that would be readily available alongside the obfuscated order ID at any time, such as perhaps the customer_id who placed the order?)
EDIT
It appears that all the client desires is for an outside party to not be able to infer the progress of sales. In this case a shuffling solution (bit-mapping, e.g. original bit 1 maps to obfuscated bit 6, original bit 6 maps to obfuscated bit 3, etc.) should be more than sufficient. Add some random bits if you really want to make it harder to crack, provided that you have the additional bits available (e.g. assuming original order numbers go only up to 6 digits, but you're allowed 8-9 in the obfuscated order number, then you can use 2-3 digits for randomness before performing bit-mapping). Possibly XOR the result for additional intimidation (an inquisitive party might attempt to generate two consecutive obfuscated orders, XOR them against each other to get rid of the XOR constant, and would then have to deduce which of the non-zero bits come from the salt, and which ones came from an increment, and whether he really got two consecutive order numbers or not... He would have to repeat this for a significant number of what he'd hope are consecutive order numbers in order to crack it.)
EDIT2
You can, of course, allocate completely random numbers for the obfuscated order IDs, store the correspondence to persistent storage (e.g. DB) and perform collision detection as well as de-obfuscation against same storage. A bit of overkill if you ask me, but on the plus side it's the best as far as obfuscation goes (and you implement whichever distribution function your soul desires, and you can change the distribution function anytime you like.)
In 9 digit number, the first digit is a random index between 0 and 7 (or 1-8). Put another random digit at that position. The rest is the "real order number:
Orig order: 100
Random index: 5
Random digit: 4 (guaranteed, rolled a
dice :) )
Result: 500040100
Orig Nr: 101
Random index: 2
Random digit 6
Result: 200001061
You can decide that the 5th (or any other) digit is the index.
Or, if you can live with real order numbers of 6 digits, then you can introduce "secondary" index as well. And you can reverse the order of the digits in the "real" order nr.
I saw this rather late, (!) hence my rather belated response. It may be useful to others coming along later.
You said: "My third thought was "eh, just encrypt the darn thing," but if I use a stock encryption library, I'd have to post-process it to get the 8-or-9 digits only requirement."
That is correct. Encryption is reversible and guaranteed to be unique for a given input. As you point out, most standard encryptions do not have the right block size. There is one however, Hasty Pudding Cipher which can have any block size from 1 bit upwards.
Alternatively you can write your own. Given that you don't need something the NSA can't crack, then you can construct a simple Feistel cipher to meet your needs.
If your Order Id is unique, Simply you can make a prefix and add/mix that prefix with your order Id.
Something like this:
long pre = DateTime.Now.Ticks % 100;
string prefix = pre.ToString();
string number = prefix + YOURID.ToString()
<?PHP
$cry = array(0=>5,1=>3,2=>9,3=>2,4=>7,5=>6,6=>1,7=>8,8=>0,9=>4);
function enc($e,$cry,$k){
if(strlen($e)>10)die("max encrypt digits is 10");
if(strlen($e) >= $k)die("Request encrypt must be lesser than its length");
if(strlen($e) ==0)die("must pass some numbers");
$ct = $e;
$jump = ($k-1)-strlen($e);
$ency = $cry[(strlen($e))];
$n = 0;
for($a=0;$a<$k-1;$a++){
if($jump > 0){
if($a%2 == 1){
$ency .=rand(0,9);
$jump -=1;
}else{
if(isset($ct[$n])){
$ency.=$cry[$ct[$n]];
$n++;
}else{
$ency .=rand(0,9);
$jump -=1;
}
}
}else{
$ency.= $cry[$ct[$n]];
$n++;
}
}
return $ency;
}
function dec($e,$cry){
//$decy = substr($e,6);
$ar = str_split($e,1);
$len = array_search($ar[0], $cry);
$jump = strlen($e)-($len+1);
$val = "";
for($i=1;$i<strlen($e);$i++){
if($i%2==0){
if($jump >0){
//$val .=array_search($e[$i], $cry);
$jump--;
}else{
$val .=array_search($e[$i], $cry);
}
}else{
if($len > 0){
$val .=array_search($e[$i], $cry);
$len--;
}else{
$jump--;
}
}
}
return $val;
}
if(isset($_GET["n"])){
$n = $_GET["n"];
}else{
$n = 1000;
}
$str = 1253;
$str = enc($str,$cry,15);
echo "Encerypted Value : ".$str ."<br/>";
$str = dec($str,$cry);
echo "Decrypted Value : ".$str ."<br/>";
?>

Resources