Our team is currently exploring the ways to encrypt PII data on the field level within BigQuery and we found out the following way to encrypt/decrypt using Crypto-JS:
#standardSQL
CREATE TEMPORARY FUNCTION encrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _encrypt = CryptoJS.AES.encrypt(_text, key, options);
return _encrypt;
""";
CREATE TEMPORARY FUNCTION decrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _decrypt = CryptoJS.AES.decrypt(_text, key, options).toString(CryptoJS.enc.Utf8);
return _decrypt;
""" OPTIONS (library="gs://path/to/Crypto-JS/crypto-js.js");
-- query to encrypt fields
SELECT
<fields>, encrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
-- query to decrypt fields
SELECT
<fields>, decrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
I am trying to benchmark the performance of AES CBC encryption & decryption using Crypto JS library in the big query before deploying it into our production. We found out the rate of data to encrypt & decrypt is growing exponential per records with increasing number of data compared to the usual query. However with the increasing number of data to process, the progress of processing per record & record processing time is improving.
As there are no available documentation regarding this, could someone from the community help provide better ways, optimize query, best practices to use field level encryption & decryption within the big query?
BigQuery now supports encryption functions. From the documentation, here is a self-contained example that creates some keysets and uses them to encrypt data. In practice, you would want to store the keysets in a real table so that you can later use them to decrypt the ciphertext.
WITH CustomerKeysets AS (
SELECT 1 AS customer_id, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset UNION ALL
SELECT 2, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') UNION ALL
SELECT 3, KEYS.NEW_KEYSET('AEAD_AES_GCM_256')
), PlaintextCustomerData AS (
SELECT 1 AS customer_id, 'elephant' AS favorite_animal UNION ALL
SELECT 2, 'walrus' UNION ALL
SELECT 3, 'leopard'
)
SELECT
pcd.customer_id,
AEAD.ENCRYPT(
(SELECT keyset
FROM CustomerKeysets AS ck
WHERE ck.customer_id = pcd.customer_id),
pcd.favorite_animal,
CAST(pcd.customer_id AS STRING)
) AS encrypted_animal
FROM PlaintextCustomerData AS pcd;
Edit: if you want to decrypt using AES-CBC with PKCS padding (it's not clear what kind of padding you are using in your example) you can use the KEYS.ADD_KEY_FROM_RAW_BYTES function to create a keyset, then call AEAD.DECRYPT_STRING or AEAD.DECRYPT_BYTES. For example:
SELECT
AEAD.DECRYPT_STRING(
KEYS.ADD_KEY_FROM_RAW_BYTES(b'', 'AES_CBC_PKCS', b'1234567890123456'),
FROM_HEX('deed2a88e73dccaa30a9e6e296f62be27db30db16f76d3f42c85d31db3f46376'),
'')
This returns abcdef. The IV is expected to be the first 16 bytes of the ciphertext.
Related
Problem Statement : I would like write a BigQuery UDF to de·crypt table columns
Setup :
PII information is encrypted in table columns with tink package
KEK (Key Encryption Key ) is sitting in KMS
DEK (data encryption Key) is sitting in Cloud storage
I have written BigQuery EXTERNAL Table to access DEK json i.e. select encryptedKeyset from my_project.my_dataset.external_table_for_decrypted_keys will give me required dek key
This is my sample code
CREATE OR REPLACE FUNCTION my_proj.my_dataset.udf_decrypt_column(table_name string, column_name string)
BEGIN
DECLARE KMS_RESOURCE_NAME STRING;
DECLARE FIRST_LEVEL_KEYSET STRING;
SET KMS_RESOURCE_NAME= "gcp-kms://projects/dev/locations/us/keyRings/dev/cryptoKeys/dev-kek";
SET FIRST_LEVEL_KEYSET = (select encryptedKeyset from my_project.my_dataset.external_table_for_decrypted_keys`);
SELECT
AEAD.DECRYPT_STRING(KEYS.KEYSET_CHAIN(KMS_RESOURCE_NAME,
from_base64(FIRST_LEVEL_KEYSET)),
from_base64(column_name),
"") as decrypted_name
FROM table_name
Issues/Question :
Declare variables do not work in Functions (while they work in procedures). So my question is how to assign values to variables in UDF
How to run SQL & assign value to variable in UDF. In my case I want to fetch column encryptedKeyset from external_table_for_decrypted_keys & assign to FIRST_LEVEL_KEYSET (in declare section)
Any idea/pointer how to achieve this ? Thanks in advance for your reply.
I have a table that is filled with some value, for setting the value I use a stored procedure that also calculate a hash function and save in database.
In case of updating value hash should be recalculated. For recalculating hash I use the following procedure:
DELIMITER $$
CREATE PROCEDURE `sp_UpdateHash`(IN rkey int)
Begin
DECLARE AuthCode VarChar(10);
SET #input = concat('SELECT r_ac into #AuthCode
FROM table_rec
where r_key=',rkey);
PREPARE squery FROM #input;
EXECUTE squery;
SET #hashed = SHA2(#AuthCode,256);
select #hashed;
DEALLOCATE PREPARE squery;
end;
and procedure just for calculating hash:
CREATE PROCEDURE `sp_GetHash`(IN AuthCode VarChar(10))
BEGIN
DECLARE hashed VarChar(64);
SET hashed = SHA2(AuthCode,256);
select hashed as 'Hash';
END
AuthCode identical, but hash is different when I try to process value after select command I get a wrong code. If I compare two hashes with other results, for example from an online generator, the result is similar to the second function: sp_GetHash
Do you have any idea why?
The problem was in one field that has a different coding from the table, and when I use it in the query it has a different size.
When a table is being used as a hash key, does it have anything to do with the hex id you get when you print out the table?
For example,
obj = {}
print(obj)
You might get something like table: 153CF5A0. Is this value used in the hashing process of this table?
I noticed that if you create two identical tables that aren't references of each other, they hash to different things:
obj1 = {}
obj2 = {}
map = {}
map[obj1] = 'obj1'
map[obj2] = 'obj2'
print(map[obj1]) -- obj1
print(map[obj2]) -- obj2
Yes, the memory address of a table is used as a key for hashing when the table is used to index another table. See the source code (tables are handled in the default case).
I'm familiar with MySQL and am starting to use Amazon DynamoDB for a new project.
Assume I have a MySQL table like this:
CREATE TABLE foo (
id CHAR(64) NOT NULL,
scheduledDelivery DATETIME NOT NULL,
-- ...other columns...
PRIMARY KEY(id),
INDEX schedIndex (scheduledDelivery)
);
Note the secondary Index schedIndex which is supposed to speed-up the following query (which is executed periodically):
SELECT *
FROM foo
WHERE scheduledDelivery <= NOW()
ORDER BY scheduledDelivery ASC
LIMIT 100;
That is: Take the 100 oldest items that are due to be delivered.
With DynamoDB I can use the id column as primary partition key.
However, I don't understand how I can avoid full-table scans in DynamoDB. When adding a secondary index I must always specify a "partition key". However, (in MySQL words) I see these problems:
the scheduledDelivery column is not unique, so it can't be used as a partition key itself AFAIK
adding id as unique partition key and using scheduledDelivery as "sort key" sounds like a (id, scheduledDelivery) secondary index to me, which makes that index pratically useless
I understand that MySQL and DynamoDB require different approaches, so what would be a appropriate solution in this case?
It's not possible to avoid a full table scan with this kind of query.
However, you may be able to disguise it as a Query operation, which would allow you to sort the results (not possible with a Scan).
You must first create a GSI. Let's name it scheduled_delivery-index.
We will specify our index's partition key to be an attribute named fixed_val, and our sort key to be scheduled_delivery.
fixed_val will contain any value you want, but it must always be that value, and you must know it from the client side. For the sake of this example, let's say that fixed_val will always be 1.
GSI keys do not have to be unique, so don't worry if there are two duplicated scheduled_delivery values.
You would query the table like this:
var now = Date.now();
//...
{
TableName: "foo",
IndexName: "scheduled_delivery-index",
ExpressionAttributeNames: {
"#f": "fixed_value",
"#d": "scheduled_delivery"
},
ExpressionAttributeValues: {
":f": 1,
":d": now
},
KeyConditionExpression: "#f = :f and #d <= :d",
ScanIndexForward: true
}
In SQLite I can run the following query to get a list of columns in a table:
PRAGMA table_info(myTable)
This gives me the columns but no information about what the primary keys may be. Additionally, I can run the following two queries for finding indexes and foreign keys:
PRAGMA index_list(myTable)
PRAGMA foreign_key_list(myTable)
But I cannot seem to figure out how to view the primary keys. Does anyone know how I can go about doing this?
Note: I also know that I can do:
select * from sqlite_master where type = 'table' and name ='myTable';
And it will give the the create table statement which shows the primary keys. But I am looking for a way to do this without parsing the create statement.
The table_info DOES give you a column named pk (last one) indicating if it is a primary key (if so the index of it in the key) or not (zero).
To clarify, from the documentation:
The "pk" column in the result set is zero for columns that are not
part of the primary key, and is the index of the column in the primary
key for columns that are part of the primary key.
Hopefully this helps someone:
After some research and pain the command that worked for me to find the primary key column name was:
SELECT l.name FROM pragma_table_info("Table_Name") as l WHERE l.pk = 1;
For the ones trying to retrieve a pk name in android, and while using the ROOM library.
#Oogway101's answer was throwing an error: "no such column [your_table_name] ... etc.. etc...
my way of query submition was:
String pkSearch = "SELECT l.name FROM pragma_table_info(" + tableName + ") as l WHERE l.pk = 1;";
database.query(new SimpleSQLiteQuery(pkSearch)
I tried using the (") quotations and still error.
String pkSearch = "SELECT l.name FROM pragma_table_info(\"" + tableName + "\") as l WHERE l.pk = 1;";
So my solution was this:
String pragmaInfo = "PRAGMA table_info(" + tableName + ");";
Cursor c = database.query(new SimpleSQLiteQuery(pragmaInfo));
String id = null;
c.moveToFirst();
do {
if (c.getInt(5) == 1) {
id = c.getString(1);
}
} while (c.moveToNext() && id == null);
Log.println(Log.ASSERT, TAG, "AbstractDao: pk is: " + id);
The explanation is that:
A) PRAGMA table_info returns a cursor with various indices, the response is atleast of length 6... didnt check more...
B) index 1 has the column name.
C) index 5 has the "pk" value, either 0 if it is not a primary key, or 1 if its a pk.
You can define more than one pk so this will not bring an accurate result if your table has more than one (IMHO more than one is bad design and balloons the complexity of the database beyond human comprehension).
So how will this fit into the #Dao? (you may ask...)
When making the Dao "abstract" you have access to a default constructor which has the database in it:
from the docummentation:
An abstract #Dao class can optionally have a constructor that takes a Database as its only parameter.
this is the constructor that will grant you access to the query.
There is a catch though...
You may use the Dao during a database creation with the .addCallback() method:
instance = Room.databaseBuilder(context.getApplicationContext(),
AppDatabase2.class, "database")
.addCallback(
//You may use the Daos here.
)
.build();
If you run a query in the constructor of the Dao, the database will enter a feedback loop of infinite instantiation.
This means that the query MUST be used LAZILY (just at the moment the user needs something), and because the value will never change, it can be stored. and never re-queried.