Writing BigQuery UDF to decrypt column data - encryption

Problem Statement : I would like write a BigQuery UDF to de·crypt table columns
Setup :
PII information is encrypted in table columns with tink package
KEK (Key Encryption Key ) is sitting in KMS
DEK (data encryption Key) is sitting in Cloud storage
I have written BigQuery EXTERNAL Table to access DEK json i.e. select encryptedKeyset from my_project.my_dataset.external_table_for_decrypted_keys will give me required dek key
This is my sample code
CREATE OR REPLACE FUNCTION my_proj.my_dataset.udf_decrypt_column(table_name string, column_name string)
BEGIN
DECLARE KMS_RESOURCE_NAME STRING;
DECLARE FIRST_LEVEL_KEYSET STRING;
SET KMS_RESOURCE_NAME= "gcp-kms://projects/dev/locations/us/keyRings/dev/cryptoKeys/dev-kek";
SET FIRST_LEVEL_KEYSET = (select encryptedKeyset from my_project.my_dataset.external_table_for_decrypted_keys`);
SELECT
AEAD.DECRYPT_STRING(KEYS.KEYSET_CHAIN(KMS_RESOURCE_NAME,
from_base64(FIRST_LEVEL_KEYSET)),
from_base64(column_name),
"") as decrypted_name
FROM table_name
Issues/Question :
Declare variables do not work in Functions (while they work in procedures). So my question is how to assign values to variables in UDF
How to run SQL & assign value to variable in UDF. In my case I want to fetch column encryptedKeyset from external_table_for_decrypted_keys & assign to FIRST_LEVEL_KEYSET (in declare section)
Any idea/pointer how to achieve this ? Thanks in advance for your reply.

Related

Can I use a scalar function as the table name in .create table control command?

I have a stored function that generates an identity name based on a set of parameters like this:
.create function
with (docstring = 'Returns the table name for the specified data log provider.')
GetTableName(param1: string, param2: string)
{
// Some string cleansing and concatenation
let tableName = strcat(system, source);
tableName
}
I want to use this function to create a table. Tried the following options with no success:
.create table GetTableName('value1', 'value2') (Timestamp: datetime)
.create table [GetTableName('value1', 'value2')] (Timestamp: datetime)
I'm guessing the command expects the table name to be a string literal. Is there any way to accomplish this?
Control commands that create tables cannot include query execution (and vise versa: queries cannot run control commands).
The restriction exists for security reasons.
You can achieve the scenario using client code and having two calls:
1) Derive table name
2) Generate table create command and send it to the server.

Field Level Encryption In Big Query

Our team is currently exploring the ways to encrypt PII data on the field level within BigQuery and we found out the following way to encrypt/decrypt using Crypto-JS:
#standardSQL
CREATE TEMPORARY FUNCTION encrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _encrypt = CryptoJS.AES.encrypt(_text, key, options);
return _encrypt;
""";
CREATE TEMPORARY FUNCTION decrypt(_text STRING) RETURNS STRING LANGUAGE js AS
"""
let key = CryptoJS.enc.Utf8.parse("<key>");
let options = { iv: CryptoJS.enc.Utf8.parse("<iv>"), mode: CryptoJS.mode.CBC };
let _decrypt = CryptoJS.AES.decrypt(_text, key, options).toString(CryptoJS.enc.Utf8);
return _decrypt;
""" OPTIONS (library="gs://path/to/Crypto-JS/crypto-js.js");
-- query to encrypt fields
SELECT
<fields>, encrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
-- query to decrypt fields
SELECT
<fields>, decrypt(<pii-fields>)
FROM
`<project>.<dataset>.<table>`
I am trying to benchmark the performance of AES CBC encryption & decryption using Crypto JS library in the big query before deploying it into our production. We found out the rate of data to encrypt & decrypt is growing exponential per records with increasing number of data compared to the usual query. However with the increasing number of data to process, the progress of processing per record & record processing time is improving.
As there are no available documentation regarding this, could someone from the community help provide better ways, optimize query, best practices to use field level encryption & decryption within the big query?
BigQuery now supports encryption functions. From the documentation, here is a self-contained example that creates some keysets and uses them to encrypt data. In practice, you would want to store the keysets in a real table so that you can later use them to decrypt the ciphertext.
WITH CustomerKeysets AS (
SELECT 1 AS customer_id, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset UNION ALL
SELECT 2, KEYS.NEW_KEYSET('AEAD_AES_GCM_256') UNION ALL
SELECT 3, KEYS.NEW_KEYSET('AEAD_AES_GCM_256')
), PlaintextCustomerData AS (
SELECT 1 AS customer_id, 'elephant' AS favorite_animal UNION ALL
SELECT 2, 'walrus' UNION ALL
SELECT 3, 'leopard'
)
SELECT
pcd.customer_id,
AEAD.ENCRYPT(
(SELECT keyset
FROM CustomerKeysets AS ck
WHERE ck.customer_id = pcd.customer_id),
pcd.favorite_animal,
CAST(pcd.customer_id AS STRING)
) AS encrypted_animal
FROM PlaintextCustomerData AS pcd;
Edit: if you want to decrypt using AES-CBC with PKCS padding (it's not clear what kind of padding you are using in your example) you can use the KEYS.ADD_KEY_FROM_RAW_BYTES function to create a keyset, then call AEAD.DECRYPT_STRING or AEAD.DECRYPT_BYTES. For example:
SELECT
AEAD.DECRYPT_STRING(
KEYS.ADD_KEY_FROM_RAW_BYTES(b'', 'AES_CBC_PKCS', b'1234567890123456'),
FROM_HEX('deed2a88e73dccaa30a9e6e296f62be27db30db16f76d3f42c85d31db3f46376'),
'')
This returns abcdef. The IV is expected to be the first 16 bytes of the ciphertext.

Amazon Redshift - table columns declared as varchar(max) but forced as varchar(255)

I'm coding a data extraction tool to load data from Google Search Console (GSC from now on) and store it on an Amazon Redshift (AR from now on) database. I coded a function to parse the elements on the data frame coming from GSC to determine the field structure when creating tables on AR.
This is the R function I created:
get_table_fields <- function (d) {
r <- FALSE
if (is.data.frame(d)) {
r <- vector()
t <- d[1,]
c <- colnames(t)
for (k in c) {
v <- t[, k]
if (is.character(v)) {
r[k] <- "nvarchar(max)"
} else if (!is.na(as.Date(as.character(v), format = c("%Y-%m-%d")))) {
r[k] <- "date"
} else if (is.numeric(v)) {
r[k] <- ifelse(grepl(".", v, fixed = TRUE), "real", "integer")
}
}
}
return(r)
}
So far, so good. I pass the full data frame and the function extracts all relevant information from the first row, giving me the structure needed to create a table on AR.
This is the code I use to extract data from GSC and write it onto AR:
# retrieve the table fields schema
s_fields <- get_table_fields(data)
# compose the table creation definition out of the fields schema
d_fields <- paste(toString(sapply(names(s_fields), function (x) {
return(sprintf('"%s" %s', x, s_fields[x]))
})))
# compose the table creation query
c_query <- sprintf("CREATE TABLE IF NOT EXISTS %s (%s);", t_table_name, d_fields)
if (nrow(data) > 0) {
# create the table if it doesn't exist
dbSendUpdate(db, c_query)
# delete previous saved records for the specified date
dbSendUpdate(db, sprintf("DELETE FROM %s WHERE date = '%s' AND gsc_domain = '%s';", t_table_name, date_range[d], config.gsc.domain))
# upload the Google Search Console (GSC) data to Amazon Redshift (AR)
dbWriteTable(db, t_table_name, data, append = TRUE, row.names = FALSE)
}
db is the database connection object, declated like this:
# initialize the Amazon Redshift JDBC driver
driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "drivers/RedshiftJDBC42-1.2.16.1027.jar", identifier.quote = "`")
# connect to the Amazon Redshift database instance
db <- dbConnect(driver, sprintf("jdbc:redshift://%s:%s/%s?user=%s&password=%s", config.ar.host, config.ar.port, config.ar.database, config.ar.user, config.ar.password))
t_table_name is a concatenated string with the different dimensions in the GSC extraction definition with gsc_by as a prefix and joined with an underscore so, if we wanted to extract date, page and device, the table name would be gsc_by_date_page_device
So, basically, what this code does is gather a data frame from GSC, ensure the table for the specified extraction exists. If not, it creates it. Otherwise, it removes any existing data (in case the extraction is re-launched not to duplicate any entries) and stores it in AR.
The problem is it seems like either the AR database or the JDBC driver from Amazon Redshift is forcing my column definitions as varchar(255) instead of the nvarchar(max) or varchar(max) I'm trying to write. I've tried different combinations but the result is always the same:
<simpleError in .local(conn, statement, ...): execute JDBC update query failed in dbSendUpdate ([Amazon](500310) Invalid operation: Value too long for character type
Details:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(255)
query: 116225
location: funcs_string.hpp:395
process: padbmaster [pid=29705]
-----------------------------------------------;)>
If I print the c_query variable (the table creation query) before sending the query, it prints out correctly:
CREATE TABLE IF NOT EXISTS gsc_by_date_query_device ("date" date, "query" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_query_country_device ("date" date, "query" nvarchar(max), "country" nvarchar(max), "device" nvarchar(max), "countryName" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_page_device ("date" date, "page" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" real, "gsc_domain" nvarchar(max));
If I execute this on SQLWorkbench/J (the tool I'm using for checking), it creates the table correctly and even with that, what is failing is the data insertion.
Can you give me a hint on what am I doing wrong or how can I specify the text columns as bigger than 256 characters? I'm having a nightmare with this and I think I've tried everything I could.
I've written an extensive blogpost explaining a lot of nuances of reading/writing data to/from Amazon Redshift: https://auth0.com/blog/a-comprehensive-guide-for-connecting-with-r-to-redshift/
In particular, the best way to read data with R is using the RPostgres library, and to write data i recommend using the R Package i created: https://github.com/sicarul/redshiftTools
In particular, it does not have the issue you are reporting, varchars are created based on the length of the strings using function calculateCharSize: https://github.com/sicarul/redshiftTools/blob/master/R/table_definition.R#L2
Though, as a best practice i'd say unless it's a temporary or staging table, try to always create the table yourself, that way you can control sortkeys, distkeys and compression, those are very important for performance in Amazon Redshift.
If you already have created the table, you can do something like:
rs_replace_table(data, dbcon=db, table_name=t_table_name, bucket="mybucket", split_files=4)
If you haven't created the table, you can do practically the same thing with rs_create_table
You'll need an S3 bucket and the AWS keys to access it, since this package uploads to S3 and then directs redshift to that bucket, it's the fastest way to bulk upload data.

SHA return different result i MariaDB

I have a table that is filled with some value, for setting the value I use a stored procedure that also calculate a hash function and save in database.
In case of updating value hash should be recalculated. For recalculating hash I use the following procedure:
DELIMITER $$
CREATE PROCEDURE `sp_UpdateHash`(IN rkey int)
Begin
DECLARE AuthCode VarChar(10);
SET #input = concat('SELECT r_ac into #AuthCode
FROM table_rec
where r_key=',rkey);
PREPARE squery FROM #input;
EXECUTE squery;
SET #hashed = SHA2(#AuthCode,256);
select #hashed;
DEALLOCATE PREPARE squery;
end;
and procedure just for calculating hash:
CREATE PROCEDURE `sp_GetHash`(IN AuthCode VarChar(10))
BEGIN
DECLARE hashed VarChar(64);
SET hashed = SHA2(AuthCode,256);
select hashed as 'Hash';
END
AuthCode identical, but hash is different when I try to process value after select command I get a wrong code. If I compare two hashes with other results, for example from an online generator, the result is similar to the second function: sp_GetHash
Do you have any idea why?
The problem was in one field that has a different coding from the table, and when I use it in the query it has a different size.

SQLite query to find primary keys

In SQLite I can run the following query to get a list of columns in a table:
PRAGMA table_info(myTable)
This gives me the columns but no information about what the primary keys may be. Additionally, I can run the following two queries for finding indexes and foreign keys:
PRAGMA index_list(myTable)
PRAGMA foreign_key_list(myTable)
But I cannot seem to figure out how to view the primary keys. Does anyone know how I can go about doing this?
Note: I also know that I can do:
select * from sqlite_master where type = 'table' and name ='myTable';
And it will give the the create table statement which shows the primary keys. But I am looking for a way to do this without parsing the create statement.
The table_info DOES give you a column named pk (last one) indicating if it is a primary key (if so the index of it in the key) or not (zero).
To clarify, from the documentation:
The "pk" column in the result set is zero for columns that are not
part of the primary key, and is the index of the column in the primary
key for columns that are part of the primary key.
Hopefully this helps someone:
After some research and pain the command that worked for me to find the primary key column name was:
SELECT l.name FROM pragma_table_info("Table_Name") as l WHERE l.pk = 1;
For the ones trying to retrieve a pk name in android, and while using the ROOM library.
#Oogway101's answer was throwing an error: "no such column [your_table_name] ... etc.. etc...
my way of query submition was:
String pkSearch = "SELECT l.name FROM pragma_table_info(" + tableName + ") as l WHERE l.pk = 1;";
database.query(new SimpleSQLiteQuery(pkSearch)
I tried using the (") quotations and still error.
String pkSearch = "SELECT l.name FROM pragma_table_info(\"" + tableName + "\") as l WHERE l.pk = 1;";
So my solution was this:
String pragmaInfo = "PRAGMA table_info(" + tableName + ");";
Cursor c = database.query(new SimpleSQLiteQuery(pragmaInfo));
String id = null;
c.moveToFirst();
do {
if (c.getInt(5) == 1) {
id = c.getString(1);
}
} while (c.moveToNext() && id == null);
Log.println(Log.ASSERT, TAG, "AbstractDao: pk is: " + id);
The explanation is that:
A) PRAGMA table_info returns a cursor with various indices, the response is atleast of length 6... didnt check more...
B) index 1 has the column name.
C) index 5 has the "pk" value, either 0 if it is not a primary key, or 1 if its a pk.
You can define more than one pk so this will not bring an accurate result if your table has more than one (IMHO more than one is bad design and balloons the complexity of the database beyond human comprehension).
So how will this fit into the #Dao? (you may ask...)
When making the Dao "abstract" you have access to a default constructor which has the database in it:
from the docummentation:
An abstract #Dao class can optionally have a constructor that takes a Database as its only parameter.
this is the constructor that will grant you access to the query.
There is a catch though...
You may use the Dao during a database creation with the .addCallback() method:
instance = Room.databaseBuilder(context.getApplicationContext(),
AppDatabase2.class, "database")
.addCallback(
//You may use the Daos here.
)
.build();
If you run a query in the constructor of the Dao, the database will enter a feedback loop of infinite instantiation.
This means that the query MUST be used LAZILY (just at the moment the user needs something), and because the value will never change, it can be stored. and never re-queried.

Resources