AWS EMR writing to KMS Encrypted S3 Parquet Files - encryption

I am using AWS EMR 5.0, Spark 2.0, Scala 2.11, S3 - encrypted with KMS(SSE-custom key), Parquet files. I can read the encrypted parquet files - no problem. However, when I write, I get a warning. Simplified code looks like:
val headerHistory = spark.read.parquet("s3://<my bucket>/header_1473640645")
headerHistory.write.parquet("s3://<my bucket>/temp/")
but generates a warning:
16/09/15 13:11:11 WARN S3V4AuthErrorRetryStrategy: Attempting to re-send the request to my bucket.s3.amazonaws.com with AWS V4 authentication. To avoid this warning in the future, please use region-specific endpoint to access buckets located in regions that require V4 signing.
Do I need an option? Do I need to set some environment variable?

Thank you for providing additional details.
Yes, it is a known issue with KMS+SSE when using EMRFS(library under the hood for s3 communication).
The problem was when server side encryption + kms is enabled, the s3client in emrfs crafted request without specifying the signer type.
In a conservative way, s3 would try V2 initially, and then retry with V4 if first attempts failed. Such behavior will slow down the overall process.
EMRFS will be patched to specify using V4 at first attempt, this should be fixed in the next EMR release.
As mentioned, it doesn't break the job.
Please keep an eye for coming emr-5.x (no ETA)
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Related

How do I specify encryption type when using s3remote for DVC

I have just started to explore DVC. I am trying with s3 as my DVC remote. I am getting
But when I run the dvc push command, I get the generic error saying
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
which I know for a fact that I get that error when I don't specify the encryption.
It is similar to running aws s3 cp with --sse flag or specifying ServerSideEncryption when using boto3 library. How can I specify the encryption type when using DVC. Coz underneath DVC uses boto3 so there must be an easy way to do this.
Got the answer for this immediately in the DVC discord channel!! By default, no encryption is used. We should specify what server-side encryption algorithm should be used.
Running dvc remote modify worked for me!
dvc remote modify my-s3-remote sse AES256
There are a bunch of things that we can configure here. All this does is that it adds an entry of sse = AES256 under the ['remote "my-s3-remote"'] inside the .dvc/config file.
More on this here
https://dvc.org/doc/command-reference/remote/modify

RavenDb patch api in embedded version of the server

Is there any difference in patch api in embedded and standard version of the server?
Is there a need to configure document store in some way to enable patch api?
I'm writing a test which use embedded raven. The code works correctly on the standard version but in test it doesn't. I'm constantly receiving patch result: DocumentDoesNotExists. I`ve checked with debugger and the document exists in the store - so it is not a problem with test.
Here you can find a repro of my issue: https://gist.github.com/pblachut/c2e0e227fa3beb51f4f9403505c292bb
I`ve reached the contact in the ravendb support and I have answer for my question.
There should be no difference between embedded and normal version of the server. The problem was that I did not passed explicitly for which database I want to invoke batch command. In the result I tried to patch document in system database.
var result = await documentStore.AsyncDatabaseCommands.ForDatabase("testDb).BatchAsync(new[] {command});
I assumed that database name will be taken from the session (beacuse I get documentStore from there). But the name of database should be always passed.
var documentStore = session.Advanced.DocumentStore;

Encrypt/Decyrypt - Cipher - JceSecurity Restriction

I am trying to encrypt/decrypt the data using javax.crypto.Ciper where I have given transformation as AES/ECB/PKCS5Padding.
My problem is when I run the code in Local machine, encryption / decryption works fine, however when I run the same code on Server, system throws Exception during Cipher.init("AES/ECB/PKCS5Padding").
On doing detailed analysis and checking the code inside Cipher.java, I found the problem is inside the following method Cipher-initCryptoPermission() when system checks for JceSecurity.isRestricted().
In my local machine, JceSecurity.isRestricted() returns FALSE, however when it runs on Server, the same method returns TRUE. Due to this on server, the system does not assigns right permissions to the Cipher.
Not sure, where exactly JceSecurity restriction is set. Appreciate your help.
On doing deep-diving I found the real problem and solution.
Under Java_home/jre/lib/security there are two jar files, local_policy.jar and US_export_policy.jar. Inside local_policy.jar, there is a file called default_local.policy, which actually stores all the permissions of the cryptography.
In my local machine the file had AllPermission, hence there were no Restriction in JceSecurity for me and was allowing me to use AES encryption algorithm, but on the server it is having limited version as provided by Java bundle.
Replacing the local_policy.jar with no restrictions (or unlimited permissions) did the trick.
On reading more about it on Internet found that Java provides the restricted version with the download package as some countries have restrictions on using the type of cyptography algorithms, hence you must check with your organisation before replacing the jar files.
Jar files with no restrictions can be found on Oracle (Java) site at following location.Download link

Error:1411809D:SSL routines - When trying to make https call from inside R module in AzureML

I have an experiment in AzureML which has a R module at its core. Additionally, I have some .RData files stored in Azure blob storage. The blob container is set as private (no anonymous access).
Now, I am trying to make a https call from inside the R script to the azure blob storage container in order to download some files. I am using the httr package's GET() function and properly set up the url, authentication etc...The code works in R on my local machine but the same code gives me the following error when called from inside the R module in the experiment
error:1411809D:SSL routines:SSL_CHECK_SERVERHELLO_TLSEXT:tls invalid ecpointformat list
Apparently this is an error from the underlying OpenSSL library (which got fixed a while ago). Some suggested workarounds I found here were to set sslversion = 3 and ssl_verifypeer = 1, or turn off verification ssl_verifypeer = 0. Both of these approaches returned the same error.
I am guessing that this has something to do with the internal Azure certificate / validation...? Or maybe I am missing or overseeing something?
Any help or ideas would be greatly appreciated. Thanks in advance.
Regards
After a while, an answer came back from the support team, so I am going to post the relevant part as an answer here for anyone who lands here with the same problem.
"This is a known issue. The container (a sandbox technology known as "drawbridge" running on top of Azure PaaS VM) executing the Execute R module doesn't support outbound HTTPS traffic. Please try to switch to HTTP and that should work."
As well as that a solution is on the way :
"We are actively looking at how to fix this bug. "
Here is the original link as a reference.
hth

Running AWS commands from commandline on a ShellCommandActivity

My original problem was that I want to increase my DynamoDB write throughput before I run the pipeline, and then decrease it when I'm done uploading (doing it max once a day, so I'm fine with the decreasing limitations).
They only way I found to do it is through a shell script that will issue the API commands to alter the throughput. How does it work with my AMI access_key and secret_key when it's a resource that pipeline creates for me? (I can't log in to set the ~/.aws/config file and don't really want to create an AMI just for this).
Should I write the script in bash? can I use ruby/python AWS SDK packages for example? (I prefer the latter..)
How do I pass my credentials to the script? do I have runtime variables (like #startedDate) that I can pass as arguments to the activity with my key and secret? Do I have any other way to authenticate with either the commandline tools or the SDK package?
If there is another way to solve my original problem - please let me know. I've only got to the ShellActivity solution because I couldn't find anything else in documentations/forums.
Thanks!
OK. found it - http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-roles.html
The resourceRole in the default object in your pipeline will be the one assigned to resources (Ec2Resource) that are created as a part of the pipeline activation.
The default one in configured to have all your permissions and AWS commandline and SDK packages are automatically looking for those credentials so no need to update ~/.aws/config of pass credentials manually.

Resources