Running AWS commands from commandline on a ShellCommandActivity - amazon-dynamodb

My original problem was that I want to increase my DynamoDB write throughput before I run the pipeline, and then decrease it when I'm done uploading (doing it max once a day, so I'm fine with the decreasing limitations).
They only way I found to do it is through a shell script that will issue the API commands to alter the throughput. How does it work with my AMI access_key and secret_key when it's a resource that pipeline creates for me? (I can't log in to set the ~/.aws/config file and don't really want to create an AMI just for this).
Should I write the script in bash? can I use ruby/python AWS SDK packages for example? (I prefer the latter..)
How do I pass my credentials to the script? do I have runtime variables (like #startedDate) that I can pass as arguments to the activity with my key and secret? Do I have any other way to authenticate with either the commandline tools or the SDK package?
If there is another way to solve my original problem - please let me know. I've only got to the ShellActivity solution because I couldn't find anything else in documentations/forums.
Thanks!

OK. found it - http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-roles.html
The resourceRole in the default object in your pipeline will be the one assigned to resources (Ec2Resource) that are created as a part of the pipeline activation.
The default one in configured to have all your permissions and AWS commandline and SDK packages are automatically looking for those credentials so no need to update ~/.aws/config of pass credentials manually.

Related

How do I specify encryption type when using s3remote for DVC

I have just started to explore DVC. I am trying with s3 as my DVC remote. I am getting
But when I run the dvc push command, I get the generic error saying
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
which I know for a fact that I get that error when I don't specify the encryption.
It is similar to running aws s3 cp with --sse flag or specifying ServerSideEncryption when using boto3 library. How can I specify the encryption type when using DVC. Coz underneath DVC uses boto3 so there must be an easy way to do this.
Got the answer for this immediately in the DVC discord channel!! By default, no encryption is used. We should specify what server-side encryption algorithm should be used.
Running dvc remote modify worked for me!
dvc remote modify my-s3-remote sse AES256
There are a bunch of things that we can configure here. All this does is that it adds an entry of sse = AES256 under the ['remote "my-s3-remote"'] inside the .dvc/config file.
More on this here
https://dvc.org/doc/command-reference/remote/modify

Execute R Script on AWS via API

I have an R package that I would like to host through Amazon Web Services that will be accessible via an API. The script should take a couple of input values and return the R output in json format. Also, the API should be able to handle multiple requests simultaneously.
So for example, call http://sampleapi.com/?location=USA?state=Florida. That would then run the R package and return the output data to the calling application.
Has anyone done this before or know of resources you can point me to that would explain how to do so? Thanks!
Thanks for all the suggestions. I decided to use Ruby for the API with the rinruby and rails-api gems and will host that through AWS Elastic Beanstalk. See this question for how I am setting it up - Ruby API - Accept parameters and execute script

Error:1411809D:SSL routines - When trying to make https call from inside R module in AzureML

I have an experiment in AzureML which has a R module at its core. Additionally, I have some .RData files stored in Azure blob storage. The blob container is set as private (no anonymous access).
Now, I am trying to make a https call from inside the R script to the azure blob storage container in order to download some files. I am using the httr package's GET() function and properly set up the url, authentication etc...The code works in R on my local machine but the same code gives me the following error when called from inside the R module in the experiment
error:1411809D:SSL routines:SSL_CHECK_SERVERHELLO_TLSEXT:tls invalid ecpointformat list
Apparently this is an error from the underlying OpenSSL library (which got fixed a while ago). Some suggested workarounds I found here were to set sslversion = 3 and ssl_verifypeer = 1, or turn off verification ssl_verifypeer = 0. Both of these approaches returned the same error.
I am guessing that this has something to do with the internal Azure certificate / validation...? Or maybe I am missing or overseeing something?
Any help or ideas would be greatly appreciated. Thanks in advance.
Regards
After a while, an answer came back from the support team, so I am going to post the relevant part as an answer here for anyone who lands here with the same problem.
"This is a known issue. The container (a sandbox technology known as "drawbridge" running on top of Azure PaaS VM) executing the Execute R module doesn't support outbound HTTPS traffic. Please try to switch to HTTP and that should work."
As well as that a solution is on the way :
"We are actively looking at how to fix this bug. "
Here is the original link as a reference.
hth

Authenticating Service Accounts on Google Compute Engine with BigQuery, via R Studio Server

I'm looking to call BigQuery from R Studio, installed on a Google Compute Engine.
I have the bq python tool installed on the instance, and I was hoping to use its service accounts and system() to get R to call bq command line tool and so get the data.
However, I run into authentication problems, where it asks for a browser key. I'm pretty sure there is no need to get the key due to the service account, but I don't know how to construct the authetication from with R (it runs on RStudio, so will have multiple users)
I can get an authetication token like this:
library(RCurl)
library(RJSONIO)
metadata <- getURL('http://metadata/computeMetadata/v1beta1/instance/service-accounts/default/token')
tokendata <- fromJSON(metadata)
tokendata$$access_token
But how do I then use this to generate a .bigqueryrc token? Its the lack of this that triggers the authetication attempt.
This works ok:
system('/usr/local/bin/bq')
showing me bq is installed ok.
But when I try something like:
system('/usr/local/bin/bq ls')
I get this:
Welcome to BigQuery! This script will walk you through the process of initializing your .bigqueryrc configuration file.
First, we need to set up your credentials if they do not already exist.
******************************************************************
** No OAuth2 credentials found, beginning authorization process **
******************************************************************
Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=XXXXXXXX.apps.googleusercontent.com&access_type=offline
Enter verification code: You have encountered a bug in the BigQuery CLI. Google engineers monitor and answer questions on Stack Overflow, with the tag google-bigquery: http://stackoverflow.com/questions/ask?tags=google-bigquery
etc.
Edit:
I have managed to get bq functioning from RStudio system() commands, by skipping the authetication by logging in to the terminal as the user using RStudio, autheticating there by signing in via the browser, then logging back to RStudio and calling system("bq ls") etc..so this is enough to get me going :)
However, I would still prefer it if BQ can be autheticated within RStudio itself, as many users may log in and I'll need to autheticate via terminal for all of them. And from the service account documentation, and the fact I can get an authetication token, hints at this being easier.
For the time being, you need to run 'bq init' from the command line to set up your credentials prior to invoking bq from a script in GCE. However, the next release of bq will include support for GCE service accounts via a new --use_gce_service_account flag.

Connect R to POP Email Server (Gmail)

Is is possible to have R connect to gmail's POP server and read/download the messages in a specific folder of mine? I have been storing emails and would like to go back and start to analyze subject lines, etc.
Basically, I need a way to export a folder in my gmail account and I would like to do this pro grammatically if it all possible.
Thanks in advance!
I am not sure that this can be done via a single command. Maybe there is a package out there, which I am not aware of that can accomplish that, but as long as you do not run into that maybe the following process would be a solution ...
Consider got-your-back (http://code.google.com/p/got-your-back/wiki/GettingStarted#Step_4%3a_Performing_A_Backup) which "is a command line tool that backs up and restores your Gmail account".
You can invoke it like this (given that python is available on your machine):
python gyb.py --email foo#bar.com --search "from:pip#pop.com" --folder "mail_from_pip"
After completion you'll find all the emails matching the --search in the specified --folder, along with a sqlite database. (posted by dukedave, Dec 4 '11)
So depending on your OS you should be able to invoke the above command from within R and then access the downloaded mails in the respective folder.
GotYourBack is a good backup utility, but for downloading metadata for analysis, you might want something that doesn't first require you to fetch the entire content of all your email.
I've recently used the gmailr package to do a similar analysis.

Resources