I'm trying to use Dask to get multiple files (JSON) from AWS S3 into memory in a Sagemaker Jupyter Notebook.
When I submit 10 or 20 workers, everything runs smoothly. However, when I submit 100 workers, between 30% and 50% of them encounter the following error: 'Unable to locate credentials'
Initially I was trying with Boto3. In order to try to eliminate this issue, I switched to S3FS but the same error is occurring.
The workers which err out with the NoCredentialError are random if I repeat the experiment, as is the exact number of failed downloads.
Sagemaker is handling all the AWS credentials through its IAM role, so I have no access to key pairs or anything. The ~/.aws/config file contains only the default location - nothing about credentials.
It seems this is a very common use for Dask so it's obviously capable of performing such a task - where am I going wrong?
Any help would be much appreciated! Code and traceback below. In this example, 29 workers failed due to credentials.
Thanks,
Patrick
import boto3
import json
import logging
import multiprocessing
from dask.distributed import Client, LocalCluster
import s3fs
import os
THREADS_PER_DASK_WORKER = 4
CPU_COUNT = multiprocessing.cpu_count()
HTTP_SUCCESSFUL_REQUEST_CODE = 200
S3_BUCKET_NAME = '-redacted-'
keys_100 = ['-redacted-']
keys_10 = ['-redacted-']
def dispatch_workers(workers):
cluster_workers = min(len(workers), CPU_COUNT)
cluster = LocalCluster(n_workers=cluster_workers, processes=True,
threads_per_worker=THREADS_PER_DASK_WORKER)
client = Client(cluster)
data = []
data_futures = []
for worker in workers:
data_futures.append(client.submit(worker))
for future in data_futures:
try:
tmp_flight_data = future.result()
if future.status == 'finished':
data.append(tmp_flight_data)
else:
logging.error(f"Future status = {future.status}")
except Exception as err:
logging.error(err)
del data_futures
cluster.close()
client.close()
return data
def _get_object_from_bucket(key):
s3 = s3fs.S3FileSystem(anon=False)# uses default credentials
with s3.open(os.path.join(S3_BUCKET_NAME,key)) as f:
return json.loads(f.read())
def get_data(keys):
objects = dispatch_workers(
[lambda key=key: _get_object_from_bucket(key) for key in keys]
)
return objects
data = get_data(keys_100)
Output:
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
Related
I am trying to use pyghsheets from within a Jupyter notebook and I do not get it to work, while the same piece of code works nicely from within ipython.
from pathlib import Path
import pygsheets
creds = Path(r"/path/to/client_secret.json")
gc = pygsheets.authorize(client_secret=creds)
book = gc.open_by_key("__key__of__sheet__")
wks = book.worksheet_by_title("Sheet1")
wks.clear(start="A2")
When called from within ipython everthing works fine, whereas from within a Jupyter notebook I get
RefreshError: ('invalid_grant: Token has been expired or revoked.', {'error': 'invalid_grant', 'error_description': 'Token has been expired or revoked.'})
I run both pieces from within the same conda environment. Any suggestion on how to narrow down the problem (and solutions) are very welcome!
It turns out that current working directory of my Jupyter notebook was not the same as the one of my plain ipython. Now pygsheets stores the token that is used for authentication with Google in the current working directory. If the .json file in that directory is invalid, authentication will fail.
You can add a parameter credentials_directory=... to specify a previously validated token file.
The solution I ended up with was
gc = pygsheets.authorize(client_secret=creds, credentials_directory=creds.parent)
That way token and credential files are in the same directory.
I am trying to set up a Google Co-lab notebook that runs in R and can read a GCS bucket from a GCP project. I am using the googleCloudStorageR package. To authenticate and read the bucket, the initial Co-lab notebook runs the following Python commands:
!gcloud auth login
!gcloud config set project project_name
!gcloud sql instances describe project_name
How can I run the above commands in R using the googleCloudStorageR package ? In the documentation for the package, they mention using the gcs_auth function that reads an authentication JSON file. However, since I will be accessing the buckets through a Co-Lab notebook running on R, I do not want to use an authentication file and instead want to authenticate and connect to the GCP storage in real-time from the Co-Lab notebook. Thank you!
Figured this out. In a Co-lab notebook, run the following code snippet:
install.packages("httr")
install.packages("R.utils")
install.packages("googleCloudStorageR")
if (file.exists("/usr/local/lib/python3.6/dist-packages/google/colab/_ipython.py")) {
library(R.utils)
library(httr)
reassignInPackage("is_interactive", pkgName = "httr", function() return(TRUE))
}
library(googleCloudStorageR)
options(
rlang_interactive = TRUE,
gargle_oauth_email = "email_address",
gargle_oauth_cache = TRUE
)
token <- gargle::token_fetch(scopes = "https://www.googleapis.com/auth/cloud-platform")
googleAuthR::gar_auth(token = token)
There is an issue with gargle authentication that the googleCloudStorageR package uses. A workaround that is similar to the one listed here (https://github.com/r-lib/gargle/issues/140) is to generate a token for cloud scopes, which would give us a token object that we would then use in the gar_auth function.
Trying to transition from sqlite db to postgresql (based on the guide here: https://www.ryanmerlin.com/2019/07/apache-airflow-installation-on-ubuntu-18-04-18-10/ ) and getting mushroom cloud error at initial screen of webserver UI.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/home/airflow/.local/lib/python3.6/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
...
...
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/www/utils.py", line 93, in is_accessible
(not current_user.is_anonymous and current_user.is_superuser())
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/auth/backends/password_auth.py", line 114, in is_superuser
return hasattr(self, 'user') and self.user.is_superuser()
AttributeError: 'NoneType' object has no attribute 'is_superuser'
Looking at the webserver logs does not reveal much...
[airflow#airflowetl airflow]$ tail airflow-webserver.*
==> airflow-webserver.err <==
/home/airflow/.local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
==> airflow-webserver.log <==
==> airflow-webserver.out <==
[2019-12-18 10:20:36,553] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=72725
==> airflow-webserver.pid <==
72745
One thing that may be useful to note (since this appears to be due to some kind of password issue) is that before trying to switch to postgres, I had set bycrpt password according to the docs (https://airflow.apache.org/docs/stable/security.html#password) with the script here:
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'airflow'
user.email = 'myemail#co.org'
user.password = 'mypasword'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
Anyone know what could be going on here or how to debug further?
Rerunning my user/password script fixed the problem.
I assume this is connected to the change to the new postgres server from the previously used sqllite db. I guess this is stored somewhere in the airflow backend DB (rather new to airflow, so was not aware of these internals) and since I am switching backends the new backend does not have this user/auth info and need to rerun the script to import the airflow package and write a new user/password to its backend to be able to log in with password (since my airflow.cfg uses auth_backend = airflow.contrib.auth.backends.password_auth).
I can successfully access gitlab project using Git Bash or via R-studio with my credentials.
But when I try to install project using devtools it returns error.
Here is my code
creds = git2r::cred_ssh_key(publickey = "C:\\Users\\user\\.ssh\\id_rsa.pub", privatekey = "C:\\Users\\user\\.ssh\\id_rsa")
devtools::install_git("git#gitlab.mycompany.com:my_project.git", credentials = creds)
Here is log:
Downloading git repo git#gitlab.mycompany.com:my_projects/my_project.git
Installation failed: Error in 'git2r_clone': Failed to authenticate SSH session: Unable to send userauth-publickey request
I use R-Studio, Windows 7.
Issue was fixed by recreation of key
I am trying to use the linkedin API to access some data for a project. I am following the instructions provided by https://github.com/mpiccirilli/Rlinkedin . But no matter what I do I run into a time out error:
library(pacman)
p_load(rvest,ggplot2,Rlinkedin,httr,curl,devtools,dplyr,devtools)
in.auth <- inOAuth()
The console returns this message:
If you've created you're own application, be sure to copy and paste
the following into 'OAuth 2.0 Redirect URLs' in the LinkedIn
Application Details: http://localhost:1410/ When done, press any key
to continue... Use a local file ('.httr-oauth'), to cache OAuth access
credentials between R sessions?
1: Yes 2: No
I click 1 or 2 and I get the same error:
Adding .httr-oauth to .gitignore Error in curl::curl_fetch_memory(url,
handle = handle) : Timeout was reached: Connection timed out after
10000 milliseconds
No matter what I try I get the same error, any help would be very very very appreciated.