Out-of-memory problem submitting a tensorflow2 job on Google AI Platform Engine - out-of-memory

I'm trying to submit a Tensorflow2 training job (fine tuning an object detection model) with gcloud on Google AI Platform Engine. My dataset is not big (raccoon dataset, which is 10M or so). I've tried many configurations but each time get the same error:
The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL)
My command:
gcloud ai-platform jobs submit training OD_ssd_fpn_large \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-east1 \
--config cloud.yml \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
My last try with cloud.yml file involved large models:
trainingInput:
runtimeVersion: "2.2"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: large_model
workerCount: 5
workerType: large_model
parameterServerCount: 3
parameterServerType: large_model
but always the same error. Any hint or help greatly appreciated

Reading all data is consuming RAM, hence you are running out of memory. You need to get a bigger instance type (large_model or complex_model_l; see this documentation for machine types for more details).
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: n1-highcpu-16
parameterServerType: n1-highmem-8
evaluatorType: n1-highcpu-16
workerCount: 9
parameterServerCount: 3
evaluatorCount: 1
Or you need to reduce your dataset.

Related

AWS Sagemaker pipeline definition error while running from aws-cli

Im trying to integrate Sagemaker pipeline with Jenkins. Im using aws-cli ( version - 2.1.24 ).
Since this version doesnt support --pipeline-definition-s3-location, Im trying to do something like below -
aws s3 cp s3://some_bucket/folder1/pipeine_definition.json - | \ jq -c . | \ tee /dev/stderr | \ xargs -0 -I{} aws sagemaker update-pipeline --pipeline-name "pipelinename" --role-arn "arn:aws:iam::<account_id>:role/sagemaker-role" --pipeline-definition '{}'
And I found this error -
An error occurred (ValidationException) when calling the UpdatePipeline operation: Pipeline definition: At least 1 step must be provided
When I recheck the definition.json, Im able to see the steps defined inside json.
Can someone help me?
I tried adding the quotes for --pipeline-definition, which isnt working.
Since jenkins has aws-cli 2.1.24 version, I want to someone copy the contents of json file in s3 and pass it to --pipeline-definition argument using aws sagemaker --update-pipeline command.

AWS Code Deploy - Script at specified location: scripts/validate_service.sh failed with exit code 1

My deployments fail on last step Validate Service with error message:
The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems.
Events log
No lines are selected.
My validate_service.sh contain
#!/bin/bash
# verify we can access our webpage successfully
curl -v --silent localhost:80 2>&1 | grep Welcome
Can someone advice what should I change ?
Script return value matters. Yours looks good to me. I just added couple of seconds to wait until application starts up.
In case you use bash -x together with pipeline of commands, you better add shopt -s pipefail so all pipeline fails when one of the commands fails.
Checkout my script:
#!/bin/bash
sleep 5
curl http://localhost:3009 | grep Welcome

404 error when using Google Cloud Scheduler to run Docker container on Cloud Run

I am posting a follow on question to this one that I posted recently: Docker container failed to start when deploying to Google Cloud Run. I am new to GCP, and am trying to teach myself by deploying a simple R script in a Docker container that connects to BigQuery and writes the system time. I've been able to successfully deploy the Docker container, but I cannot invoke it. I believe I'm misunderstanding something fundamental about APIs, and I'd greatly appreciate any input!
So far, I have:
1.- Used the plumber R package to expose the R code as a service by "decorating" it with special annotations
# script called big-query-tutorial.R
library(bigrquery)
library(tidyverse)
project = "xxxx-project"
dataset = "xxxx-dataset"
table = "xxxx-table"
bq_auth("/home/rstudio/xxxx-xxxx.json", email="xxxx#xxxx.com")
#* #get /time
systime <- function(){
# upload Sys.time() to Big Query
insert_upload_job(project=project, data=dataset, table=table, write_disposition="WRITE_APPEND", values=Sys.time() %>% as_tibble(), billing=project)
}
2.- Translated the R code from (1) to a plumber API with this R script
# script called main.R
library(plumber)
r <- plumb("/home/rstudio/big-query-tutorial.R")
r$run(host="0.0.0.0", port=8080)
3.- Made the Dockerfile
FROM rocker/tidyverse:latest
# BEGIN rstudio/plumber layers
RUN apt-get update -qq && apt-get install -y --no-install-recommends \
git-core \
libssl-dev \
libcurl4-gnutls-dev \
curl \
libsodium-dev \
libxml2-dev
RUN R -e "install.packages('plumber', repos='http://cran.us.r-project.org')"
RUN R -e "install.packages('bigrquery', repos='http://cran.us.r-project.org')"
# add json file for authentication with BigQuery and necessary R scripts
ADD xxxx-xxxx.json /home/rstudio
ADD big-query-tutorial.R /home/rstudio
ADD main.R /home/rstudio
# open port 8080 to traffic
EXPOSE 8080
# when the container starts, start the main.R script
ENTRYPOINT ["Rscript", "/home/rstudio/main.R", "--host", "0.0.0.0"]
4.- Successfully run the container locally on my machine, with the system time being written to BigQuery when I visit http://0.0.0.0:8080/time and then refresh the browser.
5.- Pushed the container to my container registry in Google Cloud
6.- Successfully deployed the container to Cloud Run.
7.- Created a service account (i.e., xxxx#xxxx.iam.gserviceaccount.com) that has roles "Cloud Run Invoker" and "Cloud Scheduler Service Agent".
8.- Set up a Cloud Scheduler job by filling out the fields in the console as follows
Frequency: ***** (i.e., once per minute)
Timezone: Pacific Standard Time (PST)
Target: HTTP
URL: xxxx-xxxx.run.app
HTTP method: GET
Auth header: Add OIDC token
Service account: xxxx#xxxx.iam.gserviceaccount.com (i.e., account from (7))
Audience: xxxx-xxxx.run.app (I leave this field blank, it is automatically added)
When I click on "RUN NOW" in Cloud Scheduler, I get the error
httpRequest: {
status: 404
}
When I check the log for Cloud Run, every minute there is the 404 error. The request count under the "METRICS" tab averages out to 0.02/s.
Thank you!
-H.
A couple of recommendations:
Make sure your service account has roles/iam.serviceAccountTokenCreator and roles/cloudscheduler.serviceAgent that will enable impersonation. And roles/run.Invoker to be able to call Cloud Run.
Also you have chosen OIDC Audience
A bit about the audience: field in OIDC tokens.
You must set this field for the invoking service and specify the fully qualified URL of the receiving service. For example, if you are invoking Cloud Run or Cloud Functions, the id_token must include the URL/path of the service.
Example declaration:
gcloud beta scheduler jobs create http oidctest --schedule "5 * * * *" --http-method=GET \
--uri=https://hello-6w42z6vi3q-uc.a.run.app \
--oidc-service-account-email=schedulerunner#$PROJECT_ID.iam.gserviceaccount.com \
--oidc-token-audience=https://hello-6w42z6vi3q-uc.a.run.app

apigee command line tool unable to create product for proxy

The following command does not work even though there is a proxy by the name income_prediction
apigeetool createProduct \
--approvalType "auto" \
--environments ${APIGEE_ENVIRONMENT} \
--proxies income_prediction \
--productName "income_prediction_trial" \
--productDesc "Free trial API for income prediction." \
--quota 10 \
--quotaInterval 1 \
--quotaTimeUnit "day"
I get an:
Error: Create Product failed with status code 403
You're missing the mandatory --organization parameter (that specifies the organization you're creating the product for). For some reason, the apigeetool is not asking for it (it should).

Multithreaded program only runs on a single processor after compiling, how do I troubleshoot?

I am trying to run a compiled program that is supposed to be running on multiple processors. But with the same data, sometimes this program runs in parallel and sometimes it won't (with the identical PBS script file!). I am suspecting that something is wrong with some of the compute nodes that won't let it run on parallel (I don't get to choose the compute node I want). How can I troubleshoot if this is a bug in the program or it is problem with the compute node?
As per the sys admin's adivce, I am using ulimit -s 100000, but this don't change anything. Also, this program is not an mpi program (runs only on a single node, with multiple processors).
The code that I run is as follows:
quorum_error_correct_reads -q 68 \
--contaminant=/data004/software/GIF/packages/masurca/2.3.0rc1/bin/../share/adapter.jf \
-m 1 -s 1 -g 1 -a 3 --thread=32 -w 10 -e 3 \
quorum_mer_db.jf aa.renamed.fastq ab.renamed.fastq ac.renamed.fastq ad.renamed.fastq ae.renamed.fastq af.renamed.fastq ag.renamed.fastq \
--no-discard -o pe.cor --verbose
Thanks for any advice you can offer. I will greatly appreciate your help!
PS: I don't have sudo access.
EDIT: I know it is supposed to be using multiple processors because, when I SSH into the node and do top -c I can see (above command) sometimes running like 3200 % CPU (all the time) and sometimes only 100 % CPU all the time. This is the only step involved and there are no other sub-process within this program. Also, I am using HPC, where I submit the job to a compute node, each with 32 procs, 512GB RAM.

Resources