GCP Storage egress charges: where are they coming from? - networking

We are seeing approximately linear growth in our bill due to "GCP Storage egress between NA and EU" costs. As far as I can tell we have neither any storage buckets, nor instances in NA. Looking at the storage.googleapis.com/network/sent_bytes_count metric, it appears the egress might be coinciding with deployment of the App Engine app (it is a static site that is redeployed every 5-10 minutes).
How can I find out what data is being transferred from NA and how to stop this, to avoid the charges?

You can activate the Cloud Storage data access logs. It's deactivated by default because the volume of logs can be huge.
Anyway, for you case, you can activate them for your investigation, and then deactivate them.
You can also have a look on your App Engine deployment region. It's maybe the root cause.

I'm also noticing some unexpected GCP Storage egress between NA and EU costs. I'm running an App Engine app in the EU region. My theory is that this is due to container images being downloaded from gcr.io (NOT eu.gcr.io) as part of the process of deploying an App Engine version. (It says here that gcr.io is currently in the US.) I find some evidence of this in the Cloud Build history: there, I see e.g. Pulling image: gcr.io/gae-runtimes/crane:current. If I browse to gcr.io/gae-runtimes/crane, I see that its "Virtual size" is 7.66MB, so, since I've done 37 deploys by now and my bill mentions 1.58GB of egress, by itself it does not explain the figure completely, but presumably other, bigger images are being downloaded as well. (I see in the Build History things like Already have image (with digest): gcr.io/cloud-builders/gcs-fetcher, but perhaps these are charged anyway?)

Related

I got Target group not attached to the Auto Scaling group error while doing blue green deployment through code deploy

I have been trying blue green deployment with code deploy, but it throws an error: The following validation error occurred: Target group not attached to the Auto Scaling group (Service: AmazonAutoScaling; Status Code: 400; Error Code: ValidationError; Request ID: cd58091b-fe83-4dcf-b090-18c3b3d2dbbc; Proxy: null)
Though the policy has been applied to create target groups:
codedeploy:GetDeployment
elasticloadbalancing:DescribeTargetGroups
autoscaling:AttachLoadBalancers
autoscaling:AttachLoadBalancerTargetGroups
Does anyone could help me to sort out the issue & what am i missing?
The following is the error i encounter.
error
In our case, we managed to fix it the hard way by contacting AWS support team. Briefly about our app, we run Magento application behind an application load balancer with autoscaling, and the deployment is managed using AWS CodeDeploy on blue/green deployment.
We spent several days figuring out what's going on. Others suggested that there might be issue with IAM permissions, but we didn't touch that for months and deployment has never had any issues.
AWS' rep replied to us and said that in our case, there is a known issue / limitation on AWS Codedeploy that it currently don't support Blue/Green deployments based on ASGs that use Target Tracking scaling policies, because currently they don't attach the Green ASG to the original target group, and this is a requirement when Target Tracking scaling policies is enabled on the autoscaling group.
We then realized that we did some minor changes on our autoscaling groups' dynamic scaling policies that we switched from "CPU utilization"-based metrics to "Request count". Reverting it back to CPU utilization-based metrics solved the issue and we can run the deployment successfully.
Hope it helps as this error seems not to be documented in AWS doc.

How to improve Cloud composer health?

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.
I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Cosmos DB Emulator hangs when pumping continuation token, segmented query

I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.
After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.

Firebase Functions returns error of Bandwidth Exhausted

We are using Firebase Functions with a few different HTTP functions .
One of the functions runs via a manual trigger from our website. It then pulls in a lot of data from an external resource and saves it into our Firestore database. Our function resources are Node.js 10, 1 GB of Memory and 540s before it times out.
However, when we have large datasets that we need to pull in, e.g. 5 000 - 10 000 records to write to the database, we start running into issues. We receive an error on large data sets of:
8 RESOURCE_EXHAUSTED: Bandwidth exhausted
The full error on Firebase Functions Health Dashboard logs looks like this:
Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted
at Object.callErrorFromStatus (/workspace/node_modules/#grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/workspace/node_modules/#grpc/grpc-js/build/src/client.js:176:52)
at Object.onReceiveStatus (/workspace/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:342:141)
at Object.onReceiveStatus (/workspace/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:305:181)
at Http2CallStream.outputStatus (/workspace/node_modules/#grpc/grpc-js/build/src/call-stream.js:117:74)
at Http2CallStream.maybeOutputStatus (/workspace/node_modules/#grpc/grpc-js/build/src/call-stream.js:156:22)
at Http2CallStream.endCall (/workspace/node_modules/#grpc/grpc-js/build/src/call-stream.js:142:18)
at ClientHttp2Stream.stream.on (/workspace/node_modules/#grpc/grpc-js/build/src/call-stream.js:420:22)
at ClientHttp2Stream.emit (events.js:198:13)
at ClientHttp2Stream.EventEmitter.emit (domain.js:466:23)
Our Firebase project is on the blaze plan and also, on GCP connected to an active billing account.
Upon inspection on GCP, it seems like we are NOT exceeding our WRITES per minute quote, as previously thought, however, we are exceeding our Cloud Build limit. We are also using batched writes when we save data to firestore from within the function, which seems to also make the amount of db writes less. e.g.
We don't use Cloud Build, so I assume that Firebase Functions uses Cloud Build in the back end to run the functions or something, but I can't find any documentation on the matter. We also have a few firestore database functions that run when documents are created. Not sure if that uses Cloud build in the back end or not.
Any idea why this would happen ? Whenever this happens, our function gets terminated with that error which causes us to only import half of our data. The data import works flawlessly with smaller amounts of data.
See our usage here for this particular project:
Cloud Build is used during the deployment of Cloud Functions. If you check this documentation you can see that:
Deployments work by uploading an archive containing your function's source code to a Google Cloud Storage bucket. Once the source code has been uploaded, Cloud Build automatically builds your code into a container image and pushes that image to Container Registry. Cloud Functions uses that image to create the container that executes your function.
This by itself is not enough to justify the charges you are seeing, but if you check the container image documentation it says:
Because the entire build process takes place within the context of your project, the project is subject to the pricing of the included resources:
For Cloud Build pricing, see the Pricing page. This process uses the default instance size of Cloud Build, as these instances are pre-warmed and are available more quickly. Cloud Build does provide a free tier: please review the pricing document for further details.
So with that information in mind, I would make an educated guess that your website is triggering the HTTP function enough times to make Cloud Functions scale up this particular function with new intances of it, which triggers a build process for the container that hosts the function and charges you as a Cloud Build charge. So to keep doing what you doing you are going to have to increase your Cloud Build Quota to meet this demand of your website.
There was a Firestore trigger that was triggering on new records of the same type I was importing.
So in short, I was creating thousands of records in a collection, and for every one of those, the firestore rule (function) triggered, but what I did not know at the time, is that it created a new build process in the background for each firestore trigger that ran, which is not documented anywhere.

google cloud compute load balancer 504 error

I have the following setup.
2 x gce n1.standard instances running nginx/php5-fpm
1 x cloud SQL d8 instance with both GCE instances connected
to it.
The front end servers are running a stripped down version of osCommerce-2.3.3.4 with NO admin section on the front end. Its just the catalog portion of osC.
I ran a load test with load impact, and the site is unusable at around 50-100 users. I am only looking to support at absolute max 500/users at any give time. We average around 130-170 on our current server.
I am not looking for a complete explanation but any helpful places to check, things to try, stuff to read, I just need a direction to go in to get this cloud platform working like we wanted.
Thanks in advance.

Resources