Losing connection to Redis Service - symfony

We have a Symfony application deployed on a Swisscom provided Cloudfoundry Instance. Additionally we're using a Redis Service provided by Swisscom for caching.
It happened now two times that we're getting a timeout for the Redis Connection which causes our Application to fail:
Redis connection failed (connect() failed: Connection timed out): redis://password#domain.service.consul:47133
Some technical information:
symfony/symfony (v3.3.9)
predis/predis (v1.1.1)
cf version 6.32.0+0191c33d9.2017-09-26
config.ymllooks like that for Caching:
framework:
cache:
system: cache.adapter.apcu
default_redis_provider: redis://%redis_password%#%redis_host%:%redis_port%
pools:
redis_pool:
adapter: cache.adapter.redis
public: true
default_lifetime: 0
provider: cache.default_redis_provider
and is used as a Service as defined here:
tag_aware_cache:
class: Symfony\Component\Cache\Adapter\TagAwareAdapter
arguments: [ '#redis_pool' ]
To my understanding we aren't using any persistent connection to Redis and it's usually working fine.
The only solution I found so far to get the application back in stable and running state is to re-deploy the whole application which isn't really a good solution.
Especially I don't understand what could be the root cause.
How could I check this on my own and is Swisscom sure that Redis Service itself runs fully stable?

You can access the service directly when using 'cf ssh' to your app. The complete process on how to access your service using 'cf ssh' is described in Swisscoms documentation: https://docs.developer.swisscom.com/devguide/deploy-apps/ssh-apps.html
You're app should be able to handle connectivity issues by itself. Usually a simple retry keeps the app from crashing.

Related

Use GCP debugger but got permission error

I add #google-cloud/debug-agent on my nodejs project which is deployed on GKE.
But I got error:
restify listening to http://[::]:80
#google-cloud/debug-agent Failed to re-register debuggee nodejs-bot: Error: The caller does not have permission
#google-cloud/debug-agent Failed to re-register debuggee nodejs-bot: Error: The caller does not have permission
#google-cloud/debug-agent Failed to re-register debuggee nodejs-bot: Error: The caller does not have permission
#google-cloud/debug-agent Failed to re-register debuggee nodejs-bot: Error: The caller does not have permission
I have checked my GKE have the debug permission. I don't know why the service didn't have permission.
Here is the code I define on my index.ts
import * as tracer from '#google-cloud/trace-agent';
tracer.start();
import * as debug from '#google-cloud/debug-agent';
debug.start();
This issue can be resolved by doing the following:
1 - Create a new Cluster with these permissions enabled (i.e. Cloud Debugger/ Cloud Platform 'Enabled') and with the required scopes. [1]
Example:
$ gcloud container clusters create example-cluster-name --scopes https://www.googleapis.com/auth/cloud_debugger --zone
2- You can use the same YAML config files you used to deploy your original workloads in the new cluster. You must make sure you have the required scopes for this to work.
You can review how authentication and scopes work using [2] [3].
I found the issue is caused by workload identity, so I just close this feature to fix this issue.
Because I select to launch the workload identity feature. Every pod which needs to connect GCP service will need to create a service account for these pods. Otherwise, the permission will be blocked.
https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

Google Cloud Composer The server encountered a temporary error and could not complete your request

After running for a couple of days Google Cloud Composer web UI returns the 502 Server Error indefinitely:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
The only way to fix it is to recreate the Composer environment. Though after running for a couple of days the new environment crashes with the same error.
Image version: composer-1.4.0-airflow-1.10.0
Python version: 3
Anyone knows what's the root cause?
I don't run Cloud Composer but I suspect that there's a case where the webserver has exited from all the web worker threads. This can sometimes happen when airflow has an extended timeout reading or writing to the database; either due to a held lock, or network connection issues. It probably is configured to restart if it fully exits, but there are some cases were the airflow webserver command will still hold on without exiting even though all web workers have exited.
Alternatively the 502 is about the identity provider implemented for GCP. If that's the case you might find you need to sign out of your Google login and use the sign in flow provided by Airflow (if it responds to a private browser session or a signed out session).
I was facing the same 502 error and it turned out to be an issue with the DAG itself. As mentioned:
https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags
"The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG."
Visible in Composer / Monitoring
Web server was affected by an issue with the DAG itself. We solved it by deleting the recently added DAGs, after couple of minutes the Airflow UI was up.

Amazon CodeDeploy. Deployment failed

I have been configuring AWS CodeDeploy for a few days and my first deployment is failing. The error message I get reads "The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems."
To get more detailed info I have installed the AWS CodeDeploy agent on the Windows instance and it appears not to be working. All what I manage to read in the code-deploy-agent-log.txt file are the repetitive lines.
2016-05-31 16:05:24 DEBUG [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Sleeping 90 seconds.
2016-05-31 16:06:55 DEBUG [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Calling PollHostCommand:
2016-05-31 16:06:55 INFO [codedeploy-agent(4872)]: Version file found in C:/ProgramData/Amazon/CodeDeploy/.version.
2016-05-31 16:06:55 ERROR [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Missing credentials - please check if this instance was started with an IAM instance profile
My question is how can I get more information about the error message I am getting over the deployments. Which credentials am I missing (specifying incorrectly) that cause to the error message in the log file?
I think you are missing service-role-arn while creating your deployment group. The service role ARN allows AWS CodeDeploy to act on the user's behalf when interacting with AWS services. The service role ARN is of the code deploy role that you may have created it earlier.
In addition please make sure that your deployment policy is set to CodeDeployDefault.OneAtATime. This is to avoid taking all instances down if you push incorrect or failing build.
I tried Suken Shah's steps it didn't solve for me. What solved for me are:
1)Creating an IamInstanceProfile say Webserver.
2)Adding AWSCodeDeployRole to the IamInstanceProfile Webserver.
3)Adding the following to AWSCodeDeployRole's Trust Relationship: "codedeploy.amazonaws.com", "ec2.amazonaws.com", "codedeploy.MY_REGION.amazonaws.com"
4)Rebooting the ec2
Make sure the role you use for EC2 has 'AWSCodeDeployRole' policy and trust relationship has 'ec2.amazonaws.com' service. if you need to change the role then restart the EC2

Running Apache spark job from Spring Web application using Yarn client or any alternate way

I have recently started using spark and I want to run spark job from Spring web application.
I have a situation where I am running web application in Tomcat server using Spring boot.My web application receives a REST web service request based on that It needs to trigger spark calculation job in Yarn cluster. Since my job can take longer to run and can access data from HDFS, so I want to run the spark job in yarn-cluster mode and I don't want to keep spark context alive in my web layer. One other reason for this is my application is multi tenant so each tenant can run it's own job, so in yarn-cluster mode each tenant's job can start it's own driver and run in it's own spark cluster. In web app JVM, I assume I can't run multiple spark context in one JVM.
I want to trigger spark jobs in yarn-cluster mode from java program in the my web application. what is the best way to achieve this. I am exploring various options and looking your guidance on which one is best
1) I can use spark-submit command line shell to submit my jobs. But to trigger it from my web application I need to use either Java ProcessBuilder api or some package built on java ProcessBuilder. This has 2 issues. First it doesn't sound like a clean way of doing it. I should have a programatic way of triggering my spark applications. Second problem will be I will loose the capability of monitoring the submitted application and getting it's status.. Only crude way of doing it is reading the output stream of spark-submit shell, which again doesn't sound like good approach.
2) I tried using Yarn client to submit the job from spring application. Following is the code that I use to submit spark job using Yarn Client:
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf conf = new SparkConf();
ClientArguments cArgs = new ClientArguments(sparkArgs, conf);
Client client = new Client(cArgs, config, conf);
client.run();
But when I run the above code, it tries to connect on localhost only. I get this error:
5/08/05 14:06:10 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/08/05 14:06:12 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
So I don't think it can connect to remote machine.
Please suggest, what is best way of doing this with latest version of spark. Later I have plans to deploy this entire application in amazon EMR. So approach should work there also.
Thanks in advance
Spark JobServer might help:https://github.com/spark-jobserver/spark-jobserver, this project receives RESTful web requests and start a spark job. Results is returned as json response.
I also had similar issues trying to run Spark app that connects to YARN cluster - having no cluster config it was trying to connect to the local machine as for the main node of the cluster, which obviously failed.
It worked for me when I've placed core-site.xml and yarn-site.xml into the classpath (src/main/resources in typical sbt or Maven project structure) - application correctly connected to the cluster.
When using spark-submit location of those files is typically specified by HADOOP_CONF_DIR environment variable, but for stand-alone application it didn't have effect.

What is the best practice to use profiler data from production system?

Assuming I have a running symfony application and it encounters an exception with following configuration:
framework:
profiler:
lifetime: 604800
only_exceptions: true
Then there should be a dump with profiling information.
But what happen next?
Just copy the file to your own local profiler data folder and start the profiler?
What are the best practices to handle and debug exceptions occuring on the production system?
I think enabling profiler even with only_exceptions: true should have performance impact, because to display something on exception is should be collected first in any way.
If you want to see the profiler data from another host, you can export and import it locally
For me more correct way is to just log events or email exception with stacktrace to admin via kernel exception listeners. Within listener you can access any info you need to send or log, i.e. request stack, logged in user info etc

Resources