Airflow UI loading extremely slow after upgrading version - airflow

I upgraded my airflow cluster from 1.7.1.3 to 1.10.1. After the upgradation the main page UI of airflow is loading very slowly. I tried reducing the page size and increasing the workers for the webserver but that didn't help. I am using the following configs -
page_size = 15
workers = 4
web_server_master_timeout = 120
web_server_worker_timeout = 120
log_fetch_timeout_sec = 5
Can someone help me with what is happening here?

Related

Unable to disable GPU in CefSharp settings

CefSharp v99.2.120
I am getting the following error in my CefSharp application:
ERROR::gpu_process_host.cc (967) GPU process exited unexpetedly: exit_code=532462766 WARNING:gpu_process_host.cc(1273) The GPU process has crashed 1 time(s)
And this repeats until crashed 3 times until
FATAL:gpu_data_manager_impl_private.cc(417) GPU process isn't usable. Goodbye/
I am using the following settings:
CefSettings settings = new CefSettings() settings.CefCommandLineArgs.Add("disable-gpu");
settings.CefCommandLineArgs.Add("disable-gpu-compositing");
settings.CefCommandLineArgs.Add("disable-gpu-vsync");
settings.CefCommandLineArgs.Add("disable-software-rasterizer");
App still crashes with same error.
Added
settings.DisableGpuAcceleration()
Still the same.
Expected there to be no GPU in use, but using the above settings doesn't alter anything.

makeClusterPSOCK ERROR workers failed to connect

I encounter this error when running Seurat on R.
Error in makeClusterPSOCK(workers, ...) :
Cluster setup failed. 4 of 4 workers failed to connect.
Never happened before installing R 4.1.
I have tried to no avail
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
cl <- parallel::makeCluster(2, setup_strategy = "sequential")
Any suggestions (and maybe a little explanation because I am relatively new to R still)? My computer overheats and I believe the command below is not working
**options(future.globals.maxSize = 8000 * 1024^2)
plan("multiprocess", workers = 4)**
4.1 R/RStudio has all sorts of issues with parallel right now. I experienced similar issues with the CB2 package on R 4.1 which also uses parallel for multicore support. This is probably related to an as of yet unpatched bug in R 4.1 (mentioned here and here), though there is now a specific fix in R-devel 80472. If your issues are unresolved with the advice from those threads, I suggest rolling back to a previous R version that doesn't present the issue.

unable to specify master_type in MLEngineTrainingOperator

I am using airflow to schedule a pipeline that will result in training a scikitlearn model with ai platform. I use this DAG to train it
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200212',
],
python_version='3.5')
training_op
The training package loads the desired csv files and train a RandomForestClassifier on it.
This works fine until the number and the size of the files increase. Then I get this error:
ERROR - The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL). To find out more about why your job exited please check the logs:
The total size of the files is around 4 Gb. I dont know what is the default machine used but is seems not enough. Hoping this would solve the memory consumption issue I tried to change the parameter n_jobs of the classifier from -1 to 1, with no more luck.
Looking at the code of MLEngineTrainingOperator and the documentation I added a custom scale_tier and a master_type n1-highmem-8, 8 CPUs and 52GB of RAM , like this:
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
master_type="n1-highmem-8",
scale_tier="custom",
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200116',
],
python_version='3.5')
training_op
This resulted in an other error:
ERROR - <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/MY_PROJECT/jobs?alt=json returned "Field: master_type Error: Master type must be specified for the CUSTOM scale tier.">
I don't know what is wrong but it appears that is not the way to do that.
EDIT: Using command line I manage to launch the job:
gcloud ai-platform jobs submit training training_job_name --packages=gs://path/to/package/package.tar.gz --python-version=3.5 --region=europe-west1 --runtime-version=1.14 --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16
However i would like to do this in airflow.
Any help would be much appreciated.
EDIT: My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present.
Updating the version to 1.10.6 solved this issue
My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present. Updating the version to 1.10.6 solved this issue

CF_STAGING_TIMEOUT for deploying Shiny app on Bluemix with Cloud Foundry

I have a Shiny app that runs as expected locally, and I am working to deploy it to Bluemix using Cloud Foundry. I am using this buildpack.
The default staging time for apps to build is 15 minutes, but that is not long enough to install R and the packages needed for my app. If I try to push my app with the defaults, I get an error about running out of time:
Error restarting application: sherlock-topics failed to stage within 15.000000 minutes
I changed my manifest.yml to increase the staging time:
applications:
- name: sherlock-topics
memory: 728M
instances: 1
buildpack: git://github.com/beibeiyang/cf-buildpack-r.git
env:
CRAN_MIRROR: https://cran.rstudio.com
CF_STAGING_TIMEOUT: 45
CF_STARTUP_TIMEOUT: 9999
Then I also changed the staging time for the CLI before pushing:
cf set-env sherlock-topics CF_STAGING_TIMEOUT 45
cf push sherlock-topics
What happens then is that the app tries to deploy. It installs R in the container and installs packages, but only for about 15 minutes (a little longer). When it gets to the first new task (package) after the 15 minute mark, it errors but with a different, sadly uninformative error message.
Staging failed
Destroying container
Successfully destroyed container
FAILED
Error restarting application: StagingError
There is nothing in the logs but info about the libraries being installed and then Staging failed.
Any ideas about why it isn't continuing to stage past the 15-minute mark, even after I increase CF_STAGING_TIMEOUT?
The platform operator controls the hard limit for staging timeout and application startup timeout. CF_STAGING_TIMEOUT and CF_STARTUP_TIMEOUT are cf cli configuration options that tell the cli how long to wait for staging and app startup.
See docs here for reference:
https://docs.cloudfoundry.org/devguide/deploy-apps/large-app-deploy.html#consid_limits
As an end user, it's not possible to exceed the hard limits put in place by your platform operator.
I ran into the same exact issue. I was also trying to deploy a shinyR app with a quite a number of dependencies.
The trick is to add the num_threads property to r.yml file. This will definitely speed up the build time.
Here's how one might look:
packages:
- packages:
- name: bupaR
- name: edeaR
num_threads: 8
See https://docs.cloudfoundry.org/buildpacks/r/index.html for more info

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!

Resources