How to restart the SHA256MigrationJob without restarting Artifactory? - artifactory

I keep encountering the below error message in the sha256_migration.log. It doesn't restart after failure, however if I restart the artifactory service it begins the SHA256 migration from where it left off until it fails again.
2018-11-13 10:24:35,060 [art-exec-3] [ERROR] (o.a.s.j.m.s.Sha256MigrationJob:78) - Caught unexpected exception during SHA256 Migration job, operation will break.
org.springframework.core.task.TaskRejectedException: Executor [org.artifactory.schedule.ArtifactoryConcurrentExecutor#70a2137a] did not accept task: org.artifactory.schedule.aop.AsyncAdvice$$Lambda$654/1640835804#7dbf55d3
at org.springframework.core.task.support.TaskExecutorAdapter.submit(TaskExecutorAdapter.java:93)
at org.springframework.scheduling.concurrent.ConcurrentTaskExecutor.submit(ConcurrentTaskExecutor.java:143)
at org.artifactory.schedule.aop.AsyncAdvice.submitWorkQueueTask(AsyncAdvice.java:235)
at org.artifactory.schedule.aop.AsyncAdvice.submit(AsyncAdvice.java:217)
at org.artifactory.schedule.aop.AsyncAdvice.executeInvocation(AsyncAdvice.java:146)
at org.artifactory.schedule.aop.AsyncAdvice.invoke(AsyncAdvice.java:124)
at org.artifactory.schedule.aop.AsyncAdvice.invoke(AsyncAdvice.java:62)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
at com.sun.proxy.$Proxy144.updateSha2(Unknown Source)
at org.artifactory.storage.jobs.migration.sha256.Sha256MigrationJob.migrationLogic(Sha256MigrationJob.java:134)
at org.artifactory.storage.jobs.migration.MigrationJobBase.migrationLoop(MigrationJobBase.java:106)
at org.artifactory.storage.jobs.migration.MigrationJobBase.runMigration(MigrationJobBase.java:83)
at org.artifactory.storage.jobs.migration.MigrationJobBase.onExecute(MigrationJobBase.java:73)
at org.artifactory.schedule.quartz.QuartzCommand.execute(QuartzCommand.java:48)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.artifactory.concurrent.ArtifactoryRunnable.run(ArtifactoryRunnable.java:30)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.RejectedExecutionException: Task org.artifactory.concurrent.ArtifactoryRunnable#4afb003 rejected from java.util.concurrent.ThreadPoolExecutor#33daf5aa[Running, pool size = 64, active threads = 64, queued tasks = 10000, completed tasks = 120723]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.artifactory.schedule.ArtifactoryConcurrentExecutor.execute(ArtifactoryConcurrentExecutor.java:69)
at org.springframework.core.task.support.TaskExecutorAdapter.submit(TaskExecutorAdapter.java:88)
... 19 common frames omitted
My artifactory.system.properties regarding sha256migration
##SHA2 Migration block
artifactory.sha2.migration.job.enabled=true
artifactory.sha2.migration.job.queue.workers=100
My setup:
Cloned production instances of Artifactory infrastructure into test instance (save for IP address and DNS records).
Ensure that DB and filestore configurations (db.properties and binarystore.xml) were updated accordingly on the cloned instances.
Things I've tried without luck:
I ran the Artifactory GC couple of times.
Increase the CPU count to 16
Increase the RAM to 16G
Ensure that I am running the latest Oracle Java 8u192
What I know:
It keep running fine for a while until it crashes.
When I restart artifactory service, the migration resumes and the total artifacts to migrate is lower.
I cannot keep restarting Artifactory in production to finish the sha256migrationjob, I have over 500k artifacts.
My question:
Any way method to restart the SHA256MigrationJob without restarting Artifactory?
Is there a way to find the artifact that it has trouble migrating to SHA256?
In the stack trace above, I feel the issue is at com.sun.proxy.$Proxy144.updateSha2(Unknown Source).
-- Workaround --
I ended up creating a new VM and installed a clean copy Artifactory 6.5.3 (with latest Oracle Java 8 Server-JRE). In the above issue, I was doing an in-place upgrade, just in a new folder.
I moved the necessary files in the etc to the new VM; such as master.key, binarystore.xml, db.properties and etc. I then executed the bin/installService.sh [user] [group], this creates the /etc/opt/jfrog/artifactory configuration symlink/folder move. My filestore and artifactory database are running on different VMs, thus only Artifactory and it's file configurations needed to be ported.
The new Artifactory 6.5.3 version started up without issues.
The sha256migrationjob is actually running without any problems. Last upgrade run I did, it worked fine without the job dying.
Notes: I did also sanely adjust the configuration values on the queue workers. https://www.jfrog.com/confluence/display/RTF/Checksum-Based+Storage#Checksum-BasedStorage-ConfiguringtheMigrationProcess

tl;dr decrease artifactory.sha2.migration.job.queue.workers to somewhere around 2 - 2 * number of cores.
What you are experiencing is an exhaustion of the ThreadPoolExecutor.
The size of the thread pool by default is 4 * [number of cores].
In your case 64 (16 cores * 4)
However the limit on the number of threads you have configured for the sha256 migration job is 100.
On an artifactory instance without any load, this will not cause any failures, because there is a queue backing up the thread pull (in your case the size is 10000).
In your case the thread pull and the queue are full.
If I understand correctly you have overcame the issue. But for other people bumping into this thread, I would recommend decreasing the value of artifactory.sha2.migration.job.queue.workers to no more than half of the thread pull number of threads.
So in your case 32: (16 * 4)/2

Related

OutOfMemory Error after updating corda version to 4.1

I had a corda 3.3 test installation and recently updated it to version 4.1 and after that when I run my nodes with deployNodes script and runnodes - I always receive the following exception in node's console as soon as it starts. What can this mean? I don't have a clue what this can be caused by.
I tried to build and run the nodes without cordapps and they work, so somehow my cordapps cause this error happen. What other information should I provide to help you figure out this issue?
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
at kotlin.io.ByteStreamsKt.readBytes(IOStreams.kt:123)
at kotlin.io.ByteStreamsKt.readBytes$default(IOStreams.kt:120)
at net.corda.core.internal.InternalUtils.readFully(InternalUtils.kt:123)
at net.corda.node.internal.cordapp.JarScanningCordappLoader.getJarHash(JarScanningCordappLoader.kt:228)
at net.corda.node.internal.cordapp.JarScanningCordappLoader.toCordapp(JarScanningCordappLoader.kt:153)
at net.corda.node.internal.cordapp.JarScanningCordappLoader.loadCordapps(JarScanningCordappLoader.kt:106)
at net.corda.node.internal.cordapp.JarScanningCordappLoader.access$loadCordapps(JarScanningCordappLoader.kt:44)
at net.corda.node.internal.cordapp.JarScanningCordappLoader$cordapps$2.invoke(JarScanningCordappLoader.kt:56)
at net.corda.node.internal.cordapp.JarScanningCordappLoader$cordapps$2.invoke(JarScanningCordappLoader.kt:44)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at net.corda.node.internal.cordapp.JarScanningCordappLoader.getCordapps(JarScanningCordappLoader.kt)
at net.corda.node.internal.cordapp.CordappLoaderTemplate$cordappSchemas$2.invoke(JarScanningCordappLoader.kt:422)
at net.corda.node.internal.cordapp.CordappLoaderTemplate$cordappSchemas$2.invoke(JarScanningCordappLoader.kt:389)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at net.corda.node.internal.cordapp.CordappLoaderTemplate.getCordappSchemas(JarScanningCordappLoader.kt)
at net.corda.node.internal.AbstractNode.<init>(AbstractNode.kt:153)
at net.corda.node.internal.AbstractNode.<init>(AbstractNode.kt:126)
at net.corda.node.internal.Node.<init>(Node.kt:98)
at net.corda.node.internal.Node.<init>(Node.kt:97)
at net.corda.node.internal.NodeStartup.createNode(NodeStartup.kt:194)
at net.corda.node.internal.NodeStartup$initialiseAndRun$5.invoke(NodeStartup.kt:186)
at net.corda.node.internal.NodeStartup$initialiseAndRun$5.invoke(NodeStartup.kt:137)
at net.corda.node.internal.NodeStartupLogging$DefaultImpls.attempt(NodeStartup.kt:509)
at net.corda.node.internal.NodeStartup.attempt(NodeStartup.kt:137)
at net.corda.node.internal.NodeStartup.initialiseAndRun(NodeStartup.kt:185)
at net.corda.node.internal.NodeStartupCli.runProgram(NodeStartup.kt:128)
at net.corda.cliutils.CordaCliWrapper.call(CordaCliWrapper.kt:190)
at net.corda.node.internal.NodeStartupCli.call(NodeStartup.kt:83)
at net.corda.node.internal.NodeStartupCli.call(NodeStartup.kt:64)
at picocli.CommandLine.execute(CommandLine.java:1056)
Corda's usage of memory has been slowly creeping upwards. It is possible that your machine does not have enough memory to run 3/4+ nodes at the same time after upgrading to 4.
I recommend trying to run a single node with CorDapps installed and see what happens. If it is still happening then, then something else could be going wrong.
Looking at the stacktrace, it is also possible that your CorDapp itself is really, really, really big and it has gone out of memory reading and loading the CorDapp.

CF_STAGING_TIMEOUT for deploying Shiny app on Bluemix with Cloud Foundry

I have a Shiny app that runs as expected locally, and I am working to deploy it to Bluemix using Cloud Foundry. I am using this buildpack.
The default staging time for apps to build is 15 minutes, but that is not long enough to install R and the packages needed for my app. If I try to push my app with the defaults, I get an error about running out of time:
Error restarting application: sherlock-topics failed to stage within 15.000000 minutes
I changed my manifest.yml to increase the staging time:
applications:
- name: sherlock-topics
memory: 728M
instances: 1
buildpack: git://github.com/beibeiyang/cf-buildpack-r.git
env:
CRAN_MIRROR: https://cran.rstudio.com
CF_STAGING_TIMEOUT: 45
CF_STARTUP_TIMEOUT: 9999
Then I also changed the staging time for the CLI before pushing:
cf set-env sherlock-topics CF_STAGING_TIMEOUT 45
cf push sherlock-topics
What happens then is that the app tries to deploy. It installs R in the container and installs packages, but only for about 15 minutes (a little longer). When it gets to the first new task (package) after the 15 minute mark, it errors but with a different, sadly uninformative error message.
Staging failed
Destroying container
Successfully destroyed container
FAILED
Error restarting application: StagingError
There is nothing in the logs but info about the libraries being installed and then Staging failed.
Any ideas about why it isn't continuing to stage past the 15-minute mark, even after I increase CF_STAGING_TIMEOUT?
The platform operator controls the hard limit for staging timeout and application startup timeout. CF_STAGING_TIMEOUT and CF_STARTUP_TIMEOUT are cf cli configuration options that tell the cli how long to wait for staging and app startup.
See docs here for reference:
https://docs.cloudfoundry.org/devguide/deploy-apps/large-app-deploy.html#consid_limits
As an end user, it's not possible to exceed the hard limits put in place by your platform operator.
I ran into the same exact issue. I was also trying to deploy a shinyR app with a quite a number of dependencies.
The trick is to add the num_threads property to r.yml file. This will definitely speed up the build time.
Here's how one might look:
packages:
- packages:
- name: bupaR
- name: edeaR
num_threads: 8
See https://docs.cloudfoundry.org/buildpacks/r/index.html for more info

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!

Lucene index gets broken segments after every restart of liferay-tomcat

I have a corrupted Lucene index. If I run "CheckIndex -fix" the problem is resolved, but as soon as I restart tomcat it becomes corrupted again.
The index directory is shared between two application servers running Liferay-Tomcat. I am fixing the index on 1 server and restarting that whilst the other is running. This is a production environment so I cannot bring them both down.
Any suggestions please?
Before fix, CheckIndex says:
Opening index # /usr/local/tomcat/liferay/lucene/0
Segments file=segments_5yk numSegments=1 version=FORMAT_SINGLE_NORM_FILE [Lucene 2.2]
1 of 1: name=_2vg docCount=31
compound=false
hasProx=true
numFiles=8
size (MB)=0.016
no deletions
test: open reader.........FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:335)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:119)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:605)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
WARNING: 1 broken segments (containing 31 documents) detected
WARNING: would write new segments file, and 31 documents would be lost, if -fix were specified
If you access your search index with more than one application server, I would suggest integrating a Solr Server. So you don't have the problem that 2 app servers are trying to write on the same file. This could be error prone as you already found out.
To get Solr up and running you have to follow those steps:
Install a Solr Server on any machine you like. A machine running only Solr would be quite preferable.
Install the Solr search portlet in Liferay
Adjust the config files according to the setup document of Sol Search portlet.
Here are some additional links:
http://www.liferay.com/de/marketplace/-/mp/application/15193648
http://www.liferay.com/de/community/wiki/-/wiki/Main/Pluggable+Enterprise+Search+with+Solr

Zcatalog clear and rebuild quits plone client without any error

I have a strange problem with one of my plone: when I clear and rebuild the zcatalog the zope client quits silently after some time. No errors.
I did the process using the ZMI (zeo + zeoclient, standalone) and using 'zinstance debug'. Same result the client quits silently.
I'm using a standard Plone 4.3 with some addon Products on a Ubuntu Server 12.04 box.
Tasks I have done in order to find out the problem, without success:
I've checked the permissions on the filesystem.
I've reinstalled Plone 4.3
Packing the database works ok, but the problem persists.
Checked the free inodes on the filesystem.
Executing the process on other computer works successfully.
Executing the client with fg parameter, no messages when quitting.
Backup the db and restoring. Same result after restoring, but if restoring on other computer the process rebuilds the catalog (with same Plone version and addons).
Reindexing a index of the catalog causes the same fail: quitting with no messages.
ZODB/scripts/fstest.py shows no errors.
ZODB/scripts/fsrefs.py shows no errors.
Any clues?
David, you are right, I just found the problem yesterday, but it was too late (I was tired) to report it here.
This plone instance is installed on an VPS (OpenVZ) with 512 MB and the kernel killed the python process silently when there was no memory free.
One of my last tests was to rebuild the catalog with "log progress" enabled, there I show that the process quitted at different points, but all around 30%. Then by chance I executed dmesg, and "voila", the mistery was resolved, look:
[2233907.698115] Out of memory in UB: OOM killed process 17819 (python) score 0 vm:799612kB, rss:497324kB, swap:45480kB
[2235168.564053] Out of memory in UB: OOM killed process 445 (python) score 0 vm:790380kB, rss:498036kB, swap:46924kB
[2236752.744927] Out of memory in UB: OOM killed process 17964 (python) score 0 vm:790392kB, rss:494232kB, swap:45584kB
[2237461.280724] Out of memory in UB: OOM killed process 26584 (python) score 0 vm:790328kB, rss:497932kB, swap:45940kB
[2238443.104334] Out of memory in UB: OOM killed process 1216 (python) score 0 vm:799512kB, rss:494132kB, swap:44632kB
[2239457.938721] Out of memory in UB: OOM killed process 12821 (python) score 0 vm:794896kB, rss:502000kB, swap:42656kB}

Resources