We do explicit soft-locking with
serviceHub.vaultService.softLockReserve(txBuilder.lockId, NonEmptySet.of(balanceStateS2R.ref))
Things were working until today when we got this exception repeatedly.
And now we cannot run the flow anymore. There was no double-spend going on. What is causing it and how can we get out of it?
[m[1;31mE 17:01:57-0500 [Node thread] vault.NodeVaultService.softLockReserve - soft lock update error attempting to reserve states for acace05e-b0fb-4d4e-9b96-a7d0d4728f68 and [392D84F9CF931F17438399D36607CAFDB549C02A5E7B63E8F8D2B2FADE1AFF57(1)]")
Soft locking error: Attempted to reserve [392D84F9CF931F17438399D36607CAFDB549C02A5E7B63E8F8D2B2FADE1AFF57(1)] for acace05e-b0fb-4d4e-9b96-a7d0d4728f68 but only 0 rows available.
Without more context (e.g. what actions the flow is executing), I cannot provide an accurate answer, but some points to raise:
The error message indicates that the previously locked state has been consumed in some way (perhaps by the flow that originally locked it) and was not explicitly released after consumption
Use the softLockRelease() API call to explicitly release states that have been explicitly previously locked
Related
In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.
I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?
I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.
https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart
Does anyone understand what this RocksDB error refers to ?
/column_family.cc:275: rocksdb::ColumnFamilyData::~ColumnFamilyData():
Assertion `refs_ == 0' failed. Aborted (core dumped)
This is an assertion failure raised by RocksDB, and it intentionally terminates the execution of the program.
In general, assertions are used by programmers to ensure certain invariants in the program. Assertions have some runtime overhead, and therefore can be completely disabled. Often they are compiled into development or debug builds, but are omitted for production builds.
When an assertion fails, the program execution is intentionally aborted immediately by calling std::abort. This may lead to your OS writing a core dump (as it obviously did as the above message reveals), but if and where core dumps are written depends on the OS configuration.
In case of this specific assertion, the destructor of rocksdb::ColumnFamilyData raised the assertion because it requires its refs_ member to have a value of 0. refs_ is a reference counter and it makes sense to assert that no references are actually held when the object's destructor is called.
From just looking at the destructor code, it is unclear whether this is a bug in the RocksDB library itself, or an error caused by using it the wrong way, e.g. destroying column family objects when they are still in use by other objects.
For reference, here's the code part that raised the assertion (currently on line 365 in file rocksdb/db/column_family.cc):
ColumnFamilyData::~ColumnFamilyData() {
assert(refs_.load(std::memory_order_relaxed) == 0);
If the error persists, it may be useful if you provide the code that uses RocksDB here. Otherwise it may be impossible to find the error source.
The core dump may also provide useful information, because it contains the stack trace of the code that actually invoked the object's destructor.
I noticed that all column_family.cc errors (core_dumped, memory_order_relaxed and etc) occur after incorrect rocksdb installation. In my vagrant script i found true way.
instead of use
https://github.com/facebook/rocksdb/blob/master/INSTALL.md
i create script
cd /opt
git clone https://github.com/facebook/rocksdb.git
cd rocksdb
git checkout tags/v4.1
PORTABLE=1 make shared_lib
export LD_LIBRARY_PATH=/opt/rocksdb
LD_LIBRARY_PATH add better to your environment path(.bash_rc or /etc/environment)
Assertion refs_ == 0 fails on ~ColumnFamilyData() means the reference count of a column family is not zero when the column family is deleted. Most likely you have some un-deleted column family handles before closing the DB. Note that all column family handles must be deleted before closing the DB. Otherwise the assertion will fail.
// Before delete DB, you have to close All column families by calling
// DestroyColumnFamilyHandle() with all the handles.
static Status Open(const DBOptions& db_options, const std::string& name,
const std::vector<ColumnFamilyDescriptor>& column_families,
std::vector<ColumnFamilyHandle*>* handles, DB** dbptr);
To fix such assertion failure, making sure you delete all column family handles before closing the DB.
We have a long request that does a catalog search and then calculates some information and then stores it in another document. Do call is made to index the document after the store.
In the logs we're getting errors such as INFO ZPublisher.Conflict ConflictError at /pooldb/csvExport/runAgent: database conflict error (oid 0x017f1eae, class BTrees.IIBTree.IISet, serial this txn started with 0x03a30beb4659b266 2013-11-27 23:07:16.488370, serial currently committed 0x03a30c0d2ca5b9aa 2013-11-27 23:41:10.464226) (19 conflicts (0 unresolved) since startup at Mon Nov 25 15:59:08 2013)
and the transaction is getting aborted and restarted.
All the documentation I've read says that due to MVVC in ZODB ReadConflicts no longer occur. Since this is written in RestrictedPython putting in a savepoint is not a simple option (but likely my only choice I'm guessing).
Is there another to avoid this conflict? If I need use savepoints, can anyone think of a security reason why I shouldn't whitelist the savepoint transaction method for use in PythonScripts?
The answer is there really is no way to do a large transaction that involves writes, when there are other small transactions on the same objects at the same time. BTrees are supposed to have some special conflict handling code but it doesn't seem to work in the cases I'm dealing with.
The only way I think to manage this is something like this code which breaks the reindex into smaller transactions to reduce the chance of a conflict. There really should be something like this built into zope/plone.
I am trying to make a multi threaded Qt Application that uses QGLWidgets and I keep getting this error.(I am trying to paint from another thread using QPainter)
And it also looks like I have a huge memory leak because of it.
The error is "QGLContext::makeCurrent() : wglMakeCurrent failed: The operation completed successfully"
I believe this is related to a rather old issue from the Qt mailing list as described here. In short, if the thread calling makeCurrent() does not equal the thread where the device context was retrieved, GetDC() is called. As outlined in the linked thread, the problem is that ReleaseDC() is not called accordingly, resulting in a handle leak, and triggering Windows to return NULL in the call to GetDC() at some point, which makes wglMakeCurrent() fail. I don't know, however, why GetLastError() claims "The operation completed successfully" in this case.