why I got the errors PartitionOwnedError and ConsumerStoppedException when starting a few consumers - pykafka

I use pykafka to fetch message from kafka topic, and then do some process and update to mongodb. As the pymongodb can update only one item every time, so I start 100 processes. But when starting, some processes occoured errors "PartitionOwnedError and ConsumerStoppedException". I don't know why.
Thank you.
kafka_cfg = conf['kafka']
kafka_client = KafkaClient(kafka_cfg['broker_list'])
topic = kafka_client.topics[topic_name]
balanced_consumer = topic.get_balanced_consumer(
consumer_group=group,
auto_commit_enable=kafka_cfg['auto_commit_enable'],
zookeeper_connect=kafka_cfg['zookeeper_list'],
zookeeper_connection_timeout_ms = kafka_cfg['zookeeper_conn_timeout_ms'],
consumer_timeout_ms = kafka_cfg['consumer_timeout_ms'],
)
while(1):
for msg in balanced_consumer:
if msg is not None:
try:
value = eval(msg.value)
id = long(value.pop("id"))
value["when_update"] = datetime.datetime.now()
query = {"_id": id}}
result = collection.update_one(query, {"$set": value}, True)
except Exception, e:
log.error("Fail to update: %s, msg: %s", e, msg.value)
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 734, in consume
raise ConsumerStoppedException
pykafka.exceptions.ConsumerStoppedException
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 726, in consume
self._raise_worker_exceptions()
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 271, in _raise_worker_exceptions
raise ex
pykafka.exceptions.PartitionOwnedError

PartitionOwnedError: check if there are some background process consuming in the same consumer_group, maybe there are not enough available partitions for starting another consumer.
ConsumerStoppedException: you can try upgrading your pykafka version (https://github.com/Parsely/pykafka/issues/574)

I met the same problem like you. But, I confused about others' solutions like adding enough partitions for consumers or updating the version of pykafka.
In fact, mine satisfied those conditions above.
Here is the version of tools:
python 2.7.10
kafka 2.11-0.10.0.0
zookeeper 3.4.8
pykafka 2.5.0
Here is my code:
class KafkaService(object):
def __init__(self, topic):
self.client_hosts = get_conf("kafka_conf", "client_host", "string")
self.topic = topic
self.con_group = topic
self.zk_connect = get_conf("kafka_conf", "zk_connect", "string")
def kafka_consumer(self):
"""kafka-consumer client, using pykafka
:return: {"id": 1, "url": "www.baidu.com", "sitename": "baidu"}
"""
from pykafka import KafkaClient
consumer = ""
try:
kafka = KafkaClient(hosts=str(self.client_hosts))
topic = kafka.topics[self.topic]
consumer = topic.get_balanced_consumer(
consumer_group=self.con_group,
auto_commit_enable=True,
zookeeper_connect=self.zk_connect,
)
except Exception as e:
logger.error(str(e))
while True:
message = consumer.consume(block=False)
if message:
print "message:", message.value
yield message.value
The two exceptions(ConsumerStoppedException and PartitionOwnedError), are raised by the function consum(block=True) of pykafka.balancedconsumer.
Of course, I recommend you to read the source code of that function.
There is a argument block=True, after altering it to False, the programme can not fall into the exceptions.
Then the kafka consumers work fine.

This behavior is affected by a longstanding bug that was recently discovered and is currently being fixed. The workaround we've used in production at Parse.ly is to run our consumers in an environment that handles automatically restarting them when they crash with these errors until all partitions are owned.

Related

Got Future <Future pending> attached to a different loop when using eventhub aio in Fastapi python

I am using python 3.9, azure-eventhub 5.10.1, azure-eventhub-checkpointstoreblob-aio, and I have following code that throws the exception regularly (we also have lots of successful case that sends the message with no error), but also got the runtime errors in the logs. Wondering what i did wrong here. Thanks
async def send_to_eventhub(self, producer, event_list, timestamp_event_received):
try:
async with producer:
event_data_batch = await producer.create_batch()
for (occupancy_status, hardware_id) in event_list:
# set message properties for space report
message_body = {
...
}
message = EventData(json.dumps(message_body))
message.properties = {
...
}
# Send message to the eventhub
logger.info("Sending message %s, %s", message, message.properties)
event_data_batch.add(message)
await producer.send_batch(event_data_batch)
logger.info(
"Message successfully sent %s, %s", message, message.properties
)
except (
EventDataError,
EventDataSendError,
OperationTimeoutError,
OwnershipLostError,
RuntimeError,
) as event_ex:
logger.error(
"eventhub Sending Error: Error ocurred\
sending message for hardware id %s %s %s",
hardware_id,
event_ex,
traceback.format_exc(),
) ```
And this function got called in the follow Fastapi
<!-- begin snippet: -->
#app.post(...)
async def handle_report(
...
):
...
try:
if len(incoming_data) > 0:
event_list = []
for sensor_data in incoming_data:
data = sensor_data["data"]
occupancy_status = json.loads(data)["value"]
hardware_id = sensor_data["properties"]["propertyList"][0]["value"]
event_list.append((occupancy_status, hardware_id))
await eventhub_helper.send_to_eventhub(
producer, event_list, received_timestamp
)
...`
<!-- end snippet -->
And the exception says:
`eventhub Sending Error: Error ocurred sending message for hardware id TSPR04ESH11000268 Task <Task pending name='Task-544711411' coro=<RequestResponseCycle.run_asgi() running at /opt/pysetup/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py:375> cb=[set.discard()]> got Future <Future pending> attached to a different loop Traceback (most recent call last):
File "/app/eventhub_helper.py", line 94, in send_to_eventhub
logger.info(
File "/opt/pysetup/.venv/lib/python3.9/site-packages/azure/eventhub/aio/_producer_client_async.py", line 218, in __aexit__
await self.close()
File "/opt/pysetup/.venv/lib/python3.9/site-packages/azure/eventhub/aio/_producer_client_async.py", line 811, in close
async with self._lock:
File "/usr/local/lib/python3.9/asyncio/locks.py", line 14, in __aenter__
await self.acquire()
File "/usr/local/lib/python3.9/asyncio/locks.py", line 120, in acquire
await fut
RuntimeError: Task <Task pending name='Task-544711411' coro=<RequestResponseCycle.run_asgi() running at /opt/pysetup/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py:375> cb=[set.discard()]> got Future <Future pending> attached to a different loop`
I tried to reproduce this error, but it was hard because it went through with no error. Wondering if I did not consider concurrency enough. Did notice that "event_data_batch.add(message)" can cause error if that batch if full, but dont think it could cause runtime error and i know that message we sent is small

Slow data reading from Google BigTable

Airflow 1.10.14 and composer 1.15.2, google Bigtable in GCP
I'm getting this issue
{taskinstance.py:1152} ERROR - <_MultiThreadedRendezvous of RPC that terminated with
status = StatusCode.ABORTE
details = "Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)
debug_error_string = "{"created":"#1649401290.125577144","description":"Error received from peer ipv4:142.251.6.95:443","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}
>
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 980, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/plugins/minutebardata.py", line 55, in execut
tk.get_data(dt=self.dt
File "/home/airflow/gcs/plugins/scraper/ticker.py", line 46, in get_dat
df0 = self.dbhook._load_all_minubebar_tickers(ticker
File "/home/airflow/gcs/plugins/scraper/db_wrapper.py", line 583, in _load_all_minubebar_ticker
dfs = [self.process(row, ticker) for row in rows if row is not None
File "/home/airflow/gcs/plugins/scraper/db_wrapper.py", line 583, in <listcomp
dfs = [self.process(row, ticker) for row in rows if row is not None
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 485, in __iter_
response = self._read_next_response(
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 474, in _read_next_respons
return self.retry(self._read_next, on_error=self._on_error)(
File "/opt/python3.6/lib/python3.6/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_fun
on_error=on_error
File "/opt/python3.6/lib/python3.6/site-packages/google/api_core/retry.py", line 184, in retry_targe
return target(
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 470, in _read_nex
return six.next(self.response_iterator
File "/opt/python3.6/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next_
return self._next(
File "/opt/python3.6/lib/python3.6/site-packages/grpc/_channel.py", line 803, in _nex
raise sel
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with
status = StatusCode.ABORTE
details = "Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)
debug_error_string = "{"created":"#1649401290.125577144","description":"Error received from peer ipv4:142.251.6.95:443","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}
I use the following approach
row_set = RowSet()
row_set.add_row_range_from_keys(
start_key=ticker + b"#" + startdt,
end_key=ticker + b"#" + enddt)
rows = table.read_rows(
row_set=row_set)
This code works properly when I run this locally. However, when I try to run this in GCP I get this issue.
Could you give me any hints to find the solution?

Unit test for exception raised in custom GNU radio python block

I have created a custom python sync block for use in a gnuradio flowgraph. The block tests for invalid input and, if found, raises a ValueError exception. I would like to create a unit test to verify that the exception is raised when the block indeed receives invalid input data.
As part of the python-based qa test for this block, I created a flowgraph such that the block receives invalid data. When I run the test, the block does appear to raise the exception but then hangs.
What is the appropriate way to test for this? Here is a minimal working example:
#!/usr/bin/env python
import numpy as np
from gnuradio import gr, gr_unittest, blocks
class validate_input(gr.sync_block):
def __init__(self):
gr.sync_block.__init__(self,
name="validate_input",
in_sig=[np.float32],
out_sig=[np.float32])
self.max_input = 100
def work(self, input_items, output_items):
in0 = input_items[0]
if (np.max(in0) > self.max_input):
raise ValueError('input exceeds max.')
validated_in = output_items[0]
validated_in[:] = in0
return len(output_items[0])
class qa_validate_input (gr_unittest.TestCase):
def setUp (self):
self.tb = gr.top_block ()
def tearDown (self):
self.tb = None
def test_check_valid_data(self):
src_data = (0, 201, 92)
src = blocks.vector_source_f(src_data)
validate = validate_input()
snk = blocks.vector_sink_f()
self.tb.connect (src, validate)
self.tb.connect (validate, snk)
self.assertRaises(ValueError, self.tb.run)
if __name__ == '__main__':
gr_unittest.run(qa_validate_input, "qa_validate_input.xml")
which produces:
DEPRECATED: Using filename with gr_unittest does no longer have any effect.
handler caught exception: input exceeds max.
Traceback (most recent call last):
File "/home/xxx/devel/gnuradio3_8/lib/python3.6/dist-packages/gnuradio/gr/gateway.py", line 60, in eval
try: self._callback()
File "/home/xxx/devel/gnuradio3_8/lib/python3.6/dist-packages/gnuradio/gr/gateway.py", line 230, in __gr_block_handle
) for i in range(noutputs)],
File "qa_validate_input.py", line 21, in work
raise ValueError('input exceeds max.')
ValueError: input exceeds max.
thread[thread-per-block[1]: <block validate_input(2)>]: SWIG director method error. Error detected when calling 'feval_ll.eval'
^CF
======================================================================
FAIL: test_check_valid_data (__main__.qa_validate_input)
----------------------------------------------------------------------
Traceback (most recent call last):
File "qa_validate_input.py", line 47, in test_check_valid_data
self.assertRaises(ValueError, self.tb.run)
AssertionError: ValueError not raised by run
----------------------------------------------------------------------
Ran 1 test in 1.634s
FAILED (failures=1)
The top_block's run() function does not call the block's work() function directly but starts the internal task scheduler and its threads and waits them to finish.
One way to unit test the error handling in your block is to call the work() function directly
def test_check_valid_data(self):
src_data = [[0, 201, 92]]
output_items = [[]]
validate = validate_input()
self.assertRaises(ValueError, lambda: validate.work(src_data, output_items))

Airflow HttpSensor won't work

I'm trying to create a HttpSensor in Airflow using the following code:
wait_to_launch = HttpSensor(
task_id="wait-to-launch",
endpoint='http://' + socket.gethostname() + ":8500/v1/kv/launch-cluster?raw",
response_check=lambda response: True if 'oui'==response.content else False,
dag=dag
)
But I keep getting this error:
Traceback (most recent call last):
File "http_sensor_test.py", line 30, in <module>
dag=dag
File "/home/me/.local/lib/python2.7/site-packages/airflow/utils/decorators.py", line 86, in wrapper
result = func(*args, **kwargs)
File "/home/me/.local/lib/python2.7/site-packages/airflow/operators/sensors.py", line 663, in __init__
self.hook = hooks.http_hook.HttpHook(method='GET', http_conn_id=http_conn_id)
File "/home/me/.local/lib/python2.7/site-packages/airflow/utils/helpers.py", line 436, in __getattr__
raise AttributeError
AttributeError
What am I missing?
You are running into a known issue, see AIRFLOW-1030. A fix has been merged (#2180), but unfortunately is not yet on a released version of airflow. The fix is marked for the next release (1.9.0), but it could be weeks/months until that is out. You can run a fork of airflow with this change or add the updated version of the HttpSensor as a custom operator (plugin).

pexpect python throw error

Although this is my first attempt at using pexpect, the python3 script using pexpect is pretty simple; yet it fails.
#!/usr/bin/env python3
import sys
import pexpect
SSH_NEWKEY = r'Are you sure you want to continue connecting \(yes/no\)\?'
child = pexpect.spawn("ssh -i /user/aws/key.pem ec2-user#xxx.xxx.xxx.xxx date")
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY )
if i == 1:
child.sendline('yes')
print(child.before)
The SSH_NEWKEY is the only response I'm expecting, but the example showed a list containing pexpect.TIMEOUT in it so I used it.
$ ./test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 144, in read_nonblocking
s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 97, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/pty_spawn.py", line 455, in read_nonblocking
return super(spawn, self).read_nonblocking(size)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 149, in read_nonblocking
raise EOF('End Of File (EOF). Exception style platform.')
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./min.py", line 15, in <module>
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY ] )
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 315, in expect
timeout, searchwindowsize, async)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 339, in expect_list
return exp.expect_loop(timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 102, in expect_loop
return self.eof(e)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 49, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f70ea4fbcf8>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-i', '/user/aws/key.pem', 'ec2-user#xxx.xxx.xxx.xxx', 'date']
searcher: None
buffer (last 100 chars): b''
before (last 100 chars): b'Fri May 6 13:50:18 EDT 2016\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: 0
flag_eof: True
pid: 31293
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
What am I missing?
CentOS 6.4
python 3.4.3
An EOF error is being raised during your expect call. This means that the response received does not match SSH_NEWKEY, and reaches end of file within the timeout period. To catch this exception, you should change your except line to read:
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY, pexpect.EOF)
You can then make your if more robust:
if i == 1:
child.sendline('yes')
elif i == 0:
print "Timeout"
elif i == 2:
print "EOF"
print(child.before)
This doesn't solve the reason behind why you are on receiving a response with the expected string - it's hard to know without looking at more code but it's likely because you have the response slightly wrong. If you manually type in the SSH string, you should be able to see the response you can expect, and enter this response into your code.
You can also print child.before after your expect call, or print child.read() instead of your expect call to see what is being sent back as a response.

Resources