Could not create internal topics - Stream-thread exception - internal

I am trying to execute a simple Wordcount stream application but I face the error "Could not create internal topics - Stream-thread exception"
I have seen a similar thread but that seems to be more of a network issue.
Here is no security enabled on the kafka broker.
Only one broker is configured and still this issue.
Can someone let me know how to fix this?

Clean your temporary kafka queues.
Run --list command on kafka to see all the queues starting with your names and ending with -changelog & -repartition and manually run delete on them.
This one worked for me.
Also, check your settings on delete.topic.enable for actual deletion happening. It was not the default setting until 1.0.0 - see https://issues.apache.org/jira/browse/KAFKA-5384

i have connected to kafka using kafka tool and delete them manually

Related

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

how can i monitor iccube server and data via an external tool

I'd like to put iccube under solid monitoring so that we know when a) cube load failure or b) cube last update time exceeded the expected.
is there an api i can use to integrate with standard monitoring tools?rest, command-line etc ...
thanks in advance, assaf
Regarding the schema load failure you can check the notification service (www); you can for example receive an eMail on failure. Note that you can implement (JAVA) your own transport service to receive notifications. There is no "notification" for last update time exceeded but if you could use an external LOAD command (www) for loading your schema; in that case you will know the last update time and perform whatever logic required.
Edit: XMLA commands can be sent via any tools (e.g., Bash).
Hope that helps.

how to use the example of scrapy-redis

I have read the example of scrapy-redis but still don't quite understand how to use it.
I have run the spider named dmoz and it works well. But when I start another spider named mycrawler_redis it just got nothing.
Besides I'm quite confused about how the request queue is set. I didn't find any piece of code in the example-project which illustrate the request queue setting.
And if the spiders on different machines want to share the same request queue, how can I get it done? It seems that I should firstly make the slave machine connect to the master machine's redis, but I'm not sure which part to put the relative code in,in the spider.py or I just type it in the command line?
I'm quite new to scrapy-redis and any help would be appreciated !
If the example spider is working and your custom one isn't, there must be something that you have done wrong. Update your question with the code, including all relevant parts, so we can see what went wrong.
Besides I'm quite confused about how the request queue is set. I
didn't find any piece of code in the example-project which illustrate
the request queue setting.
As far as your spider is concerned, this is done by appropriate project settings, for example if you want FIFO:
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Don't cleanup redis queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Schedule requests using a queue (FIFO).
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'
As far as the implementation goes, queuing is done via RedisSpider which you must inherit from your spider. You can find the code for enqueuing requests here: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/scheduler.py#L73
As for the connection, you don't need to manually connect to the redis machine, you just specify the host and port information in the settings:
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
And the connection is configured in the ċonnection.py: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/connection.py
The example of usage can be found in several places: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/pipelines.py#L17

No eventlogs from BizTalk

I've got a new production computer and installed my BizTalk app on there. The problem is that I don't see any messages in the event log, nor from my BizTalk app or BizTalk Server itself. The only message that appears in the event log is the following:
The following BizTalk host instance has initialized successfully.
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication
The source of that message is BizTalk Server. And no messages at all, even no logs about errors which I suppose already took place.
Just a quick thought incase you are still having problems.
I tried to write to the event log with a source type that didnt already exist and my BizTalk Host user account didn’t have permissions to create a new source type. This meant I never saw the entry in the event log.
E.g. (from example #Bill Osuch)
System.Diagnostics.EventLog.WriteEntry("MyBiztalkApp", "oh i did something");
Make sure either the MyBiztalkApp source exists or that your user has permissions on the event log to create it.
Also, if you have a lot of messages going through BizTalk you will probably want to implement your own logging so your event log doesn’t fill up. We used Log4Net for our implementation and a database to store messages.
If you're not getting any errors (suspended messages) as the messages process, you're not going to see anything in the app log. You could try adding an Expression shape to your orchestration and manually writing out some debug info:
System.Diagnostics.EventLog.WriteEntry("event type", "whatever...");
Does your application actually use the BiztalkServerApplication host? Check in the Biztalk Administration Console if all the host instances are indeed running. Is your application fully started? Messages are "put on hold" if your receive location is disabled for example.
To check this functionality, write to event log after every operation or shape in BizTalk orchestration.
Scenario-
Suppose you have to assigned a value to xpath of node in a map after transformation so in message assignment shape after you assign some value, you can write eventlog to admin console.
Ex. Suppose we have already initialized - "orderType" as "PO" in our expression shape and now we have to assign the value of "orderType" to the xpath of a node in our map then-
Shape- MessageAssignment(Under constructMessage Shape after transformation of map)
xpath(msgGetOrderReq, "/[local-name()='CustomerOrders' and namespace-uri()='http://example.com/EAI/IEmployee/v1.0']/[local-name()='ordertype' and namespace-uri()='http://example.com/EAI/IEmployee/v1.0']") = ordertype;
Next to this we want to print this information on the admin console so we need to write:-
System.Diagnostics.EventLog.WriteEntry("msgGetOrderReq", ordertype, msgGetOrderReq);
Build the project, Deploy and GAC it. Restart the host instance. Run the orchestration, process something and now you will be able to see the logs in admin console.
Regards
Mayank

Resources