What is the reason for ejabberd Crash when paired with Redis? - tcp

I have been able to scale to more than a million users using
How to scale ejabberd Server machine on CentOS to handle 200 K connections?
I am now using Redis as a backend for ejabberd.
After a million users, I am getting the following error:
2016-04-15 12:50:24 =ERROR REPORT====
** State machine <0.24986.34> terminating
** Last event in was {xmlstreamelement,{xmlel,<<"iq">>,[{<<"type">>,<<"set">>},{<<"id">>,<<"820919">>}],[{xmlel,<<"bind">>,[{<<"xmlns">>,<<"urn:ietf:params:xml:ns:xmpp-bind">>}],[{xmlel,<<"resource">>,[],[{xmlcdata,<<"tsung">>}]}]}]}}
** When State == wait_for_bind
** Data == {state,{socket_state,gen_tcp,#Port<0.418817>,<0.24984.34>},ejabberd_socket,#Ref<0.0.36.113046>,false,<<"2087913259">>,undefined,c2s,c2s_shaper,false,true,false,false,[verify_none,compression_none,{protocol_options,<<"no_sslv3">>},{certfile,<<"/opt/ejabberd-15.11/conf/ejabberd.pem">>}],true,undefined,<<"mac52944bec562c9c82eae8e818abdea7e4b">>,<<"ejabberd-benchmark">>,<<>>,{{1460,704802,37025},<0.24986.34>},{pres_t,0},{pres_f,0},{pres_a,0},undefined,undefined,{userlist,none,[],false},unknown,ejabberd_auth_external,{{10,245,32,24},29307},[],active,[],inactive,undefined,undefined,1000,undefined,300,300,true,0,0,<<>>}
** Reason for termination =
** {timeout,{gen_server,call,[ejabberd_redis_client,{request,[[<<"*">>,"2",<<"\r\n">>],[[<<"$">>,"7",<<"\r\n">>,<<"HGETALL">>,<<"\r\n">>],[<<"$">>,"67",<<"\r\n">>,<<"ejabberd:sm:mac52944bec562c9c82eae8e818abdea7e4b#ejabberd-benchmark">>,<<"\r\n">>]]]},20000]}}

This error simply means that your system is saturated. It does not mean anything by itself.
You have to analyse to find bottlenecks for your use case, tune the platform and possibly optimize the code.

Related

jupyterhub fails to spawn server with systemdspawner

I am trying to run jupyterhub on an Ubuntu 20.04 LTS server. My idea is to run python/jupyterhub in a conda virtual environment as a system service. As I want to be able to limit the resources available to individual users I installed the systemdspawner.
After installing everything and starting the jupyterhub service I can login through my web browser. However, when trying to start the server the spawner stucks and after a while I get an error message saying "Spawn failed: Timeout"
in journalctl I can see the following messages:
User logged in: me 302 POST /hub/login?next= -> /hub/spawn (me#::ffff:[my IP address]) 59.42ms
Adding role server to token: <APIToken('93c8...', user='me', client_id='jupyterhub')
Creating oauth client jupyterhub-user-me
pam_loginuid(login:session): Error writing /proc/self/loginuid: Operation not permitted
pam_loginuid(login:session): set_loginuid failed
pam_unix(login:session): session opened for user me by (uid=0)
Failed to open PAM session for me: [PAM Error 14] Cannot make/remove an entry for the specified session
Disabling PAM sessions from now on. user:me
Unit jupyter-me-singleuser in a failed state. Resetting state.
Disclaimer: My Jupyter/Python installation is replacing an former installation that was setup by someone else and got messed up a bit during time. I tried to remove everything related and start with a clean installation from scratch. However, as I had very little documentation about the old setup there is a certain risk that there might be some left-overs of the previous installation that may cause trouble.
Any ideas?
Solved it out myself. In the end the PAM related messages seem to be non-critical and were not related to the timeout at all. Instead I found a mistake in /etc/systemd/system/jupyterhub.service, where the PATH variable was not including the bin directory of my miniconda installation.

How to download data into Rstudio via "Rblpapi"?

Good afternoon. Recently I've experienced a problem with downloading data from Bloomberg information system into RStudio via "Rblpapi" package. In order to get this package on my PC I executed the following commands:
install.packages("Rblpapi")
library(Rblpapi)
This code was proceeded succesfully. Further, in order to establish the connection between my PC and Blooberg I did the following:
blpConnect()
Then I received the error message:
25MAR2021_12:27:10.598 4484:7384 ERROR blpapi_platformtransporttcp.cpp:671 blpapi.session.transporttcp.{4}.localhost:8194 Connection failed
25MAR2021_12:27:10.598 4484:7384 WARN blpapi_platformcontroller.cpp:371 blpapi.session.platformcontroller.{4} Platform: 0 failed 1 consecutive connect attempts, stopped trying to reconnect.
Error in blpConnect_Impl(host, port, appName) :Failed to start session.
Therefore, I couldn`t establish the connection between my PC and Bloomberg.
Could you, please, tell me, how this problem can be solved?
Thank you for your effort.

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!

Rmpi, OpenCPU, and Apparmor: DENIED request for "/"

I have an R package that sends out a job to the OpenMPI cluster I have running by means of the Rmpi package. All works as expected within an R session run from the console. However, when I try to execute the relevant function with from my OpenCPU server like this (details changed to protect the innocent):
curl -XPOST http://99.999.999.99/ocpu/library/MyPackage/R/my_cluster_function
I get this error:
R call failed: process died.
(Other, non-cluster calling functions within the package work as expected via OpenCPU). I noticed in /var/log/kern.log a variety of requests being DENIED by apparmor, and I have been able to resolve most of them by adding entries into /etc/apparmor.d/opencpu.d/custom to allow OpenMPI to access the files it needs. However, I cannot resolve these two issues (again, IP address changed) related to "open" requests for location "/":
Oct 26 03:49:58 99.999.999.99 kernel: [142952.551234] type=1400 audit(1414295398.849:957): apparmor="DENIED" operation="open" profile="opencpu-main" name="/" pid=22486 comm="orted" requested_mask="r" denied_mask="r" fsuid=33 ouid=0
Oct 26 03:49:58 99.999.999.99 kernel: [142952.556422] type=1400 audit(1414295398.857:958): apparmor="DENIED" operation="open" profile="opencpu-main" name="/" pid=22485 comm="apache2" requested_mask="r" denied_mask="r" fsuid=33 ouid=0
Adding this to my apparmor rules did not help:
/* r,
Two questions:
Why is opencpu trying to read from my root level directory (or does this mean something else)?
More urgently, how can I resolve this apparmor issue?
Thanks.
You might need to add both apparmor rules
/ r,
/* r,
The first rule allows directory listing of / and the second rule allows read access to any file under /.
I don't understand why Rmpi wants to read / or why were you getting process died error instead of access denied. Are you sure the problem is completely resolved?

Couchbase 2.2 - Cannot connect to console after installation

I have installed Couchbase 2.2.0 for Windows 7 64 bits. I installed it using the default options. The service gets installed and I can stop/start or restart it without errors. However, I cannot connect to the management console on port 8091.
I've found some posts on how to deal with this but they all relate to older versions of couchbase and reference files/options I can't find in version 2.2 (for example this post : Unable to connect to http://localhost:8091/index.html).
When I try netstat -an -p tcp I can't see any service listening on port 8091 so I suspect something goes wrong during startup.
Looking in the couchbase log in /var/lib/couchbase/logs I can see some errors but they don't make sense to me.
[error_logger:error,2014-02-11T17:13:16.536,babysitter_of_ns_1#127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.233.0> terminating
** Last message in was {die,{abnormal,3}}
** When Server state == {state,ns_server,5000,
{1392,135189,391066},
undefined,infinity}
** Reason for termination ==
** {abnormal,3}
[ns_server:debug,2014-02-11T17:13:16.536,babysitter_of_ns_1#127.0.0.1:<0.235.0>:supervisor_cushion:init:39]starting ns_port_server with delay of 5000
[error_logger:error,2014-02-11T17:13:16.536,babysitter_of_ns_1#127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: supervisor_cushion:init/1
pid: <0.233.0>
registered_name: []
exception exit: {abnormal,3}
in function gen_server:terminate/6
ancestors: [child_ns_server_sup,ns_babysitter_sup,<0.58.0>]
messages: []
links: [<0.73.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 2584
stack_size: 24
reductions: 2365
neighbours:
The windows firewall is running but, since I installed the same version of couchbase on another machine with the same Windows version without troubles I can't imagine that the firewall is causing this trouble.
I'm out of options though, I have no idea why I can't get this to work.
In the meantime, I have uninstalled 2.2 and tried falling back to 2.1 (with the same result) and moving forward to 2.5 (with the same result). In all cases I turned of antivirus software and stopped the Windows firewall to eliminate this cause.
Thanks to the nice people at Couchbase.com, the problem has been solved.
Please refer to :
https://www.couchbase.com/issues/browse/MB-10245
https://www.couchbase.com/issues/browse/MB-8760
The solution is to reinstall couchbase in a location without spaces in the path. Due to a bug, spaces in the installation path lead to trouble.

Resources