Clickhouse distributed table node stops accepting TCP connections - tcp
Clickhouse version: (version 20.3.9.70 (official build))
(I know this is no longer supported, we have plans to upgrade but it takes time)
The Setup
We are running three query nodes (nodes with distributed tables only), we spun up the third one yesterday. All nodes point to the same storage nodes and tables.
The Problem
The node serves requests just fine over TCP and HTTP for up to 11 hours. After that, the clickhouse server starts to close TCP connections. HTTP still works just fine when this happens.
Extra Information/Evidence
The system.metrics.tcp_connection number steadily drops over time for the new node.
netstatgives shows a lot of ACTIVE_WAIT connections
netstat -ntp | tail -n+3 | awk '{print $6}' | sort | uniq -c | sort -n
2 LAST_ACK
380 CLOSE_WAIT
386 ESTABLISHED
29279 TIME_WAIT
Normal node for comparison:
1199 CLOSE_WAIT
1292 ESTABLISHED
186 TIME_WAIT
Opening the clickhouse_client is not possible
user#server:~$ clickhouse-client
ClickHouse client version 20.3.9.70 (official build).
Connecting to localhost:9000 as user default.
Code: 32. DB::Exception: Attempt to read after eof
The following shows up in logs:
2021.12.15 19:00:29.215048 [ 25146 ] {e2f742e013b7d83f5d1d6e524afc5d2b} <Warning> ConnectionPoolWithFailover: Connection failed at try №1, reason: Code: 32, e.displayText() = DB::Exception: Attempt to read after eof (version 20.3.9.70 (official build))
2021.12.15 19:03:32.098881 [ 25536 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/src/Exception.cpp:27: Poco::IOException::IOException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) # 0x1053e380 in /usr/lib/debug/usr/bin/clickhouse
1. /build/obj-x86_64-linux-gnu/../contrib/poco/Net/src/NetException.cpp:26: Poco::Net::NetException::NetException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) # 0xe38f6ed in /usr/lib/debug/usr/bin/clickhouse
2. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/string:2134: Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (.cold) # 0xe3a5093 in /usr/lib/debug/usr/bin/clickhouse
3. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/string:2134: Poco::Net::SocketImpl::peerAddress() # 0xe3a0633 in /usr/lib/debug/usr/bin/clickhouse
4. /build/obj-x86_64-linux-gnu/../src/IO/ReadBufferFromPocoSocket.cpp:66: DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) # 0x902ffd7 in /usr/lib/debug/usr/bin/clickhouse
5. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3696: DB::TCPHandler::runImpl() # 0x9023905 in /usr/lib/debug/usr/bin/clickhouse
6. /build/obj-x86_64-linux-gnu/../programs/server/TCPHandler.cpp:1235: DB::TCPHandler::run() # 0x9025470 in /usr/lib/debug/usr/bin/clickhouse
7. /build/obj-x86_64-linux-gnu/../contrib/poco/Net/src/TCPServerConnection.cpp:57: Poco::Net::TCPServerConnection::start() # 0xe3ac69b in /usr/lib/debug/usr/bin/clickhouse
8. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/atomic:856: Poco::Net::TCPServerDispatcher::run() # 0xe3acb1d in /usr/lib/debug/usr/bin/clickhouse
9. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/include/Poco/Mutex_STD.h:132: Poco::PooledThread::run() # 0x105c3317 in /usr/lib/debug/usr/bin/clickhouse
10. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/include/Poco/AutoPtr.h:205: Poco::ThreadImpl::runnableEntry(void*) # 0x105bf11c in /usr/lib/debug/usr/bin/clickhouse
11. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:2615: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void* (*)(void*), Poco::ThreadImpl*> >(void*) # 0x105c0abd in /usr/lib/debug/usr/bin/clickhouse
12. start_thread # 0x8184 in /lib/x86_64-linux-gnu/libpthread-2.19.so
13. __clone # 0xfe03d in /lib/x86_64-linux-gnu/libc-2.19.so
(version 20.3.9.70 (official build))
2021.12.15 19:03:32.098881 [ 25536 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/src/Exception.cpp:27: Poco::IOException::IOException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) # 0x1053e380 in /usr/lib/debug/usr/bin/clickhouse
1. /build/obj-x86_64-linux-gnu/../contrib/poco/Net/src/NetException.cpp:26: Poco::Net::NetException::NetException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > cons
t&, int) # 0xe38f6ed in /usr/lib/debug/usr/bin/clickhouse
2. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/string:2134: Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (.cold
) # 0xe3a5093 in /usr/lib/debug/usr/bin/clickhouse
3. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/string:2134: Poco::Net::SocketImpl::peerAddress() # 0xe3a0633 in /usr/lib/debug/usr/bin/clickhouse
4. /build/obj-x86_64-linux-gnu/../src/IO/ReadBufferFromPocoSocket.cpp:66: DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) # 0x902ffd7 in /usr/lib/debug/usr/bin/cl
ickhouse
5. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3696: DB::TCPHandler::runImpl() # 0x9023905 in /usr/lib/debug/usr/bin/clickhouse
6. /build/obj-x86_64-linux-gnu/../programs/server/TCPHandler.cpp:1235: DB::TCPHandler::run() # 0x9025470 in /usr/lib/debug/usr/bin/clickhouse
7. /build/obj-x86_64-linux-gnu/../contrib/poco/Net/src/TCPServerConnection.cpp:57: Poco::Net::TCPServerConnection::start() # 0xe3ac69b in /usr/lib/debug/usr/bin/clickhouse
8. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/atomic:856: Poco::Net::TCPServerDispatcher::run() # 0xe3acb1d in /usr/lib/debug/usr/bin/clickhouse
9. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/include/Poco/Mutex_STD.h:132: Poco::PooledThread::run() # 0x105c3317 in /usr/lib/debug/usr/bin/clickhouse
10. /build/obj-x86_64-linux-gnu/../contrib/poco/Foundation/include/Poco/AutoPtr.h:205: Poco::ThreadImpl::runnableEntry(void*) # 0x105bf11c in /usr/lib/debug/usr/bin/clickhouse
11. /build/obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:2615: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__t
hread_struct> >, void* (*)(void*), Poco::ThreadImpl*> >(void*) # 0x105c0abd in /usr/lib/debug/usr/bin/clickhouse
12. start_thread # 0x8184 in /lib/x86_64-linux-gnu/libpthread-2.19.so
13. __clone # 0xfe03d in /lib/x86_64-linux-gnu/libc-2.19.so
(version 20.3.9.70 (official build))
Attempted Remedies/Debugging
Restarting clickhouse on the host temporarily fixes the problem. We have tried it once. This state happens again after 10-11 hours of operation.
There are no helpful logs at INFO level before the dwindling of TCP connections
# this returns nothing
cat clickhouse-server.log.18-43-to-18-52 | grep -vE 'Done processing|Client has not sent any data|executeQuery|Processed in'
HTTP still works just fine when this happens
Related
Julia: segfault when assigning DataFrame column while using threads
The following code creates a segfault for me - is this a bug? And if so, in which component? using DataFrames function test() Threads.#threads for i in 1:50 df = DataFrame() df.foo = 1 end end test() (need to start Julia with multithreading support for this to work, eg JULIA_NUM_THREADS=50; julia) It only generates a segfault if the number of iterations / threads is sufficiently high, eg 50. For lower numbers it only sporadically / never does so. My environment: julia> versioninfo() Julia Version 1.4.2 Commit 44fa15b150* (2020-05-23 18:35 UTC) Platform Info: OS: Linux (x86_64-redhat-linux) CPU: Intel(R) Xeon(R) Gold 6254 CPU # 3.10GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, skylake) Environment: JULIA_NUM_THREADS = 50
It is most likely caused by the fact that you are using deprecated syntax so probably something with deprecation handling messes up things (I do not have enough cores to test it). In general your code uses deprecated syntax (and produces something different than you probably expect): ~$ julia --depwarn=yes --banner=no julia> using DataFrames julia> df = DataFrame() 0×0 DataFrame julia> df.foo=1 ┌ Warning: `setproperty!(df::DataFrame, col_ind::Symbol, v)` is deprecated, use `df[!, col_ind] .= v` instead. │ caller = top-level scope at REPL[3]:1 └ # Core REPL[3]:1 1 julia> df # note that the resulting deprecated syntax has added the column but it has 0 rows 0×1 DataFrame julia> df2 = DataFrame() 0×0 DataFrame julia> df2.foo = [1] # this is a correct syntax - assign a vector 1-element Array{Int64,1}: 1 julia> df2[:, :foo2] .= 1 # or use broadcasting 1-element Array{Int64,1}: 1 julia> insertcols!(df2, :foo3 => 1) # or use insertcols! which does broadcasting automatically, see the docstring for details 1×3 DataFrame │ Row │ foo │ foo2 │ foo3 │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 1 │ 1 │ The reason why df.foo = 1 is disallowed and df.foo = [1] is required follows the fact that, as opposed to e.g. R, Julia distinguishes scalars and vectors (in R everything is a vector). Going back to the original question something e.g. like this should work: using DataFrames function test() Threads.#threads for i in 1:50 df = DataFrame() df.foo = [1] end end test() please let me know if it causes problems or not. Thank you!
python 2 to python 3 : >=' not supported between instances of 'int' and 'NoneType'
i have the problem with the following code with the error =' not supported between instances of 'int' and 'NoneType' b = None if max([a, b]) <= t: I want the code to work in both python 2 and python 3. The above code is working in python 2 because comparison is allowed between NoneType and INT but not allowed in python3, if max([simil_bank, simil_efx]) <= Tl:
You could provide a key function that returns -inf when the value is None: if max([a, b], key=lambda val: float('-inf') if val is None else val) <= t:
Identify cause of high GHC memory consumption without profile build
I heard GHC is slow in terms of LOC per second. So I created this Haskell program: module Main where x1 = 1 x2 = 2 x3 = 3 ... x999998 = 999998 x999999 = 999999 x1000000 = 1000000 main = putStrLn "1M LOC!" And I can't even compile it! At least I can see that parser can do 43 lines per second: [1 of 1] Compiling Main ( 1Mloc.hs, 1Mloc.o ) *** Parser [Main]: Parser [Main]: alloc=22735369056 time=23683.420 As far as I know, GHC RTS must be recompiled with profiling enabled to start digging into the cause. Given I don't have profiled GHC, is there any chance to figure out what is causing this? I can't even collect statistics because it gets killed... Killed process 16609 (ghc) total-vm:1074093288kB, anon-rss:6804448kB ... Actually, I can't compile 10K LOC either. With down to 1K LOC at least I can see horrible productivity numbers. I realize this is a synthetic program, but what could be so bad about it? Linking 1Kloc ... 1,383,344,416 bytes allocated in the heap 325,164,408 bytes copied during GC 60,849,840 bytes maximum residency (9 sample(s)) 282,960 bytes maximum slop 58 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 233 colls, 0 par 0.227s 0.230s 0.0010s 0.0066s Gen 1 9 colls, 0 par 0.149s 0.174s 0.0193s 0.0588s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.000s elapsed) MUT time 0.522s ( 1.047s elapsed) GC time 0.376s ( 0.404s elapsed) EXIT time 0.000s ( 0.008s elapsed) Total time 0.899s ( 1.460s elapsed) Alloc rate 2,647,974,824 bytes per MUT second Productivity 58.1% of total user, 71.7% of total elapsed
tutorial example fails: mismatch with alt-ergo?
I have installed frama-c using opam and homebrew, following the instructions from the frama-c site. I'm on Mac OS X (El Capitan), and the versions are: frama-c: Magnesium-20151002 alt-ergo: 1.01 ocaml: 4.02.3 When I attempt to run with the swap.c tutorial, it fails to verify. Here's the error I get: [ frama-c ]> frama-c -wp -wp-out temp swap.c swap1.h [kernel] Parsing FRAMAC_SHARE/libc/__fc_builtin_for_normalization.i (no preprocessing) [kernel] Parsing swap.c (with preprocessing) [kernel] Parsing swap1.h (with preprocessing) [wp] warning: Missing RTE guards [wp] 2 goals scheduled ------------------------------------------------------------ --- Alt-Ergo (stdout) : ------------------------------------------------------------ File "temp/typed/swap_post_A_Alt-Ergo.mlw", line 786, characters 1-299:Valid (0.0093) (12 steps) ------------------------------------------------------------ [wp] [Alt-Ergo] Goal typed_swap_post_A : Failed Error: Can not understand Alt-Ergo output. [wp] Proved goals: 1 / 2 Qed: 1 Alt-Ergo: 0 (failed: 1) The output message seems to suggest that alt-ergo could prove the assertion, but then frama-c could not parse the output. Could this be because the alt-ergo version is too new? Here is the goal on line 786 of the generated file, referenced in the above output: goal swap_post_A: forall t : (addr,int) farray. forall a_1,a : addr. let x = t[a] : int in let x_1 = t[a_1] : int in let x_2 = t[a_1 <- x][a <- x_1][a_1] : int in is_sint32(x) -> is_sint32(x_1) -> (region(a.base) <= 0) -> (region(a_1.base) <= 0) -> is_sint32(x_2) -> (x = x_2) If I run alt-ergo on this generated file directly, it returns with code 0.
imap_client_buffer not being obeyed on nginx
I have imap_client_buffer set to 64k(as required by imap protocol) in the nginx.conf file However when an imap client sends a very long command, post authentication, the length gets truncated at 4k(the default page size of the linux operating system) how can i debug this problem? i have stepped through the code using gdb. as far as i can see, in mail_proxy module, the value of the conf file(120000 for testing) was correctly seen gdb) p p->upstream.connection->data $24 = (void *) 0x9e4bd48 (gdb) p s $25 = (ngx_mail_session_t *) 0x9e4bd48 (gdb) p p->buffer->end Cannot access memory at address 0x1c (gdb) p s->buffer->end - s->buffer->last $26 = 120000 (gdb) p s $27 = (ngx_mail_session_t *) 0x9e4bd48 (gdb) n 205 pcf = ngx_mail_get_module_srv_conf(s, ngx_mail_proxy_module); (gdb) n 207 s->proxy->buffer = ngx_create_temp_buf(s->connection->pool, (gdb) p pcf $28 = (ngx_mail_proxy_conf_t *) 0x9e3c480 (gdb) p *pcf $29 = {enable = 1, pass_error_message = 1, xclient = 1, buffer_size = 120000, timeout = 86400000, upstream_quick_retries = 0, upstream_retries = 0, min_initial_retry_delay = 3000, max_initial_retry_delay = 3000, max_retry_delay = 60000} When the command below is sent using telnet, only 4 k of data is accepted, then nginx hangs until i hit enter on keyboard......after which the truncated command is sent to the upstream imap server. I am using nginx 0.78, is this a known issue? This is the command sent HP1L UID FETCH 344990,344996,344998,345004,345006,345010:345011,345015,345020,345043,345046,345049:345050,345053,345057:345059,345071,345080,345083,345085,345090,345092:345093,345096,345101:345102,345106,345112,345117,345136,345140,345142:345144,345146:345147,345150,345161,345163,345167,345174:345176,345195,345197,345203,345205,345207:345209,345214,345218,345221,345224,345229,345231,345233,345236,345239,345248,345264,345267,345272,345285,345290,345292,345301,345305,345308,345316,345320,345322,345324,345327,345358,345375,345384,345386,345391,345409,345427:345428,345430:345432,345434:345435,345437,345439,345443,345448,345450,345463,345468:345470,345492,345494,345499,345501,345504,345506,345515:345519,345522,345525,345535,345563,345568,345574,345577,345580,345582,345599,345622,345626,345630,345632:345633,345637,345640,345647:345648,345675,345684,345686:345687,345703:345704,345714:345717,345720,345722:345724,345726,345730,345734:345737,345749,345756,345759,345783,345785:345787,345790,345806:345807,345812,345816,345720,345722:345724,345726,345730,345734:345737,345749,345756,345759,345783,345785:345787,345790,345806:345807,345812,345817,345902,345919,345953,345978,345981,345984,345990,345997,346004,346008:346009,346011:346012,346022,346039,346044,346050,346061:346062,346066:346067,346075:346076,346081,346088,346090,346093,346096,346098:346099,346110,346140,346170,346172:346174,346187,346189,346193:346194,346197,346204,346212,346225,346241,346244,346259,346323,346325:346328,346331:346332,346337:346338,346342,346346,346353,346361:346362,346364,346420,346426,346432,346447,346450:346451,346453:346454,346456:346457,346459:346460,346466:346468,346474,346476,346479,346483,346496,346498:346501,346504,346530,346539,346546,346576,346589:346590,346594:346595,346601,346607:346609,346612,346614:346615,346617:346618,346625,346631,346638,346641,346654,346657,346661,346665,346671,346687:346688,346693,346695,346734:346735,346741,346747:346748,346755,346757,346777,346779,346786:346788,346791,346793,346801,346815,346821:346822,346825,346828,346837,346839,346843,346848,346857:346858,346860,346862:346863,346866,346868:346869,346877,346883,346895:346897,346899,346923,346927,346945,346948,346961,346964:346966,346968,346970,346974,346987,346989:346990,346992,347000,347003,347008:347011,347013,347021,347028,347032:347034,347036,347049,347051,347058,347076,347079,347081,347083,347085,347092,347096,347108,347130,347145:347148,347150,347155:347158,347161,347163:347164,347181,347184,347187:347189,347204,347210:347211,347215,347217:347220,347227:347228,347234,347244,347246,347251,347253,347263:347264,347266,347268,347275,347292,347294,347304,347308,347317:347320,347322,347325:347327,347340:347341,347346,347352:347353,347357,347360:347361,347375,347379,347382:347386,347389,347392,347402,347405:347406,347411,347433:347434,347438,347440:347441,347443:347444,347448,347459:347460,347465,347468:347469,347476:347479,347490,347497,347506,347526,347530,347545,347547,347555:347556,347601:347605,347632,347634,347641,347643:347646,347649,347653,347660,347668,347676,347707,347719,347722,347724,347727:347732,347735,347746,347754,347756:347757,347761,347776,347779,347791,347798,347800,347805,347816:347817,347822,347837,347841,347843,347846,347848,347851,347879,347885,347892:347894,347903,347907:347911,347915:347916,347918,347950,347952:347953,347981,347986:347988,348001,348037:348038,348049,348052,348056:348058,348061,348072,348074,348077:348078,348080,348082,348100,348105,348109,348111:348116,348119:348123,348131:348132,348138,348150:348151,348153,348157,348161:348163,348166,348168:348169,348171,348173,348176,348178,348180:348181,348201,348204,348208,348218:348219,348222,348226,348229:348230,348235,348238:348240,348244:348247,348249,348251:348253,348256:348257,348263,348285,348288:348289,348293,348298:348299,348301:348302,348305:348306,348310,348327,348332:348337,348340,348342,348344,348348,348351:348353,348356:348357,348360,348366,348377,348386,348390,348398,348400:348401,348406:348407,348419,348422,348424,348427:348428,348430,348432:348433,348439,348444,348447:348448,348450:348451,348454,348456,348459:348460,348473,348493,348497:348498,348504,348506,348508,348516,348520,348527,348530,348532,348546:348547,348551,348560:348563,348567,348570:348572,348574,348577,348581,348588,348595,348610,348632,348636,348642,348646,348667,348672:348673,348679,348703,348713:348714,348716,348718:348722,348728:348729,348731,348735,348743:348745,348749,348751:348752,348759,348768,348773,348780:348781,348784:348791 (UID FLAGS)