Understanding Quorum/Vnode (R vs PR) - riak

I was doing some experiments to understand Riak. Here's something intersting I found:
I have a cluster of 2 nodes and a bucket type that has n_val of 2
[root#co-riak002 ~]# riak-admin ring-status
================================== Claimant ===================================
Claimant: 'riak#10.172.48.68'
Status: up
Ring Ready: true
============================== Ownership Handoff ==============================
No pending changes.
============================== Unreachable Nodes ==============================
All nodes are up and reachable
[root#co-riak002 ~]# riak-admin bucket-type create testBucket '{"props":{"n_val":2}}'
testBucket created
[root#co-riak002 ~]# riak-admin bucket-type activate testBucket
testBucket has been activated
And then I wrote something into it:
[root#co-riak002 ~]# curl -XPUT -d '{"bar":"foo"}' -H "Content-Type: application/json" http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?w=2&returnbody=true
[1] 10890
[root#co-riak002 ~]#
[1]+ Done curl -XPUT -d '{"bar":"foo"}' -H "Content-Type: application/json" http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?w=2
Now I can read it fine with both r=2 and pr=2:
[root#co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?r=2
{"bar":"foo"}
[root#co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?pr=2
{"bar":"foo"}
After I killed one of the nodes, r=2 still reads fine but not pr=2
[root#co-riak002 ~]# riak-admin ring-status
================================== Claimant ===================================
Claimant: 'riak#10.172.48.68'
Status: up
Ring Ready: true
============================== Ownership Handoff ==============================
No pending changes.
============================== Unreachable Nodes ==============================
The following nodes are unreachable: ['riak#10.172.48.66']
With r=2:
[root#co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?r=2
{"bar":"foo"}
With pr=2:
[root#co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?pr=2
PR-value unsatisfied: 1/2
I am confused - shouldn't the Quorum number r used in reading operation mean the number of replicas/physical nodes that need to agree before returning data? Why is it not working in this case? And why is pr working in this case when it should mean the number of vnodes?
Am pretty new to this space. Much appreciated for any pointers.

You should distinguish between the "sloppy quorum" and "strict quorum".
As you probably know, a hash function is applied to each key to calculate where that key must be located in the Riak cluster. The entire space of hash values is called a "ring", and is equally divided between vnodes (virtual nodes), which in turn are assigned to physical nodes. The assignment is done in such a way, as to ensure that adjacent vnodes belong to distinct physical nodes for reliability, although it's not always possible. If replication is turned on (i.e. n_val > 1), a key is written not only to its destination vnode, but also to a few nodes that follow the vnode on the ring (different physical nodes in most cases - see above). Now, those are primary nodes for that key. However, in case of a sloppy quorum (for instance, W = 2), if a primary node is not available, replicas of the key will be written to any vnode, potentially even on the same physical node. That's OK, because they will be handed off to the "right" vnodes as soon as the problem is fixed and the primary nodes become available. If you don't want to risk replicas being written to the same physical node even temporary, or want to make sure the client receives the most up-to-date values, you can explicitly require all or at least some writes to be made only to primary vnodes (PW = 2, "P" stands for "primary"). This comes at the expense of high availability, though. The same logic works for reads.
Hope this helps.
I strongly recommend you to read "A Little Riak Book".
Also, the online documentation is excellent.

shouldn't the Quorum number r used in reading operation mean the number of replicas/physical nodes that need to agree before returning data?
Not exactly. The read quorum(r) is the number of vnodes that must provide an acceptable response. When you read with one node down, the rest of the cluster (in this case the remaining node) will start up fallbacks for any missing vnodes as needed.
When your read request with r=2 arrives, since one vnode in the preflist is unavailable, a fallback is started up. Naturally, that fallback is empty when first started, so the read process receives notfound from the fallback and the stored object from the other.
The trick here is the notfound_ok setting in the bucket properties or request options. If left at the default of notfound_ok=true the notfound is considered a valid response, so the operation meets the quorum, the response with data trumps the notfound, and the client gets back an object. This also triggers read repair which will populate the fallback with that object so the next get request will get 2 objects and no notfound responses.
If notfound_ok is false, the first read request will see only 1 valid response and fail, but read repair still happens so the next r=2 request succeeds because the fallback also has the data.
It is a valid tactic to use r=1, notfound_ok=false for reads to get high availability and the fastest possible response while keeping reasonable reassurances that you won't get false notfound responses when a node fails.

Related

.in-addr.arpa. not found: 3(NXDOMAIN)

I have been struggling with this for about 3 days now. I will continue to work on it as I wait for anyone to help but I'm having the following problem. I will use examples in this post to mask the domains and IPs somewhat. This is not to make more work for you I just don't want it easily cached in search results on google etc. Thank you in advance for any help
I have installed WHM on a Cloudlinux system hosted on a VM using VMWare. The domain(In this case lets call it domain.co.za) was used as the hostname of the system and if you go to that domain it actually loads. That domain name is pointing to Cloudflare which in turn points back A records to the WHM server as the nameservers i would like to use. This system is currently using PowerDNS as well
Now what I also have encountered is that the ns1.domain.co.za is working fine(This is also the machines hostname) but the ns2.domain.co.za is not
If I try to set nameservers for any other domains it does not allow me to change them and they are giving the following errors
Authoritative Nameserver failure for domain
This I am assuming is because of the following error when I use intoDNS to check what the problem is(this is not for domain.co.za this is for a domain I own called orginc.co.za which only the ns1.domain.co.za is accepted and not ns2.domain.co.za)
The following nameservers are listed at your nameservers as nameservers for your domain, but are not listed at the parent nameservers
When I use a dig command I get the following results for ns2(Please note actual IPs changed)
Host 20.20.20.164.in-addr.arpa. not found: 3(NXDOMAIN)
[root#ns1 ~]# dig ns2.domain.co.za
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> ns2.domain.co.za
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 61082
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ns2.domain.co.za. IN A
;; Query time: 0 msec
;; SERVER: 164.20.20.20#53(164.20.20.20)
;; WHEN: Sat Feb 13 12:11:12 SAST 2021
;; MSG SIZE rcvd: 51
I have been reading around and it seems like the general consensus is that it is a reverse DNS issue but I'm not sure how to proceed. I get answers like the following that I found on a cPanel forum
This functionality only works if your data center has delegated permission to your server to control the entry
But at the end of the day, we own the physical hardware that we put in at the data center.
I do not know how to proceed at the moment but will keep trying in the meantime
I am assuming domain.co.za is a dummy domain name and not the actual one.
From what I’ve read so far, it seems that you may have some troubles with domain NS.
first thing to check is who’s configured as NS for `domain.co.zak
$ dig NS domain.co.za
Then make sure whatever NS entries are returned, those servers must have the zone entries for the domain. What I am assuming is the case is that you have ns1.domain.co.za as well as ns2.domain.co.za but for one reason or another, only one of these is aware of your entries.
Typically you’d host your zone on ns1.domain.co.zak and then you authorise ns2.domain.co.za` to fetch your zone entries (known as AXFR) so this way both name servers are in sync and have all the zone entries. This might be where the problem is if ns2 is unable to fetch the zone. This is a long shot in the dark here, but you can try this:
ns2$ dig #ns1.domain.co.za AXFR domain.co.za
NS2 should be able to obtain zone entries from NS1.
Again, all the above is just a wild guess ;-)
Ok everyone so the sequence of events went as follows.
List item There was an A record mismatch on WHM itself as ns2.iclixhosting.co.za was not set in the iclixhosting.co.za zones
Reverse DNS had to then propagate
We then had a firewall issue that needed a bypass for port 53 on that IP
In other words for future people reading this answer make sure of the above-mentioned items if you have problems similar to this

what does `TCPBacklogDrop` means when using `netstat -s`

all
Recently I am debugging a problem on unix system, by using command
netstat -s
and I get an output with
$ netstat -s
// other fields
// other fields
TCPBacklogDrop: 368504
// other fields
// other fields
I have searched for a while to understand what does this field means, and got mainly two different answers:
This means that your tcp-date-receive-buffer is full, and there are some packages overflow
This means your tcp-accept-buffer is full, and there are some disconnections
Which is the correct one? any offical document to support it?
Interpretation #2 is referring to the queue of sockets waiting to be accepted, possibly because its size is set (more or less) by the value of the parameter named backlog to listen. This interpretation, however, is not correct.
To understand why interpretation #1 is correct (although incomplete), we will need to consult the source. First note that the string "TCPBacklogDrop"is associated with the Linux identifier LINUX_MIB_TCPBACKLOGDROP (see, e.g., this). This is incremented here in tcp_add_backlog.
Roughly speaking, there are 3 queues associated with the receive side of an established TCP socket. If the application is blocked on a read when a packet arrives, it will generally be sent to the prequeue for processing in user space in the application process. If it can't be put on the prequeue, and the socket is not locked, it will be placed in the receive queue. However, if the socket is locked, it will be placed in the backlog queue for subsequent processing.
If you follow through the code you will see that the call to sk_add_backlog called from tcp_add_backlog will return -ENOBUFS if the receive queue is full (including that which is in the backlog queue) and the packet will be dropped and the counter incremented. I say this interpretation is incomplete because this is not the only place where a packet could be dropped when the "receive queue" is full (which we now understand to be not as straightforward as a single queue).
I wouldn't expect such drops to be frequent and/or problematic under normal operating conditions as the sender's TCP stack should honor the advertised window of the receiver and not send data exceeding the capacity of the receive queue (with the exception of zero window probes and older kernel versions whose calculations could cause drops when the receive window was not actually full). If it is somehow indicative of a problem, I would start worrying about malicious clients (some form of DDOS maybe) or some failure causing a sockets lock to be held for an extended period of time.

Is it necessary to use a queue to save messages received from clients and pending to be forwarded to the backend server?

I want to write a Proxy Server for SMB2 based on Asio and consider using a cumulative buffer to receive a full message so as to do business logic, and introducing a queue for multiple messages which will force me to synchronize the following resouce accesses:
the read and write operation on the queue because the two upstream/downstream queue are shared the frontend client and the backend server,
the backend connection state because reads on the frontend won't wait for the completion of connect or writes on the backend server before the next read, and
the resource release when an error occurs or a connection is normally closed because both read and write handlers on the same scoket registered with the EventLoop are not yet completed and a asynchronous connect operation can be initiated in worker threads while its partner socket has been closed, and those may run concurrently.
If not using the two queues, only one (read, write and connect) handler is register with the EventLoop on the proxy flow for a request, so no need to synchronize.
From the Application level,
I think a cumulative buffer is generally a must in order to process a full message packet (e.g. a message in the fomat | length (4 bytes) | body (variable) |) after multiple related API calls (System APIs: recv or read, or Library APIs: asio::sync_read).
And then, is it necessary to use a queue to save messages received from clients and pending to be forwarded to the backend server
use the following diagram from http://www.partow.net/programming/tcpproxy/index.html, it turned out to have similar thoughts to mine (the upstream concept as in NGINX upstream servers).
---> upstream ---> +---------------+
+---->------> |
+-----------+ | | Remote Server |
+---------> [x]--->----+ +---<---[x] |
| | TCP Proxy | | +---------------+
+-----------+ | +--<--[x] Server <-----<------+
| [x]--->--+ | +-----------+
| Client | |
| <-----<----+
+-----------+
<--- downstream <---
Frontend Backend
For a Request-Response protocol without a message ID field (useful for matching each reply message to the corresponding request message), such as HTTP, I can use one single buffer for every connection in the two downstream and upstream flows, and then continue processing the next request (note for the first request, a connection to the server is attempted, so it's slower than the subsequent processes), because clients always wait (may block or get notified by an asynchronous callback function) for the response after sending requests.
However, for a protocol in which clients don't wait for the response before sending the next request, a message ID field can be used to uniquely identify or distinguish request-replies pairs. For example, JSON-RPC 2.0, SMB2, etc. If I strictly complete the two above flows regardless of next read (without call to read and make TCP data accumulated in kernel), the subsequent requests from the same connection cannot be timely processed. After reading What happens if one doesn't call POSIX's recv “fast enough”? I think it can be done.
I also did a SMB2 proxy test using one single buffer for the two downstream and upstream flows on windows and linux using the ASIO networking library (also included in Boost.Asio). I used smbclient as a client on linux to create 251 of connections (See the following command):
ft=$(date '+%Y%m%d_%H%M%S.%N%z'); for ((i = 2000; i <= 2250; ++i)); do smbclient //10.23.57.158/fromw19 user_password -d 5 -U user$i -t 100 -c "get 1.96M.docx 1.96M-$i.docx" >>smbclient_${i}_${ft}_out.txt 2>>smbclient_${i}_${ft}_err.txt & done
Occasionally, it printed several errors, "Connection to 10.23.57.158 failed (Error NT_STATUS_IO_TIMEOUT)". But if increasing the number of connections, the number of errors would increase, so it's a threshold? In fact, those connections were completed within 30 seconds, and I also set the timeout for smbclient to 100. What's wrong?
Now, I know those problems need to be resolved. But here, I just want to know "Is it necessary to use a queue to save messages received from clients and pending to be forwarded to the backend server?" so I can determine my goal because it causes a great deal of difference.
Maybe they cannot care about the application message format, the following examples will reqest the next read after completing the write operation to it peer.
HexDumpProxyFrontendHandler.java or tcpproxy based on c++ Asio.
Other References
[Computer Networks: A Systems Approach] 5.3 Remote Procedure Call - Overcoming Network Limitations
[Computer Networks: A Systems Approach] 5.3 Remote Procedure Call - Overcoming Network Limitations at github
JSON RPC at wikipedia

DICOM fail to use c-move : Move Request Failed: 0006:0317 Peer aborted Association (or never connected)

I am running the following command with the find-SCU tool from the OFFIS DICOM toolkit (dcmtk):
movescu -k 0010,0020="PAT004" ip_adress 104 -aec serverAET -aet myAET --study -ll debug -od data
And I keep getting the error.
The association seem to have worked well but the actual c-move seems to fail at the moment of the transfer
message of the error
The screenshot tells that you can successfully establish the connection, but the server aborts after receiving the request.
You missed to specify the mandatory key QueryRetrieveLevel (0008,0052).
add
-k 0008,0052="PATIENT"
to your command, and it should work.
However, moving means, that the server (MOVE-SCP) is prompted to transfer the images matched by the request to a destination application entity. This must be specified by providing the AET of that system:
-aem <AET of the destination>
This frequently fails due to one of these reasons:
the move destination AE title is resolved to an IP-address and port. This is achieved through the C-MOVE-SCP's configuration.
A Storage SCP has to listen for images transferred in the scope of the C-MOVE, its IP, AET and port have to match the MOVE-SCP's configuration for the Move destination AE title.

Calculating Server Processing Time With Curl

I am getting request timing info with curl using the --write-out option as described in this article.
Here is example output from one of my requests:
time_namelookup: 0.031
time_connect: 0.328
time_appconnect: 1.560
time_pretransfer: 1.560
time_redirect: 0.000
time_starttransfer: 1.903
----------
time_total: 2.075
My question is: How can I determine how much time the server took processing the request? Is the answer:
time_starttransfer - time-connect
That is, the time from when the connection was established to when the server starting sending its response? That seems correct but I want to be sure.
Details about curl timing variables may be found here: http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html
Yes, (time_starttransfer - time-connect) is the time from the connect was noticed by curl until the first byte arrived. Do note that it also includes the transfer time so for a remote site it will be longer simply because of that.
I'd say that you're right, (time_starttransfer - time_connect) definitely is an amount of time server took to process the request.
However - I also wondered what is the difference between time_connect and time_pretransfer? (intrigued by #Schwartzie and #Cheeso comments)
By looking at several curl queries on the web I observed that sometimes they're equal and sometimes they're not.
Then I figured out (at least I believe so) that they differ only for HTTPS requests, cause server needs some time to decrypt ssl layer, which is not exactly the time spent by the target app but the time spent by server hosting the app/service.
The time to decrypt ssl (and to connect also, the one given in time_connect) is time_appconnect, and only when it's 0 (like for non-https requests) - time_connect and time_pretransfer are EQUAL, otherwise for https requests they differ, and for https time_pretransfer would be equal to time_appconnect (and not to time_connect).
Check the following two examples:
curl -kso /dev/null -w "time_connect=%{time_connect}, time_appconnect:%{time_appconnect}, time_pretransfer=%{time_pretransfer}\n" http://www.csh.rit.edu
time_connect=0.128, time_appconnect:0.000, time_pretransfer=0.128
curl -kso /dev/null -w "time_connect=%{time_connect}, time_appconnect:%{time_appconnect}, time_pretransfer=%{time_pretransfer}\n" https://www.csh.rit.edu
time_connect=0.133, time_appconnect:0.577, time_pretransfer=0.577
so I'd say that time_pretransfer is more precise to be used compared to time_connect since it'll respect ssl connections too, and maybe some other things I'm not aware of.
With all the previous being said, I'd say that more precise answer for the question:
"How can I determine how much time the server took processing the request?"
would probably be:
time_starttransfer - time_pretransfer
as #Schwartzie already mentioned, I just wanted to understand why.
If you don't want to count the SSL handshake part then it's time_starttransfer - time_pretransfer
Here is a nice diagram from this cloudflare blog

Resources