What exactly is timekey_wait parameter in Fluentd? - kibana

I'm new in EFK. I have a problem with showing logs in Kibana. I already resolved, but I'm not sure of my approach.
Problem: Kibana shows log after 10 minutes when Elasticsearch is restarted.
I studied in https://docs.fluentd.org/configuration/buffer-section
And I found out that if I configure timekey_wait parameter in buffer section is 0s, Kibana shows logs without delay.
The problem is resolved, but I have still concerned about timekey_wait parameter.
Are there others impacted by the change?
Why is timekey_wait needed? Please give me an example of the necessity of it.
Thank you for your time!

1 & 2 ) According to the documentation by using timekey_wait parameter fluentd waits the specified amount of time, before writing chunks. By this way delayed log lines that needs to be in the same log chunk won't be missed.
If your timekey is 60m and timekey_wait is 10m, now the chunks will be written after 70m not 60m.
If you don't have delayed log lines to be come it becomes less important.In one of my implementations, I use flush_interval parameter. That way timekey is not needed. (buffer chunks will be flushed after this time)

Related

Why is Gremlin Server / JanusGraph ignoring some of my requests?

I'm using the Gremlin Python library to perform traversals on a JanusGraph deployment of Gremlin Server (the same also happens using just Tinkergraph). Some long traversals (with thousands of instructions) don't get a response, no errors, no timeouts, no log entries or errors on the server or client. Nothing.
The conditions for this silence treatment aren't clear. The described behaviour doesn't linearly depend on bytes or number of instructions. For instance, this code will hang forever for me:
g = traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin', 't'))
g = g.inject("")
for i in range(0, 8000):
g = g.constant("test")
print(f"submitting traversal with length={len(g.bytecode.step_instructions)}")
result = g.next()
print(f"done, got: {result}") # this is never reached
It doesn't depend on just the number of bytes in the request message since the number of instructions beyond which I don't get response doesn't change even with very large constant values in place of just "test". For instance, injecting 7000 values with many paragraphs of Lorem Ipsum works as expected and returns in a few milliseconds.
While it shouldn't matter (since I should be getting a proper error instead of nothing), I've already increased server-side maxHeaderSize, maxChunkSize, maxContentLength etc. to ridiculously high numbers. Changing the serialization format (e.g. from GraphSONMessageSerializerV3d0 to GraphBinaryMessageSerializerV1) doesn't help either.
Note: I know that very long traversals are an anti-pattern in Gremlin, but sometimes it's not possible or very inefficient to structure traversals such that they can use injected values instead.
I've answered this question on gremlin-users not realizing it was also asked here on StackOverflow. For completeness, I'll duplicate my response here.
The issue is less related to bytes and string lengths and more with the length of the traversal chain (i.e. the number of steps in your traversal). You end up hitting a JVM limit on the stack size on the server. You can increase the stack size on the jvm by changing the size of the -Xss value which should allow you a longer traversal length. That will likely come with the need to re-examine other JVM settings like -Xmx and perhaps garbage collection options.
I do find it interesting that you don't get any error messages though - you should see a stackoverflow somewhere, unless the server is just wholly bogged down by your request. I'd consider throwing more -Xmx at it to see if you can get it to respond with an error at least or to keep an eye on server logs to at least see it surface there.

How to add and check for constraints when sending logs using Cloud Watch Logs SDK

I successfully sent my application logs which is in JSON format to Cloud Watch Logs using the Cloud Watch Logs SDK but I could not understand how to handle the constraints provided by the end point.
Question 1: Documentation says
If you call PutLogEvents twice within a narrow time period using the
same value for sequenceToken, both calls may be successful, or one may
be rejected.
Now what the word "May Be" means, is there no certain outcome?
Question 2:
Restriction is 10,000 inputlogevent are allowed in one batch, this is not too hard to incorporate code wise but there is size constraint too, only 1 MB can be sent in one batch. Does that mean every time I append inputlogevent to logevent collection/batch I need to calculate the size of the batch? Does that mean I need to check for both number of inputlogevent as well as size of overall batch when sending logs? Isn't that too cumbersome?
Question 3
What happens if one of my inputlogevent's 100th character reached 1 MB. Then I cannot simply send incomplete last log with just 100 characters, I would have to comepletely take that inputlogevent out of the picture and send it as a part of other batch?
Question 4
With multiple docker container writing logs, there will be constant change in sequence token, and alot of calls will fail coz sequence token will keep on changing.
Question 5:
In offical POC they have not checked any constraint at all. why so?
PutBatchEvent POC
Am I thinking in the right direction?
Here's my understanding on how to use Cloudwatch log. Hope this helps
Question 1: I believe there is no guarantee due to the nature of distributed systems, your request can land on the same cluster and be rejected or land of different clusters and both of them accept it
Question 2 & Question 3: For me log events should always be small, fast and put pretty rapidly. Most logging frameworks do help you configure ( Batch size for AWS / Number of lines for file logging... ) Take a look at these frameworks
Question 4: Each of your containers ( Or, any parallel application unit) should use and maintain their own sequenceTokens and each of them will get a separate log stream

Status of accessing the current offset of a consumer?

I see that there was some discussion on this subject before[*], but I cannot see any way how to access this. Is it possible yet?
Reasoning: we have for example ContainerStoppingErrorHandler. It would be vital, if container was automatically stopped, to know, where it stopped. I guess ContainerStoppingErrorHandler could be improved to log partition/offset, as it has access to Consumer (I hope I'm not mistaken). But if I'm stopping/pausing it manually via MessageListenerContainer, I cannot see a way how to get info about last processed offset. Is there a way?
[*] https://github.com/spring-projects/spring-kafka/issues/431

The operation could not be performed because the filter is in the wrong state GetCurrentBuffer

The operation could not be performed because the filter is in the wrong state
I am getting this error when attemting to run hr = m_pGrabber->GetCurrentBuffer(&cbBuffer, NULL);.
Strange part is - it initially worked when I stopped the graph, now it fails on running or stopped graph.
So - what state it should be in??
The sample grabber code in MSDN I copied does not say if the graph should be stopped or running to get the buffer size - but the way it is presented the graph is running. I assume the graph should be running to fill the buffer, but I am not getting pass the sizing the buffer.
The graph is OK, all filters are conncted and renders as required, in may app and in GraphEdit.
I am trying to save the captured still frame into bitmap file so I need the capured data in the buffer.
Buffering and GetCurrentBuffer expose you a copy of last known media sample. Hence, you might hit conditions "no media sample available yet to copy from" and "last known media sample is released due to transition to stopped state". In both cases the request in question might fail. Copy data from SampleCB instead of buffered mode and this is going to be one hundred percent reliable.
See also: ISampleGrabber::GetCurrentBuffer() returning VFW_E_WRONG_STATE
Using GetCurrentBuffer is a bad idea in most cases. Proper way to use sample grabber is by setting your callback and receiving data in SampleCB.

Can you sacrifice performance to get concurrency in Sqlite on a NFS?

I need to write a client/server app stored on a network file system. I am quite aware that this is a no-no, but was wondering if I could sacrifice performance (Hermes: "And this time I mean really slash.") to prevent data corruption.
I'm thinking something along the lines of:
Create a separate file in the system everytime a write is called (I'm willing do it for every connection if necessary)
Store the file name as the current millisecond timestamp
Check to see if the file with that time or earlier exists
If the same one exists wait a random time between 0 to 10 ms, and try again.
While file is the earliest timestamp, do work, delete file lock, otherwise wait 10ms and try again.
If a file persists for more than a minute, log as an error, stop until it is determined that the data is not corrupted by a person.
The problem I see is trying to maintain the previous state if something locks up. Or choosing to ignore it, if the state change was actually successful.
Is there a better way of doing this, that doesn't involve not doing it this way? Or has anyone written one of these with a lot less problems than the Sqlite FAQ warns about? Will these mitigations even factor in to preventing data corruption?
A couple of notes:
This must exist on an NSF, the why is not important because it is not my decision to make (it doesn't look like I was clear enough on that point).
The number of readers/writers on the system will be between 5 and 10 all reading and writing at the same time, but rarely on the same record.
There will only be clients and a shared memory space, there is no way to put a server on there, or use a server based RDMS, if there was, obviously I would do it in a New York minute.
The amount of data will initially start off at about 70 MB (plain text, uncompressed), it will grown continuous from there at a reasonable, but not tremendous rate.
I will accept an answer of "No, you can't gain reasonably guaranteed concurrency on an NFS by sacrificing performance" if it contains a detailed and reasonable explanation of why.
Yes, there is a better way. Don't use NFS to do this.
If you are willing to create a new file every time something changes, I expect that you have a small amount of data and/or very infrequent changes. If the data is small, why use SQLite at all? Why not just have files with node names and timestamps?
I think it would help if you described the real problem you are trying to solve a bit more. For example if you have many readers and one writer, there are other approaches.
What do you mean by "concurrency"? Do you actually mean "multiple readers/multiple writers", or can you get by with "multiple readers/one writer with limited latency"?

Resources