How to build large/busy RSS feed - rss

I've been playing with RSS feeds this week, and for my next trick I want to build one for our internal application log. We have a centralized database table that our myriad batch and intranet apps use for posting log messages. I want to create an RSS feed off of this table, but I'm not sure how to handle the volume- there could be hundreds of entries per day even on a normal day. An exceptional make-you-want-to-quit kind of day might see a few thousand. Any thoughts?

I would make the feed a static file (you can easily serve thousands of these), regenerated periodically. Then you have a much broader choice, because it doesn't have to run below second, it can run even minutes. And users still get perfect download speed and reasonable update speed.

If you are building a system with notifications that must not be missed, then a pub-sub mechanism (using XMPP, one of the other protocols supported by ApacheMQ, or something similar) will be more suitable that a syndication mechanism. You need some measure of coupling between the system that is generating the notifications and ones that are consuming them, to ensure that consumers don't miss notifications.
(You can do this using RSS or Atom as a transport format, but it's probably not a common use case; you'd need to vary the notifications shown based on the consumer and which notifications it has previously seen.)

I'd split up the feeds as much as possible and let users recombine them as desired. If I were doing it I'd probably think about using Django and the syndication framework.
Django's models could probably handle representing the data structure of the tables you care about.
You could have a URL that catches everything, like: r'/rss/(?(\w*?)/)+' (I think that might work, but I can't test it now so it might not be perfect).
That way you could use URLs like (edited to cancel the auto-linking of example URLs):
http:// feedserver/rss/batch-file-output/
http:// feedserver/rss/support-tickets/
http:// feedserver/rss/batch-file-output/support-tickets/ (both of the first two combined into one)
Then in the view:
def get_batch_file_messages():
# Grab all the recent batch files messages here.
# Maybe cache the result and only regenerate every so often.
# Other feed functions here.
feed_mapping = { 'batch-file-output': get_batch_file_messages, }
def rss(request, *args):
items_to_display = []
for feed in args:
items_to_display += feed_mapping[feed]()
# Processing/returning the feed.
Having individual, chainable feeds means that users can subscribe to one feed at a time, or merge the ones they care about into one larger feed. Whatever's easier for them to read, they can do.

Without knowing your application, I can't offer specific advice.
That said, it's common in these sorts of systems to have a level of severity. You could have a query string parameter that you tack on to the end of the URL that specifies the severity. If set to "DEBUG" you would see every event, no matter how trivial. If you set it to "FATAL" you'd only see the events that that were "System Failure" in magnitude.
If there are still too many events, you may want to sub-divide your events in to some sort of category system. Again, I would have this as a query string parameter.
You can then have multiple RSS feeds for the various categories and severities. This should allow you to tune the level of alerts you get an acceptable level.

In this case, it's more of a manager's dashboard: how much work was put into support today, is there anything pressing in the log right now, and for when we first arrive in the morning as a measure of what went wrong with batch jobs overnight.

Okay, I decided how I'm gonna handle this. I'm using the timestamp field for each column and grouping by day. It takes a little bit of SQL-fu to make it happen since of course there's a full timestamp there and I need to be semi-intelligent about how I pick the log message to show from within the group, but it's not too bad. Further, I'm building it to let you select which application to monitor, and then showing every message (max 50) from a specific day.
That gets me down to something reasonable.
I'm still hoping for a good answer to the more generic question: "How do you syndicate many important messages, where missing a message could be a problem?"

Related

Cronofy available_periods does not do anything

I am trying to get a user's availability, but in one use case, I want to ignore their actual availability rules and even their current schedule and calendar. Basically I am using cronofy in this use case to just provide me a list of times.
According to the docs https://docs.cronofy.com/developers/api/scheduling/availability/
I should be able to specify participants.members.available_periods.start and participants.members.available_periods.end. I've read and re-read these params over and over and am sure I am sending it as specified, but cronofy still returns to me only times that the user is not "busy".
Am I still not understanding this param? Is there another way to ignore a user's calendar, ie ignore their "busy" time slots?
The intent of the participants.members.available_periods parameter is to define a specific participant's availability hours for the query. Useful in multi-person queries when one or more participants have ad-hoc shift patterns or other complicated working hours. You can choose to specify these here or use our Available Periods endpoint along with Managed Availability to have the availability query consider the latest set of Available Periods when it is run.
The Availability query isn't designed to ignore all calendar events for an individual participant but there is another way you can achieve what you're looking for.
Application Calendars are designed as drop in replacements for synced-calendar Accounts in Cronofy. So you can create one of more of these via the API and use them in the Availability query as a stand-in for the participant.
They still support Managed Availability and can have events created in them. So if you wanted to ensure that your application doesn't double book over the events it already knows about, you can just create the events as your application books them.
I hope this helps. Our support team (support#cronofy.com) are always happy to talk through the specifics of your use case if that would be helpful.
-- UPDATE --
We've decided to support this as a first class concept in our API.
You can now pass an empty array to the participants.member.calendar_ids attribute to indicate that you don't want any calendars included within the availability query. And thus, only the Availability Rules or query periods will be considered. Thanks for the question.
More information here.

How to find which kinds are not being used in Google Datastore

There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.

Checking how many times a document has been read in Firestore?

I am working on a video based app that keeps track of how many views that video has received. I originally planned on having a field for view_count in my document that I would write to after someone watches a video.
However, knowing how many writes that could end up leading to, I started to wonder if it's possible to see a breakdown of how many reads have been made for each document in a collection and use that number instead. Since the videos are short, I figured this would be an accurate number for the view count.
Is this possible to access this kind of data?
Firestore does not expose any per-document access metrics. The available monitoring options are shown on this page on monitoring usage.
If you want something beyond that you'll have to build it yourself, as you originally intended.

How to yield multiple requests but only accept one request in Simpy

Within Simpy, I have multiple resources that can do the same job, but they are different so I can't just increase the capacity. Picture a single queue in a shopping centre that leads to all tellers. Some are manned and some are self serve. I put a request for both (two separate requests), and then yield rq_manned OR rq_selfserve, satisfied if atleast one of the requests is granted.
The problem is, what if they both become available at the same time, I don't want to actually request them both. What to do?
Something like this might work:
with rq_manned.request() as manned_req, rq_selfserve.request() as sserve.req:
result = yield manned_req | sserve.req
if manned_req in result:
do_manned_register_stuff()
else:
do selfserved_register_stuff()
I guess that the central issue is that SimPy doesn't see a User asking for any of several resources. It sees independent Requests on each one of the resources, without worrying about who made them.
Therefore, as you pointed out, SimPy's yield env.any_of() is not useful here, given that every single Request will go through the process of queueing, using and releasing its respective resource.
I was personally struggling with similar issues and ended up deciding to create a higher layer of abstraction on top of SimPy, which is now the Chronon project.
In Chronon, your request would be expressed as something like:
yield user.waits(
[self_teller01, self_teller02, ..., manned_teller01, manned_teller02, ...],
which='any'
)
by which the user waits for any of the resources in the list, withdrawing all the other requests when access is obtained to one of them.
This example of a bike sharing system demonstrates all the functionality you probably need.

How to hack proof a data submission program

I am writing a score submission system for games where I need to ensure that reports back to the server are not falsified (aka, hacked).
I know that I can store a password or private passkey in the program to authenticate or encrypt the request but if the program is decompiled, a crafty hacker can extract the password/passkey and use it to falsify reports.
Does a perfect solution exist?
Thanks in advance.
No. All you can do is make it difficult for cheaters.
You don't say what environment you're running on, but it sounds like you're trying to solve a code authentication problem*: knowing that the code that is executing is actually what you think it is. This is a problem that has plagued online games forever and does not have a good solution.
Common ways in which such systems are commonly broken:
Capture, modification and replay of submissions to the server
Modifying the binary to allow cheating
Using a debugger to modify the submission in-memory before the program applies signatures/encryption/whatever
Punkbuster is an example of a system which attempts to solve some of these problems: http://en.wikipedia.org/wiki/PunkBuster
Also consider http://en.wikipedia.org/wiki/Cheating_in_online_games
Chances are, this is probably too hard for your game. Hiding a public key in your binary and signing everything that leaves it will probably put you well ahead of the pack, security-wise.
* Apologies, I don't actually remember what the formal name for this is. I keep thinking "running code authentication", but Google comes up with nothing for the term.
There is one thing you can do - record all of the user inputs and send those to the server as part of the submission. The server can then replay the inputs through a local copy of the game engine to determine the score. Obviously this isn't appropriate for every type of game, though. Depending on the game, you may need to include replay protection.
Another method that may be appropriate for some types of games is to include a video recording of the high-scoring play within the submission. Provide links to the videos from the high score table, along with a link to report suspicious entries. This will let you "crowd-source" cheat detection - if a cheater's score hits the table at number 1, then the players behind scores 2 through 10 have a pretty big incentive to validate the video for you. If a score is reported enough times, you can check the video yourself and decide if it should be removed (and the user banned).

Resources