EMR - create user log from log - hadoop-streaming

EMR Newbie Alert:
We have large logs containing the usage data of our web site. Customers are authenticated and identified by their customer id. Whenever we try to troubleshoot a customer issue we grep through all the logs (using the customer_id as search criteria) and pipe the results into a file. Then we use the results file to troubleshoot the issue. We were thinking about using EMR to create per-customer log files so we don't have to create a per-customer log file on demand. EMR would do it for us every hour for every customer.
We were looking at EMR streaming and produced a little ruby script for the map step. Now we have a large list of key/values (userid, logdata).
We're stuck with the reduce step however. Ideally I'd want to generate a file with all the logdata of a particular customer and put it into an S3 bucket. Can anybody point us to how we'd do this? Is EMR even the technology we want to use?
Thanks,
Benno

One possibility would be to use the identity reducer, stipulating the number of reduce tasks via property beforehand. You would arrive at a fixed number of files, in which all the records for a set of users would live. To find the right file to search for a particular user, hash the user id to determine the right file and search therein.
If you really want one file per user, your reducer should generate a new file every time it is called. I'm pretty sure there are plenty of s3 client libraries available for ruby.

Without looking at your code, yes, this is typically pretty easy to do in MapReduce; the best case scenario here is if you have many, many users (who doesn't want that?), and a somewhat limited number of interactions per user.
Abstractly, your input data will probably look something like this:
File 1:
1, 200, "/resource", "{metadata: [1,2,3]}"
File 2:
2, 200, "/resource", "{metadata: [4,5,6]}"
1, 200, "/resource", "{metadata: [7,8,9]}"
Where this is just a log of user, HTTP status, path/resource, and some metadata. Your best bet here is to really only focus your mapper on cleaning the data, transforming it into a format you can consume, and emitting the user id and everything else (quite possibly including the user id again) as a key/value pair.
I'm not extremely familiar with Hadoop Streaming, but according to the documents: By default, the prefix of a line up to the first tab character is the key, so this might look something like:
1\t1, 200, "/resource", "{metadata: [7,8,9]}"
Note that the 1 is repeated, as you may want to use it in the output, and not just as part of the shuffle. That's where the processing shifts from single mappers handling File 1 and File 2 to somethig more like:
1:
1, 200, "/resource", "{metadata: [1,2,3]}"
1, 200, "/resource", "{metadata: [7,8,9]}"
2:
2, 200, "/resource", "{metadata: [4,5,6]}"
As you can see, we've already basically done our per-user grep! It's just a matter of doing our final transformations, which may include a sort (since this is essentially time-series data). That's why I said earlier that this is going to work out much better for you if you've got many users and limited user interaction. Sorting (or sending across the network!) tons of MBs per user is not going to be especially fast (though potentially still faster than alternatives).
To sum, it depends both on the scale and the use case, but typically, yes, this is a problem well-suited to map/reduce in general.

Take a look at Splunk. This is an enterprise-grade tool designed for discovering patterns and relationships in large quantities of text data. We use it for monitoring the web and application logs for a large web site. Just let Splunk index everything and use the search engine to drill into the data -- no pre-processing is necessary.
Just ran across this: Getting Started with Splunk as an Engineer

Related

How are requests and tests organized in postman?

I am investigating Postman but I can't find an optimal way to organize the requests. Let me give an example:
Let's imagine I'm working on testing two microservices and my goal is to automate the tests and create a report that informs us of the outcome. Each microservice has 10 possible requests to perform (either get, post, etc.).
The ideal solution would be to have a collection which contains two folders (one per microservice) and that each folder has 10 requests; and to be able to run that collection from newman with a CSV/JSON data file, generating a final report (with tools like htmlextra).
The problem comes with data and iterations. Regarding the iterations, some of the requests like for example some 'GET' need only 1 execution but others like for example some 'POST' need several executions to check the status(if it returns a 200, a 400, etc.). Regarding the data, having 10 requests in the same collection, and some of the requests occupy quite a few columns of the csv, the data file becomes unreadable and hard to maintain.
So what I get when I run the collection is an unreadable data file and many iterations on requests that don't need them. As an alternative, I could create 1 collection for each request, and a data file for each collection but in that case I could not get a report as a whole when executing from newman, since to execute several collections I must execute several newman instructions, generating a report file for each one. So I would get 20 html file as report, 1 for each request(also not readable).
Sorry if I am pointing in the wrong direction as I have no experience with the tool.
On the web I only see examples of reports and basic collections, but I'm left wanting to see something more 'real'.
Thanks a lot!!!
Summary of my notes:
Only 1 data file can be assigned to the collection runner.
There are many requests which have a lot of data, about 10 columns, so if we have one data file for several, it becomes unreadable. Ideally 1 file for each request.
This forces us to create a collection for each request, since only 1 file per collection is feasible.
Newman can only run 1 collection per command, if you want to launch more, you have to duplicate instruction. This forces to create a report per collection. If we have 1 collection per request, we would have 1 report for each one instead of a report with the status of all the requests, so it loses the sense.

How to find which kinds are not being used in Google Datastore

There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.

Can you get a log file of 'reads' on specific RECID(Tablename) in Progress-4GL/Openedge at RunTime without access to Source Code?

I want to know which tables are being read by a query.
for each Customer where CustomerID = 12345.
Eventually this customer will be found in the following example, but progress must 'read' many tables before getting to customer 12345.
How do I know exactly which tables are read (By CustomerID), prior to getting to customer 12345?
*NOTE: I do not have access to modify the code being run for this selection. Ideally I would run a separate set of code that is executed at the same time as the customer query above to track the reads.
EDIT: More clearly - Can you track reads from a given program (.p) OR ProcessID and output either a RECID or the PrimaryKey to a file?
I understand the information is being read off the Disk and probably stored in a database buffer. So how would I get at the information in the database buffer?
You seem to be mixing up a few different things.
In a situation like your example where you FIND a specific record in one, and only one table then there is just a single record read. Progress will find that record by first scanning a relevant index. That might be 2 or 3 "logical reads" of the b-tree to get to the proper node. The record block and index blocks may, or may not be read from disk - that depends on what has happened previously.
There are "Virtual System Tables" available that can tell you how many READ operations take place against a particular table or index. But they do not trace the specific ROWID or other identifying data. _TableStat and _IndexStat are aggregates for all users on the system, _UserTableStat and _UserIndexStat are specific to a particular user's activity. You do need to set the -tablerangesize and -indexrangesize parameters adequately to take advantage of these.
If you have enabled the table and index statistics then you can use a tool like ProTop - http://protop.wss.com to get insight into this activity. Or you can write your own code.
OpenEdge Auditing does not track reads. That would be prohibitively expensive.
It's probably not really a good idea but, in theory, you could write FIND triggers for the tables you are interested in. That doesn't require access to the application source but you would need a development license. It will probably kill performance to do this though - so unless this is a non-production test environment that you just want to fiddle with I wouldn't really do that.
You mention wanting to know how you got to that point. That sounds more like you might need to have a "4gl trace". One easy way to get the stack trace of a running process is to execute:
$DLC/bin/proGetStack PID (UNIX)
or
%DLC%\bin\proGetStack PID (Windows)
This command will generate a "protrace.pid" file containing a 4gl stack trace and other interesting information.
There are also more complicated ways to get that info like using PROMON and the "client statement cache" or setting various log entry types at session startup. But proGetStack is pretty convenient and requires no code or scripting changes.
Some great options from Tom above. And all of them may be relevant to you. The option he only skirts around is the logging options. I feel obliged to expand on this because I'm giving a talk on it in a couple of weeks!
Assuming you are running a modern version of Progress, or even 10.2B08, then you have client logging available to you. Start your session with these additional options:
-clientlog "\somefolder\somefile.txt"
-logentrytypes "QryInfo:3"
This will log all the info of all the queries in your session to the file you specified above. If you navigate to the point in the system where you want to analyse your query and empty the logfile and save it, you can then run the offending query and see all the detail you need.
The output tells you all sorts of useful info, including the number of reads on each table, compared with the number returned to the user. You also get the index selected.
Using Tom's advice and/or this will get you what you need.

Explain Query Bands in Teradata

Can anyone explain Query Bands in Teradata?
I've searched regarding this a lot, but wasnt able to get information which I can understand.
Please be a bit detailed.
Thanks!!!
QUERY BANDING IN TERADATA:
QUERY BANDING PROVIDES CIRCUMSTANTIAL WORKFLOW INFORMATION.
Concept:
Scientists will often band the legs of birds with devices to track their flight paths. Monitoring and analyzing the data retrieved via the bands provides critical information about the species.
The same process is followed by DBAs who need some more information about a query than what is available.
Metadata—such as the name of requesting user, work unit & the application name is important, Workload management will be tracking the entire use of data warehouse & query troubleshooting.
Query banding feature is used such a way that, these metadata details are linked to the query in database.
A query band can contain any number of name or value pairs such as initiating users corporate ID, department & location, also the time of the initiation execution started.
Prashanth provided a good analogy with birds and bands. Adam is asking for specific situations. I can come up with several examples, when query banding may be very useful:
Your system is used by hundreds of users via an Application Server with a custom application or a reporting application like Business Objects, Tableau or Qlikview. Application server connects to Teradata using one user ID, however the administrator would still like to know what users, departments and groups of users generate each query to be able to analyze later in DBQL or simply to allocate proper system resources using TASM. For this the application can be configured in such a way that each query is "banded" with information like "AppUser:User1;Appgroup:DataScientists;QueryType:strategic02". Despite the fact that Application Server uses one Teradata user and a limited number of connection to route all the queries from hundreds of users, each individual query is marked with information exactly which user has initiated the query. You can then perform all kinds of analysis based on this information.
Suppose you have a complex ETL application, and you want to track and analyze your execution of loads - what and when went wrong. Usually you would need to log all the steps of your ETL process, and in the logs you must specify unique Load ID, Process ID, Step ID, etc. You do this because you want to be able to understand what specific process caused this halt or a performance degradation, and without such logging it would not be possible to distinguish running of the same steps between different runs of your ETL application. A good alternative would be to switch on DBQL and embellish your queries with Query Band information with Load ID, Process ID, Step ID, etc. In this way you would have all necessary information in DBQL without the necessity to create additional elaborate log tables.
SET QUERY BAND = 'name=value; name2=value;' FOR SESSION|TRANSACTION;
this will tag your query with some name value pairs. This can be used to manage your query's workload management for example in TDWM you have throttles and priority management hooks that will priorities all name2 types with the value "value". It means you can submit a very rich detail on the session or transaction
Yes, what you described can easily be done with QueryBanding; think of it as a "wagon of key-pair attributes in transit". you can access them via sql or prgrammatically with session attributes in bteq or jdbc for example.
Necromancing... Existing answers do a good job at explaining how query bands work, but since I could not find a complete working example, I thought of adding one here.
Setting query bands in Teradata is already covered, so I will provide an example of how to set them from a .NET client:
private void SetQueryBands()
{
TdQueryBand qb = Connection.QueryBand;
qb["CustomApplicationName"] = "MyAppName";
foreach (string key in CustomQueryBands.Keys)
{
qb[key] = CustomQueryBands[key];
}
Connection.ChangeQueryBand(qb);
}
Connection = new TdConnection(GetConnectionString());
Connection.Open();
SetQueryBands();
More details can be found here.
To retrieve stored queryband data, GetQueryBandValue function can be used:
SELECT CollectTimestamp, QueryBand,
GetQuerybandValue(queryband, 0, 'Key1') AS Value1,
GetQuerybandValue(queryband, 0, 'Key2') AS Value2,
GetQuerybandValue(queryband, 0, 'Key3') AS Value3,
FROM dbql_data.dbqlogtbl
WHERE dateofday = DATE - 1
AND queryband LIKE '%somekeyorvalue%'

How to build large/busy RSS feed

I've been playing with RSS feeds this week, and for my next trick I want to build one for our internal application log. We have a centralized database table that our myriad batch and intranet apps use for posting log messages. I want to create an RSS feed off of this table, but I'm not sure how to handle the volume- there could be hundreds of entries per day even on a normal day. An exceptional make-you-want-to-quit kind of day might see a few thousand. Any thoughts?
I would make the feed a static file (you can easily serve thousands of these), regenerated periodically. Then you have a much broader choice, because it doesn't have to run below second, it can run even minutes. And users still get perfect download speed and reasonable update speed.
If you are building a system with notifications that must not be missed, then a pub-sub mechanism (using XMPP, one of the other protocols supported by ApacheMQ, or something similar) will be more suitable that a syndication mechanism. You need some measure of coupling between the system that is generating the notifications and ones that are consuming them, to ensure that consumers don't miss notifications.
(You can do this using RSS or Atom as a transport format, but it's probably not a common use case; you'd need to vary the notifications shown based on the consumer and which notifications it has previously seen.)
I'd split up the feeds as much as possible and let users recombine them as desired. If I were doing it I'd probably think about using Django and the syndication framework.
Django's models could probably handle representing the data structure of the tables you care about.
You could have a URL that catches everything, like: r'/rss/(?(\w*?)/)+' (I think that might work, but I can't test it now so it might not be perfect).
That way you could use URLs like (edited to cancel the auto-linking of example URLs):
http:// feedserver/rss/batch-file-output/
http:// feedserver/rss/support-tickets/
http:// feedserver/rss/batch-file-output/support-tickets/ (both of the first two combined into one)
Then in the view:
def get_batch_file_messages():
# Grab all the recent batch files messages here.
# Maybe cache the result and only regenerate every so often.
# Other feed functions here.
feed_mapping = { 'batch-file-output': get_batch_file_messages, }
def rss(request, *args):
items_to_display = []
for feed in args:
items_to_display += feed_mapping[feed]()
# Processing/returning the feed.
Having individual, chainable feeds means that users can subscribe to one feed at a time, or merge the ones they care about into one larger feed. Whatever's easier for them to read, they can do.
Without knowing your application, I can't offer specific advice.
That said, it's common in these sorts of systems to have a level of severity. You could have a query string parameter that you tack on to the end of the URL that specifies the severity. If set to "DEBUG" you would see every event, no matter how trivial. If you set it to "FATAL" you'd only see the events that that were "System Failure" in magnitude.
If there are still too many events, you may want to sub-divide your events in to some sort of category system. Again, I would have this as a query string parameter.
You can then have multiple RSS feeds for the various categories and severities. This should allow you to tune the level of alerts you get an acceptable level.
In this case, it's more of a manager's dashboard: how much work was put into support today, is there anything pressing in the log right now, and for when we first arrive in the morning as a measure of what went wrong with batch jobs overnight.
Okay, I decided how I'm gonna handle this. I'm using the timestamp field for each column and grouping by day. It takes a little bit of SQL-fu to make it happen since of course there's a full timestamp there and I need to be semi-intelligent about how I pick the log message to show from within the group, but it's not too bad. Further, I'm building it to let you select which application to monitor, and then showing every message (max 50) from a specific day.
That gets me down to something reasonable.
I'm still hoping for a good answer to the more generic question: "How do you syndicate many important messages, where missing a message could be a problem?"

Resources