Graphite data loss after first retention policy - graphite

I am inserting the data in Graphite db with below retention policy in storage-schemas.conf
[default_1min_for_1day]
pattern = .*
retentions = 10s:2m,20s:4m
I have inserted data for the metrics key and the data is lost after 2min , i am not able to get the data with the below render api , its lost after 2mins reached , below not able to fecth for past 3min or 1h or with currentdate
GET : http://localhost:50000//render?target=metrics.*.api.proxy.north.*.*.danna.*.success.*&format=json&noNullPoints=true&from=20200110

You can follow the githib link
https://github.com/graphite-project/whisper/issues/289
aggregationMethod will be applied on this retention policy when switching boundaries.
First retention - 10s:5m means Graphite will store 30 datapoints (every 10 seconds for last 5 minutes) in archive 0.
Please note, that it will always store these datapoints, even if no data arrived. In that case Graphite will put NULLs there.
Then next retention - 1m:1d means that every minute whisper will take 6 of these 10s datapoints from archive 0, apply average() function and store them in archive 1.
But please note that Whisper will do so only if at least 3 (number of datapoints - 6 multiplied by xFilesFactor = 0.5) or more points in archive 0 have values (i.e. not NULLs). Otherwise Whisper decides that it has not enough data to propagate and put also NULL instead.
Etc - third retention 1h:30d means that 60 of datapoints from archive 1 will be aggregated using average function and propagated to archive 2, but only if at least 30 of them have value, etc.

Related

How to design a recommendation system in DynamoDb based on likes

Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?
The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:
Give me the top 10 of the week
Give me the top 10 of the month
I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.
If the only data structure you had was a hash map, how would you solve this problem?
What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?
How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.
How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?
Let's give this a shot
Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task
Part 1. Representing an Item And Updating Like Counts
First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".
I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.
In this case, you will probably want to store, for each item:
a count of total all-time likes
a count of likes since the beginning of current month
a count of likes since the beginning of current week
current month number - needed to determine if we need to reset
current week number - needed to determine if we need to reset
And each week, reset the counts for the current week; And each month reset the counts for the current month.
In DynamoDB, this might be represented like so:
{
id: "<item-id>",
likes_all: <numeric>, // total likes of all time
likes_wk: <numeric>, // total likes for the current week
likes_mo: <numeric>, // total likes for the current month
curr_wk: <numeric>, // number of the current week of year, eg. 27
curr_mo: <numeric>, // number of the current month of year, eg. 6
}
Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc value can be any number (not just 1). This will come in handy below.
But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:
update the total count (and get the most recent values back)
conditionally update the week and month counts
So, our update sequence becomes:
Step 1. update total count and read back the updated item:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:
{
id: "<item-id>",
likes_all: 1000, // total likes of all time
likes_wk: 70, // total likes for the current week
likes_mo: 150, // total likes for the current month
curr_wk: 26, // number of the week of last update
curr_mo: 6, // number of the month of year of last update
}
curr_wk is 6, but at the time of update, the actual current week should be 7.
Then your update query would look look like this:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
--condition-expression "curr_wk = :wk AND curr_mo = :mo" \
--expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
--return-values ALL_NEW
The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.
Part 2 - Keeping Track of Statistics
To take care of your statistics you need to keep track of most likes per week and per month.
You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.
For example, let's say you want to keep track of top 3. You might store something like:
{
id: "item-stats",
week_top: ["item3:4000", "item2:2000", "item9:700"],
month_top: ["item2:100000", "item4:50000", "item3:12000"],
curr_wk: 26,
curr_mo: 6,
sequence: <optimistic-lock-token>
}
Whenever you perform an update for items, you would also update the statistics.
The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.
First, you read the current values for the item-stats special item, including the value of the current sequence (this is important to detect clobbering)
Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top and/or month-top attributes and prepare a conditional PutItem request.
The PutItem request must include a conditional check that verifies the sequeuce of the item-stats is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.
Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.
When you make the PutItem request, make sure to generate a new sequence value.
Part 3 - Putting It All Together
In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.
To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".
We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.
create a queue to store pending likes
let "likes API" would enqueue a message tagging a post with a like, instead of applying them as they come
implement a queue consumer (could be a Lambda, or some other periodically running process) to pull messages off the queue and aggregate likes per item, then update items and the item-stats
By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.
We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats update would be applied per batch processor.
Depending on volume of incoming likes, you may need to spin up more processors.

Google Analytics reporting - wider data range filters out the result

I am trying to get a GA's client id stored in a custom dimension by using other custom dimension value filter.
The problem is that I don't know why when I change start-date=2019-01-01 to start-date=2016-01-01 or start-date=2006-01-01 the result I get with start-date=2019-01-01 is gone. Why it happens? I would like to search for all the users.
Is there other method just to find a user based on a dimension, I don't need any metrics.
https://ga-dev-tools.appspot.com/query-explorer/?start-date=2019-01-01&end-date=2019-01-28&metrics=ga%3Ausers&dimensions=ga%3Adimension16%2Cga%3Adimension65&filters=ga%3Adimension16%3D%3DUMM8SBTCS0U7HIZL&include-empty-rows=true
Java:
DateRange dateRange = new DateRange();
dateRange.setStartDate("2018-01-01");
dateRange.setEndDate("2019-01-28");
final Dimension euciDimension = new Dimension().setName("ga:dimension65");
final Dimension gaDimension = new Dimension().setName("ga:dimension16");
ReportRequest request = new ReportRequest()
.setViewId(VIEW_ID)
.setDimensions(Arrays.asList(euciDimension,gaDimension))
.setDateRanges(Arrays.asList(dateRange))
.setMetrics(Arrays.asList(sessionsMetrics)).setPageSize(1000).setIncludeEmptyRows(true)
.setSamplingLevel("LARGE")
.setFiltersExpression("ga:dimension16==XYZ");
ArrayList<ReportRequest> requests = new ArrayList<ReportRequest>();
requests.add(request);
// Create the GetReportsRequest object.
GetReportsRequest getReport = new GetReportsRequest()
.setReportRequests(requests);
// Call the batchGet method.
GetReportsResponse response = service.reports().batchGet(getReport).execute();
Sampling
About data sampling
In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover the meaningful information in the larger data set. For example, if you wanted to estimate the number of trees in a 100-acre area where the distribution of trees was fairly uniform, you could count the number of trees in 1 acre and multiply by 100, or count the trees in a half acre and multiply by 200 to get an accurate representation of the entire 100 acres.
There is no way to disable data sampling in Google analytics api or website. The only way to get around it is to use smaller date ranges. Sampling for the last 12 years will likely always result in sampling unless well you started your site less than a year ago.
You can check the response to see if your data is sampled and then just reduce the number of data until it stops showing up sampled.
Note Big query: you can export the data to a big query account if you have access to that it removes the sampling.
Missing data
If you only started sending a custom dimension yesterday then the data for last week does not contain any values for this custom dimension so the data will not be returned. There is no way to do analytics on against data that did not exist at that time.

Removing first few entries in dynamo db list conditional on field values or on number of elements

I have a dataset in dynamoDB which looks like this:
{
"userID" : 2323423, // Primary Key
"lt" : [
{
"timestamp" : epoch1,
"coordinates" : "coordinate1"
},
{
"timestamp" : epoch2,
"coordinates" : "coordinate2"
},
...
]
}
The "lt" is location-tracking list, which is intended to store the coordinate values for a userID at different times.
Q1 The requirement is:
Store a maximum of 1 days worth of location tracking data per user, with the auto deletion happening only when the new LT coordinate data entry is received
What this means is at a time, there could be stale LT data, all of which is of 24 hour duration. However, as soon as a new LT coordinate data comes, deletion of stale entries should happen such that entries older than 24 hours are removed.
I'm clear on how to append entries to a list, or even remove entries at a particular index from a list in dynamoDB.
UpdateExpression : "REMOVE lt[0]" - Remove one element
UpdateExpression : "REMOVE lt[0] lt[1]" - Remove elements 0 and 1
However, now the requirement is to remove entries from the beginning of the list, such that the entries older than 24 hours are removed from it. I've banged my head over this for very long, and there does not seem to be any conditional expression which helps us do that. Am I missing something?
Q2 As a workaround, I changed requirements to:
Store last 100 entries into this "lt" list.
This is going to keep potentially stale LT data for users in case their LT data is not received
If I receive N new LT points for a user, I want to remove the first "n" entries, if the total entries have become 100 + "n". If total entries are less than or equal to 100, no need to delete the entries.
I can obviously append new N entries into the User item's "lt" list, get that User Item back, find out total number of entries, and then remove the first "n" entries, but that would be inefficient, since I'll have to make two queries, one where I'll have to return the entire "lt" list.
It would help if the size of the "lt" list could be retrieved be via some sort of Count construct, is there one.
I want to understand how it should actually be done?
You do not need to model locations as a list. You could model them as a map, with HH:MM as the key of the map. In your update expression, just SET lt.#hhmm = :coord with ExpressionAttributeNames={#hhmm:"16:05"} and ExpressionAttributeValues={:coord:"0,0"}. If you record location once per minute, that means 24 * 60 = 1440 entries in the lt map. If each coordinate pair is 19 characters long, you have approximately 30 bytes per entry or around 43 KB per person if you record once per minute.
Using the scheme above, it would cost around 43 WCU per minute, or less than 1 WCU per second per user to maintain current location with minute granularity. This is kind of high for one customer. Instead, you could split up the user location item into 30 minute buckets, using 48 such items to cover a 24 hour span per person. Thus, the UpdateItem write cost would be 1 WCU per minute or 60 WCU per hour. The hash key would be of a form like <user_id>_HH:<00 or 30>.

Use MapReduce or other distributed computation method for an analytics calculation?

Let's say I have three basic models: a User, a Company, and a Visit. Every time a User goes to a Company, a Visit is recorded in this format (user_id, company_id, visit_date).
I'd like to be able to calculate the average time between visits for a company. Not visits overall, but specifically how long on average one of their customers waits before returning to the store.
For example, if one user visited on Tuesday, Wednesday, and Friday that gives one "gap" of one day, and one "gap" of two days => (1, 2). If another user visited on Monday and Friday, that gives one gap of 4 days => (4). If a third user visited only once, he should not be considered. The average time between user visits for the company is (1 + 2 + 4) / 3 = 2.333 days.
If I have thousands of users, taps, and companies and I want to calculate a single figure for each company, how should I go about this? I've only done basic MapReduce applications before and I can't figure out what my Map and Reduce steps would be to get this done. Can anyone help me figure out a MapReduce in pseudocode? Or is there some other method of distributed calculation I can reasonably perform? For the record, I'd like to perform this operation on my database every night.
The overly simplistic approach would be to have two job steps.
First job step has a mapper to write key values in the form "company:user" and "visit_date". In the example above, the mapper would write something like:
"user1:companyA" -> "2012/07/16"
"user1:comapnyA" -> "2012/07/17"
"user1:comapnyA" -> "2012/07/19"
"user2:comapnyA" -> "2012/07/15"
"user2:comapnyA" -> "2012/07/19"
...
This means that each call to the reducer will pass all of the visits by a single user to a single company. That means that one call to the reducer will pass in:
"user1:companyA" -> {2012/07/16, 2012/07/17, 2012/07/19}
and another call will pass in:
"user2:companyA" -> {2012/07/15, 2012/07/19}
I'm assuming the set of dates (passed in as an Iterable value) is easily managed as you sort it, figure out the gaps and write a record for each gap as a key value pair in the form "company" and "gap". For example, when passed:
"user1:companyA" -> {2012/07/16, 2012/07/17, 2012/07/19}
The first job's reducer will write to the context:
"companyA" -> 1
"compnayA" -> 2
The second job has a pass-through mapper that just passes the company/gap info on to the reducer. Each call to the reducer gives an Iterable value of gaps for a specific company. Iterate through the data to produce an average and write the key value pair in the form "company" and "average_gap".
If the original set of visits gets too big, we can talk about getting hadoop to do the sorting for you with some custom comparators.

oracle year change trigger

I m on a problem that I can t figure out. I m building an application in c++ builder 2009 and oracle 11g. I have some calculated data that depend on users age. What I want to do is to re-calculate these data every new year. I thought I could have a trigger to do this, but I don t know which event I should catch and I didn t find something in internet.
My table is :
ATHLETE (name, ......, birthdate, Max_heart_frequency)
Max_heart_frequency is the field that depends on age. In insertion I calculate athlete's age, but what about next year??????
Can anyone help????
How is the max_heart_frequence calculated?
If this is a simply formula, I would create a view that returns that information. No need to store values that can easily be calculated:
CREATE VIEW v_athlete
AS
select name,
case
-- younger than 20 years
when (MONTHS_BETWEEN(sysdate, birthday) / 12) < 20 then 180
-- younger than 40 years
when (MONTHS_BETWEEN(sysdate, birthday) / 12) < 40 then 160
-- younger than 60 years
when (MONTHS_BETWEEN(sysdate, birthday) / 12) < 60 then 140
-- everyone else
else 120
end as max_heart_frequency
from athlete
Then you only need to select from the view and it will always be accurate.
You can use oracle scheduler to run a procedure at specific intervals (can be minutes hours, daily, yearly etc .. any time span).
Check this linke: http://download.oracle.com/docs/cd/B19306_01/server.102/b14231/schedover.htm
You have two options:
Have a stored procedure that calculates and updates the Max_Heart_Frequency of all the athletes every 01st Jan (using the yearly scheduling of a procedure)
Have a stored procedure that runs daily and calculates and updates the Max_Heart_Frequency of all the athletes every day (using the daily scheduling of a procedure)
If Max_Heart_Frequency changes over time because the user is getting older, why are you storing it in the table in the first place? Why not just call the function that computes the maximum heart rate at runtime when you need the value? Potentially, it may make sense to have a view on top of the Athlete table that adds the computed Max_Heart_Frequency column to hide from the callers that this is a computed column.

Resources