Can we set retention period for a table in years in Kusto? - bigdata

Currently, I can see only examples using the retention period specified in days. Can we specify the retention period of a table in years in Kusto? I mean will the below command sets the retention period to 10 years?
.alter-merge table Table1 policy retention softdelete = 10y recoverability = disabled
Also, soft delete keeps the data in DB and marks the record as deleted right. Is there a way to do hard delete and any issues with using it? My records do not refer to old data and hence I want to completely delete the records after the retention period.

The retention policy command receives a timespan literal and year is not one of the supported formats, so the command in your example does not work. You need to specify the period in days.
If Recoverability is set to Enabled, the storage artifacts will be deleted from storage 14 days after retention policy expires. If it is set to Disabled deletion will be done 1d after retention policy period.

Related

How To Change Firebase BigQuery Integration 'Dataset Time To Live' From 60 Days To Does Not Expire?

We've recently upgraded our BigQuery integration in Firebase from Sandbox to Blaze, but the 'Dataset Time To Live' is still 60 days. We've updated the dataset's 'Default table expiration' in BigQuery from 60 days to 'Never', but it's still only retaining the last 60 days of data in the historic event table and didn't change the 'Dataset Time To Live' field in Firebase. We've also updated our Data Retention settings in GA4 to retain data for the last 14 months, but it didn't have an effect on the BQ integration table expiration either.
Any help on how to get the 'Dataset Time To Live' to be set to 'Does not expire' would be greatly appreciated.
Thanks!
Once you've updated the dataset's Default table expiration all new tables within this dataset will have Table expiration = Never, but all already existing tables will still have the old value. You need to update it manually for all existing tables. Also check if there any partition expiration configured.

Firebase BigQuery server offset time

Background:
I'm having the Firebase analytics data exported to BigQuery. And I'm using cron jobs to crunch data in BigQuery for getting insight.
Problem:
To be able to only crunch delta data i.e. the data that has arrived since last time I ran my cron job I need a way to figure out the time when the data arrived at server, since the event_timestamp is generated at client and can be cached at client before sent.
Insights:
I have laborated with event_server_timestamp_offset (offset) which I thought I could use together with event_timestamp. But I was expecting the offset to only be positive but it can also be negative. And when I look at the MAX and MIN for the offset in the entire exported Firebase analytics dataset and re-calculate it to years instead of microseconds I can get more than 18 years offset.
Query:
SELECT
MAX(event_server_timestamp_offset)/(1000000*60*60*24) max_days,
MIN(event_server_timestamp_offset)/(1000000*60*60*24) min_days
FROM
`analytics_<project_id>.events_*`
Result: max_days=6784.485790436655,
min_days=-106.95833052104166
Question:
How can I figure out the server arrival time for my Firebase exported BigQuery data so I can run cron jobs crunching only delta data?
Can I use event_server_timestamp_offset together with event_timestamp? If so, how?
Best regards,
Daniel
Surprisingly enough, this question not having a clear answer for almost 2 years, I am leaving here the answers I got from the Firebase support team. The format is - question asked followed by the answer of the support staff.
Q1. event_date - The date on which the event was logged (YYYYMMDD format in the registered timezone of your app). Does it mean that the event occurred on that date, or that it was actually collected on that date?
A1. Per documentation, event_date refers to the date on which the event is logged/occurred. Note that event_date is based on the Analytics timezone setting of your Firebase Project.
Q2. event_timestamp - The time (in microseconds, UTC) at which the event was logged on the client. Is it safe to assume that this is the exact timestamp the event occurred on client side (in the app timezone of course)?
A2. Yes, this is based on the device timezone setting. However, event_timestamp may be skewed if the device time is incorrect.
Q3. event_server_timestamp_offset - Timestamp offset between collection time and upload time in micros. This is the main field that causes all the misunderstandings - in our BigQuery table for the year 2020 this field takes values in a range between 5 days and -2 days. I mean how can the colleciton time be 2 days ahead?
A3. The event_server_timestamp_offset field in the export schema is the time difference between when the event took place and the app uploaded it to our server. In other words, this is the estimated difference between the client's local time and the actual time, according to our servers. The values of this field are usually positive, but can be negative as well if the device time setting is incorrect.
Q4. One last question is very important - can we ignore the
event_server_timestamp_offset field and just rely on event_timestamp -
as the exact date and time the event occurred on the clientside (not
collected, not uplaoded, etc). If not- please explain how we can get
the exact datetime of the event occuring on the clientside. But if yes
please let me know why do we need the event_server_timestamp_offset field?
A4. Yes, you may actually ignore it and use event_timestamp alone. However, as mentioned earlier, event_timestamp could be off if the device time setting incorrect, but it shouldn't really affect the bigger picture of your analytics data as cases like this are usually one-off.
We use the event_date as the indicator and load the data once a day.

Hits Processed Per Month?

If you refer to http://www.google.com/intl/en_uk/analytics/premium/features.html, you will notice that Standard allows for 10 million hits processed per month and Premium allows for 1 billion.
I have a website on an account, with multiple "folders" for different sub-domains, and also different "Views" or dashboards for some of these sub-domains.
The website I am on recently lost tracking for conversion rates, and everything has plummeted to near 0%, which is an incorrect statistic. I am curious as to how I can figure up if this account is reaching the 10 million limit on the standard version. Or at least how to figure actual hits processed a day, week, or month?
Any ideas?
Thanks!
I don't know how Google enforces hit limits in 2015. However in 2013 a Google representative sent one of our bigger clients a document (answering a question about data limits) that contained the following paragraph:
How do data limits impact sampling? Google Analytics does not sample
your clients data at the point of collection or processing, regardless
of how far they exceed our stated limits. So no hits are discarded.
The only way to sample data at the point of collection is for clients
to use_setSampleRate in their tracking code.
[...]
[...] we reserve the right to shutdown their account [sc. if limits are exceeded], but it won't
happen before we have attempted to contact the account Admins multiple times
and we have exhausted all other options.
Unless Google has changed it's policy in the last 1,5 years I would say not, unprocessed hits are not your problem; it seems Google would have contacted you with an request to limit your hits or upgrade to Analytics Premium before problems occurr.
Plus, since you mentioned that you have several views - views do not count towards your quota (they display the same data in different ways). However properties (I think that is what you mean by "folders") do.
Updated 2017: It seems that Google intends to enforce limits more strictly. One of my clients now has the following warning in his GA interface:
Your data volume (XXX hits) exceeds the limit of 10M hit per month as
outlined in our terms of service. If you continue to exceed the limit
you will lose access to future data.
You can create a database table, like this:
visits(
id bigint primary key auto_increment,
ip text,
visit_date timestamp default current_timestamp
)
Upon each page visit, you can insert a record into the table. Later you can view statistics. For instance, visit count in a given day would look like:
select id, ip, visit_date
from visits
where visit_date >= '2015-07-21 00:00:00' and visit_date < '2015-07-22 00:00:00'

Best retention practice using Graphite

I have been a happy user of Graphite+Grafana for a few months now and I have been advocating it around my firm.
My approach has been to measure data of interest and collect them into 1-minute or 5-minute buckets and send that information to Graphite. I was recently contacted by a group that processes quotes (billions a day!) and their approach has been to create a log line each time their applications process 1 million quotes. The problem is that the interval between 2 log lines can be highly erratic from 1 second to a few hours.
The dilemma is then: should I set my retention policy to a 1-second bucket so that I can see all measurements associated with spikes or should I use say a 1-minute bucket so that the number of data points to be saved and later on queried is much more manageable. FYI, when I set it to 1-second, showing the data for 8 or 10 charts, for a few days was bringing the system (or at least my browser) to a crawl because of the numbers of data points (mostly NULL) being pushed around from Graphite to Grafana
Here's my retention policy: 1s:10d,1m:36d,5m:180d
Alternatively, is there a way to configure Grafana+Graphite to only retrieve non-NULL data points?
What do you recommend?
You can always specify a lower retention period for 1s metrics so when you show a longer range Graphite will send you only the more coarse level.
For example, you can specify: 1s:2d, 1m:7d, 5m:180d
This way, if you show a range more than 2 days in the past you will get 1m resolution (and so on), which won't make your browser crawl, while you will still be able to inspect spikes in the last 2 days.

Last “end date” with data in Analytics

I'm using "Reporting google Analitics API" and I can’t find information about what the last “end date” with data in Analytics is.
For example, let's suppose you want to retrive the last month’s data.
When do you have to perform the query?
The first day of the current month?
...or the second one?
...or maybe the third one?
And only another question: are the returned data for days in pacific time?
Google Analytics API is supposed to have access to the same data you have in the interface.
Google says that data can take up to 24h to process. The time it takes to really update the data depends on the type and size of the account. Small accounts are updated multiple times a day and can have data available in just a few hours. Once you reach 1M hits a month you are moved to a different mode where the data on your account is updated only once a day. Google Analytics Premium customers have updates more often even for large ammounts of traffic.
There's no way to tell through the API what is exactly the time of the last hit processed. You can query the data for today by the hour and see for yourself though.
Usually you don't care and just want to make sure that the data you're querying has been fully processed for that day.
So if you query data for yesterday there's a chance it has not being completely updated, for example if it's midnight the data for yesterday is just a couple minutes ago and probably haven't been completely processed yet. The safest bet in this case is to query data for 2 days ago.
So if today is 2012-06-15 and you want to get 1 month of data a safe approach is to query data with start-date=2012-05-13 and end-date=2012-06-13. This will most of the time give you data for days that have been fully processed, but it's not 100% safe as well. Google Analytics have had outages in the past where data took longer than that to process, these are not usual though. When you get the data out it's really hard to tell just for the API if the data for those days have been fully processed or not, using the 2 days ago isea you just make it more likely that it is.
The days are aggregate following your timezone settings configured on the Google Analytics profile.

Resources