As I understand, the Chrome browser uses the WebKit time format for timestamps within the browser history database. WebKit time is expressed as milliseconds since January, 1601.
I've found numerous articles that seemingly have the answer to my question, but none have worked so far. The common answer is to use the formula below to convert from WebKit to a human-readable, localtime:
SELECT datetime((time/1000000)-11644473600, 'unixepoch', 'localtime') AS time FROM table;
Sources:
https://linuxsleuthing.blogspot.com/2011/06/decoding-google-chrome-timestamps-in.html
What is the format of Chrome's timestamps?
I'm trying to convert the timestamps while gathering the data through Osquery, using the configuration below.
"chrome_browser_history" : {
"query" : "SELECT urls.id id, urls.url url, urls.title title, urls.visit_count visit_count, urls.typed_count typed_count, urls.last_visit_time last_visit_time, urls.hidden hidden, visits.visit_time visit_time, visits.from_visit from_visit, visits.visit_duration visit_duration, visits.transition transition, visit_source.source source FROM urls JOIN visits ON urls.id = visits.url LEFT JOIN visit_source ON visits.id = visit_source.id",
"path" : "/Users/%/Library/Application Support/Google/Chrome/%/History",
"columns" : ["path", "id", "url", "title", "visit_count", "typed_count", "last_visit_time", "hidden", "visit_time", "visit_duration", "source"],
"platform" : "darwin"
}
"schedule": {
"chrome_history": {
"query": "select distinct url,datetime((last_visit_time/1000000)-11644473600, 'unixepoch', 'localtime') AS time from chrome_browser_history where url like '%nhl.com%';",
"interval": 10
}
}
The resulting events have timestamps from the year 1600:
"time":"1600-12-31 18:46:16"
If I change the config to pull the raw timestamp with no conversion, I get stamps such as the following:
"last_visit_time":"1793021894"
From what I've read about WebKit time, it is expressed in 17-digit numbers, which clearly is not what I'm seeing. So I'm not sure if this is an Osquery, Chrome, or query issue at this point. All help and insight appreciated!
Solved. The datetime conversion needs to take place within the table definition query.
I.e. the query defined underneath "chrome_browser_history".
"chrome_browser_history" : {
"query" : "SELECT urls.id id, urls.url url, urls.title title, urls.visit_count visit_count, urls.typed_count typed_count, datetime(urls.last_visit_time/1000000-11644473600, 'unixepoch') last_visit_time, urls.hidden hidden, visits.visit_time visit_time, visits.from_visit from_visit, visits.visit_duration visit_duration, visits.transition transition, visit_source.source source FROM urls JOIN visits ON urls.id = visits.url LEFT JOIN visit_source ON visits.id = visit_source.id",
"path" : "/Users/%/Library/Application Support/Google/Chrome/%/History",
"columns" : ["path", "id", "url", "title", "visit_count", "typed_count", "last_visit_time", "hidden", "visit_time", "visit_duration", "source"],
"platform" : "darwin"
}
"schedule": {
"chrome_history": {
"query": "select distinct url,last_visit_time from chrome_browser_history where url like '%nhl.com%';",
"interval": 10
}
}
Trying to make the conversion within the osquery scheduled query (as I was trying before) will not work. i.e:
"schedule": {
"chrome_history": {
"query": "select distinct url,datetime((last_visit_time/1000000)-11644473600, 'unixepoch', 'localtime') AS time from chrome_browser_history where url like '%nhl.com%';",
"interval": 10
}
}
Try:
SELECT datetime(last_visit_time/1000000-11644473600, \"unixepoch\") as last_visited, url, title, visit_count FROM urls;
This is from something I wrote up a while ago - One-liner that runs osqueryi with ATC configuration to read in the chrome history file, export as json and curl the json to an API endpoint
https://gist.github.com/defensivedepth/6b79581a9739fa316b6f6d9f97baab1f
The things you're working with, are pretty straight sqlite. So I would start by debugging inside sqlit.
First, you should verify the data is what you expect. On my machine, I see:
$ cp Library/Application\ Support/Google/Chrome/Profile\ 1/History /tmp/
$ sqlite3 /tmp/History "select last_visit_time from urls limit 2"
13231352154237916
13231352154237916
Second, I would verify the underlying math:
sqlite> select datetime(last_visit_time/1000000-11644473600, "unixepoch") from urls limit 2;
2020-04-14 15:35:54
2020-04-14 15:35:54
It would be easier to test your config snippet if you included it as text we can copy/paste.
Related
I have a table with PK (String) and SK (Integer) - e.g.
PK_id SK_version Data
-------------------------------------------------------
c3d4cfc8-8985-4e5... 1 First version
c3d4cfc8-8985-4e5... 2 Second version
I can do a conditional insert to ensure we don't overwrite the PK/SK pair using ConditionalExpression (in the GoLang SDK):
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version)"),
}
However I would also like to ensure that the SK_version is always consecutive but don't know how to write the expression. In pseudo-code this is:
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version) **AND attribute_exists(SK_version = :SK_prev_version)**"),
}
Can someone advise how I can write this?
in SQL I'd do something like:
INSERT INTO example_table (PK_id, SK_version, Data)
SELECT {pk}, {sk}, {data}
WHERE NOT EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk}
)
AND EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk} - 1
)
Thanks
A conditional check is applied to a single item. It cannot be spanned across multiple items. In other words, you simply need multiple conditional checks. DynamoDb has transactWriteItems API which performs multiple conditional checks, along with writes/deletes. The code below is in nodejs.
const previousVersionCheck = {
TableName: 'example_table',
Key: {
PK_id: 'prev_pk_id',
SK_version: 'prev_sk_version'
},
ConditionExpression: 'attribute_exists(PK_id)'
}
const newVersionPut = {
TableName: 'example_table',
Item: {
// your item data
},
ConditionExpression: 'attribute_not_exists(PK_id)'
}
await documentClient.transactWrite({
TransactItems: [
{ ConditionCheck: previousVersionCheck },
{ Put: newVersionPut }
]
}).promise()
The transaction has 2 operations: one is a validation against the previous version, and the other is an conditional write. Any of their conditional checks fails, the transaction fails.
You are hitting your head on some of the differences between a SQL and a no-SQL database. DynamoDB is, of course, a no-SQL database. It does not, out of the box, support optimistic locking. I see two straight forward options:
Use a software layer to give you locking on your DynamoDB table. This may or may not be feasible depending on how often updates are made to your table. How fast 'versions' are generated and the maximum time your application can be gated on the lock will likely tell you if this can work foryou. I am not familiar with Go, but the Java API supports this. Again, this isn't a built-in feature of DynamoDB. If there is no such Go API equivalent, you could use the technique described in the link to 'lock' the table for updates. Generally speaking, locking a no-SQL DB isn't a typical pattern as it isn't exactly what it was created to do (part of which is achieving large scale on unstructured documents to allow fast access to many consumers at once)
Stop using an incrementor to guarantee uniqueness. Typically, incrementors are frowned upon in DynamoDB, in part due to the lack of intrinsic support for it and in part because of how DynamoDB shards you don't want a lot of similarity between records. Using a UUID will solve the uniqueness problem, but if you are porting an existing application that means more changes to the elements that create that ID and updates to reading the ID (perhaps to include a creation-time field so you can tell which is the newest, or the prepending or appending of an epoch time to the UUID to do the same). Here is a pertinent link to a SO question explaining on why to use UUIDs instead of incrementing integers.
Based on Hung Tran's answer, here is a Go example:
checkItem := dynamodb.TransactWriteItem{
ConditionCheck: &dynamodb.ConditionCheck{
TableName: "example_table",
ConditionExpression: aws.String("attribute_exists(pk_id) AND attribute_exists(version)"),
Key: map[string]*dynamodb.AttributeValue{"pk_id": {S: id}, "version": {N: prevVer}},
},
}
putItem := dynamodb.TransactWriteItem{
Put: &dynamodb.Put{
TableName: "example_table",
ConditionExpression: aws.String("attribute_not_exists(pk_id) AND attribute_not_exists(version)"),
Item: data,
},
}
writeItems := []*dynamodb.TransactWriteItem{&checkItem, &putItem}
_, _ = db.TransactWriteItems(&dynamodb.TransactWriteItemsInput{TransactItems: writeItems})
I'm having trouble with date parsing in elasticsearch 7.10.1.
Here's (a relevant part of) the mapping for the index:
"utcTime": {
"type": "date",
"format": "strict_date_optional_time_nanos"
}
Date format reference.
Some of the documents are accepted, for example documents with:
"utcTime": "2021-02-17T09:50:13.173Z"
"utcTime": "2021-02-17T09:51:44.158Z"
Note that in both cases, there are exactly 3 decimals to the seconds.
This, on the other hand, is rejected:
"utcTime": "2021-02-17T09:51:45.07Z"
illegal_argument_exception: failed to parse date field [2021-02-17T09:51:45.07Z] with format [yyyy-MM-dd''T''HH:mm:ss.SSSXX]
In this case, there are only two decimals. I'm using Newtonsoft's JSON.net to do the serialization, with a format that should always include 3 decimals, but it doesn't seem to do so anyway. It'll include at most 3 decimals, though.
How can I tell elasticsearch to accept date formats with anywhere between 0 and 3 decimals for the seconds?
EDIT
I finally found the issue, which had nothing to do with the mapping, but rather with a pipeline processor date_index_name.
PUT _ingest/pipeline/test_reroute_pipeline
{
"description" : "Route documents to another index",
"processors" : [
{
"date_index_name": {
"field": "utcTime",
"date_rounding": "d",
"index_name_prefix": "rerouted-"
}
}
]
}
Because the date_format parameter wasn't defined, it would remember the format of the first date received. If it was 2 decimals, it would require 2 every time. If it was 3, it would require three.
Specifying the date format solved the issue for good:
PUT _ingest/pipeline/test_reroute_pipeline
{
"description" : "Route documents to another index",
"processors" : [
{
"date_index_name": {
"field": "utcTime",
"date_rounding": "d",
"index_name_prefix": "rerouted-",
"date_formats": ["ISO8601"]
}
}
]
}
I just tried on a fresh new 7.10.1 cluster and it also accepted 1, 2, 3 decimals for the seconds part.
Looking at the error message you get
illegal_argument_exception: failed to parse date field [2021-02-17T09:51:45.07Z] with format [yyyy-MM-dd''T''HH:mm:ss.SSSXX]
The format that seems to be set is yyyy-MM-dd''T''HH:mm:ss.SSSXX and it is different from strict_date_optional_time_nanos which is yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ
If you check the real mapping from your index, I'm pretty sure the utcTime field doesn't have strict_date_optional_time_nanos as the format.
I have #daily_schedule triggered daily at 3 minutes past 12am
When triggered by the scheduled tick at '2021-02-16 00:03:00'
The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'
While if triggered via backfill for partition '2021-02-16'
The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'
Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron #schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates
P.S I have tested both scheduled run and backfil run to have the same Timezone.
#solid()
def test_solid(_, date):
_.log.info(f"Input date: {date}")
#pipeline()
def test_pipeline():
test_solid()
#daily_schedule(
pipeline_name="test_pipeline",
execution_timezone="Asia/Singapore",
start_date=START_DATE,
end_date=END_DATE,
execution_time=time(00, 03),
# should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.
It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:
from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector
def partition_run_config(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
test_partition_set = PartitionSetDefinition(
name="test_partition_set",
pipeline_name="test_pipeline",
partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
run_config_fn_for_partition=partition_run_config,
)
test_schedule_daily = (
test_partition_set.create_schedule_definition(
"test_schedule_daily",
"3 0 * * *",
execution_timezone="Asia/Singapore",
partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
)
)
This is pretty similar to #daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).
I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!
Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.
I am trying to save set of events from MySQL database into elastic search using jdbc input plugin with Logstash. The event record in database contains date fields which are in microseconds format. Practically, there are records in database between set of microseconds.
While importing data, Elasticsearch is truncating the microseconds date format into millisecond format. How could I save the data in microsecond format? The elasticsearch documentation says they follow the JODA time API to store date formats, which is not supporting the microseconds and truncating by adding a Z at the end of the timestamp.
Sample timestamp after truncation : 2018-05-02T08:13:29.268Z
Original timestamp in database : 2018-05-02T08:13:29.268482
The Z is not a result of the truncation but the GMT timezone.
ES supports microseconds, too, provided you've specified the correct date format in your mapping.
If the date field in your mapping is specified like this:
"date": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"
}
Then you can index your dates with the exact microsecond precision as you have in your database
UPDATE
Here is a full re-creation that shows you that it works:
PUT myindex
{
"mappings": {
"doc": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"
}
}
}
}
}
PUT myindex/doc/1
{
"date": "2018-05-02T08:13:29.268482"
}
Side note, "date" datatype stores data in milliseconds in elasticsearch so here in case nanoseconds precision level are wanted in date ranges queries; the appropriate datatype is date_nanos
I'm using Firebase Database to store the scores of a game. All is working fine until I've decided to implement a "weekly score".
In order to be able to filter by score and then order by weekly, I'm storing the data in the following structure:
game_scores-weekly
2018-01-29
user_id: { score, date, bla, bla bla}
user_id: { score, date, bla, bla bla}
user_id: { score, date, bla, bla bla}
2018-02-05
user_id: { score, date, bla, bla bla}
user_id: { score, date, bla, bla bla}
So, this works just fine but I get that annoying warning every new week about performance issues due not having indexes on "game_scores-weekly/new_week" indexOn "score". Manually adding the index works... until the next week, so not an option.
"game_scores-weekly": {
"2018-02-19": {
".indexOn": ["score", "uid"]
},
"2018-02-12": {
".indexOn": ["score", "uid"]
}
}
Is there any way to specify somehow a wildcard in the date, so it works for any new date? or perhaps can I programatically create the new index every week, or is there any other solution I might have not thought about?
Also, thought of manually creating list of all weeks of the year and adding it in one go, but likely would be a limit?
Last, but not least, I'm only interested on current week and last week scores, anything older I'd like to keep it to have some historical data but I don't query it in the game, so could potentially get rid of indexes of older weeks.
Cheers!
Thanks to #Tristan for pointing me in the right direction.
I used the following code to define the index and now warning is gone:
"game_scores-weekly": {
"$date": {
".indexOn": ["score", "uid"]
}
}
Seems super obvious now but couldn't find anything clear in the documentation.
Note that $date could be any name really, seems like you can specify a variable value using any $identifier.