How rocksdb delete key works - rocksdb

I'm trying to learn how rocksdb works under the hood. I understand that each SST file has a bloom filter, to indicate whether a key belongs to the file or not. But what happens when a key is deleted from the file? Bloom filter doesn't support deletions so a new bloom filter is created instead?

If a key is deleted, RocksDB creates a deletion marker (tombstone) for it, which is later persisted in SST files. Tombstones in an SST file will be added to the file's bloom filter.

bloom filter, to indicate whether a key belongs to the file or not.
The bloom filter is to speed up the check whether a key is contained in the file or not. That is why the api is called keyMayExist since it is probabilistic
Bloom filter doesn't support deletions so a new bloom filter is created instead?
No - nothing to do with bloom filters
what happens when a key is deleted from the file
First it is deleted from the memtable and when the data is flushed to disk a marker is stored that key xyz has been deleted
I'm trying to learn how rocksdb works under the hood.
Watch the video https://youtu.be/7QHI7JQEc5c?t=1281

Related

Track of structure changes in a Progress database

I am asked to automate the tracking of changes in the structure of the database: Any modification, addition or removal of tables, fields, indexes, etc.
I have searched the audit but only found that it can track changes in the "Database schema", which is something else.
Do you know if it is possible to do that?
We use 11.6.3.
One wonders how those magical changes in the schema (I think you clarified that it was actually schema changes you wanted to automate) occur. Optionally it could be up to those making the changes to also keep track of them. Usually (hopefully) the database is updated using "delta df-files". Those df-files if kept are a changelog of the database.
Another option is to daily/hourly/weekly dump the data definitions:
CREATE ALIAS DICTDB FOR DATABASE sports.
DISPLAY LDBNAME("DICTDB").
RUN prodict/dump_df.p ("ALL",
"c:/temp/sports.df",
"").
DELETE ALIAS DICTDB. /* Optional */
Taken from this entry in the knowledge base: https://community.progress.com/s/article/15884
Then you can diff that df-file using your favorite tool or keep as it is.
If you actually mean structure (that's more how the data is stored in different files on disc) you can use the prostrct command to save a new st-file to disc:
prostrct list sports
This will save a file called sports.st. Handle it as above and you will have a changelog of the database structure.

How to introduce a new column in dynamo DB running in production?

I have a use case where DynamoDB is running in production and I need to add a new column IDUpdatedAt which will also be serving as a sort key for one of the GSIs.
I tried a thing in test where my application adds the new rows with IDUpdatedAt, it's working fine but what about the existing rows? How to add the values for those?
Also the new rows will not be added without IDUpdatedAt, but how will the search be impacted for older rows?
PS: IDUpdatedAt is being used as a filter in the application, i.e., user can search for specific ID and can get results sorted by date. That's why IDUpdatedAt is also a part of GSI (sort key).
Please help.
You've got the right idea by adding the field to new items. After all, DynamoDB does not enforce a particular schema outside of the primary key.
This also happens to be a very useful feature, especially when defining a GSI on that attribute; if the atttibute exists on the item, it ends up in the index! For example, imagine modeling an email inbox in DDB where each item represents an email. You could include an attribute 'is_read' and define a GSI using that atttibute.
If the 'is_read' attribute exists on the item, it's in the index. Otherwise, it's not. A cool way to use GSIs to implement filtering.
Pretty neat stuff!
However, there is no way to retroactively update all items with a new attribute other than manually updating each item (or in batches). The equivalent in SQL databases is defining a new column. Unfortunately, an analogous operation in DDB does not exist.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

BizTalk - Delete without a schema

I am importing a file with 200+ records into a master table.
The BizTalk package only services one source, other packages service other sources
I am using strongly type stored procedures for all SQL CRUD
All records inside the file come from the same source
The file does not contain source name or source Id
I want to determine source from package hard coded value
The Master table contains records from several sources
Before import: delete inside Master table existing records from source
Unlike the file import, the delete statement happens once
DELETE FROM Master WHERE SourceID = #SourceID
The file import works, but how can I hard code the delete source ID?
In your delete transform (just above the send shape) you can set up a SourcID property for the outgoing message. You can then populate the message context with this SourceID. This sourceID can then be used in your delete statement.
If I understand correctly, you want to delete all existing records for the SourceID before inserting new ones?
If so, you need to have access to the SourceID value on the inbound message into the orchestration.
To do this, use property promotion.
You can either do this:
inside a pipline component configured on the receive port so that the property is available when the message arrives on the orchestration, or,
inside the orchestration, which will require you moving the construct shape for the InsertCSV message above the delete construct shape, and promoting the property within the contruct shape.
Of these options, the first one is probably the best option as assigning properties should ideally be done during message dissasembly.
Alternatively, you can use an xpath() call within an Expression shape to interrogate the message using xpath, and retrieve the value like that. This way you can avoid thinking about property promotion.
However, while quicker to implement, this approach is not best practise because it makes your orchestration very sensitive to changes in message schema.

Is there a way to use the DBC views to find the last date and time that a database schema was altered?

I would like to find the date and time that any schema modification has taken place on a particular database. Modifications are things like tables or columns that have been created, altered, or dropped. It does not include any data that has been inserted, updated, or deleted.
The reason why I need this is because I am writing a .NET utility that depends heavily on the data returned from dbc.tables, dbc.columns, and dbc.indices. Since querying these views can be a very expensive operation, I want to read it all into custom business objects and then serialize the objects to an XML file stored on disk. This way, I can just deserialize the data when I need it unless the database's current_timestamp is greater than or equal to the datetime of the last schema change, at which point I'll refresh the local XML file with the updated schema.
LastAlterTimestamp - If it is equal to CreateTimestamp then object has not been modified since being created or replaced. It is updated when an attribute specific to that data dictionary object was updated.
For example, DBC.Databases.LastAlterTimestamp is not update when a child object (table, view, macro, stored procedure, function, etc.) is added, removed, or altered. It is updated in situations such as when the password, default role, profile, or account is changed.

Resources