I am looking at using weaviate for vector searching, but would also like to have an age off feature for rolling off old records. Does weaviate have any type of bulk delete operation to accomplish this? I would imagine that single deletes by ids would have an effect on the performance.
It doesn't have a way to batch delete but you can remove a class.
So, if you have a class called Foobar that has e.g., 1M data objects. You can simply do a DELETE v1/schema/Foobar.
In case there is another use case where this might add value you can add it here too.
Related
I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.
We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.
First, I'm using azure cosmos graph db.
I see this sort of pattern quite a bit:
out('an-edge').fold().coalesce(unfold(),addV('incoming-schedule'))
I want to add an edge immediately after I do an addV in the coalesce. I've been trying to do it in a simple example:
g.V('any-vertex-id').as('a').out('an-edge').coalesce(unfold(),addV('new-vertex').addE('to-v').from('a'))
"a" seems to no longer exist after a fold() since it's a barrier step. I tried store and aggregate but I must not understand those properly. Is it possible to get a reference after a fold()? I need it because it may reference a previous addV in the query to which I wouldn't have the id yet.
What is your requirement here? Do you want to create a new vertex an edge only when out('an-edge') is not present?
If that's the case, I will try this:
g.V('any-vertex-id').as('a').coalesce(out('an-edge'), addV('new-vertex').addE('to-v').from(select('a')))
Fold() is typically used when one needs to aggregate on all the output from the preceding step. I don't think, that is necessary in this case.
http://tinkerpop.apache.org/docs/current/reference/#fold-step
It looks like I can store and then select from it when adding the edge.
g.V('any-vertex-id').store('a').out('an-edge').fold()
.coalesce(unfold(),addV('new-vertex')
.addE('to-v').from(select('a').unfold()))
Not sure if someone has a better alternative or a better suggestion then store, but this seems to work at least in my scenario
I am trying to form a new table that contains unique user_id's from existing one. Is it possible to add auto_increment primary key in U-SQL like we can add in MySQL?
To elaborate on David's answer: Unlike MySQL, ADLA/U-SQL is executed in a scale-out shared nothing architecture. Thus there is not an easy way to manage auto-incremental numbers.
However, there is are some tricks that you can use:
You can use the ROW_NUMBER() function to generate a number per row. You could add that to the MAX you have so far.
Or you could use DateTime.Now.Ticks to get an initial seed (plus some additional offset if you want to make sure you do not have overlapping ranges between different inserts) and the use ROW_NUMBER().
Less recommended is the use of NewGUID(), since that generates a different guid and is not repeatable. Thus if a vertex is retried, it may fail the job due to non-determinism.
I hope this helps.
This is not currently possible.
I am inserting data in Firebase Realtime Database in a table with the above structure. The key of the data is auto-generated based on push. After several such entries are created, sometime due to certain conditions I may need to delete one of the entries. At the point of deleting the entry, I may know some of the values of the node that I want to delete like createdAt and createdForPostID. But I will not know the key as it was auto-generated using push feature of firebase database. A combination of createdAt and createdForPostID makes a unique combination and only one such entry should exist in the database.
What would be the most efficient way to identify the entry without having to retrieve the entire node at OUTBOUND?
The reason I am using push is because Firebase claims it to be efficient and not subject to write conflicts. I also rely on the auto-sorting by date/time offered by push.
If no efficient way can be found, then I will generate my own key using date/time stamp. But I am hoping that this is a problem that someone has solved before and hence can guide me.
Any suggestions are welcome.
You'll need to run a query to find the items that match your conditions.
Since you seem to have multiple properties in your conditions, and the Firebase Database can only query on a single property, you'll need to combine the values into a single property as shown here.
Then you can run a query on that combined property and delete the items it returns:
var query = ref.orderByChild("createForPostID-createdAt").equalTo("20171229_124904-20171230_200343");
query.once("value", function(snapshot) {
snapshot.forEach(function(child) {
child.ref.remove();
})
Given Frank's answer I realised, I needed to create a unique property as per his suggestion because I will need it to do the future query. But then it seemed that I may be better off using that unique property as the key instead of using push
So it seems from an overall perspective, it might be more efficient to create your own key instead of push, if the app needs both create and delete functions. Reliance on push makes sense only if data is being created and deletion is not a big functionality of your app.
So, in conclusion, for Firebase data, the most efficient way to do both data create and delete needs creation of a unique key on your own.