I am hoping this is an appropriate usecase for Azure Data Factory.
I have a Cosmos DB that has ~200k records, and I would like to iterate over the entire database, passing each record into a Logic App. Is there an easy way to foreach over every record? I thought that Azure Data Factory would have this capability, but the "Lookup + Foreach" combo doesn't like the number of records I have. My attempts at creating a while loop with the "Lookup + Foreach" pipeline also feels slightly clunky.
I don't feel that 200k records is a large dataset. Am I missing something? Is there a better way?
I believe the ideal way is for you to use Change feed mechanism. This would be a perfect use-case for it.
https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-functions
Related
I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.
I have a database that's been populated with data on a server. I've made some
changes to the database models and I wish to merge the existing data with the new data without losing everything.
Do you know of an appropriate and no-nonsense method to achieve this?
Well one method would be to set up replication from each of them to a single database. But that seems pretty time intensive. I think you might have to write a program to do it (either .net or SQL) because how would it know which record to take if you have two records with the same primary key?
If you know that there will never be any duplicates, you could just write sql statements to do it. But that's a pretty big if.
You might try something like http://www.red-gate.com/products/sql-development/sql-data-compare/ which will synchronize data and I believe it will also figure out what to do when there are two records. I've had very good luck with Red-Gate products.
Here's my problem.
I want to ingest lots and lots of data .... right now millions and later billions of rows.
I have been using MySQL and I am playing around with PostgreSQL for now.
Inserting is easy, but before I insert I want to check if that particular records exists or not, if it does I don't want to insert. As the DB grows this operation (obviously) takes longer and longer.
If my data was in a Hashmap the look up would be o(1) so I thought I'd create a Hash index to help with lookups. But then I realised that if I have to compute the Hash again every time I will slow the process down massively (and if I don't compute the index I don't have o(1) lookup).
So I am in a quandry, is there a simple solution? Or a complex one? I am happy to try other datastores, however I need to be able to do reasonably complex queries e.g. something to similar to SELECT statements with WHERE clauses, so I am not sure if no-sql solutions are applicable.
I am very much a novice, so I wouldn't be surprised if there is a trivial solution.
Nosql Stores are good for handling huge inserts and updates
MongoDB has really good feature for update/Insert (called as upsert) based on whether the document is existing.
Check out this page from mongo doc
http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
Also you can checkout the safe mode in mongo connection. Which you can set it as false to get more efficiency in inserts.
http://www.mongodb.org/display/DOCS/Connections
You could use CouchDB. Its no SQL so you can't do queries per se, but you can create design documents that allow you to run map/reduce functions on your data.
In my web application, I have a dynamic query that returns huge data to datatable, and this query is often recalled with different parameters. So database is exhausted.
I want to get all record with no parameters to an object, and perform queries (may be with linq) on this object. So database will not be exthausted.
Which objects can be used instead of datatable?
This is one of my pet peeves - people who return all the data from the database.
There is absolutely no need for this unless you are doing reporting.
If you are doing reporting, then you need to increase your hardware capability so that the database can cope. This may also include tuning your database, rearranging tables, reindexing, regular rebuilding of indexes, updating statistics, archiving out old data, etc.
If you are NOT doing reporting, then start limiting how much data can be queried at any one time. Users DO NOT need to see massive quantities of data all at once. They need to see discrete amounts of data presented in a manageable and coherent way.
Another rule of thumb i like to observe is: let your database server do the work, it is made to manipulate lots of data, it is what it is good at, and it should have the power to do it. Pulling back loads of data to the client, and then trying to manipulate that data on the client is a foolish thing to do. If your client machines are more powerful than the database server then you have issues.
Never ever perform this(except cache)!!!
You are trying to implement DB mechanisms, like
persistent storage
index search and query strategy
replication
and so on
Spend your time on db optimization(optimal scheme, indexes, query, partitioning).
Let's say I have a dataset in an ASP.NET website (.NET 3.5) with 5 tables, each has roughly 30,000 rows and an average of 12 columns. I want to insert all of the data from the dataset into 5 very-similar-but-not-quite-identical tables in SQL Server 2008. I also want to use LINQ (personal preference - trying to learn something new).
Is it as simple as iterating through the dataset and, for each row, creating a new instance of the associated class, initializing its data with the dataset's row, adding it to the data model, and then doing one giant SubmitChanges at the end?
Are there better ways of doing this with LINQ? Or is this the de-facto standard?
Creating objects and inserting them is fine. But to avoid a gigantic commit at the end, you might want to perform a SubmitChanges() every 100 rows or so.
Alternately you could get a copy of Red Gate's "SQL Data Compare" utility if you have the cash. Then you never have to write one of these things again. :-)
Edit 2010-04-19: If you want to use a transaction, I think you should still use my approach instead of a single SubmitChanges(). In this case you'll want to explicitly manage your own transaction in L2S (see http://msdn.microsoft.com/en-us/library/bb386995.aspx). Run your queries in a try/catch and roll back the transaction if you get any failures.
Two last bits of advice:
Make sure your ASP.NET timeout is set high enough.
Consider printing out some kind of progress indicator. It makes running these kind of long-running things much more palatable.
Linq To Sql doesn't natively have anything like the SqlBulkCopy class. I did a quick search and it looks like there's an implementation for Linq To Sql. No clue if it is any good but it can't hurt to check it out.
DataContext.ExecuteCommand can be used with an arbitrary SQL statement. You could do a "INSERT FROM".