Azure data explorer update record - bigdata

I am new to Azure data explorer and I am wondering how you can do update on a record in Azure data explorer using microsoft .NET SDK in C# ?
The Microsoft documentation is really poor
Can we update or you can replace a row only or you?

You can use soft-delete to delete the original record, and then append/ingest the updated record.
Please note that this won't be atomic, meaning if someone queries the table between the soft-delete and the append operations, they won't see neither the old record, nor the updated record.

there is no record "update" mechanism in Azure Data Explorer, even the 'soft delete' removes and replaces the row. this is useful for one-off scenarios, and may not be worth implementing in another language since it should not be used frequently. as the soft delete documentation says, if you plan to update data often, materialize may be a better option.
materialize is a bit more work and abstract, generally being worth the effort if you have a very large table that relies on metadata information like ingestion_time to make sense of records.
in smaller tables (say, less than a gig) i recommend the simple approach of replacing the table with an updated version of itself (just make sure that if you do rely on fields like ingestion_time, you update the schema and extend that data as a field for later use).
You will need to query for the entire table, implement logic for isolating only the row(s) of interest (while retaining all others, and perform an extend function to modify that value. Then, replace (do not append) the entire table.
For example:
.set-or-replace MyTable1 <|
MyTable1
| extend IncorrectColumn = iif(IncorrectColumn == "incorrectValue", "CorrectValue", IncorrectColumn)
alternatively, you can have the unchanging relevant data and updated data in two tabular results, and perform a union on them to form the final table.
.set-or-replace MyTable1 <|
let updatedRows =
MyTable1
| where Column1 = "IncorrectValue"
| extend Column1 = "CorrectValue";
let nonUpdatedRows =
MyTable1
| where Column1 = "CorrectValue";
updatedRows
| union nonUpdatedRows
I prefer to write to a temp table, double check the data quality, then replace the final table. This is particularly useful if you're working in batches and want to minimize the risk of data loss if there's a failure halfway through your batches

Related

Rails - Create and operate on a temporary table?

My background is in data science with R, but in my current position I'm pulling data through Rails and ActiveRecord. I want to perform transformations to my data and create new columns and save it in a temporary way that allows me to continue querying it like a regular table, but without actually making changes to the database.
In R, this might look something like:
new_table <- old_table[old_table$date >= '2020-01-01']
new_table$average <- mean(new_table$value)
I would take this new_table and perform any number of queries I could have done to the old_table, and once I close my app I expect this temporary table to be removed as well.
This particular transformation is simple and wouldn't require a new table, but for example, there are a number of tables I'd like to join with my new_table. It would be easier if I could perform my transformations once and then join it, rather than joining the old_table and performing the transformation each time.
Since your question is vague I'll give a general answer that might not fit your use but it's a best guess at this point. There are numerous ways to use the DB connection in Rails to query directly, as referenced in the link in my comments above. But as an experiment I wanted to see if this would work and it does, at least with a project that is using Postgres. I wanted it to be DB agnostic so I'm avoiding calling the DB connection directly...
First create a temporary class in the Rails console:
rails c
Loading development environment (Rails...
class MyTempTable < ActiveRecord::Base
end
=> nil
EDIT:
In addition to the method below, you can also do this to create the table:
MyTempTable.find_by_sql('create temp table temp_tables AS select...')
This will create the temp table directly from a query. You could then use a join statement if you wanted data from more than one table in the new temp table, and you can add any additional columns you want
End Edit
Now you have a class that will act like a table with the usual ActiveRecord methods. Rails now assumes there is a table in the DB called my_temp_tables (must be plural). You can then create a temp table (if your DBMS supports temp tables) like this:
MyTempTable.find_by_sql('create temp table my_temp_tables(col1, col2... ')
Now you have a temp table with the columns you want. You can then do SQL operations using
MyTempTable.find_by_sql('INSERT INTO my_temp_tables SELECT * FROM ....')
You can then treat MyTempTable like any other model in Rails. If you wanted all the columns from one table joined with some columns from another table you can create the temp table as above, you just have to create all the columns first (at least in Postgres, in MSSQL you can probably create the temp table inserting directly from a select => join statement). If you are new to Rails you can grab column names by doing this on existing tables:
some_columns = SomeTable.column_names
=> ["id", "name", "serial", "purchased", ...]
Now you have an array of the column names so you don't have to type all of them. You can list out the columns you want from the various tables, cut and past them into the create temp table... statement, then INSERT the joined data into MyTempTable
If you do much of this regularly you'll probably want to keep a listing of all your column names in an text file. You can also create Rake tasks that do all of this and save the data to some format, or send it off to where ever it is supposed to go. That way you can have it all in a file that you can just run and it will create the temp tables, do the work, and then when it closes out the temporary classes and tables go away.
You might want to investigate some Ruby Gems, there are probably existing gems that do some of what you want. But as a proof of concept this works. You could also spin up a local Rails app and use scripting to import the data you want into tables, then just flush and recreate it at will.
Any Rails gurus that know of a better way, please add an answer or edit this one. This is mostly a thought experiment for me since I wanted to see if it was possible.
If you want to create views that you can access later on you could use a gem like https://github.com/scenic-views/scenic
Or something like this might be of interest: https://github.com/igorkasyanchuk/rails_db
Sounds like you're keen on the benefits of having some structure and tools available to work on the data, but don't want the data persisted in a db table.
Maybe use a model without a table like this.

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

Dynamic query and caching

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?
For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.
For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

Determine flyway variables from earlier SQL step

I'd like to use flyway for a DB update with the situation that an DB already exists with productive data in it. The problem I'm looking at now (and I did not find a nice solution yet), is the following:
There is an existing DB table with numeric IDs, e.g.
create table objects ( obj_id number, ...)
There is a sequence "obj_seq" to allocate new obj_ids
During my DB migration I need to introduce a few new objects, hence I need new
object IDs. However I do not know at development time, what ID
numbers these will be
There is a DB trigger which later references these IDs. To improve performance I'd like to avoid determine the actual IDs every time the trigger runs but rather put the IDs directly into the trigger
Example (very simplified) of what I have in mind:
insert into objects (obj_id, ...) values (obj_seq.nextval, ...)
select obj_seq.currval from dual
-> store this in variable "newID"
create trigger on some_other_table
when new.id = newID
...
Now, is it possible to dynamically determine/use such variables? I have seen the flyway placeholders but my understanding is that I cannot set them dynamically as in the example above.
I could use a Java-based migration script and do whatever string magic I like - so, that would be a way of doing it, but maybe there is a more elegant way using SQL?
Many thx!!
tge
If the table you are updating contains only reference data, get rid of the sequence and assign the IDs manually.
If it contains a mix of reference and user data, you need to select the id based on values in other columns.

SQLite Modify Column

I need to modify a column in a SQLite database but I have to do it programatically due to the database already being in production. From my research I have found that in order to do this I must do the following.
Create a new table with new schema
Copy data from old table to new table
Drop old table
Rename new table to old tables name
That seems like a ridiculous amount of work for something that should be relatively easy. Is there not an easier way? All I need to do is change a constraint on a existing column and give it a default value.
That's one of the better-known drawbacks of SQLite (no MODIFY COLUMN support on ALTER TABLE), but it's on the list of SQL features that SQLite does not implement.
edit: Removed bit that mentioned it may being supported in a future release as the page was updated to indicate that is no longer the case
If the modification is not too big (e.g. change the length of a varchar), you can dump the db, manually edit the database definition and import it back again:
echo '.dump' | sqlite3 test.db > test.dump
then open the file with a text editor, search for the definition you want to modify and then:
cat test.dump | sqlite3 new-test.db
As said here, these kind of features are not implemented by SQLite.
As a side note, you could make your two first steps with a create table with select:
CREATE TABLE tmp_table AS SELECT id, name FROM src_table
When I ran "CREATE TABLE tmp_table AS SELECT id, name FROM src_table", I lost all the column type formatting (e.g., time field turned into a integer field
As initially stated seems like it should be easier, but here is what I did to fix. I had this problem b/c I wanted to change the Not Null field in a column and Sqlite doesnt really help there.
Using the 'SQLite Manager' Firefox addon browser (use what you like). I created the new table by copying the old create statement, made my modification, and executed it. Then to get the data copied over, I just highlighted the rows, R-click 'Copy Row(s) as SQL', replaced "someTable" with my table name, and executed the SQL.
Various good answers already given to this question, but I also suggest taking a look at the sqlite.org page on ALTER TABLE which covers this issue in some detail: What (few) changes are possible to columns (RENAME|ADD|DROP) but also detailed workarounds for other operations in the section Making Other Kinds Of Table Schema Changes and background info in Why ALTER TABLE is such a problem for SQLite. In particular the workarounds point out some pitfalls when working with more complex tables and explain how to make changes safely.

Resources