I want to write same query for multiple files. Is this possible to write dynamic query in U-SQL or is there any way to eliminate re-writing of same piece of code like
Select count(*) as cnt from #table1;
Select count(*) as cnt from #table2;
can be replaced to
Select count(*) as cnt from #dynamic
where #dynamic = table1, table2
(Azure Data Lake team here)
Your question mentions reading from files, but your example shows tables. If you do really do want to read from files, the EXTRACT statement supports "File Sets" that allow a single EXTRACT statement to read multiple files that are specified by a pattern
#data =
EXTRACT name string,
age int,
FROM "/input/{*}.csv"
USING Extractors.Csv();
Sometimes, the data needs to include the filename the data came frome, so you can specify it like this:
#data =
EXTRACT name string,
age int,
basefilename string
FROM "/input/{basefilename}.csv"
USING Extractors.Csv();
I use a custom CSV extractor that matches columns to values using the first row in the CSV file.
Here is the Gist to be added as in code behind or as a custom assembly: https://gist.github.com/serri588/ff9e3047d8341398df4aea7557f0a82c
I made it because I have a list of files that have a similar structure, but slightly different columns. The standard CSV extractor is not well suited to this task. Write your EXTRACT with all the possible column names you want to pull and it will fill those values and ignore the rest.
For example:
Table_1 has columns A, B, and C.
Table_2 has columns A, C, and D.
I want A, B, and C so my extract would be
EXTRACT
A string,
B string,
C string
FROM "Table_{*}.csv"
USING new yourNamespace.CSVExtractor();
Table 1 will populate all three columns, while Table 2 will populate A and C, ignoring D.
U-SQL does not provide a dynamic execution mode per se, but it is adding some features that can help with some of the dynamic scenarios.
Today, you have to provide the exact schema for table type parameters for TVFs/SPs, however, we are working on a feature that will give you flexible schema parameters that will make it possible to write a TVF/SP that can be applied to any table shape (as long as your queries do not have a dependency on the shape).
Until this capability becomes available, the suggestions are:
If you know what the possible schemas are: Generate a TVF/SP for each possible schema and call it accordingly.
Use any of the SDKs (C#, PowerShell, Java, Python, node.js) to code-gen the script based on the schema information (assuming you are applying it to an object from which you can get schema information and not just a rowset expression).
Related
I am looking to see if there is a way to format conditional values in batch instead of manually typing. For example, I am filtering on 5 digit codes in SQL, my source of the codes is in Excel in list form. There can be hundreds of codes to add to a SQL WHERE statement to filter on, is there tool or formatting methods the will take a list of values and format with single quotes and comma separation?
From this:
30239
30240
30241
30242
To this:
'30239',
'30240',
'30241',
'30242',
...
Then, these formatted values can be pasted into the WHERE clause instead of manually typing all of this out. Again, this is for hundreds of values...
I used to use BrioQuery that had functionality to import text files to be used in filtering, but my current qry tool, TOAD Data Point does not seem to have this.
Thank you
Look into SQL*Loader. Create s staging table to contain the imported values. Use loader to populate the stage table. Then modify your query to reference the stage table; it becomes something like:
Select ...
where target_column_name in (select column_name from stage_table).
The structure "where in ( select)" may not be the best for performance, but once loaded you will have all the facilities SQL offers at your disposal.
It has been a few years since I've used TOAD but as I remember it has an import functionality. There are other tools for loading data into Excel into Oracle. SQL*Loader just happens to be the one Oracle supplies with the RDBMS.
I want to detect column data types of any SELECT query in SQLite.
In the C API, there is const char *sqlite3_column_decltype(sqlite3_stmt*,int) for this purpose. But that only works for columns in a real table. Expressions, such as LOWER('ABC'), or columns from queries like PRAGMA foreign_key_list("mytable"), always return null here.
I know there is also typeof(col), but I don't have control over the fired SQL, so I need a way to extract the data type out of the prepared statement.
You're looking for sqlite3_column_type():
The sqlite3_column_type() routine returns the datatype code for the initial data type of the result column. The returned value is one of SQLITE_INTEGER, SQLITE_FLOAT, SQLITE_TEXT, SQLITE_BLOB, or SQLITE_NULL. The return value of sqlite3_column_type() can be used to decide which of the first six interface should be used to extract the column value.
And remember that in sqlite, type is for the most part associated with value, not column - different rows can have different types stored in the same column.
My R workflow now involves dealing with a lot of queries (RPostgreSQL library). I really want to make code easy to maintain and manage in the future.
I started loading large queries from separate .SQL files (this helped) and it worked great.
Then I started using interpolated values (that helped) which means that I can write
SELECT * FROM table WHERE value = ?my_value;
and (after loading it into R) interpolate it using sqlInterpolate(ANSI(), query, value = "stackoverflow").
What happens now is I want to use something like this
SELECT count(*) FROM ?my_table;
but how can I make it work? sqlInterpolate() only interpolates safely by default. Is there a workaround?
Thanks
In ?DBI::SQL, you can read:
By default, any user supplied input to a query should be escaped using
either dbQuoteIdentifier() or dbQuoteString() depending on whether it
refers to a table or variable name, or is a literal string.
Also, on this page:
You may also need dbQuoteIdentifier() if you are creating tables or
relying on user input to choose which column to filter on.
So you can use:
sqlInterpolate(ANSI(),
"SELECT count(*) FROM ?my_table",
my_table = dbQuoteIdentifier(ANSI(), "table_name"))
# <SQL> SELECT count(*) FROM "table_name"
sqlInterpolate() is for substituting values only, not other components like table names. You could use other templating frameworks such as brew or whisker.
I'd like to use flyway for a DB update with the situation that an DB already exists with productive data in it. The problem I'm looking at now (and I did not find a nice solution yet), is the following:
There is an existing DB table with numeric IDs, e.g.
create table objects ( obj_id number, ...)
There is a sequence "obj_seq" to allocate new obj_ids
During my DB migration I need to introduce a few new objects, hence I need new
object IDs. However I do not know at development time, what ID
numbers these will be
There is a DB trigger which later references these IDs. To improve performance I'd like to avoid determine the actual IDs every time the trigger runs but rather put the IDs directly into the trigger
Example (very simplified) of what I have in mind:
insert into objects (obj_id, ...) values (obj_seq.nextval, ...)
select obj_seq.currval from dual
-> store this in variable "newID"
create trigger on some_other_table
when new.id = newID
...
Now, is it possible to dynamically determine/use such variables? I have seen the flyway placeholders but my understanding is that I cannot set them dynamically as in the example above.
I could use a Java-based migration script and do whatever string magic I like - so, that would be a way of doing it, but maybe there is a more elegant way using SQL?
Many thx!!
tge
If the table you are updating contains only reference data, get rid of the sequence and assign the IDs manually.
If it contains a mix of reference and user data, you need to select the id based on values in other columns.
I am creating an activity table with many types of activities. Let's
say activities of type "jogging" will have elements a, b, and c while
activities of "football" will have elements a, d, and e. Can I create a
table in which the row elements for each column depend on that column's
type? I have considered creating one table for each activity type or
creating a single table with rows for every activity's options, but
there will be many activity types so it seems like a waste to use so
many tables or leave so many rows blank.
You cannot create such a table, it is not in the nature of databases to allow for "varargs". That is the reason we have relations in databases to model this type of stuff.
For an evil quickhack you could store the variable number of arguments in one column in a specific format and parse this again. Something like "a:foo|e:bar|f:qux". Don't do this, it will get out of hand in about 1 day.
I second James' proposal: redesign your tables. It should then look something like this.
Table: Activities
id|activity
0|jogging
1|football
2|...
Table: ElementsOfActivities
id|activity_id|element
0|0|a
1|0|b
2|0|c
3|1|a
4|1|d
5|1|e
Look up "normalization" (for example http://en.wikipedia.org/wiki/Database_normalization)
I assume in the subject you mean column instead of row because the whole concept of a table is around the fact that is has a variable number of rows. The same goes for your statement "leave so many rows blank" - again I assume you are talking about columns.
What you are describing is essentially an (anti) pattern called "entity attribute value". Search for this and you'll find a lot of hits describing how to do it and why not to do it.
In Postgres things are somewhat easier. It has a contrib module called "hstore" which is essentially what you are looking for. "Multiple columns inside a single column".
The biggest drawback with the hstore module is that you lose type safety. You can only put character data into a hstore column. So you cannot say "the attribute *price" is numeric, the attribute name is a character value".
If you can live with that restriction, hstore is probably what you are looking for in Postgres
It's complicated. The short answer is, "No."
You should ask yourself what you're trying to report on, and try to figure out a different schema for tracking your data.
If you really want to implement a variable-column-count table, you can do something close.
Define the activity types, and the elements you'll track on each one, and a junction table to resolve the many-to-many relationship. These tables will be mostly static. Then you have an Activity table and an ActivityAttribute table.
Create an Activity table, and then have an Activity Type, Activity Element, Activity Type-Elements and Activity Attribute tables.
Types would be "jogging", "football".
Elements would be "a", "b", "c", "d"...
Type-Elements would have rows that look like "jogging:a", "jogging:b", "jogging:c", "football:a", "football:d"
Attributes would have the actual data: "18236:a:'0:10:24'", "18237:d:'356 yards'"
Tables aren't a limited resource (in reasonable practice) so don't obsess about creating lots of them "wasting" them. Similarly in most modern databases, null columns don't take up space (in postgresql, beyond a minimal "null bitmask" overhead) so they aren't a particularly precious resource either.
It probably makes sense to have a table to represent distinct sets of attributes that can be defined together (this is essentially one of the general rules of database normalisation). If you want to deal with "activities" in a generic way, you may want to have common attributes in a shared table, rather like a base class in OOP... or you may not.
For example you could have:
jogging(activity_id int, a type, b type, c type)
football(activity_id int, a type, d type, e type)
and then create a view to combine these together when desired:
create view activity as
select 'jogging', activity_id, a, b, c, null as d, null as e from jogging
union all
select 'football', activity_id, a, null, null, d, e from football
Alternatively you could have:
activity(activity_id int, a type)
jogging(activity_id int, b type, c type)
football(activity_id int, d type, e type)
and then:
create view activity as
select case when jogging.activity_id is not null then 'jogging'
when football.activity_id is not null then 'football'
end,
activity_id, a, b, c, d, e
from activity
left join jogging using (activity_id)
left join football using (activity_id)
These models are mostly equivalent, the main difference being that the second one provides a clear path to a distinct activity_id identifier, which is one reason many people would prefer it, especially when using an ORM to persist the data (although you can do it the first way too by sharing a sequence).