We are dealing with log data of machines which are in presto . A view is created in Teradata Query Grid which would query the presto and have the result set in databricks. For some other internal use case we are trying to create a table in Teradata but facing difficulty in doing so.
When we try to create a table in Teradata for a particular date which has 2610117,459,037,913088 records only 14K odd records get inserted to the target table. Below is the query for the same. xyz.view is the view created in TD query grid which eventually fetches the data from presto.
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' )
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But when we create the table with sample data (say sample 10000000), exact number of records we get in the table created like below:
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' sample 10000000)
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But again creating with a sample of 1 billion records gets us only 208 million odd records in our target table.
Can anyone please help in here as to why is this happening and if it is possible to create the table with 2610117,459,037,913088 records.
We are using TD 16 .
Related
Recently I've moved one Teradata test table data to bigquery and I see the row count difference between TD and BQ. As I checked further, I see one of the row value is in "DATE"format instead of "String" because that column is PI column and the data type is VARCHAR. so this row is returning in BQ when I run select but not in TD whereas I see that row when I'm exporting data to excel. I'm really not sure what could be the reason of not showing when I run select statement. Please help me someone to know the reason and also let me know how can I search those problematic data when table is too big. Thanks.
eg : create multiset table Test(a int,b varchar,c varchar,d timestamp(6)) primary index (b);
Data like below in that table.
enter image description here
I am trying to automate some performance check on query in Teradata.
So as part of that I want to check if columns used in joining condition are primary index of respective table or not and similarly for columns used in where condition are partition column in respective table or not. Is there any direct Teradata query which can directly give this without parsing whole query.
Yes there are two dbc objects where you can query :
dbc.columnsv
dbc.indicesv.
Primary index information will be stored in the 2nd view just search with your tablename and database name.
Partitioned information is stored in columnsv , there is a column with a flag value 'Y' for partitioned columns.
Example :
SELECT DATABASENAME,TABLENAME,COLUMNNAME FROM DBC.COLUMNSV WHERE PARTITIONINGCOLUMN='Y' where tablename=<> and databasename=<>;
Select * from dbc.indicesv where tablename=<> and databasename=<>;
I would like to cluster raw table with raw data of events from Firebase in BQ, but without reprocessing/creating another tables (keeping costs at minimum).
The main idea is to find a way to cluster tables when they create from intraday table.
I tried to create empty tables with pre-defined schema (same as previous events tables), but partitioned by _partition_time column (NULL partition) and clustered by event_name column.
After Firebase inserts all the data from intraday table, the column event_name stays in details tab of table as cluster field, but no reducing costs happens after querying.
What could be another solution or way how to make it working ?
Thanks in advance.
/edit:
Our table has detail tab as:
detail tab of table
After running this query:
SELECT * FROM 'ooooooo.ooooooo_ooooo.events_20181222'
WHERE event_name = 'screen_view'
the result is:
how query processed whole table
So no cost reducing.
But if I try to create the same table clustered by event_name manually with:
Create TABLE 'aaaa.aaaa.events_20181222'
partition by DATE(event_timestamp)
cluster by event_name
AS
Select * from ooooooo.ooooooo_ooooo.events_20181222
Then the same query from first IMG applied to created table processes only 5mb - so clustering really works.
suppose we have a file with just one table named TableA and this table has just one column named Text;
let say we populate our TableA with 3,000,000 of strings like these(each line a record):
Many of our patients are incontinent.
Many of our patients are severely disturbed.
Many of our patients need help with dressing.
if I save the file at this level it'll be: ~326 MB
now let say we want to increase the speed of our queries and therefore we set our Text column as the PrimaryKey(or create index on it);
if I save the file at this level it'll be: ~700 MB
our query:
SELECT Text FROM "TableA" where Text like '% home %'
for the table without index: ~5.545s
for the indexed table: ~2.231s
As far as I know when we create index on a column or set a column to be our PrimaryKey then sqlite engine doesn't need to refer to table itself(if no other column was requested in query) and it uses the index for query and hence the speed of query execution increases;
My question is in the scenario above which we have just one column and set that column to be the PrimaryKey too, then why sqlite holds some kind of unnecessary data?(at least it seems unnecessary!)(in this case ~326 MB) why not just keeping the index\PrimaryKey data?
In SQLite, table rows are stored in the order of the internal rowid column.
Therefore, indexes must be stored separately.
In SQLite 3.8.2 or later, you can create a WITHOUT ROWID table which is stored in order of its primary key values.
i have oracle 11g
i have tables employee table and employee time data tables.
and my table having id,employee_no,employee_in Time,...
in one day i get with 1,100,17-JUN-14 04.57.19 PM,..., 2,100,17-JUN-14 05.57 PM...etc multiple records with multiple employe ids.
how to get recently get recorded with using employee_no.
i already tried with join of both tables and try to get employee name and employee_in Time
please save my days.
It would really help to know the table structures and relevant columns in each. In addition it helps to know sample data and expected results and what you've tried to date.
Based on statement to " recently get recorded with using employee_no." I take to mean get the most recent employee_in_time for each employee.
Select max(employee_In_Time), employee_no, trunc(employee_In_time) as CalDay
from employee_Time
Group by employee_no, trunc(employee_In_time)
This would return the most recent time entry for each employee, if you need other data from employee table a join should suffice. but without knowing results... not sure what you're after.