I have about 100 thousand records in Cosmos DB. I want to get the distinct records by some property. I am using Stored Procedure to acheive this and sets the page size to -1 to get the maximum records. When i fire a query without distinct, i get about 19 thousand records. At the same time if i fired the distinct query, it gives me distinct records, and the distinct applied with in the undistincted 19 thousand records instead of the entire 100 thousand records.
Below is the query i have used:
SELECT r.[[FieldName]] FROM r -> returns 19000 records with duplicates
SELECT DISTINCT r.[[FieldName]] FROM r -> returns distinct records (few about 5000) which are distincted from the above 19000 records instead of 100 thousand records
Related
I need to run a query to find all documents with duplicated e-mails.
SELECT * FROM (SELECT c.Email, COUNT(1) as cnt FROM c GROUP BY c.Email) a WHERE a.cnt > 1
When I run it in Data Explorer in Azure Portal it finds 4 results, but it's not a complete list of duplicated emails, because I already know one email that is duplicated and when the query is narrowed (where email = 'x') it is returned and there are about 70 duplicated emails in the collection.
Currently, throughput is set to autoscale with 6000 Max RU/s, the collection has about 4kk of documents. When running the query I observe an increased count of 429s responses on this collection.
Query Statistics shows that all documents are retrieved from the collection, but output is only 4 (should be around 70).
Query used 277324 RUs and took 71 seconds which gives 3905 RU/s in average, so it shouldn't be throttled.
Why cosmos returns only limited results for this query?
What can I do to get all duplicates?
We are dealing with log data of machines which are in presto . A view is created in Teradata Query Grid which would query the presto and have the result set in databricks. For some other internal use case we are trying to create a table in Teradata but facing difficulty in doing so.
When we try to create a table in Teradata for a particular date which has 2610117,459,037,913088 records only 14K odd records get inserted to the target table. Below is the query for the same. xyz.view is the view created in TD query grid which eventually fetches the data from presto.
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' )
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But when we create the table with sample data (say sample 10000000), exact number of records we get in the table created like below:
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' sample 10000000)
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But again creating with a sample of 1 billion records gets us only 208 million odd records in our target table.
Can anyone please help in here as to why is this happening and if it is possible to create the table with 2610117,459,037,913088 records.
We are using TD 16 .
I have a SPARQL query which returns result with a LIMIT of 20..
In this query I also want to know the total number of results without running the query two times (one with LIMIT and one without LIMIT)..
For Example-- On running a query total possible results are 500 with LIMIT it displays only 20 at a time, but in my response I want a field which displays total result count, i.e., 500 ...
Updated question
Suppose
Now if i do a query where sequence = abc_11 with LIMIT=2
I will get something like
which is fine, with addition to this output what i want is
where totalMatchedResult is 5 because query actually matched 5 results but returned only 2 because of our LIMIT=2
If I extract full data from all existing ga_sessions_ or firebase tables, the Bytes Processed are 4.5GB.
If I save the previous query into a Destination Table and then I extract full data from this table, the Bytes Processed are 217GB.
Both tables have the same table size. Why this discrepancy?
UPDATE:
My standardSQL query:
SELECT TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city,
user_properties.key,
event.name
FROM project.dataset.app_events_*,
UNNEST(user_dim.user_properties) AS user_properties,
UNNEST(event_dim) AS event
returnes 4.5GB. If I save this table (called historical_data), and I compose this query:
SELECT *
FROM `project.dataset.historical_data`
then it returnes 217GB.
I think it is possible because of double cross joins - for each cross joined row you now have redundant set of below fields
TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city
so even though original table was of 4.5GB in size the result got of 217 GB
make sense to me - and this is something that happ[ens with BigData - result can explode to enormous size if not to be careful enough
And, btw, check number of rows in original table vs. output table
I need to search first 200 rows in my database with out full table can. If I scan full table it takes too much time because my table contain 160 million record. I am using oracle 11g.
Do you really need to avoid a FTS in this case as I expect
SELECT * FROM table WHERE ROWNUM <= 200;
runs pretty fast and starts returning results immediately despite a FTS even with a table containing millions of rows.