I run a simulation with varying number of iterations and each iteration creates an output table like table1,table2,table3... They all have the same structure like:
ID | value
but varying number of rows.
For each table, I want to compute the average of the 'value' column and show them in a new table with the column "averages" like:
tableNumber | averageValue
1 | 516
2 | 512
3 | 521
... | ...
Is this possible in SQlite if the number of tables is quite high? And if not, how can I achieve this in a different way?
Thanks a lot in advance :-)
Instead of creating different tables, put the results in the same table, and have a column which indicates which batch or set the row belongs to. Then when you query the table you can filter on that column so that you're working only with the desired batch/set. Put an index on that column to improve the efficiency of the query and make it run faster. There will be no need to save the average results to separate tables either. Your query can produce the result without your having to persist the results as data in another table.
select batch, avg(value) as AvgValue
from simulation
where batch = 100
group by batch
Related
I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)
I have two cross tables on a single page.
The first cross table is a summary that has Components on the horizontal axis, and Facilities on the vertical axis. The cell values shows colors "RED", "YELLOW", or "NA". The second cross table is a drilldown of the marked row on the summary table, with the horizontal axis Components and Type on the vertical axis. The cell values are a count function.
What I need is to have the color of what I marked show below each component in the drilldown.
Summary
+----------+--------+-------+--------+
| Facility | COMP1 | COMP2 | COMP3 |
+----------+--------+-------+--------+
| FAC1 | NA | RED | RED |
| FAC2 | YELLOW | NA | RED |
| FAC3 | RED | RED | YELLOW |
+----------+--------+-------+--------+
Drilldown (If I mark the FAC2 row)
+-------+--------+-------+
| Type | COMP1 | COMP3 |
+ + YELLOW + RED +
|-------|--------|-------|
| TYPE1 | 12 | |
| TYPE2 | 11 | 4 |
+-------+--------+-------+
Does anyone know if this is possible with cross tables? Any tips on how to do it? I appreciate the help.
Thanks,
John
Edit: I'm doing this to go around not being able to color column headers of a cross table, so if anyone has an alternative, I would appreciate it.
Currently using Spotfire 7.11
Okay. Bear with me here as I have hacked together a solution. I will say, I made some assumptions about your data structure. Depending on the structure of your data, the answer may need slightly modified.
Here is the structure of my data:
Step 1: Create two document properties to hold the values of the title. I created two document properties named "tableTitle1" and "tableTitle2" (one for each column in the details cross table). Create one document property to hold a DateTime value that an r script will pass us (will discuss later). I named mine "time".
Step 2: Create the cross tables as you have them. Ensure the first cross table is using Marking "Marking" and the second is limited by the marking "Marking". In the second cross table, ensure that the titles look something like this: Count([Comp1]) as [Comp1 ${tableTitle1}], Count([Comp3]) as [Comp2 ${tableTitle2}]. You need to use the document properties created in Step 1.
Step 3: Create the python script. The code is as follows:
from System.Collections.Generic import List
from Spotfire.Dxp.Data import *
# Create a cursor for the table column to get the values from.
# Add a reference to the data table in the script.
dataTable = Document.Data.Tables["SOTest"]
cursor = DataValueCursor.CreateFormatted(dataTable.Columns["Comp1"])
# Retrieve the marking selection
markings = Document.Data.Markings["Marking"].GetSelection(dataTable).AsIndexSet()
# Create a List object to store the retrieved data marking selection
markedata = List [str]();
# Iterate through the data table rows to retrieve the marked rows
for row in dataTable.GetRows(markings, cursor):
value = cursor.CurrentValue
if value <> str.Empty:
markedata.Add(value)
# Get only unique values
valData = List [str](set(markedata))
# Store in a document property
Document.Properties["tableTitle1"] = ', '.join(valData)
####DO IT AGAIN FOR THE SECOND COLUMN#####
# Create a cursor for the table column to get the values from.
# Add a reference to the data table in the script.
cursor = DataValueCursor.CreateFormatted(dataTable.Columns["Comp2"])
# Create a List object to store the retrieved data marking selection
markedata = List [str]();
# Iterate through the data table rows to retrieve the marked rows
for row in dataTable.GetRows(markings, cursor):
value = cursor.CurrentValue
if value <> str.Empty:
markedata.Add(value)
# Get only unique values
valData = List [str](set(markedata))
# Store in a document property
Document.Properties["tableTitle2"] = ', '.join(valData)
Step 4: Create an R Script to kick off the python script when data is marked. This is going to be a very simple R Script. The code is as follows:
markedTable <- inputTable
time <- Sys.time()
The check box for allow caching should be unchecked. The output parameter time should go to the document property time. the input parameter inputTable should be your datatable, all columns, and should be limited by Marking. Ensure that the refresh function automatically checkbox is checked.
Step 5: Map the python script to the time document property. In the Edit > Document Properties dialogue box, under Properties, assign the python script we created to the document property. The R script will change the current datetime each time the marking on the table changes, thus running our python script for us.
Step 6: Watch the magic happen.
I have a similar question like the one here: distinct values as new columns & count
But instead of having only 3 values (in the case above: drivers), I have about 1 million, so I cannot list all of them in my code. How can I do that in SQLite?
So I kind of want something like the code below to be repeated for i= 1 to length(DISTINCT(driver)):
SELECT model
, COUNT(model) as drives
, SUM(distance) as distance
, SUM(CASE WHEN driver=DISTINCT(driver)[i] THEN 1 ELSE 0 END) AS DISTINCT(driver)[i]
FROM new_table
GROUP BY model;
SQLite has no mechanism for dynamic SQL. You have to read the list of all possible drivers from the database, and construct the query with a separate SUM(CASE...) column for each value in your program.
But a large number of columns is inefficient, and when it becomes larger than 2000, it will not work anyway.
It might be a better idea to return each matrix entry individually:
SELECT model,
driver,
COUNT(*) AS drives_for_this_model_and_driver
FROM new_table
GROUP BY model, driver
ORDER BY model, driver;
is there anything equivalent to "ROWNUM" in teradata ? i have to implement the below query in teradata, it runs fine with oracle. any ideas or suggestions ?
INSERT INTO ADDRES(CITY,STATEPROVINCEID) SELECT 'sample',AA.ID FROM
AA WHERE ROWNUM<=1000
As there's no ORDER BY you can simply use:
INSERT INTO ADDRES(CITY,STATEPROVINCEID)
SELECT TOP 1000 'sample',AA.ID
FROM AA
But this is not random, it's just the first 1000 rows found on an AMP.
To get get sampled rows:
INSERT INTO ADDRES(CITY,STATEPROVINCEID)
SELECT 'sample',AA.ID
FROM AA
SAMPLE 1000
If you are a statistician and need a true random sample switch to:
SAMPLE RANDOMIZED ALLOCATION 1000
You can also get multiple samples, up to 16, e.g.
SAMPLE 1000,2000 --use column SAMPLEID to know which row belongs to which sample
or a fractional sample:
SAMPLE 0.1 -- 10% of the rows
or a stratified sample, i.e. samples from different groups:
SAMPLE WHEN col< 0 THEN 10
WHEN col <100 THEN 20
ELSE 50
END
I'm not sure it won't help in your situation, but for future reference, Teradata has a ROW_NUMBER() function. It works pretty much like everyone else's :
ROW_NUMBER over ([PARTITION by <column>] ORDER by <column1>[<column2]...]).
Teradata has the added advantage of being able to constrain on it using QUALIFY, instead of having to use a derived table.
Select
...
from
...
QUALIFY ROW_NUMBER over (order by...)
I have a SQLite table which contains a numeric field field_name. I need to group by ranges of this column, something like this: SELECT CAST(field_name/100 AS INT), COUNT(*) FROM table GROUP BY CAST(field_name/100 AS INT), but including ranges which have no value (COUNT for them should be 0). And I can't get how to perform such a query?
You can do this by using a join and (though kludgy) an extra table.
The extra table would contain each of the values you want a row for in the response to your query (this would not only fill in missing CAST(field_name/100 AS INT) values between your returned values, but also let you expand it such that if your current groups were 5, 6, 7 you could include 0 through 10.
In other flavors of SQL you'd be able to right join or full outer join, and you'd be on your way. Alas, SQLite doesn't offer these.
Accordingly, we'll use a cross join (join everything to everything) and then filter. If you've got a relatively small database or a small number of groups, you're in good shape. If you have large numbers of both, this will be a very intensive way to go about this (the cross join result will have #ofRowsOfData * #ofGroups rows, so watch out).
Example:
TABLE: groups_for_report
desired_group
-------------
0
1
2
3
4
5
6
Table: data
fieldname other_field
--------- -----------
250 somestuff
230 someotherstuff
600 stuff
you would use a query like
select groups_for_report.desired_group, count(data.fieldname)
from data
cross join groups_for_report
where CAST(fieldname/100.0 AS INT)=desired_group
group by desired_group;