Find the x minimal values in a distance matrix in R - r

I have computed a distance matrix between patches of ancient forests and recent forests in PostgresSQL thank to the following code:
CREATE TABLE MatDist as (
SELECT
a.id a,
b.id b ,
st_distance(a.geom, b.geom) dist
FROM public.bvi_foret a, public.bvi_foret b
WHERE a.id != b.id AND a.ANC_FOR != b.ANC_FOR
)
and it works perfectly.
I want now to select the 5 pairs ancient forests (a)/recent forest (b) presenting the minimal distance between them.
So I started working with R, and I can find the unique pair presenting the minim distance, thanks to the following code:
DT <- data.table(df)
DT[ , .SD[which.min(dist)], by = a]
But how can I compute the 5 first pairs? It's probably easy, with a for loop or an apply function in R, but I can't find it...
Thanks in advance for your answers.

Using pure SQL:
SELECT *
FROM MatDistMat
ORDER BY dist
LIMIT 5;
Thanks for your answer, but I need the 5 first pairs FA/FR for each patch of ancient forest.
SELECT *
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY dist ASC) as rn
FROM MatDistMat) sub
WHERE sub.rn <= 5;

Related

SQLite order results by smallest difference

In many ways this question follows on from my previous one. I have a table that is pretty much identical
CREATE TABLE IF NOT EXISTS test
(
id INTEGER PRIMARY KEY,
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER NOT NULL,
d INTEGER NOT NULL,
weather INTEGER NOT NULL);
in which I would typically have entries such as
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30100306);
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30140306);
INSERT INTO test (a,b,c,d) VALUES(1,2,5,5,10100306);
INSERT INTO test (a,b,c,d) VALUES(1,5,5,5,11100306);
INSERT INTO test (a,b,c,d) VALUES(5,5,5,5,21101306);
Typically this table would have multiple rows with the some/all of b, c and d values being identical but with different a and weather values. As per the answer to my other question I can certainly issue
WITH cte AS (SELECT *, DENSE_RANK() OVER (ORDER BY (b=2) + (c=3) + (d=4) DESC) rn FROM test where a = 1) SELECT * FROM cte WHERE rn < 3;
No issues thus far. However, I have one further requirement which arises as a result of the weather column. Although this value is an integer it is in fact a composite where each digit represents a "banded" weather condition. Take for example weather = 20100306. Here 2 represents the wind direction divided up into 45 degree bands on the compass, 0 represents a wind speed range, 1 indicates precipitation as snow etc. What I need to do now while obtaining my ordered results is to allow for weather differences. Take for example the first two rows
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30100306);
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30140306);
Though otherwise similar they represent rather different weather conditions - the fourth number is four as opposed to 0 indicating a higher precipitation intensity brand. The WITH cte... above would rank the first two rows at the top which is fine. But what if I would rather have the row that differs the least from an incoming "weather condition" of 30130306? I would clearly like to have the second row appearing at the top. Once again, I can live with the "raw" result returned by WITH cte... and then drill down to the right row based on my current "weather condition" in Java. However, once again I find myself thinking that there is perhaps a rather neat way of doing this in SQL that is outwith my skill set. I'd be most obliged to anyone who might be able to tell me how/whether this can be done using just SQL.
You can sort the results 1st by DENSE_RANK() and 2nd by the absolute difference of weather and the incoming "weather condition":
WITH cte AS (
SELECT *,
DENSE_RANK() OVER (ORDER BY (b=2) + (c=3) + (d=4) DESC) rn
FROM test
WHERE a = 1
)
SELECT a,b,c,d,weather
FROM cte
WHERE rn < 3
ORDER BY rn, ABS(weather - ?);
Replace ? with the value of that incoming "weather condition".

Alternative for recursive aggregate queries not supported in sqlite3

I would like to perform a SQL computation of a system evolving in time as
v <- v + a (*) v
where v is a vector of N components (N >> 10), a is an N-by-N matrix, fairly sparse, (*) denotes matrix multiplication, and the evolution is recursively computed as a sequence of timesteps, with each step using the previous value of v. a changes with time as an external factor, but it is sufficient for this question to assume a is constant.
I could do this recursion loop in an imperative language, but the underlying data was kind of messy and SQL was brilliant for normalising. It would be kind of neat to just finish the job in one language.
I found that matrix multiplication is fine. Recursion is fine too, as of sqlite 3.8. But matrix multiplication inside a recursion loop does not appear to be possible. Here is my progress so far (also at http://sqlfiddle.com/#!5/ed521/1 ):
-- Example vector v
DROP TABLE IF EXISTS coords;
CREATE TABLE coords( row INTEGER PRIMARY KEY, val FLOAT );
INSERT INTO coords
VALUES
(1, 0.0 ),
(2, 1.0 );
-- Example matrix a
DROP TABLE IF EXISTS matrix;
CREATE TABLE matrix( row INTEGER, col INTEGER, val FLOAT, PRIMARY KEY( row, col ) );
INSERT INTO matrix
VALUES
( 1, 1, 0.0 ),
( 1, 2, 0.03 ),
( 2, 1, -0.03 ),
( 2, 2, 0.0 );
-- The timestep equation can also be expressed: v <- ( I + a ) (*) v, where the
-- identity matrix I is first added to a.
UPDATE matrix
SET val = val + 1.0
WHERE row == col;
-- Matrix multiply to evaluate the first step.
SELECT a.row AS row, SUM( a.val*v.val ) AS val
FROM coords v
JOIN matrix a
ON a.col == v.row
GROUP BY a.row;
Here is where the problem arises. I can't see how to do a matrix multiply without a
GROUP BY (aggregation) operation, but Sqlite3 specifically does not permit aggregation inside of a recursion loop:
-- Recursive matrix multiply to evaluate a sequences of steps.
WITH RECURSIVE trajectory( row, val ) AS
(
SELECT row, val
FROM coords
UNION ALL
SELECT a.row AS row, SUM( a.val*v.val ) AS val
FROM trajectory v -- recursive sequence of steps
--FROM coords v -- non-recursive first step only
JOIN matrix a
ON a.col == v.row
GROUP BY a.row
LIMIT 50
)
SELECT *
FROM trajectory;
Returns
Error: recursive aggregate queries not supported
No doubt the designers had some clear reason for excluding it! I am surprised that JOINs are allowed but GROUP BYs are not. I am not sure what my alternatives are, though.
I've found a few other recursion examples but they all seem to have carefully selected problems for which aggregation or self-joins inside the loop are not required. In the docs( https://www.sqlite.org/lang_with.html ) an example query walks a tree recursively, and performs an avg() on the output. This is subtly different: the aggregation happens outside the loop, and tree-walking uses JOINs but no aggregation inside the recursion loop. That problem proceeds only because the recursion does not depend on the aggregations, as it does in this problem.
Another example, the Fibonacci generator is an example of an N = 2 linear dynamical system, but with N = 2 the implementations can just hard-code the two values and the matrix multiply directly into the query, so no aggregating SUM() is needed. More generally with N >> 10 it is not feasible to go down this path.
Any help would be much appreciated. Thanks!

Oracle complex update statement

I have a table where data is as given below
My requirement is to update this table in such a way that, within a group (grouping will be done based on column A), if there is value in column B, same value should be updated to other rows in column B having null values within that group. If column B have null value for all the records within that group, then new sequence should be generated.Also I can't use pl/SQL block for this. I need to write a SQL query to perform this
My expected output is given below
You won't be able to use the sequence_name.nextval directly in your update statement, as the value will increase with every row, meaning that you would end up with different values in your b column for each a value.
The best way round that I can think of doing this is to first of all ensure every set of all-null b values has a single value in it, which you can do as follows:
merge into t1 tgt
using (select a,
b,
rid,
row_number() over (partition by a order by b) rn
from (select a,
b,
rowid rid,
max(b) over (partition by a) max_b
from t1)
where max_b is null) src
on (tgt.rowid = src.rid and src.rn = 1)
when matched then
update set tgt.b = t1_seq.nextval;
This finds the rows which have all the b values as null for a given a, and then updates one of them to have the next sequence value.
Once you've done that, you can then go ahead and populate the null values based on the max b value for that group, like so:
update t1
set b = (select max(b) from t1 t2 where t1.a = t2.a)
where b is null;
See this LiveSQL script for evidence that this works.
Something like this:
update table t1
set B = (select nvl(max(b),sequence_name.nextval) from table where a=t1.a)
Ps: I couldn't test this.
Indeed we can't use sequences in correlated subqueries... :(
One workaround is the use of merge :
merge into teste t1
using (select max(b) as m,a from teste group by a) t2
on (t1.a=t2.a)
when matched then update set b= nvl(t2.m,seq_teste.nextval);
One thing: that nextval will ALWAYS be consumed even when it won't be inserted. If you don't want that, you might need some pl/sql code.

Need to apply Primary Indexes and secondary indexes in teradata tables

Can some one please help in solving my problem
I have three tables to be joined ed using indexes in Teradata to improve performance. Query specified below:-
Select b.Id, b.First_name, b.Last_name, c. Id,
c.First_name, c.Last_name, c.Result
from
(
select a.Id, a.First_name, a. Last_name, a.Approver1, a.Approver2
From table1 a
Inner join table2 d
On a.Id =D.Id
and A.Approver1 =a.Approver1
And a.Approve2 =D.Approver2
) b
Left join
(
select * from table3
where result is not null
and application like 'application1'
) c
On c. Id=b.Id
Group by b.Id, b.First_name, b.Last_name, c.Id,
c.First_name, c.Last_name, c.Result
The above query is taking so much of time since PI not defined correctly.
First two tables (table1 and 2) are with same set of columns hence pi can be defined like PI on I'd, approve1, approve2
However, while joining with table3 am confused and need to understand how to define pi. Is it something that PI can only work when we have same set of columns in the tables?
Structure of table3 is
I'd, first name, last name, result
And table 1 and table2
Id , First name, Last name, Approved 1, Approved 2, Results
Can you please help in defining primary indexes so that query can be optimised.
Teradata will usually not use Secondary Indexes for joins. The best PI would be id for all three tables, of course you need to check if there are not too many rows per value and it's not too skewed.
GROUP BY can be simplified to a DISTINCT, why do you need it, can you show the Primary Keys of those tables?
Edit based on comment:
PI-based joins are by far the fastest way. But you should be able the get rid of the DISTINCT, too, it's always a huge overhead.
Try replacing the 1st join with a NOT EXISTS:
Select b.Id, b.First_name, b.Last_name, c. Id,
c.First_name, c.Last_name, c.Result
from
(
select a.Id, a.First_name, a. Last_name, a.Approver1, a.Approver2
From table1 a
WHERE EXISTS
(
SELECT *
FROM table2 d
WHERE a.Id =D.Id
and A.Approver1 =a.Approver1
And a.Approve2 =D.Approver2
)
) b
Left join
(
select * from table3
where result is not null
and application like 'application1'
) c
On c. Id=b.Id

SQL Hive: select (*) LIMIT 1 based on a combination of 3 columns, union in R, RODBC

I'm an intern working with big data and this is my first question. If I'm not asking it well, please let me know how to improve.
I have a very large table that I'm querying through Hive via R's RODBC package.
Let's say that table has columns named A:ZZZ.
I'd like to pull one row, with all columns, for every unique combination of 3 columns, let's say B, F, and G.
I ran the below query to get all unique combinations of B, F and G and came up with a little over 7000:
select B, F, G, count(*)
from DB.tableName
group by B, F, G;
I did a lot of research and found this:
SELECT * FROM T WHERE (A,B) IN (('1', '1'),('2', '2'));
I currently have all my combinations of B, F and G stored as a data frame in R. I thought that if I could convert the data frame of combinations into a vector that I named TestVector, that I could try this:
SELECT * FROM DB.Table WHERE (B,F,G) IN TestVector LIMIT 1;
but I get these errors, and don't know how to fix the syntax:
[1] "HY000 110 [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Syntax error in line 5:\n (B, F, G)\n ^\nEncountered: COMMA\nExpected: AND, BETWEEN, DIV, IN, IS, LIKE, NOT, OR, REGEXP, RLIKE\n\nCAUSED BY: Exception: Syntax error\n"
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select *\n from \n DB.table \n WHERE \n (B, F, G)\n IN (vectorTest)\n LIMIT 1;'"
Please help!
Thanks for your time and patience.
I'd like to pull one row, with all columns, for every unique
combination of 3 columns, let's say B, F, and G.
Queries like this are typically solved using row_number to enumerate each row in a group and select rows with a certain row number.
select * from (
select * ,
row_number() over (partition by B, F, G order by id) rn
from DB.tableName
) t where rn = 1
The query above will pick the row with the lowest id for each B,F,G group.

Resources