I have a table of the form
select rowid,* from t;
rowid f
----- -----
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
7 ggg
8 aaa
9 bbb
10 ccc
11 ddd
12 eee
13 fff
14 ggg
Id like to select n row before and m row after a given row match, i.e for instance for rows that match f='ccc' with n=m=1 I'd like to get
2 bbb
3 ccc
4 ddd
9 bbb
10 ccc
11 ddd
The rowid is sequential in my setup so I guess we can play with it. I tried thing along the line of
select rowid,f from t where rowid between
(select rowid-1 from t where f='ccc') and
(select rowid+1 from t where f='ccc');
rowid f
----- -----
2 bbb
3 ccc
4 ddd
But the result is obviously wrong I got only the 1st occurence of the 'ccc' match. I guess I got to join or may be recursive cte, but I am affraid it is beyound my knowlegde so far :) Thanx in advance.
A scalar subquery can return only a single value.
You could do two self joins, but it would be simpler to use set operations:
SELECT * FROM t
WHERE rowid IN (SELECT rowid - 1 FROM t WHERE f = 'ccc'
UNION ALL
SELECT rowid FROM t WHERE f = 'ccc'
UNION ALL
SELECT rowid + 1 FROM t WHERE f = 'ccc');
Larger values of n and m require more subqueries.
If there are too many, you can use a join:
SELECT *
FROM t
WHERE rowid IN (SELECT t.rowid
FROM t
JOIN (SELECT rowid - ? AS n,
rowid + ? AS m
FROM t
WHERE f = 'ccc'
) AS ranges
ON t.rowid BETWEEN ranges.n AND ranges.m);
I come with a solution that is not optimal I think but I am not able to simplify (remove) the temp (intermediate) table.
select rowid,f from t;
rowid f
----- -----
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
7 ggg
8 aaa
9 bbb
10 ccc
11 ddd
12 eee
13 fff
14 ggg
create table u as
select t2.rowid x,t1.rowid+2 y from t t1 // +2 ==> 2 rows after 'ccc'
join t t2 on t1.rowid=t2.rowid+1 // +1 ==> 1 row before 'ccc'
where t1.f='ccc';
select * from u;
x y
----- -----
2 5
9 12
select t.rowid,t.f from t inner
join u on t.rowid>=u.x and t.rowid<=u.y'
rowid f
----- -----
2 bbb 1 before
3 ccc <== match
4 ddd 2 after
5 eee
9 bbb 1 before
10 ccc <== match
11 ddd 2 after
12 eee
I think I am set with what I need, but optimisations are welome :)
I might be overlooking something, but the provided solutions that suggest adding/subtracting from values of the rowid column could be improved upon. They will face issues should rowid ever be missing a value (Which I'm aware was stated to never be the case in the top post, but in general is an assumption that's often not true).
By using sqlite's row_number() you can have a solution that circumvents that problem and also can be used to fetch the entries "around" your row matches based on any arbitrary order you want, not just based on rowid.
Together with Common Table Expressions this can even be made somewhat readable, though should you have a larger amount of row-matches this will still be a slow query.
What you'll conceptually be doing is:
Do a pre-select on your table (here cte_t) to get all possible values that could be a valid hit and attach a row-number to each entry
Do a select on that pre-select to fetch the specific row that you actually want and get only its row-number (here targetRows)
"join" the two by pretty much just multiplying the two tables generated in 1) and 2).
Now you can easily select for all entries whose row-number is in a specific range around the target's row-number using ABS
WITH
cte_t AS (
SELECT *, row_number() OVER (ORDER BY t.rowid) AS rownum
FROM t
-- If you can make this cte smaller by removing all entries that can't possibly be the solution with the appropriate WHERE clause, you can make the entire query substantially faster
),
targetRows AS(
SELECT rownum AS targetRowNum
FROM cte_t
WHERE f = 'ccc' -- This should be the WHERE condition that defines the entries that match your query exactly and for which you want to get the entries around them
)
SELECT cte_t.rowid, cte_t.f
FROM cte_t, targetRows -- This is basically multiplying both tables with one another, this part will be horribly slow if targetRows gets larger
WHERE ABS(cte_t.rownum - targetRows.targetRowNum) <= 1; --Get all entries in targetRows as well as those whose rownum is 1 larger or 1 lower than the rownum of a targetRow
This will return
rowid f
2 bbb
3 ccc
4 ddd
9 bbb
10 ccc
11 ddd
Here a good resource about this.
Related
This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 5 years ago.
i have a table say,Table A :
uid pid code
1 1 aaa
2 1 ccc
3 4 ddd
4 2 eee
i have another table, Table B:
pid msg
1 good
2 inspiring
3 thing to wtch
4 terrible
now, i want to replace pid in table A with msg in Table B.
I used merge(tableA, tableb, by =c("pid"))
I got the result as
uid pid code msg
1 1 aaa good
2 1 ccc good
3 4 ddd terrible
4 2 eee inspiring
where in i want the result as
uid msg code
1 good aaa
2 good ccc
3 terrible ddd
4 inspiring eee
Your approach seems absolutely correct, just needs further steps:
selection of required columns
reordering them
With tidyverse functions, you can do something like:
TableA %>%
left_join(TableB) %>%
select(uid, msg, code)
which gives:
uid msg code
1 1 good aaa
2 2 good ccc
3 3 terrible ddd
4 4 inspiring eee
Base R solution:
newtable = merge(tableA,tableB,by = "pid")
newtable$pid = NULL
newtable = newtable[order(newtable$uid,decreasing=FALSE),c(1,3,2)]
I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+
I have below data table and want to replace other than Web,Mobile usage into another category say others. (OR)
Is there anyway group the usage as Web, Mobile and rest all as others without replace the value like Web is used 1 , Mobile 1 and Others - 4 (OR)
Do we need to write a function to do so.
Id Name Usage
1 AAA Web
2 BBB Mobile
3 CCC Manual
4 DDD M1
5 EEE M2
6 FFF M3
Assuming that the 'Usage' is character class, we can use %chin% to create a logical index, negate it (!) and assign (:=) values in 'Usage' to 'Others'. This would be more efficient as we are assigning in place without any copying.
library(data.table)
setDT(df1)[!Usage %chin% c("Web", "Mobile"), Usage := "Others"]
df1
# Id. Name Usage
#1: 1 AAA Web
#2: 2 BBB Mobile
#3: 3 CCC Others
#4: 4 DDD Others
#5: 5 EEE Others
#6: 6 FFF Others
I have a record of A, B, C and D.
My SQL1
SELECT * FROM main_table order by main_table.date desc limit 2 returns A and B.
My SQL2
SELECT * FROM main_table left join sub_table using (key) where sub_table._id is not null returns B and C.
I want to have a single SQL Statement that returns A, B and C. Basically, I want to join SQL1 or SQL2
How could I combine this two SQLs (in SQLite) optimally?
My data is as below
main_table
_id date key
1 2016-08-04 D
2 2016-10-06 A
3 2016-09-04 B
4 2016-07-05 C
sub_table
_id age key
1 8 B
2 9 C
Desire Output
m._id m.date m.key s._id s.age s.key
2 2016-10-06 A
3 2016-09-04 B 1 8 B
4 2016-07-05 C 2 9 C
My logic of selection... I want to pick the top two latest data, and also data that has an age. The others I don't care (i.e. the D is not in top 2, and doesn't have an age).
If I read correctly, then a UNION might be what you have in mind:
SELECT * FROM
(SELECT * FROM main_table LEFT JOIN sub_table USING (key)
ORDER BY date DESC LIMIT 2)
UNION
SELECT * FROM main_table LEFT JOIN sub_table USING (key)
WHERE sub_table._id IS NOT NULL
I got a table like this
a b c
-- -- --
1 1 10
2 1 0
3 1 0
4 4 20
5 4 0
6 4 0
The b column 'points' to 'a', a bit like if a is the parent.
c was computed. Now I need to propagate the parent c value to their children.
The result would be
a b c
-- -- --
1 1 10
2 1 10
3 1 10
4 4 20
5 4 20
6 4 20
I can't make an UPDATE/SELECT combo that works
So far I got a SELECT that procuce the c column I'd like to get
select t1.c from t t1 join t t2 on t1.a=t2.b;
c
----------
10
10
10
20
20
20
But I dunno how to stuff that into c
Thanx in advance
Cheers, phi
You have to look up the value with a correlated subquery:
UPDATE t
SET c = (SELECT c
FROM t AS parent
WHERE parent.a = t.b)
WHERE c = 0;
I finnally found a way to copy back my initial 'temp' SELECT JOIN to table 't'. Something like this
create temp table u as select t1.c from t t1 join t t2 on t1.a=t2.b;
update t set c=(select * from u where rowid=t.rowid);
I'd like to know how the 2 solutions, yours with 1 query UPDATE correlated SELECT, and mine that is 2 queries and 1 correlated query each, compare perf wise. Mine seems more heavier, and less aesthetic, yet regarding perf I wonder.
On the Algo side, yours take care not to copy the parent data, only copy child data, mine copy parent on itself, but that's a nop, yet consuming some cycles :)
Cheers, Phi