I have a table like this:
I want to group by this 3 columns (col1, col2, col3) with a specific sequence, to get something like this:
Is there any kind of function to do this?
Analytical function Dense_Rank() would work well here.
SELECT col1 , col2 , col3
, DENSE_Rank() OVER (ORDER BY COL1, COL2, COL3) AS MYGROUP
FROM REFTABLE
Rank() vs Dense_rank() dense rank will keep numbers sequential w/o gap whereas rank will introduce gaps. In your example 3, B, Y twice may be assigned rank 3 but the next value would be assigned rank 5 if rank was used 4 if dense_rank() is used.
Link to Docs
Note each time the query runs the Dense_rank is assigned so #'s can change as data changes. If you want the numbers to be unique then you would need to store them. and then force Dense_Rank to start at a seed being the highest # in the stored results.
add rownum will add sequence number per row:
select col1, col2, col3, rownum as mygroup from table a;
Related
I use a formula like this to get stock data:
=IMPORTDATA("https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol=IBM&interval=15min&slice=year1month1&apikey=demo")
Basically my goal is to filter the data by date and time with a query formula, so the table with the raw data does not show up.
I want the data from Column 1, Column 3 and Column 6, filtered by date (in this case 7/12/2022 from cell J1) and time (between 4:15:00 and 9:30:00).
I tried this formula
=QUERY(IMPORTDATA("https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol=IBM&interval=15min&slice=year1month1&apikey=demo"),"select Col1, Col3, Col6 WHERE Col1 > datetime '"&TEXT(J1+time(4,15,0),"yyyy-mm-dd HH:mm:ss")&"' and Col1 <= datetime '"&TEXT(J1+time(9,30,0),"yyyy-mm-dd HH:mm:ss")&"'")
but the only result I can get are the headers.
Here is a link to the Sheet
Answer
The following formula should produce the result you desire:
=QUERY(IMPORTDATA("https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol=IBM&interval=15min&slice=year1month1&apikey=demo"),"SELECT Col1, Col3, Col6 WHERE Col1 > "&J1+TIME(4,15,0)&" AND Col1 <= "&J1+TIME(9,30,0))
For easier reading, here is the second argument of =QUERY isolated.
"SELECT Col1, Col3, Col6 WHERE Col1 > "&J1+TIME(4,15,0)&" AND Col1 <= "&J1+TIME(9,30,0)
Explanation
Behind the scenes, all date and time values in Google Sheets are stored as simple numbers. Therefore, simple number comparison can be used in a query to determine if one date is greater than another, skipping a lot of the in-between with the =TEXT function and the datetime argument. The provided =QUERY simply compares each value in Col1 to the sum of J1 and your provided time value.
Functions used:
=QUERY
=IMPORTDATA
=TIME
I have two columns
tab1
col1 col2
a 1
a 2
b 3
b 6
c 4
d 5
I need distinct values from both columns at once like:
result
col1 col2
a 1
b 3
c 4
d 5
select distinct col1, col2 from tab1 #is not giving such results.
Is that possible in sqlite?
From the sample data you posted I see that the easiest way is to group by col1 and get the minimum of col2:
select col1, min(col2)
from tab1
group by col1
To get your expected result (which is not distinct rows, but other thing).
What you want formally is not clear, below will select minimum values of column 2 for each column 1.
select col1, min (col2) from tab1 group by col1
If you want to select first of each, let it be known to you, that order of values is not defined in SQL unless e.g. you add numeric id and sort by it.
NOTE: noted forpas answered before me, so please mark his answer as accepted in case you consider our answers the same.
I have a dataset that contains the financial information of various companies such as revenue, and profits in the year. The companies are given a unique ID. I would like to subset the dataset to include all the duplicated company ID and their related financial information.
I have tried the function duplicated() but it either hangs or provide an error as the dataframe has over 200 million records. Could anyone help me with this? Thank you as I am still unfamiliar with R.
I have tried the following:
Duplicate <- DF[duplicated(DF$ID),]
where DF is the name of the dataframe and ID is the company ID. But the code could not run and I am stuck.
#ID# #Revenue# #Profit#
1 50 10
2 20 10
2 20 10
2 20 10
3 10 0
4 20 5
4 20 5
I want a dataframe that includes all the IDs with Company 2 and 4 since they are duplicates
The function duplicated(DF$ID) returns a logical vector of same length as rows in DF, indicating whether the value at each position has been seen before. So for the following string, duplicated will return
1 2 2 2 2 3 4 4
F F T T T F F T
and hence, your code line returns the subset of rows where the ID has is a duplicate but not the first instance.
For me it is unclear whether you need just a reduced list of which IDs appear multiple times, the entire rows of duplicate records (including/excluding first record), or whether you are considering looking at duplicate records or just duplicate IDs.
To get which IDs appear multiple times, count them:
count <- table(DF$ID)
names(count[count > 1])
Note: names() returns a character vector.
To get the records where the IDs appear multiple times, we can:
Duplicate <- DF[DF$ID %in% as.integer(names(count[count > 1])), ] # wrapped in as.integer, as I suspect your ID column is an integer vector.
or, try with dplyr:
library(dplyr)
Duplicate <- DF %>% group_by(ID) %>%
add_count() %>%
filter(n > 1)
Might be quicker if you have 200 mio rows.
Update:
To get only the first(*) occurrence of each ID, simply
DF[!duplicated(DF$ID),]
*first occurrence depends entirely on the ordering of the data.frame.
But do note that you must be entirely sure whether the entire records are actually duplicate, so you would definitely want to look into if duplicate records differ in any way.
If you have multiple columns that you regard sufficient to be duplicated, try with dplyr::distinct:
library(dplyr)
DF %>% distinct(ID, Revenue, .keep_all = TRUE)
If you do not include .keep_all = TRUE, only the named columns will be returned.
You have 200 mio rows
At this point I would say you've fallen outside the scope of R - at least using R naively.
If you have access to a database server (MySQL, MS SQL, PostSQL, etc.), you should definitely leverage that! I assume you are just starting on the data cleaning and these database engines have some powers for working with data of this size.
If you do not, you can offload it into an SQLite database file with RSQLite-package. It allows you to run many of the same operations with SQL. For your case, we can retrieve the set of rows with duplicated IDs with:
WITH cte AS (
SELECT ID
FROM DF
GROUP BY ID
HAVING count(*) > 1
)
SELECT *
FROM DF
INNER JOIN cte USING (ID);
Once loaded your data frame into the table DF and using an appropriate index.
To get DISTINCT rows you do not have the same luxury as dplyr::distinct, but you can do:
SELECT DISTINCT *
FROM DF;
If you however want to retrieve all columns that are distinct on a subset of columns, window function could be the way (unless you have a distinct row ID -- which SQLite implicitly always has):
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, column2, ...) AS rownumber
FROM DF
WHERE rownumber = 1;
SQLite wouldn't like that, so it would have to be wrapped in a subquery or view. But SQLite, with its implicit rowid you could do:
WITH cte AS (
SELECT min(_ROWIID_) AS rowid
FROM DF
GROUP BY ID, column2, ...
)
SELECT DF.*
FROM DF
INNER JOIN cte USING (rowid);
I have written an SQL query which amalgamates data from two separate tables with the following query:
SELECT * FROM table 1
UNION ALL
SELECT * FROM table 2
ORDER BY column 1
What I'd like to be able to do is to add a column or 'stamp' in a newly created column which details the table which each text entry originally came from. So my output would have a column which detailed the table which each row was originally from.
Essentially, the tables I have are made up of large quantities of numeric data and are hard to distinguish upon completing the Union command.
Thanks for any help.
Regards,
CJW.
You can select a scalar value from your selects, but you need to specify columns instead of *:
SELECT col1, col2, 'TABLE1' FROM table 1
UNION ALL
SELECT col1, col2, 'TABLE2' FROM table 2 ORDER BY column 1
You can simply add any expression(s) anywhere in the SELECT clause:
SELECT *, 1 AS SourceTable FROM Table1
UNION ALL
SELECT *, 2 AS SourceTable FROM Table2
ORDER BY Column1;
I am trying to do multiple inserts into an oracle table with data rom another table and I also use a sequence. Something like this: http://www.dbforums.com/oracle/1626242-insert-into-table-sequence-value-other-table.html
Now..in the destination table there is a primary key on the column that is being populated by the sequence and it is giving me primary key violation. My guess is that the sequence.nextval is not working for some reason. Where is the error? This is my actual query:
insert into xxxx (col1, col2, col3, col4, col5)
select SEQ_CNT.nextval, inner_view.*
from (select col1, 26, 0, 'N'
FROM yyyy WHERE col_ID = 30 AND DELETED = 'N' ) inner_view;
It seems unlikely to me that the problem is that calling nextval on the sequence is not working. It is much more likely that some other process has inserted data in the table with primary key values greater than the values currently being returned from the sequence. If you
SELECT seq_cnt.nextval
FROM dual
and compare that to the largest value of the primary key in the table
SELECT max(col1)
FROM xxxxx
my wager is that the maximum value is greater than the nextval from the sequence. If that's the case, you'd generally want to reset the sequence to the current maximum value as well as figuring out how the problematic data got inserted so that the problem doesn't happen again in the future.
the outer query will not loop, hence the sequence will not increment. try this solution below
insert into xxxx (col1, col2, col3, col4, col5)
select inner_view.*
from (select SEQ_CNT.nextval, col1, 26, 0, 'N'
FROM yyyy WHERE col_ID = 30 AND DELETED = 'N' ) inner_view;