Select distinct values from two columns in SQLITE - sqlite

I have two columns
tab1
col1 col2
a 1
a 2
b 3
b 6
c 4
d 5
I need distinct values from both columns at once like:
result
col1 col2
a 1
b 3
c 4
d 5
select distinct col1, col2 from tab1 #is not giving such results.
Is that possible in sqlite?

From the sample data you posted I see that the easiest way is to group by col1 and get the minimum of col2:
select col1, min(col2)
from tab1
group by col1

To get your expected result (which is not distinct rows, but other thing).
What you want formally is not clear, below will select minimum values of column 2 for each column 1.
select col1, min (col2) from tab1 group by col1
If you want to select first of each, let it be known to you, that order of values is not defined in SQL unless e.g. you add numeric id and sort by it.
NOTE: noted forpas answered before me, so please mark his answer as accepted in case you consider our answers the same.

Related

Distinct combination of two columns with ordering from a third column? [duplicate]

This question already has answers here:
How to GROUP and choose lowest value in SQL
(4 answers)
Closed 1 year ago.
Lets say I have the following table:
col1 col2 diff
ABC XYZ 1.2
FOO BAR 5.0
FOO BAR 0.0
ABC XYZ 1.3
Now I want to get unique combinations of col1 and col2, but for each combination I only want the one that has the lowest diff. So the result I'm looking for is this:
col1 col2 diff
ABC XYZ 1.2
FOO BAR 0.0
I could do either DISTINCT col1, col2 or GROUP BY col1, col2, but how do I make sure I get the combination that has the lowest diff?
Try aggregating by col1 and col2 and select the minimum value of diff:
SELECT col1, col2, MIN(diff) AS diff
FROM yourTable
GROUP BY col1, col2;

Remove duplicated row in column and keep last row in R [duplicate]

This question already has answers here:
Get last row of each group in R [duplicate]
(4 answers)
Closed 2 years ago.
Hello I have a df such as
COL1 COL2 COL3 COL4
NA NA Sp_canis_lupus 10
3 8 Sp_canis_lupus 10
3 8 Sp_canis_lupus 10
How can I remove duplicate rows in COL3 and keep the last row ?
Here I should get :
COL1 COL2 COL3 COL4
3 8 Sp_canis_lupus 10
Thank you very much for your help
You could also solve this with aggregate, like below:
aggregate(. ~ COL3, data = df, FUN = tail, 1)
Or another way in dplyr:
library(dplyr)
df %>%
group_by(COL3) %>%
slice(n())
This of course assumes that you're only after duplicates in COL3 - otherwise you'll need to rephrase the problem (as the example doesn't seem to be particularly complex).
Using dplyr:
df %>%
group_by(COL3) %>%
filter(row_numer() == n() )
Upvote if it helps thanks!
Use duplicated to find duplicates - and then select those that are not duplicated, i.e. x[!duplicated(x), ]. You may need to make the statement a bit more elaborate given that you have NAs in there.

Subsetting the duplicate company identifiers in the company level data

I have a dataset that contains the financial information of various companies such as revenue, and profits in the year. The companies are given a unique ID. I would like to subset the dataset to include all the duplicated company ID and their related financial information.
I have tried the function duplicated() but it either hangs or provide an error as the dataframe has over 200 million records. Could anyone help me with this? Thank you as I am still unfamiliar with R.
I have tried the following:
Duplicate <- DF[duplicated(DF$ID),]
where DF is the name of the dataframe and ID is the company ID. But the code could not run and I am stuck.
#ID# #Revenue# #Profit#
1 50 10
2 20 10
2 20 10
2 20 10
3 10 0
4 20 5
4 20 5
I want a dataframe that includes all the IDs with Company 2 and 4 since they are duplicates
The function duplicated(DF$ID) returns a logical vector of same length as rows in DF, indicating whether the value at each position has been seen before. So for the following string, duplicated will return
1 2 2 2 2 3 4 4
F F T T T F F T
and hence, your code line returns the subset of rows where the ID has is a duplicate but not the first instance.
For me it is unclear whether you need just a reduced list of which IDs appear multiple times, the entire rows of duplicate records (including/excluding first record), or whether you are considering looking at duplicate records or just duplicate IDs.
To get which IDs appear multiple times, count them:
count <- table(DF$ID)
names(count[count > 1])
Note: names() returns a character vector.
To get the records where the IDs appear multiple times, we can:
Duplicate <- DF[DF$ID %in% as.integer(names(count[count > 1])), ] # wrapped in as.integer, as I suspect your ID column is an integer vector.
or, try with dplyr:
library(dplyr)
Duplicate <- DF %>% group_by(ID) %>%
add_count() %>%
filter(n > 1)
Might be quicker if you have 200 mio rows.
Update:
To get only the first(*) occurrence of each ID, simply
DF[!duplicated(DF$ID),]
*first occurrence depends entirely on the ordering of the data.frame.
But do note that you must be entirely sure whether the entire records are actually duplicate, so you would definitely want to look into if duplicate records differ in any way.
If you have multiple columns that you regard sufficient to be duplicated, try with dplyr::distinct:
library(dplyr)
DF %>% distinct(ID, Revenue, .keep_all = TRUE)
If you do not include .keep_all = TRUE, only the named columns will be returned.
You have 200 mio rows
At this point I would say you've fallen outside the scope of R - at least using R naively.
If you have access to a database server (MySQL, MS SQL, PostSQL, etc.), you should definitely leverage that! I assume you are just starting on the data cleaning and these database engines have some powers for working with data of this size.
If you do not, you can offload it into an SQLite database file with RSQLite-package. It allows you to run many of the same operations with SQL. For your case, we can retrieve the set of rows with duplicated IDs with:
WITH cte AS (
SELECT ID
FROM DF
GROUP BY ID
HAVING count(*) > 1
)
SELECT *
FROM DF
INNER JOIN cte USING (ID);
Once loaded your data frame into the table DF and using an appropriate index.
To get DISTINCT rows you do not have the same luxury as dplyr::distinct, but you can do:
SELECT DISTINCT *
FROM DF;
If you however want to retrieve all columns that are distinct on a subset of columns, window function could be the way (unless you have a distinct row ID -- which SQLite implicitly always has):
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, column2, ...) AS rownumber
FROM DF
WHERE rownumber = 1;
SQLite wouldn't like that, so it would have to be wrapped in a subquery or view. But SQLite, with its implicit rowid you could do:
WITH cte AS (
SELECT min(_ROWIID_) AS rowid
FROM DF
GROUP BY ID, column2, ...
)
SELECT DF.*
FROM DF
INNER JOIN cte USING (rowid);

PL/SQL Group By Sequence

I have a table like this:
I want to group by this 3 columns (col1, col2, col3) with a specific sequence, to get something like this:
Is there any kind of function to do this?
Analytical function Dense_Rank() would work well here.
SELECT col1 , col2 , col3
, DENSE_Rank() OVER (ORDER BY COL1, COL2, COL3) AS MYGROUP
FROM REFTABLE
Rank() vs Dense_rank() dense rank will keep numbers sequential w/o gap whereas rank will introduce gaps. In your example 3, B, Y twice may be assigned rank 3 but the next value would be assigned rank 5 if rank was used 4 if dense_rank() is used.
Link to Docs
Note each time the query runs the Dense_rank is assigned so #'s can change as data changes. If you want the numbers to be unique then you would need to store them. and then force Dense_Rank to start at a seed being the highest # in the stored results.
add rownum will add sequence number per row:
select col1, col2, col3, rownum as mygroup from table a;

Filtering Dataset based on similar column values

I have a Data in Dataset with 5 columns (ID, col1, col2, col3, col4). I have to filter rows and separate them from main Dataset based on the similar column values. ID column is unique. I have to check the data for the column values in col1, col2, col3 and col4. For example I have 10 records, 5 of them have same column values, 3 of them have same column values other than previous 5 and 2 rows with complete different values. I should now have 4 different datatables with 5, 3, 1 and 1 rows respectively. Theses datatables can be dynamic depending on the data.
Please suggest me the best possible solution.
Dim myDataView As DataView = New DataView('Your Dataset')
myDataView.RowFilter = 'Your Serch Values '
try this

Resources