I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.
Related
I have a dataset that contains the financial information of various companies such as revenue, and profits in the year. The companies are given a unique ID. I would like to subset the dataset to include all the duplicated company ID and their related financial information.
I have tried the function duplicated() but it either hangs or provide an error as the dataframe has over 200 million records. Could anyone help me with this? Thank you as I am still unfamiliar with R.
I have tried the following:
Duplicate <- DF[duplicated(DF$ID),]
where DF is the name of the dataframe and ID is the company ID. But the code could not run and I am stuck.
#ID# #Revenue# #Profit#
1 50 10
2 20 10
2 20 10
2 20 10
3 10 0
4 20 5
4 20 5
I want a dataframe that includes all the IDs with Company 2 and 4 since they are duplicates
The function duplicated(DF$ID) returns a logical vector of same length as rows in DF, indicating whether the value at each position has been seen before. So for the following string, duplicated will return
1 2 2 2 2 3 4 4
F F T T T F F T
and hence, your code line returns the subset of rows where the ID has is a duplicate but not the first instance.
For me it is unclear whether you need just a reduced list of which IDs appear multiple times, the entire rows of duplicate records (including/excluding first record), or whether you are considering looking at duplicate records or just duplicate IDs.
To get which IDs appear multiple times, count them:
count <- table(DF$ID)
names(count[count > 1])
Note: names() returns a character vector.
To get the records where the IDs appear multiple times, we can:
Duplicate <- DF[DF$ID %in% as.integer(names(count[count > 1])), ] # wrapped in as.integer, as I suspect your ID column is an integer vector.
or, try with dplyr:
library(dplyr)
Duplicate <- DF %>% group_by(ID) %>%
add_count() %>%
filter(n > 1)
Might be quicker if you have 200 mio rows.
Update:
To get only the first(*) occurrence of each ID, simply
DF[!duplicated(DF$ID),]
*first occurrence depends entirely on the ordering of the data.frame.
But do note that you must be entirely sure whether the entire records are actually duplicate, so you would definitely want to look into if duplicate records differ in any way.
If you have multiple columns that you regard sufficient to be duplicated, try with dplyr::distinct:
library(dplyr)
DF %>% distinct(ID, Revenue, .keep_all = TRUE)
If you do not include .keep_all = TRUE, only the named columns will be returned.
You have 200 mio rows
At this point I would say you've fallen outside the scope of R - at least using R naively.
If you have access to a database server (MySQL, MS SQL, PostSQL, etc.), you should definitely leverage that! I assume you are just starting on the data cleaning and these database engines have some powers for working with data of this size.
If you do not, you can offload it into an SQLite database file with RSQLite-package. It allows you to run many of the same operations with SQL. For your case, we can retrieve the set of rows with duplicated IDs with:
WITH cte AS (
SELECT ID
FROM DF
GROUP BY ID
HAVING count(*) > 1
)
SELECT *
FROM DF
INNER JOIN cte USING (ID);
Once loaded your data frame into the table DF and using an appropriate index.
To get DISTINCT rows you do not have the same luxury as dplyr::distinct, but you can do:
SELECT DISTINCT *
FROM DF;
If you however want to retrieve all columns that are distinct on a subset of columns, window function could be the way (unless you have a distinct row ID -- which SQLite implicitly always has):
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, column2, ...) AS rownumber
FROM DF
WHERE rownumber = 1;
SQLite wouldn't like that, so it would have to be wrapped in a subquery or view. But SQLite, with its implicit rowid you could do:
WITH cte AS (
SELECT min(_ROWIID_) AS rowid
FROM DF
GROUP BY ID, column2, ...
)
SELECT DF.*
FROM DF
INNER JOIN cte USING (rowid);
So I've been looking at this for the past week and learning. I'm used to SQL Server not SQLite. I understand RowId now, and that if I have an "id" column of my own (for convenience) it will actually use RowId. I've done running totals in SQL Server using ROW_NUMBER, but that doesn't seem to be an option with SQLite. The most useful post was...
How do I calculate a running SUM on a SQLite query?
My issue is that it works as long as I have data that I will keep adding to at the "bottom" of the table. I say "bottom" and not bottom because my display of the data is always sorted based on some other column such as a month. So in other words if I insert a new record for a missing month it will get inserted with a higher "id" (aka _RowId"). My running total below that month now needs to reflect this new data for all subsequent months. This means I cannot order by "id".
With SQL Server, ROW_NUMBER took care of my sequencing because in the select where I use a.id > running.id, I would have used a.rownum > running.rownum
Here's my table
CREATE TABLE `Test` (
`id` INTEGER,
`month` INTEGER,
`year` INTEGER,
`value` INTEGER,
PRIMARY KEY(`id`)
);
Here's my query
WITH RECURSIVE running (id, month, year, value, rt) AS
(
SELECT id, month, year, value, value
FROM Test AS row1
WHERE row1.id = (SELECT a.id FROM Test AS a ORDER BY a.id LIMIT 1)
UNION ALL
SELECT rowN.id, rowN.month, rowN.year, rowN.value, (rowN.value + running.rt)
FROM Test AS rowN
INNER JOIN running ON rowN.id = (
SELECT a.id FROM Test AS a WHERE a.id > running.id ORDER BY a.id LIMIT 1
)
)
SELECT * FROM running
I can order my CTE with year,month,id similar to how it is suggested in original example I linked above. However unless I'm mistaken that example solution relies on records in the table already ordered by year, month, id. If I'm right if I insert an earlier "month", then it will break because the "id" will have the largest value of all the _RowId_s.
Appreciate if someone can set me straight.
I have written an SQL query which amalgamates data from two separate tables with the following query:
SELECT * FROM table 1
UNION ALL
SELECT * FROM table 2
ORDER BY column 1
What I'd like to be able to do is to add a column or 'stamp' in a newly created column which details the table which each text entry originally came from. So my output would have a column which detailed the table which each row was originally from.
Essentially, the tables I have are made up of large quantities of numeric data and are hard to distinguish upon completing the Union command.
Thanks for any help.
Regards,
CJW.
You can select a scalar value from your selects, but you need to specify columns instead of *:
SELECT col1, col2, 'TABLE1' FROM table 1
UNION ALL
SELECT col1, col2, 'TABLE2' FROM table 2 ORDER BY column 1
You can simply add any expression(s) anywhere in the SELECT clause:
SELECT *, 1 AS SourceTable FROM Table1
UNION ALL
SELECT *, 2 AS SourceTable FROM Table2
ORDER BY Column1;
Super new to SQLite but I thought it can't hurt to ask.
I have something like the following table (Not allowed to post images yet) pulling data from multiple tables to calculate the TotalScore:
Name TotalScore
Course1 15
Course1 12
Course2 9
Course2 10
How the heck do I SELECT only the max value for each course? I've managed use
ORDER BY TotalScore LIMIT 2
But I may end up with multiple Courses in my final product, so LIMIT 2 etc won't really help me.
Thoughts? Happy to put up the rest of my query if it helps?
You can GROUP the resultset by Name and then use the aggregate function MAX():
SELECT Name, max(TotalScore)
FROM my_table
GROUP BY Name
You will get one row for each distinct course, with the name in column 1 and the maximum TotalScore for this course in column 2.
Further hints
You can only SELECT columns that are either grouped by (Name) or wrapped in aggregate functions (max(TotalScore)). If you need another column (e.g. Description) in the resultset, you can group by more than one column:
...
GROUP BY Name, Description
To filter the resulting rows further, you need to use HAVING instead of WHERE:
SELECT Name, max(TotalScore)
FROM my_table
-- WHERE clause would be here
GROUP BY Name
HAVING max(TotalScore) > 5
WHERE filters the raw table rows, HAVING filters the resulting grouped rows.
Functions like max and sum are "aggregate functions" meaning they aggregate multiple rows together. Normally they aggregate them into one value, like max(totalscore) but you can aggregate them into multiple values with group by. group by says how to group the rows together into aggregates.
select name, max(totalscore)
from scores
group by name;
This groups all the columns together with the same name and then does a max(totalscore) for each name.
sqlite> select name, max(totalscore) from scores group by name;
Course1|15
Course2|12
I am trying to run some analysis on sales data using SQLite.
At the moment, my table has several columns including a unique transaction ID, product name, quantity of that product and value of that product. For each transaction, there can be several records, because each distinct type of product in the basket has its own entry.
I would like to add two new columns to the table. The first one would be a total for each transaction ID which summed up the total quantity of all products in that basket.
I realize that there would be duplication in the table, as the repeated transaction IDs would all have the total. The second one would be similar but in value terms.
I unfortunately cannot do this by creating a new table with the values I want calculated in Excel, and then joining it to the original table, because there are too many records for Excel.
Is there a way to get SQL to do the equivalent of a sumif in Excel?
I was thinking something along the lines of:
select sum(qty) where uniqID = ...
But I am stumped by how to express that it needs to sum all quantities where the uniqID is the same as the one in that record.
You wouldn't create a column like that in SQL. You would simply query for the total on the fly. If you really wanted a table-like object, you could create a view that held 2 columns; uniqID and the sum for that ID.
Let's set up some dummy data in a table; column a is your uniqID, b is the values you're summing.
create table tab1 (a int, b int);
insert into tab1 values (1,1);
insert into tab1 values (1,2);
insert into tab1 values (2,10);
insert into tab1 values (2,20);
Now you can do simple queries for individual uniqIDs like this:
select sum(b) from tab1 where a = 2;
30
Or sum for all uniqIDs (the 'group by' clause might be all you're groping for:) :
select a, sum(b) from tab1 group by a;
1|3
2|30
Which could be wrapped as a view:
create view totals as select a, sum(b) from tab1 group by a;
select * from totals;
1|3
2|30
The view will update on the fly:
insert into tab1 values (2,30);
select * from totals;
1|3
2|60
In further queries, for analysis, you can use 'totals' just like you would a table.