aggregating and concatenating by 2 identifiers - sqlite

There's a particular SQLIte command I want to run to perform a aggregate and concatenate operation.
I need to aggregate by "ID" column, then for each ID, concatenate unique 'Attribute' and also concatenate average of 'Value' for each unique corresponding 'Attribute':
I can do concatenate unqiue Attribute and aggregate by ID, but haven't got average of Value working.

Try to use subquery for getting AVG for combination of id+attribute and then use group_concat:
select t.id, Group_Concat(t.attribute) as concat_att, Group_Concat(t.avg) as concat_avg from
(
select test.id, test.attribute, AVG(test.value) as avg from test
group by test.id, test.attribute
) as t group by t.id;
See this example here: http://sqlfiddle.com/#!7/03fe4b/17

Related

How to Average the most recent X entries with GROUP BY

I've looked at many answers on SO concerning situations related to this but I must not be understanding them too well as I didn't manage to get anything to work.
I have a table with the following columns:
timestamp (PK), type (STRING), val (INT)
I need to get the most recent 20 entries from each type and average the val column. I also need the COUNT() as there may be fewer than 20 rows for some of the types.
I can do the following if I want to get the average of ALL rows for each type:
SELECT type, COUNT(success), AVG(success)
FROM user_data
GROUP BY type
But I want to limit each group COUNT() to 20.
From here I tried the following:
SELECT type, (
SELECT AVG(success) AS ave
FROM (
SELECT success
FROM user_data AS ud2
WHERE umd2.timestamp = umd.timestamp
ORDER BY umd2.timestamp DESC
LIMIT 20
)
) AS ave
FROM user_data AS ud
GROUP BY type
But the returned average is not correct. The values it returns are as if the statement is only returning the average of a single row for each group (it doesn't change regardless of the LIMIT).
Using sqlite, you may consider the row_number function in a subquery to acquire/filter the most recent entries before determining the average and count.
SELECT
type,
AVG(val),
COUNT(1)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY type
ORDER BY timestamp DESC
) rn
FROM
user_data
) t
WHERE rn <=20
GROUP BY type

Redshift join with metadata table and select columns

I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.

Selecting multiple maximum values? In Sqlite?

Super new to SQLite but I thought it can't hurt to ask.
I have something like the following table (Not allowed to post images yet) pulling data from multiple tables to calculate the TotalScore:
Name TotalScore
Course1 15
Course1 12
Course2 9
Course2 10
How the heck do I SELECT only the max value for each course? I've managed use
ORDER BY TotalScore LIMIT 2
But I may end up with multiple Courses in my final product, so LIMIT 2 etc won't really help me.
Thoughts? Happy to put up the rest of my query if it helps?
You can GROUP the resultset by Name and then use the aggregate function MAX():
SELECT Name, max(TotalScore)
FROM my_table
GROUP BY Name
You will get one row for each distinct course, with the name in column 1 and the maximum TotalScore for this course in column 2.
Further hints
You can only SELECT columns that are either grouped by (Name) or wrapped in aggregate functions (max(TotalScore)). If you need another column (e.g. Description) in the resultset, you can group by more than one column:
...
GROUP BY Name, Description
To filter the resulting rows further, you need to use HAVING instead of WHERE:
SELECT Name, max(TotalScore)
FROM my_table
-- WHERE clause would be here
GROUP BY Name
HAVING max(TotalScore) > 5
WHERE filters the raw table rows, HAVING filters the resulting grouped rows.
Functions like max and sum are "aggregate functions" meaning they aggregate multiple rows together. Normally they aggregate them into one value, like max(totalscore) but you can aggregate them into multiple values with group by. group by says how to group the rows together into aggregates.
select name, max(totalscore)
from scores
group by name;
This groups all the columns together with the same name and then does a max(totalscore) for each name.
sqlite> select name, max(totalscore) from scores group by name;
Course1|15
Course2|12

How to Count number of occurences of values in all columns

I am able to find out the Count number of occurrences of values in a single column.
By using
select column_name,count(count_name)
from table_name order by column_name
But I want a query for no of occurrences of multiple column values.
The count function, when used directly on a column, just returns a count of the rows. The sum of the counts over multiple columns is just the amount of rows times the amount of columns. One thing we could do is to return the sum of decodes of the condition over all columns, e.g.:
select mytable.*,
DECODE(mytable.column1,"target value",1,0) + DECODE(mytable.column2,"target
value",1,0) as hits from mytable
Basically what that does, is for each row, it will check the amount of columns that meet the condition. In this case, that value ('hits') can be 0, 1 or 2 because we are checking the condition over 2 columns.

Count of columns with filters

I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))

Resources