I am trying to create a query where if the student id count is 2, then the output should be the average score of those two.
If the student id count is more than 2, then the output should be the median score of those students. I'm not getting the desired output
Here is the case-when statement that I am using:
case when (count(Student.STUDENT_ID) = 2) THEN AVG(Scores.TEST_GROWTH_PERCENTILE)
else PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY TEST_GROWTH_PERCENTILE) OVER (PARTITION BY dates.local_school_year,grade.domain_decode)
END
I've looked at many answers on SO concerning situations related to this but I must not be understanding them too well as I didn't manage to get anything to work.
I have a table with the following columns:
timestamp (PK), type (STRING), val (INT)
I need to get the most recent 20 entries from each type and average the val column. I also need the COUNT() as there may be fewer than 20 rows for some of the types.
I can do the following if I want to get the average of ALL rows for each type:
SELECT type, COUNT(success), AVG(success)
FROM user_data
GROUP BY type
But I want to limit each group COUNT() to 20.
From here I tried the following:
SELECT type, (
SELECT AVG(success) AS ave
FROM (
SELECT success
FROM user_data AS ud2
WHERE umd2.timestamp = umd.timestamp
ORDER BY umd2.timestamp DESC
LIMIT 20
)
) AS ave
FROM user_data AS ud
GROUP BY type
But the returned average is not correct. The values it returns are as if the statement is only returning the average of a single row for each group (it doesn't change regardless of the LIMIT).
Using sqlite, you may consider the row_number function in a subquery to acquire/filter the most recent entries before determining the average and count.
SELECT
type,
AVG(val),
COUNT(1)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY type
ORDER BY timestamp DESC
) rn
FROM
user_data
) t
WHERE rn <=20
GROUP BY type
I'm trying to calculate the percentage of a customer has spent over the total sales value.
I have calculated the total sales value per customer using sum() and group by, but after I use group by, I cannot differentiate the total sales value and the individual total for each sustomer.
is there anyway i could get around this?
i got to here so far and dont know what to do next:
select c.firstname ||' '|| c.lastname as 'Ful name',
sum(total) as 'Sales value',
/*something to calculate percentage*/,
from invoice i inner join customer c on i.customerid = c.customerid
group by i.customerid order by sum(total) desc limit 5;
To calculate the simple sum over the entire table, move it into an independent subquery:
SELECT ...,
sum(total) / (SELECT sum(total) FROM invoice)
FROM ...;
I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.
I have a query that computes the moving average in a table over the last 7 days. My table has two columns date_of_data which is date type and is a date series with one day interval and val which is float.
with B as
(SELECT date_of_data, val
FROM mytable
group by date_of_data
order by date_of_data)
select
date_of_data,val, avg(val) over(order by date_of_data rows 7 preceding)mean7
from B
order by date_of_data;
I want to compute a moving filter for 7 days. It means that for every row , the moving window would contain the last 3 days, the row itself and 3 succeeding rows.I cannot find a command to take into account the succeeding rows. Can anybody help me on this?
Try this:
select date_of_data,
val,
avg(val) over(order by date_of_data ROWS BETWEEN 3 preceding AND 3 following) as mean7
from mytable
order by date_of_data;