Count observations arranged in multiple columns - r

I have a database with species ID in the rows (very large) and places where they occur in the columns (several sites). I need a summary of how many species are per site. My observations are categorical in some cases (present) or numerical (number of individuals), because they are from different database sources. Also, there are several na's in the entire database.
in R, I have been using functions to count observations one site at the time only.
I appreciate any help on how to count the observations from the different columns at the same time.

You could do just:
SELECT COUNT(*)
FROM tables
WHERE conditions
And in the conditions specify the different columns conditions
WHERE t.COLUMN1="THIS" AND t.COLUMN2="THAT"
or with a SUM CASE (probably the best idea in general):
SELECT grfield,
SUM(CASE when a=1 then 1 else 0 end) as tcount1,
SUM(CASE when a=2 then 1 else 0 end) as tcount2
FROM T1
GROUP by grfield;
Or in a more complex way you could do a subquery inside the count:
SELECT COUNT(*) FROM
(
SELECT DISTINCT D
FROM T1
INNER JOIN T2
ON A.T1=B.T2
) AS subquery;
You could do also several counts in subqueries... the possibilities are endless.

Related

How to get the proportion of values in a column >X?

Using sqlite3, I have a column "grades" in table "students" and I want to get the proportion of students who scored over 80 on a test. How do I get that? I can select count(*) from students and then select count(*) from students where score>80, but how do I get the proportion in one statement?
Here is a simple way to do this:
SELECT
AVG(CASE WHEN grades > 80 THEN 1 ELSE 0 END)
FROM students;
This just takes a conditional average over the entire table, counting the number of students with a grade over 80, then normalizing that count by the total number of students.

PL/SQL Case with Group By and Pivot

I have data that I'm presenting in an APEX interactive report, using a pivot statement to display monthly data for a period of 15 years. I am color coding some of the values based on if it contains a decimal using a case statement.
My problem is that by using the case statement, it is creating multiple rows from one row of data. My report is showing 2 rows for each item, one for the row containing values without decimals, and one row with values containing decimals.
Multiple Rows
How can I combine the rows into one? Use a Group By? or is there a better way?
select buscat, prod_parent, year_month, volume, load_source, tstamp,
case when instr(VOLUME, '.') > 0 then 'color:#FF7755;' else 'color:#000000;' end flag
from HISTORY where id > 0
Here is raw data from SQL query...
SQL return
According to the SQL Return image the data is not repeating. It looks like you are not filtering for 'Volume == 0'. Try changing 'ID' to 'volume' in where clause:
select yearMonth, volume, load_source, tstamp,
case when instr(volume, '.') > 0 then 'color:#FF7755;' else 'color:#000000;' end flag
from HISTORY
where volume > 0

will converting from dateFrom/dateTo to period data type improve performance?

I have a really slow query and I'm trying to speed it up.
I have a target date range (dateFrom/dateTo) defined in a table with only one row I need to use as a limit against a table with millions of rows. Is there a best practice for this?
I started with one table with one row with dateFrom and dateTo fields. I can limit the rows in the large table by CROSS JOINing it with the small table and using the WHERE clause, like:
select
count(*)
from
tblOneRow o, tblBig b
where
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
or I can inner join the tables on the date range, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.dateFrom < b.dateTo and
o.dateTo >= b.dateFrom
but I thought if I changed my single-row table to use one field with a PERIOD data type instead of two fields with DATE data types, it could improve the performance. Is this a reasonable assumption? The explain isn't showing a time difference if I change it to:
select
count(*)
from
tblOneRow o inner join
tblBig b on
begin(o.date) < b.dateTo and
end(o.date) >= b.dateFrom
or if I convert the small table's date range to a PERIOD data type and join ON P_INTERSECT, like:
select
count(*)
from
tblOneRow o inner join
tblBig b on
o.date p_intersect period(b.dateFrom, b.dateTo + 1) is not null
to help the parsing engine with this join, would I need to define the fields on the large table with a period data type instead of two dates? I can't do that as I don't own that table, but if that's the case, I'll give up on improving performance with this method.
Thanks for your help.
I don't expect any difference between the first three Selects, Explain should be the same a product join (the optimizer should expect exactly one row, but as it's duplicated the estimated size should be the number of AMPs in your system). The last Select should be worse, because you apply a calculation (OVERLAPS would be more appropriate, but probably not better).
One way to improve this single row cross join would be a View (select date '...' as dateFrom, date '...' as dateTo) instead of the single row table. This should resolve the dates and result in hard-coded dateFrom/To instead of a product join.
Similar when you switch to Scalar Subqueries:
select
count(*)
from
tblBig b
where
(select min(o.dateFrom) from tblOneRow) < b.dateTo
and
(select min(o.dateTo) from tblOneRow) >= b.dateFrom

Stata Count Distinct Values

In SQL I would do:
SELECT COUNT(DISTINCT column_name,column_name2) AS some_alias FROM table_name
In Stata I would like to do the same ...
I have not found an easy way to do this ...
For example, I import new panel data for 20 Countries - if available, for a timespan over 20 years - a max of 20*20 values. But some country-year combinations might be missing.
I would like to know then, how many values I have of the possible 400!
ssc inst distinct
will install a function that is very close to what the mentioned SQL statement does. In the case of a dichotomous variable and 20 countries this statement will give the distinct number of value combinations of countries and the dichotomous variable.
distinct Countries dichVar if dichVar == 1

SQLite Ranking Time Stamps

I am new to SQL and am having trouble with a (fairly simple) query to rank time stamps.
I have one table with survey data from 2014. I am trying to determine the 'learning curve' for good customer satisfaction performance. I want to order and rank each survey at an agent level based on the time stamp of the survey. This would let me see what the average performance is when an agent has 5 total surveys, 10, 20 etc.
I imagine it should be something like (table name is tablerank):
select T1.*,
(select count(*)
from tablerank as T2
where T2.call_date > T1.call_date
) as SurveyRank
from tablerank as T1
where p1.Agent_ID = T2.Agent_ID;
For each agent, it would list each survey in order and tag a 1 for the earliest survey, a 2 for the second earliest, etc. Then I could Pivot the data in Excel and see the learning curve based on survey count rather than tenure or time (since surveys are more rare, sometimes you only get 1 or 2 in a month).
A correlated subquery must have the correlation in the subquery itself; any table names/aliases from the subquery (such as T2) are not visible in the outer query.
For ranking, you want to count earlier surveys, and you want to include the current survey so that the first one gets the rank number 1, so you need to use <= instead of >:
SELECT *,
(SELECT COUNT(*)
FROM tablerank AS T2
WHERE T2.Agent_ID = T1.Agent_ID
AND T2.call_date <= T1.call_date
) AS SurveyRank
FROM tablerank AS T1

Resources