Stata Count Distinct Values - count

In SQL I would do:
SELECT COUNT(DISTINCT column_name,column_name2) AS some_alias FROM table_name
In Stata I would like to do the same ...
I have not found an easy way to do this ...
For example, I import new panel data for 20 Countries - if available, for a timespan over 20 years - a max of 20*20 values. But some country-year combinations might be missing.
I would like to know then, how many values I have of the possible 400!

ssc inst distinct
will install a function that is very close to what the mentioned SQL statement does. In the case of a dichotomous variable and 20 countries this statement will give the distinct number of value combinations of countries and the dichotomous variable.
distinct Countries dichVar if dichVar == 1

Related

Get all possible combinations on a large dataset in R

I am having a large dataset having more than 10 million records and 20 variables. I need to get every possible combination for 11 variables out of these 20 variables and for each combination the frequency also should be displayed.
I have tried count() in plyr package and table() function. But both of them are unable to get all possible combinations, since the number of combinations are very high (greater than 2^32 combinations) and also the size is huge.
Assume following dataset having 5 variables and 6 observations -
And I want all possible combinations of first three variables where frequencies are greater than 0.
Is there any other function to achieve this? I am just interested in combinations whose frequency is non-zero.
Thanks!
OK. I think I have an idea of what you require. If you are saying you want the count by N categories of rows in your table, you can do so with the data.table package. It will give you the count of all combinations that exist in the table. Simply list the required categories in the by arguement
DT<-data.table(val=rnorm(1e7),cat1=sample.int(10,1e7,replace = T),cat2=sample.int(10,1e7,replace = T),cat3=sample.int(10,1e7,replace = T))
DT_count<-DT[, .N, by=.(cat1,cat2,cat3)]

Count observations arranged in multiple columns

I have a database with species ID in the rows (very large) and places where they occur in the columns (several sites). I need a summary of how many species are per site. My observations are categorical in some cases (present) or numerical (number of individuals), because they are from different database sources. Also, there are several na's in the entire database.
in R, I have been using functions to count observations one site at the time only.
I appreciate any help on how to count the observations from the different columns at the same time.
You could do just:
SELECT COUNT(*)
FROM tables
WHERE conditions
And in the conditions specify the different columns conditions
WHERE t.COLUMN1="THIS" AND t.COLUMN2="THAT"
or with a SUM CASE (probably the best idea in general):
SELECT grfield,
SUM(CASE when a=1 then 1 else 0 end) as tcount1,
SUM(CASE when a=2 then 1 else 0 end) as tcount2
FROM T1
GROUP by grfield;
Or in a more complex way you could do a subquery inside the count:
SELECT COUNT(*) FROM
(
SELECT DISTINCT D
FROM T1
INNER JOIN T2
ON A.T1=B.T2
) AS subquery;
You could do also several counts in subqueries... the possibilities are endless.

Selecting multiple maximum values? In Sqlite?

Super new to SQLite but I thought it can't hurt to ask.
I have something like the following table (Not allowed to post images yet) pulling data from multiple tables to calculate the TotalScore:
Name TotalScore
Course1 15
Course1 12
Course2 9
Course2 10
How the heck do I SELECT only the max value for each course? I've managed use
ORDER BY TotalScore LIMIT 2
But I may end up with multiple Courses in my final product, so LIMIT 2 etc won't really help me.
Thoughts? Happy to put up the rest of my query if it helps?
You can GROUP the resultset by Name and then use the aggregate function MAX():
SELECT Name, max(TotalScore)
FROM my_table
GROUP BY Name
You will get one row for each distinct course, with the name in column 1 and the maximum TotalScore for this course in column 2.
Further hints
You can only SELECT columns that are either grouped by (Name) or wrapped in aggregate functions (max(TotalScore)). If you need another column (e.g. Description) in the resultset, you can group by more than one column:
...
GROUP BY Name, Description
To filter the resulting rows further, you need to use HAVING instead of WHERE:
SELECT Name, max(TotalScore)
FROM my_table
-- WHERE clause would be here
GROUP BY Name
HAVING max(TotalScore) > 5
WHERE filters the raw table rows, HAVING filters the resulting grouped rows.
Functions like max and sum are "aggregate functions" meaning they aggregate multiple rows together. Normally they aggregate them into one value, like max(totalscore) but you can aggregate them into multiple values with group by. group by says how to group the rows together into aggregates.
select name, max(totalscore)
from scores
group by name;
This groups all the columns together with the same name and then does a max(totalscore) for each name.
sqlite> select name, max(totalscore) from scores group by name;
Course1|15
Course2|12

Count the occurrences of just one value in R

I have 34 subsets with a bunch of variables and I am making a new dataframe with summarizing information about each variable for the subsets.
- Example: A10, T2 and V2 are all subsets with ~10 variables and 14 observations where one variable is population.
I want my new dataframe to have a column which says how many times per subset variable 2 hit zero.
I've looked at a bunch of different count functions but they all seem to make separate tables and count the occurrences of all variables. I'm not interested in how many times each unique value shows up because most of the values are unique, I just want to know how many times population hit zero for each subset of 14 observations.
I realize this is probably a simple thing to do but I'm not very good at creating my own solutions from other R code yet. Thanks for the help.
I've done something similar with a different dataset where I counted how many times 'NA' occurred in a vector where all the other values were numerical. For that I used:
na.tmin<- c(sum(is.na(s1997$TMIN)), sum(is.na(s1998$TMIN)), sum(is.na(s1999$TMIN))...
Which created a column (na.tmin) that had the number of times each subset recorded NA instead of a number. I'd like to just count the number of times the value 0 occurred but is.0 is of course not a function because 0 is numerical. Is there a function that will just count the number of times a specific value shows up? If there's not should I use the count occurrences for unique values function?
Perhaps:
sum( abs( s1997$TMIN ) < 0.00000001 )
It's safer to use a tolerance value unless you are sure that you value is an integer. See FAQ 7.31.
sum( abs( pi - (355/113+seq(-0.001, 0.001, length=1000 ) ) )< 0.00001 )
[1] 10

Select from a matrix based on the values of 2 distinct variables

Suppose I have a matrix with values of a response variable as one column and 2 characteristics such as Gender and location as the other two columns.
How do I select the particular values of the response based on specific values of both gender and location?
For example, I know
dataset$response[gender=="Male"]
will select all the Males. But say I want to select the response values from males that are from location=='SE' as well. I don't know how to do this.
Thanks a lot!
p.s. (I tried looking for this on the internet, but it is difficult finding help for the [] operator)
Logical 'and':
dataset$response[dataset$gender=="Male" & dataset$location=="SE"]
More information on logical operators in R can be found by using help("&").
If dataset is a data-frame, simply use subset:
subset( dataset, gender == 'Male' & location == 'SE' )$response

Resources