Counting variables per observation per month in Sas - count

a quick question, I have data of the following sort:
Ticker _ Date _ Fem Analyst (dummy 1 if true) ___ Variables of that month like beta
AA _ 01/04/2001 _ 1 ___ 0.61
AA _ 05/04/2001 _ 1 ___ 0.62
AA _ 08/04/2001 _ 1 ___ 0.63
AA _ 01/05/2002 _ 1 ___ 0.7
AA _ 04/05/2002 _ 1 ___ 0.71
AA _ 08/07/2002 _ 0 ___ 0.8
AA _ 07/04/2003 _ 1 ___ 0.4
and so on.. What I want to receive is the following:
Ticker _ Date Number of fem analyst Number of Male Analysts _ Total ___Variables
AA _ 04/2001 3 0 _ 3 ___ 0.63
AA _ 05/2002 2 0 _ 2 ___ 0.71
AA _ 07/2002 0 1 _ 1 ___ 0.8
AA _ 04/2003 1 0 _ 1 ___ 0.4
So a counting algorithm that allows me to count the number of female and male analyst for a certain company per month( using dummy variable gender 0 or 1) and deletes all observations for that month except the most recent one (for instance for 08/04/01 this becomes 04/01 with 0.63 which is the most recent observation for beta for 04/01 for company AA) The example explains it all I guess?
Any ideas?

You may want something like this:
/* Create the month variable into a string YYYY/MM */
data analysts0;
set <your data>;
format month $7.;
month=cats(year(date),'/',put(month(date),z2.));
run;
/* Sort so you can do the by processing required for counting */
proc sort data=analyst0 out=analyst1;
/* You need to include the date in the sort so the most recent is last */
by ticker month date;
run;
/* Count */
data count;
retain n_fem n_male 0;
set analyst1;
by ticker month;
if first.ticker of first.month then do;
n_fem=0;
n_male=0;
end;
else do;
if gender=1 then n_fem+1;
else if gender=0 then n_male+1;
else put 'Huh?';
end;
/* this outputs only the values you need.*/
if last.ticker or last.month then output;
run;
This should give you the general idea - I don't have access to SAS right now so I can't check the code. See the documentation for retain and by processing in the data step for more details.

Related

SQLite: Group data within certain time interval

I have a single table which stores data of orders:
Orders Table:
id | order_time | quantity | ...
1 | 1592821854318 | 2
2 | 1592901538199 | 4
3 | 1592966454547 | 1
4 | 1593081282406 | 9
5 | 1593141826330 | 6
order_time table is UNIX timestamp.
Using below query I am able to get available data grouped by days (86400000 = 24 hours):
SELECT order_time+ (86400000 - (order_time % 86400000)) as gap, SUM(quantity) as
totalOrdersBetweenInterval
FROM USAGE_DETAILS ud
WHERE order_time >= 1590969600 AND order_time <= 1593388799000
GROUP BY gap
ORDER BY gap ASC
Suppose for this month of June, I receive order on 1, 4, 6, 7 date then by using above query I am able to retrieve data as follow :
gap | totalOrdersBetweenInterval
1 | 5
4 | 6
6 | 4
7 | 10
I would receive UNIX timestamp in gap column but for the sake of example I have used readable dates.
Above query will only retrieve data for the days which would have received order but I want to split data in range like below which also include days with no orders :
gap | totalOrdersBetweenInterval
1 | 5
2 | 0
3 | 0
4 | 6
5 | 0
6 | 4
7 | 10
8 | 0
9 | 0
. | .
. | .
How do I go about that?
You need a query that returns 30 rows:1,2,...,30 for the days of June.
You could do it with a recursive CTE:
with days as (
select 1 day
union all
select day + 1
from days
where day < 30
)
but I'm not sure if Android uses a version of SQLite that supports CTEs.
If it does support them, all you need to do is join the CTE with a LEFT join to your query:
with
days as (
select 1 day
union all
select day + 1
from days
where day < 30
),
yourquery as (
<your query here>
)
select d.day, coalesce(t.totalOrdersBetweenInterval, 0) totalOrdersBetweenInterval
from days d left join yourquery t
on t.gap = d.day
If Android does not support CTEs you will have to build the query that returns the days with UNION ALL:
select d.day, coalesce(t.totalOrdersBetweenInterval, 0) totalOrdersBetweenInterval
from (
select 1 day union all select 2 union all
select 3 union all select 4 union all
......................................
select 29 union all select 30
) d left join (
<your query here>
) t
on t.gap = d.day
Thanks to #forpas for helping me out.
Just posting in case someone is searching for slicing data by unix time intervals.
with
days as (
select 1590969600000 day --Starting of June 1 2020
union all
select day + 86400000 --equivalent to 1 day
from days
where day < 1593388799000 --Less than 28th of June
),
subquery as (
SELECT order_time+ (86400000 - (order_time % 86400000)) as gap, SUM(quantity) as
totalOrdersBetweenInterval
FROM USAGE_DETAILS ud
WHERE order_time >= 1590969600000 AND order_time <= 1593388799000
GROUP BY gap
)
select d.day, coalesce(t.totalOrdersBetweenInterval, 0) totalOrdersBetweenInterval
from days d left join subquery t
on t.gap = d.day
order by d.day

How to fill in observations using other observations R or Stata

I have a dataset like this:
ID dum1 dum2 dum3 var1
1 0 1 . hi
1 0 . 0 hi
2 1 . . bye
2 0 0 1 .
What I'm trying to do is that I want to fill in information based on the same ID if observations are missing. So my end product would be something like:
ID dum1 dum2 dum3 var1
1 0 1 0 hi
1 0 1 0 hi
2 1 0 1 bye
2 0 0 1 bye
Is there any way I can do this in R or Stata?
This continues discussion of Stata solutions. The solution by #Pearly Spencer looks backward and forward from observations with missing values and so is fine for the example with just two observations per group, and possibly fine for some other situations.
An alternative approach makes use, as appropriate, of the community-contributed commands mipolate and stripolate from SSC as explained also at https://www.statalist.org/forums/forum/general-stata-discussion/general/1308786-mipolate-now-available-from-ssc-new-program-for-interpolation
Examples first, then commentary:
clear
input ID dum1a dum2a dum3a str3 var1a
1 0 1 . "hi"
1 0 . 0 "hi"
2 1 . . "bye"
2 0 0 1 ""
2 0 1 . ""
end
gen long obsno = _n
foreach v of var dum*a {
quietly count if missing(`v')
if r(N) > 0 capture noisily mipolate `v' obsno, groupwise by(ID) generate(`v'_2)
}
foreach v of var var*a {
quietly count if missing(`v')
if r(N) > 0 capture noisily stripolate `v' obsno, groupwise by(ID) generate(`v'_2)
}
list
+----------------------------------------------------------------+
| ID dum1a dum2a dum3a var1a obsno dum3a_2 var1a_2 |
|----------------------------------------------------------------|
1. | 1 0 1 . hi 1 0 hi |
2. | 1 0 . 0 hi 2 0 hi |
3. | 2 1 . . bye 3 1 bye |
4. | 2 0 0 1 4 1 bye |
5. | 2 0 1 . 5 1 bye |
+----------------------------------------------------------------+
Notes:
The groupwise option of mipolate and stripolate uses the rule: replace missing values within groups with the non-missing value in that group if and only if there is only one distinct non-missing value in that group. Thus if the non-missing values in a group are all 1, or all 42, or whatever it is, then interpolation uses 1 or 42 or whatever it is. If the non-missing values in a group are 0 and 1, then no go.
The variable obsno created here plays no role in that interpolation and is needed solely to match the general syntax of mipolate.
There is no assumption here that groups consist of just two observations or have the same number of observations. A common playground for these problems is data on families whenever some variables were recorded only for certain family members but it is desired to spread the values recorded to other family members. Naturally, in real data families often have more than two members and the number of family members will vary.
This question exposed a small bug in mipolate, groupwise and stripolate, groupwise: it doesn't exit as appropriate if there is nothing to do, as in dum1a where there are no missing values. In the code above, this is trapped by asking for interpolation if and only if missing values are counted. At some future date, the bug will be fixed and the code in this answer simplified accordingly, or so I intend as program author.
mipolate, groupwise and stripolate, groupwise both exit with an error message if any group is found with two or more distinct non-missing values; no interpolation is then done for any groups, even if some groups are fine. That is the point of the code capture noisily: the error message for dum2a is not echoed above. As program author I am thinking of adding an option whereby such groups will be ignored but that interpolation will take place for groups with just one distinct non-missing value.
Assuming your data is in df
library(dplyr)
df %>%
group_by(ID) %>%
mutate(dum1=dum1[dum1!="."][1],
dum2=dum2[dum2!="."][1],
dum3=dum3[dum3!="."][1],
var1=var1[var1!="."][1])
Using your toy example:
clear
input ID dum1a dum2a dum3a str3 var1a
1 0 1 . "hi"
1 0 . 0 "hi"
2 1 . . "bye"
2 0 0 1 "."
end
replace var1a = "" if var1a == "."
sort ID (dum2a)
list
+------------------------------------+
| ID dum1a dum2a dum3a var1a |
|------------------------------------|
1. | 1 0 1 . hi |
2. | 1 0 . 0 hi |
3. | 2 0 0 1 |
4. | 2 1 . . bye |
+------------------------------------+
In Stata you can do the following:
ds ID, not
local varlist `r(varlist)'
foreach var of local varlist {
generate `var'b = `var'
bysort ID (`var'): replace `var'b = cond(!missing(`var'[_n-1]), `var'[_n-1], ///
`var'[_n+1]) if missing(`var')
}
list ID dum?ab var?ab
+----------------------------------------+
| ID dum1ab dum2ab dum3ab var1ab |
|----------------------------------------|
1. | 1 0 1 0 hi |
2. | 1 0 1 0 hi |
3. | 2 0 0 1 bye |
4. | 2 1 0 1 bye |
+----------------------------------------+

A set of positive integers that do not begin with 0, except for 0

While trying to solve the following exercises in programming language subjects, I know my answer can't create string 201, but I can't imagine how to solve this exception.
Problem: L(G) is a set of positive decimal numbers that do not start with 0, except zero. Design grammar G.
My answer:
G is:
S -> Digit
NonZeroDigit -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Digit -> 0 | NonZeroDigit | NonZeroDigit 0 | NonZeroDigit Digit
Check correctness:
Digit => 0
Digit => NonZeroDigit => 1
Digit => NonZeroDigit Digit => 2 Digit => 20
If I add Digit -> Digit Digit, it would create Digit => Digit Digit => Digit Digit Digit => 201, but this also can create Digit => Digit Digit => Digit Digit Digit => 000. What?
How do I change the grammar I define so I can meet the condition?
Why not just Split n=0 and n>0?
S -> 0 | posDig digit
posDig -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
digit -> digit digit | 0 | posDig | <epsilon>
Instead of (posDig digit) in S, you could also say e.g. number (tho 1 to 9 would now be a number as well)
From there on, you just need to make sure the first digit is not

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Build a SQL with sum

Here is my table - PK is (Con_num, version, order) :
Con_num version operation amount
15 1 A 1
15 1 B 2
15 1 C 3
15 2 A 4
15 3 A 5
15 3 B 6
15 4 C 7
Con_num is the contract number.
version is the version number.
operation is just an ID for an operation.
amount is the amount of the operation.
I would like to have the total amount per version. The tricky part is that: for version 1, i just have to sum the amount. But for version 2, I need to sum the version 2 line (with operation = A) and to take the two lines from version 1 (with operation != A). Therefore, for version 3, i will take the two lines of version 3, and only the line with operation = C from version 1. Any new operation invalidate the one from the previous versions.
The result will be:
Con_num version amount
15 1 6 (1 + 2 + 3)
15 2 9 (4 + 2 + 3)
15 3 14 (5 + 6 + 3)
15 4 18 (5 + 6 + 7)
How can I do that ?
For each con_num and version add up all records
for the same con_num
with no version greater than the version in question
having the highest version per operation
To get the amount of the record with the highest version can be solved with Oracle's KEEP FIRST/LAST:
select
base.con_num,
base.version,
(
select sum(max(mytable.amount) keep (dense_rank last order by mytable.version))
from mytable
where mytable.con_num = base.con_num
and mytable.version <= base.version
group by mytable.con_num, mytable.operation
) as total
from (select distinct con_num, version from mytable) base;
select
Con_num, version, orderno, a0+a1+a2 as amount
from (
select
Con_num, version, orderno
, lag(amount,2) over(partition by Con_num order by version, orderno) a2
, lag(amount,1) over(partition by Con_num order by version, orderno) a1
, amount a0
, row_number() over(partition by Con_num, version order by orderno desc) as rn
from table1
) d
where rn = 1
You seem to want only the "most recent" combinations of (Con_num, version, orderno) which can be identified using row_number() and the values required established using lag(,1) and lag(,2) but I don't reach the stated result.
result:
| con_num | version | orderno | amount |
|---------|---------|---------|--------|
| 15 | 1 | 3 | 37 |
| 15 | 2 | 1 | 42 |
| 15 | 3 | 2 | 35 |
sqlfiddle example
Using LAST_VALUE analytic function.
select con_num, version, q1+q2+q3
from (
select x.*,
last_value(case when operation = 1 then amount end) ignore nulls over (order by version) q1,
last_value(case when operation = 2 then amount end) ignore nulls over (order by version) q2,
last_value(case when operation = 3 then amount end) ignore nulls over (order by version) q3
from x
)
group by con_num,version, q1, q2, q3
order by con_num,version;

Resources