Get frequency counts for a subset of elements in a column - count

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –

I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Related

Returning count for each id in table

Im using sqlite3 with my node.js API.
I have a DB talbe structured below:
id | colour
___|_______
1 | blue
1 | red
1 | green
2 | yellow
2 | green
5 | red
I want to return a count of the IDs in my table such that
1 - 3 occurences
2 - 2 occurences
5 - 1 occurence
Is there a sql qualifier I can use count like this, or will this need to be done within the js iteself?
Any help here would be awesome!
You can use COUNT with GROUP BY
select id, COUNT(id) from tbl GROUP BY id

Searching a vector/data table backwards in R

Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.

Create a calculated field in tableau

I have a set of data in the following format:
Items Shipped | Month
A 1
B 1
C 1
D 2
E 2
F 3
G 3
H 3
I would like to show the count of items shipped each month using a calculated field in Tableau.
Item_Count | Month
3 1
2 2
3 3
Any Suggestions?
You should probably have a look on the Tableau page for their basic tutorials:
https://www.tableau.com/learn/training
Drag the [month] pill to row (if it's an actual date, change it to discrete month, otherwise leave it like it is)
Drag the [item_count] to columns, click on it and change it to COUNT or COUNTD depending whether you want the total count or only the distinct elements.

dplyr filter function gives wrong data

I have the following dataset: (sample)
Team Job Question Answer
1 1 1 2 1
2 1 1 3a 2
3 1 1 3b 2
4 1 1 4a 1
5 1 1 4b 1
and I have 21 teams so there are many rows. I am trying filter the rows of the teams which did good in the experiment (with the dplyr package):
q10best <- filter(quest,Team==c(2,4,6,10,13,17,21))
But it gives me messed up data and with many missing rows.
On the other hand, when I use:
q10best <- filter(quest,Team==2 | Team==4 | Team==6 | Team==10 | Team==13 | Team==17 | Team==21)
It gives me the right dataset that I want. What is the difference? what am I doing wrong in the first command?
Thanks
== checks if two objects are exactly the same. You are trying to check if one object (each element of quest$Team) belongs to a list of value. The proper way to do that is to use %in%
q10best <- filter(quest,Team %in% c(2,4,6,10,13,17,21))

Tabulating association frequency counts

I have data which is in this format:
User Item
1 A
1 B
1 C
1 D
2 A
2 C
2 E
What I want to get is a frequency count for each pair. Order is not important so I don't want to count the inverse. I want to end up with a result similar to this, where the frequency counts are partitioned by user.
Pair Frequency
AB 1
AC 2
AD 1
AE 1
BC 1
BD 1
BE 0
CD 1
CE 1
What tool can I use to formulate this kind of table? I'd prefer some open source solution if possible.
Edit- Added example for my comment below
I'm reading in data from a CSV file using the following two lines and removing the factors with these two steps in code.
xa<-read.csv("C:/Direcotry/MyData.csv")
xa<-data.frame(lapply(xa, as.character), stringsAsFactors=FALSE)
User Item
1 394324 Item A
2 124209 Item B
3 212457 Item C
4 427052 Item A
5 118281 Item D
6 156831 Item A
7 212442 Item E
8 156831 Item B
9 212442 Item A
10 177734 Item C
When I try running suggested answer, I get an error with this result:
Error in combn(x, 2) : n < m
Well R is open source.
Here's an example based on your tiny sample of data:
Here I just read your data in by copypasting it straight from your post:
> xa=read.table(stdin(),header=TRUE,as.is=TRUE)
0: User Item
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 2 A
6: 2 C
7: 2 E
8:
So that's the data in. Then with a couple of lines of code:
> f=function(x) apply(combn(x,2),2,paste0,collapse="")
> table(unlist(tapply(xa$Item,xa$User,f)))
AB AC AD AE BC BD CD CE
1 2 1 1 1 1 1 1
If you need all the empty combinations explicitly as zeroes it takes another line or two (you need to generate all the possible combinations as a factor, rather than just the observed ones and tell table to include the empty ones).
After some research and suggestions by Glen, I came up with the following code which gets me a 3 column CSV file with the pair combination plus frequency count. If anyone sees a better way, let me know! But this seems to work.
The errors I was referring to in my follow up comments were caused by users having purchased only at one location.
library(reshape2)
xa<-read.csv("C:/Input.csv",as.is=TRUE)
xa=xa[!duplicated(xa),]
xa<-data.table(xa)
setkey(xa,ContactId,PurchaseLocation)
tab=table(xa$ContactId)
xa=xa[xa$ContactId %in% names(tab[tab>1]),]
f=function(x) apply(combn(x,2),2,paste0,collapse="--")
xb<-as.data.frame(table(unlist(tapply(xa$PurchaseLocation,xa$ContactId,f))))
xc=with(xb, cbind(Freq, colsplit(xb$Var1, pattern = "--", names = c('a', 'b'))))
xc=subset(xc,a!=b & a!="" & b!="" & Freq>1)
write.csv(xc,file="C:/Output.csv")
Edit- I made a small change to make it order independent by sorting the data table on a key.

Resources