Create function to count occurrences within groups in R - r

I have a dataset with a unique ID for groups of patients called match_no and i want to count how many patients got sick in two different years by running a loop function to count the occurrences in a large dataset
for (i in db$match_no){(with(db, sum(db$TBHist16 == 1 & db$match_no == i))}
This is my attempt. I need i to cycle through each of the match numbers and count how many TB occurrences there was.
Can anyone correct my formula please.
Example here
df1 <- data.frame(Match_no = c(1, 1,1,1,1,2,2,2,2,2, 3,3,3,3,3, 4,4,4,4,4, 5,5,5,5,5),
var1 = c(1,1,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,0,1))
I want to count how many 1 values there are in each match number.
Thank you

Some ideas:
Simple summary of all Match_no values:
xtabs(~var1 + Match_no, data = df1)
# Match_no
# var1 1 2 3 4 5
# 0 2 2 1 3 1
# 1 3 3 4 2 4
Same as 1, but with a subset:
xtabs(~ Match_no, data = subset(df1, var1 == 1))
# Match_no
# 1 2 3 4 5
# 3 3 4 2 4
Results in a frame:
aggregate(var1 ~ Match_no, data = subset(df1, var1 == 1), FUN = length)
# Match_no var1
# 1 1 3
# 2 2 3
# 3 3 4
# 4 4 2
# 5 5 4

In base R you can use aggregate and sum:
aggregate(var1 ~ Match_no, data = df1, FUN = sum)
Match_no var1
1 1 3
2 2 3
3 3 4
4 4 2
5 5 4

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

R grouping with Select the rows with Limited

The sample data frame
grp = c(1,1,1, 1,1,2,2,2,2,2, 2,2)
val = c(2,1,5,NA,3,NA,1)
dta = data.frame(grp=grp, val=val)
The results should look like this:
# The max number of count is 3
grp count
1 3
1 2
2 3
2 3
2 1
Here's a way with base R. We first count the repeated measures with rle. Then use a custom function that combines 3 with the remainder of the division. Finally we combine to form a new data frame:
grp = c(1,1,1, 1,1,2,2,2,2,2,2,2)
fun3 <- function(x) c(rep(3, floor(x/3)), x %% 3)
len <- rle(grp)$lengths
ans <- lapply(len, fun3)
cbind.data.frame(grp=rep(unique(grp), lengths(ans)), count=unlist(ans))
# grp count
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 3
# 5 2 1

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

aggregating data frames by sum

Say I have two data frames as follows.
d1 = data.frame(table(c(1,2,2,2,4,5)))
d2 = data.frame(table(c(2,2,2,6)))
Combing the frames with rbind gives the following:
> rbind(d1, d2)
Var1 Freq
1 1 1
2 2 3
3 4 1
4 5 1
5 2 3
6 5 1
But what I would like is to calculate the sum of the Freq values with the same Var1, i.e. get
Var1 Freq
1 1 1
2 2 6
3 4 1
4 5 1
5 6 1
How can I accomplish this?
In addition to aggregate, there is also xtabs which is designed specifically for summing up tables.
xtabs(Freq ~ Var1, data=rbind(d1, d2))

Resources