R grouping with Select the rows with Limited - r

The sample data frame
grp = c(1,1,1, 1,1,2,2,2,2,2, 2,2)
val = c(2,1,5,NA,3,NA,1)
dta = data.frame(grp=grp, val=val)
The results should look like this:
# The max number of count is 3
grp count
1 3
1 2
2 3
2 3
2 1

Here's a way with base R. We first count the repeated measures with rle. Then use a custom function that combines 3 with the remainder of the division. Finally we combine to form a new data frame:
grp = c(1,1,1, 1,1,2,2,2,2,2,2,2)
fun3 <- function(x) c(rep(3, floor(x/3)), x %% 3)
len <- rle(grp)$lengths
ans <- lapply(len, fun3)
cbind.data.frame(grp=rep(unique(grp), lengths(ans)), count=unlist(ans))
# grp count
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 3
# 5 2 1

Related

R: Why is expand.grid() producing many more rows than I expect?

My understanding is that base::grid.expand() and tidyr::grid_expand() will return an object with a row for each unique value of the joint distribution of unique values across one or more vectors. For example, here is what I expect:
# Preliminaries
library(tidyr)
set.seed(123)
# Simulate data
df <- data.frame(x = as.factor(rep(c(1,2), 50)), y= as.factor(sample(1:3, 100, replace = T)))
# Expected result
data.frame(x = rep(1:2, 3), y = rep(1:3, 2)) # 6 rows!
However, when I actually use the functions, I get many more (duplicated) rows than I expect:
# Tidyverse result
tidyr::expand_grid(df) # produces 100 rows!
tidyr::expand_grid(df$x, df$y) # produces 10k rows!
# Base R version
base::expand.grid(df) # produces 10k rows!
base::expand.grid(df$x, df$y) # produces 10k rows!
# Solution...but why do I have to do this?!
unique(base::expand.grid(df))
Can someone explain what I am missing about what it is supposed to do?
The input to expand_grid is variadic (...), we can use do.call
do.call(expand_grid, df)
Or with invoke
library(purrr)
invoke(expand_grid, df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
Or with !!!
expand_grid(!!! df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
As #Mossa commented, the function to return unique combinations would be expand or crossing because expand calls expand_grid on unique values
> expand(df, df)
# A tibble: 6 × 2
x y
<fct> <fct>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
Based on the source code
getAnywhere("expand.data.frame")
function (data, ..., .name_repair = "check_unique")
{
out <- grid_dots(..., `_data` = data)
out <- map(out, sorted_unique)
out <- expand_grid(!!!out, .name_repair = .name_repair)
reconstruct_tibble(data, out)
}
expand.grid makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:
nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000
nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1
If you look at the source code for expand.grid, it takes the variadic dots and turns them into a list called args. It then includes the line:
d <- lengths(args)
which returns a vector with one entry for each vector that we feed into expand.grid. In the case of expand.grid(df$x, df$y), d would be equivalent to c(100, 100).
There then follows the line
orep <- prod(d)
which gives us the product of d, which is 100x100, or 10,000.
The variable orep is used later in the function to repeat each vector so that its length is equal to the value orep.
If you only want unique combinations of the two input vectors, then you must make them unique at the input to expand.grid.

Create function to count occurrences within groups in R

I have a dataset with a unique ID for groups of patients called match_no and i want to count how many patients got sick in two different years by running a loop function to count the occurrences in a large dataset
for (i in db$match_no){(with(db, sum(db$TBHist16 == 1 & db$match_no == i))}
This is my attempt. I need i to cycle through each of the match numbers and count how many TB occurrences there was.
Can anyone correct my formula please.
Example here
df1 <- data.frame(Match_no = c(1, 1,1,1,1,2,2,2,2,2, 3,3,3,3,3, 4,4,4,4,4, 5,5,5,5,5),
var1 = c(1,1,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,0,1))
I want to count how many 1 values there are in each match number.
Thank you
Some ideas:
Simple summary of all Match_no values:
xtabs(~var1 + Match_no, data = df1)
# Match_no
# var1 1 2 3 4 5
# 0 2 2 1 3 1
# 1 3 3 4 2 4
Same as 1, but with a subset:
xtabs(~ Match_no, data = subset(df1, var1 == 1))
# Match_no
# 1 2 3 4 5
# 3 3 4 2 4
Results in a frame:
aggregate(var1 ~ Match_no, data = subset(df1, var1 == 1), FUN = length)
# Match_no var1
# 1 1 3
# 2 2 3
# 3 3 4
# 4 4 2
# 5 5 4
In base R you can use aggregate and sum:
aggregate(var1 ~ Match_no, data = df1, FUN = sum)
Match_no var1
1 1 3
2 2 3
3 3 4
4 4 2
5 5 4

Avoid Loop in Slicing Operation

I have the following code that I execute using a for loop. Is there a way to accomplish the same without a for loop?
first_list <- c(1,2,3, rep(1,5), rep(2,5), rep(3,5), rep(4,5))
print(first_list)
[1] 1 2 3 1 1 1 1 1 2 2 2 2 2
[1] 3 3 3 3 3 4 4 4 4 4
breaks <- c(rep(1,3), rep(5,4))
values <- vector()
i <- 1
prev <- 1
for (n in breaks){
values[i] <- sum(first_list[prev:sum(breaks[1:i])])
i <- i + 1
prev <- prev + n
}
print(values)
[1] 1 2 3 5 10 15 20
The purpose of the loop is to take the first three elements of a list, then add to that list the sum of the next four sets of 5.
You can use tapply for grouped operation
tapply(first_list, rep(1:length(breaks), breaks), sum)
or, preferably, using data.table
library(data.table)
data.table(first_list, id=rep(1:length(breaks), breaks))[, sum(first_list), id]$V1
If you have to perform it on your data as in your original post
setDT(mydata)
mydata[, id:=rep(1:length(breaks), breaks),][, sum(Freq), by=id]

How to assign a counter to a specific subset of a data.frame which is defined by a factor combination?

My question is: I have a data frame with some factor variables. I now want to assign a new vector to this data frame, which creates an index for each subset of those factor variables.
data <-data.frame(fac1=factor(rep(1:2,5)), fac2=sample(letters[1:3],10,rep=T))
Gives me something like:
fac1 fac2
1 1 a
2 2 c
3 1 b
4 2 a
5 1 c
6 2 b
7 1 a
8 2 a
9 1 b
10 2 c
And what I want is a combination counter which counts the occurrence of each factor combination. Like this
fac1 fac2 counter
1 1 a 1
2 2 c 1
3 1 b 1
4 2 a 1
5 1 c 1
6 2 b 1
7 1 a 2
8 2 a 2
9 1 b 2
10 1 a 3
So far I thought about using tapply to get the counter over all factor-combinations, which works fine
counter <-tapply(data$fac1, list(data$fac1,data$fac2), function(x) 1:length(x))
But I do not know how I can assign the counter list (e.g. unlisted) to the combinations in the data-frame without using inefficient looping :)
This is a job for the ave() function:
# Use set.seed for reproducible examples
# when random number generation is involved
set.seed(1)
myDF <- data.frame(fac1 = factor(rep(1:2, 7)),
fac2 = sample(letters[1:3], 14, replace = TRUE),
stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
# fac1 fac2 counter
# 1 1 a 1
# 2 2 b 1
# 3 1 b 1
# 4 2 c 1
# 5 1 a 2
# 6 2 c 2
# 7 1 c 1
# 8 2 b 2
# 9 1 b 2
# 10 2 a 1
# 11 1 a 3
# 12 2 a 2
# 13 1 c 2
# 14 2 b 3
Note the use of stringsAsFactors=FALSE in the data.frame() step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along).
A data.table solution
library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
This is a base R way that avoids (explicit) looping.
data$counter <- with(data, {
inter <- as.character(interaction(fac1, fac2))
names(inter) <- seq_along(inter)
inter.ordered <- inter[order(inter)]
counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
counter[match(names(inter), names(inter.ordered))]
})
Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):
x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]
No idea whether this is efficient over a large data.frame but it works!

Cumulative count of each value [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I want to create a cumulative counter of the number of times each value appears.
e.g. say I have the column:
id
1
2
3
2
2
1
2
3
This would become:
id count
1 1
2 1
3 1
2 2
2 3
1 2
2 4
3 2
etc...
The ave function computes a function by group.
> id <- c(1,2,3,2,2,1,2,3)
> data.frame(id,count=ave(id==id, id, FUN=cumsum))
id count
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
I use id==id to create a vector of all TRUE values, which get converted to numeric when passed to cumsum. You could replace id==id with rep(1,length(id)).
Here is a way to get the counts:
id <- c(1,2,3,2,2,1,2,3)
sapply(1:length(id),function(i)sum(id[i]==id[1:i]))
Which gives you:
[1] 1 1 1 2 3 2 4 2
The dplyr way:
library(dplyr)
foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo
# A tibble: 8 x 2
# Groups: id [3]
id count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
That ends up grouped by id. If you want it not grouped, add %>% ungroup().
For completeness, adding a data.table way:
library(data.table)
DT <- data.table(id = c(1, 2, 3, 2, 2, 1, 2, 3))
DT[, count := seq(.N), by = id][]
Output:
id count
1: 1 1
2: 2 1
3: 3 1
4: 2 2
5: 2 3
6: 1 2
7: 2 4
8: 3 2
The dataframe I had was too large and the accepted answer kept crashing. This worked for me:
library(plyr)
df$ones <- 1
df <- ddply(df, .(id), transform, cumulative_count = cumsum(ones))
df$ones <- NULL
Function to get the cumulative count of any array, including a non-numeric array:
cumcount <- function(x){
cumcount <- numeric(length(x))
names(cumcount) <- x
for(i in 1:length(x)){
cumcount[i] <- sum(x[1:i]==x[i])
}
return(cumcount)
}

Resources