Repeat sequence by group - r

I have the following dataframe:
a <- data.frame(
group1=factor(rep(c("a","b"),each=6,times=1)),
time=rep(1:6,each=1,times=2),
newcolumn = c(1,1,2,2,3,3,1,1,2,2,3,3)
)
I'm looking to replicate the output of newcolumn with a rep by group function (the time variable is there for ordering purposes). In other words, for each group, ordered by time, how can I assign a sequence 1,1,2,2,n,n? I also need a general solution (in the case that groups are of differing number of rows, or I want to repeat values 3,10,n times).
For instance, I can generate that sequence with this:
newcolumn=rep(1:3,each=2,times=2)
But that wouldn't work in a group by statement where group1 has differing rows.

We specify the length.out in the rep after grouping by 'group1'
library(dplyr)
a %>%
group_by(group1) %>%
mutate(new = rep(seq_len(n()/2), each = 2, length.out = n()))
NOTE: each and times are not used in the same call. Either we use each or times
EDIT: Based on comments from #r2evans

A data.table alternative:
library(data.table)
DT <- as.data.table(a[1:2])
DT[order(time),newcolumn := rep(seq_len(.N/2), each=2, length.out=.N),by=c("group1")]
DT
# group1 time newcolumn
# 1: a 1 1
# 2: a 2 1
# 3: a 3 2
# 4: a 4 2
# 5: a 5 3
# 6: a 6 3
# 7: b 1 1
# 8: b 2 1
# 9: b 3 2
# 10: b 4 2
# 11: b 5 3
# 12: b 6 3

Related

Group observations into specified number of groups according to id with data.table solution

I have the following data.table:
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt
id obs
1: 1 0.1470735
2: 1 1.6954685
3: 1 2.3947260
4: 1 2.1782338
5: 1 0.5168873
6: 2 -0.8879545
7: 2 1.9320034
8: 2 2.6269272
9: 2 1.5212627
10: 2 -0.1581711
Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping. For this example, let's say I want to end up with a data.table like this:
id obs group
1: 1 0.1470735 A
2: 1 1.6954685 A
3: 1 2.3947260 A
4: 1 2.1782338 A
5: 1 0.5168873 A
6: 2 -0.8879545 A
7: 2 1.9320034 A
8: 2 2.6269272 A
9: 2 1.5212627 A
10: 2 -0.1581711 A
Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C.
My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
Could someone please help me with an elegant data.table way to accomplish this?
This is the same as #Shree's answer, just using length.out in rep and no dplyr.
I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
You can make an id table; assign groups there; and if necessary merge back:
# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)
# make an id table
idDT = unique(DT[, "id"])
# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]
# merge back if needed
DT[idDT, on=.(id), g := i.g]
You refer to "my actual dataset" -- but R allows you to juggle multiple tables. Trying to do everything in one is almost always counterproductive.
EDIT: Didn't notice that you needed this with data.table. I'll leave this out here as an alternative.
I am creating a dataframe with id and randomly assigned group. This will be joined with your data to get groups for each record by id -
library(dplyr)
library(data.table)
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
max_per_group <- 5
n_ids <- length(unique(dt$id))
data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
left_join(dt, ., by = "id")
id obs grp
1 1 1.28879713 S
2 1 1.04471197 S
3 1 0.36470847 S
4 1 0.46741567 S
5 1 1.07749891 S
6 2 1.73640785 K
7 2 1.61144042 K
8 2 2.85196859 K
9 2 1.84848117 K
10 2 2.11395863 K
11 3 0.88623462 S
12 3 2.11706351 S
13 3 1.29225433 S
14 3 0.30458037 S
15 3 -1.72070005 S
16 4 2.24593162 U
17 4 2.10346287 U
18 4 2.28724412 U
19 4 0.02978044 U
20 4 0.56234660 U
21 5 2.92050008 F
22 5 1.08048974 F
23 5 0.58885261 F
24 5 1.53299092 F
25 5 1.47271123 F

`data.table` how to get `keyby` to include all combinations of factors? [duplicate]

This question already has answers here:
Keeping zero count combinations when aggregating with data.table
(2 answers)
Use a factor column in "by" and do not drop empty factors
(2 answers)
Closed 4 years ago.
I have a data.table and I would like to count the occurrence of each combination of a and b:
dt1 <- data.table(
a = c(1,1,1,1,2,2,2,2,3,3,3,3),
b = c(1,1,2,2,1,1,1,1,1,2,2,2) %>% letters[.]
)
# a b
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 1 b
# 5: 2 a
# 6: 2 a
# 7: 2 a
# 8: 2 a
# 9: 3 a
# 10: 3 b
# 11: 3 b
# 12: 3 b
dt1[, .N, keyby = .(a, b)]
# a b N
# 1: 1 a 2
# 2: 1 b 2
# 3: 2 a 4
# 4: 3 a 1
# 5: 3 b 3
It misses out the case of a==2 & b=="b", which has a zero count in dt1, but I want it to be included so the result would look like:
# a b c
# 1: 1 a 2
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 b 0
# 5: 3 a 1
# 6: 3 b 3
The most intuitive way to use the loop or the apply family but it is just inefficient for my large datasets. Any idea?
That's a tidyr/dplyr approach:
dt1 %>%
group_by(a,b) %>%
summarise(c = length(.)) %>%
ungroup %>%
complete(a,b, fill = list(c = 0))

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Get top k records per group, where k differs by group, in R data.table

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

Cumulative sequence of occurrences of values [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a dataset that looks something like this, with a column that can have four different values:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
In R, I'd like to create a second column that tallies, in sequence, the cumulative number of rows containing a particular value. Thus the output column would look like this:
out
1
1
1
2
1
2
2
3
2
3
3
4
Try this:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
with(dataset, ave(as.character(out), out, FUN = seq_along))
# [1] "1" "1" "1" "2" "1" "2" "2" "3" "2" "3" "3" "4"
Of course, you can assign the output to a column in your data.frame using something like out$asNumbers <- with(dataset, ave(as.character(out), out, FUN = seq_along))
Update
The "dplyr" approach is also quite nice. The logic is very similar to the "data.table" approach. An advantage is that you don't need to wrap the output with as.numeric which would be required with the ave approach mentioned above.
dataset %>% group_by(out) %>% mutate(count = sequence(n()))
# Source: local data frame [12 x 2]
# Groups: out
#
# out count
# 1 a 1
# 2 b 1
# 3 c 1
# 4 a 2
# 5 d 1
# 6 b 2
# 7 c 2
# 8 a 3
# 9 d 2
# 10 b 3
# 11 c 3
# 12 a 4
A third option is to use getanID from my "splitstackshape" package. For this particular example, you just need to specify the data.frame name (since it's a single column), however, generally, you would be more specific and mention the column(s) that presently serve as "ids", and the function would check whether they are unique or whether a cumulative sequence is required to make them unique.
library(splitstackshape)
# getanID(dataset, "out") ## Example of being specific about column to use
getanID(dataset)
# out .id
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4
Update:
As Ananda pointed out, you can use the simpler:
DT[, counts := sequence(.N), by = "V1"]
(where DT is as below)
You can create a "counts" column, initialized to 1, then tally the cumulative sum, by factor.
below is a quick implementation with data.table
# Called the column V1
dataset<-data.frame(V1=c("a","b","c","a","d","b","c","a","d","b","c","a"))
library(data.table)
DT <- data.table(dataset)
DT[, counts := 1L]
DT[, counts := cumsum(counts), by=V1]; DT
# V1 counts
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4

Resources