Selecting top N rows for each group based on value in column - r

I have dataframe like below :-
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
x y z
1 3 a 2
2 2 a 2
3 1 a 2
4 8 b 1
5 7 b 1
6 11 c 3
7 10 c 3
8 9 c 3
9 7 c 3
10 5 c 3
11 4 c 3
I want to select top n row for each group by column y where n is provided in column z.
So the output should be like :
x y z
1 3 a 2
2 2 a 2
3 8 b 1
4 11 c 3
5 10 c 3
6 9 c 3

A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
A much more direct way (still in base R) provided in comment by #mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3

One approach with data.table:
df[,.(inc=seq_len(.N)<=z,x,z),by=.(y)][inc==T ,-2]
# y x z
#1: a 3 2
#2: a 2 2
#3: b 8 1
#4: c 11 3
#5: c 10 3
#6: c 9 3

A solution with dplyr that uses do:
df %>%
group_by(y) %>%

I'm posting the solution I was looking for using dplyr. It is based on #HNSKD:
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
df %>% group_by(y) %>% slice(1:2)
Which returns the first two elements for each y:
# A tibble: 6 x 3
# Groups: y [3]
x y z
<dbl> <fct> <dbl>
1 3 a 2
2 2 a 2
3 8 b 1
4 7 b 1
5 11 c 3
6 10 c 3


Assign unique non-repeated ID to nested groups with the same values in R

I have run across similar questions, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to include a unique non-repeating ID to nested groups that can have identical values. While I regularly conduct this type of data wrangling, both the structure of this data set as well as the required outcome are beyond my skillset at this time.
Below I have provided an example data set (df) and what the results should look like.
I used the below code in my actual data set, but realized that it fails under certain circumstances...which are exaggerated in the example data set provided here. I prefer the ID to be sequentially numbered.
df$ID = cumsum(c(TRUE, diff(df$LENGTH) != 0))
I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP REGION TIME LENGTH
a x 1 3
a x 2 3
a x 3 3
a y 4 3
a y 5 3
a y 6 3
a z 7 2
a z 8 2
b z 1 2
b z 2 2
b x 3 2
b x 4 2
c x 1 2
c x 2 2
c y 3 2
c y 4 2
c x 5 2
c x 6 2
c z 7 1", header = TRUE)
result <- read.table(text = "GROUP REGION TIME LENGTH ID
a x 1 3 1
a x 2 3 1
a x 3 3 1
a y 4 3 2
a y 5 3 2
a y 6 3 2
a z 7 2 3
a z 8 2 3
b z 1 2 4
b z 2 2 4
b x 3 2 5
b x 4 2 5
c x 1 2 6
c x 2 2 6
c y 3 2 7
c y 4 2 7
c x 5 2 8
c x 6 2 8
c z 7 1 9", header = TRUE)
Paste GROUP and REGION columns and use rle to create a sequential ID column.
transform(df,ID = with(rle(paste(GROUP, REGION)),rep(seq_along(values),lengths)))
In data.table we can use rleid.
setDT(df)[, ID := rleid(GROUP, REGION)]
# 1: a x 1 3 1
# 2: a x 2 3 1
# 3: a x 3 3 1
# 4: a y 4 3 2
# 5: a y 5 3 2
# 6: a y 6 3 2
# 7: a z 7 2 3
# 8: a z 8 2 3
# 9: b z 1 2 4
#10: b z 2 2 4
#11: b x 3 2 5
#12: b x 4 2 5
#13: c x 1 2 6
#14: c x 2 2 6
#15: c y 3 2 7
#16: c y 4 2 7
#17: c x 5 2 8
#18: c x 6 2 8
#19: c z 7 1 9
Another base R option, but without rle
ID = cumsum(c(1, (s <- paste0(GROUP, REGION))[-1] != head(s, -1)))
1 a x 1 3 1
2 a x 2 3 1
3 a x 3 3 1
4 a y 4 3 2
5 a y 5 3 2
6 a y 6 3 2
7 a z 7 2 3
8 a z 8 2 3
9 b z 1 2 4
10 b z 2 2 4
11 b x 3 2 5
12 b x 4 2 5
13 c x 1 2 6
14 c x 2 2 6
15 c y 3 2 7
16 c y 4 2 7
17 c x 5 2 8
18 c x 6 2 8
19 c z 7 1 9
With dplyr
df %>%
mutate(ID = rleid(GROUP, REGION))

How to add value into new column based on corresponding value in another column?

This is the sample data with 'y' being the new variable created.
If the value of column x ="A", I would like the value of col.A to be displayed in column y. And similarly for the "B" & "C" values in column x.
Final result should be something like this.
A proposition :
df <- read.table(header=TRUE, text="
x A B C
A 1 4 7
B 5 6 7
C 3 5 3
df$y <- paste0("df$",df$x,"[df$x=='",df$x,"']")
#> x A B C y
#> 1 A 1 4 7 df$A[df$x=='A']
#> 2 B 5 6 7 df$B[df$x=='B']
#> 3 C 3 5 3 df$C[df$x=='C']
df$y <- eval(ivmte:::unstring(df$y))
#> x A B C y
#> 1 A 1 4 7 1
#> 2 B 5 6 7 6
#> 3 C 3 5 3 3
# Created on 2021-01-30 by the reprex package (v0.3.0.9001)
Try this:
for (i in 1:nrow(your_dataframe)){
y[i]<-your_dataframe[i, which(names(your_dataframe)==your_dataframe$x[i])]
cbind(your_dataframe, y)
x A B C y
1 A 1 4 7 1
2 B 5 6 7 6
3 C 3 5 3 3
another option with apply:
cbind(your_dataframe, y=apply(your_dataframe, 1, function(x){
> your_dataframe
x A B C y
1 A 1 4 7 1
2 B 5 6 7 6
3 C 3 5 3 3
Try this
df$y <- df[-1][cbind(seq(nrow(df)),match(df$x,names(df)[-1]))]

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

Create a new variable which count length of duplicate in R

I have a data frame,I want to create a variable z,count duplicate of "y variable", if y have 1,1 set z = 2,2, if y have 3,3,3, set z = 3,3,3.
x = c("a","b","c","d","e","a","b","c","d","e","a","b","c")
y = c(1,1,2,2,2,3,3,4,4,4,5,5,5)
data <- data.frame(x,y)
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
Thanks for your help.
You can try the rle:
data$z <- with(data, unlist(mapply(rep, rle(y)$lengths, rle(y)$lengths)))
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
If your your variable y is sorted as an increasing sequence as you say, then the following solution will work:
# calculate counts of each level
counts <- table(data$y)
# fill in z
data$z <- counts[match(data$y, names(counts))]
Note, however, that this method will fail if y is not ordered and, since you want to restart the count when a different level occurs. For these purposes, #psidom's solution is more robust to mis-ordered data as rle will reset the count.
This method calculates the total occurrences of a level and then feeds these total counts to the proper location using match.
Here is a quick method using dplyr, and its rather intuitive syntax:
left_join(data, data %>%
group_by(y) %>%
summarize(z = n()),
by = "y")
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
We can do this easily with data.table
setDT(data)[, z := .N , rleid(y)]
# x y z
# 1: a 1 2
# 2: b 1 2
# 3: c 2 3
# 4: d 2 3
# 5: e 2 3
# 6: a 3 2
# 7: b 3 2
# 8: c 4 3
# 9: d 4 3
#10: e 4 3
#11: a 5 3
#12: b 5 3
#13: c 5 3
Or using rle from base R without any loops
inverse.rle(within.list(rle(data$y), values <- lengths))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3
Or another base R method with ave
with(data, ave(y, cumsum(c(TRUE, y[-1]!= y[-length(y)])), FUN=length))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3

how to mutate a column with ID in group

how to mutate a column with ID in group
data.frame like:
a b c
1 a 1 1
2 a 1 2
3 a 2 3
4 b 1 4
5 b 2 5
6 b 3 6
group by a, flag start with 1, if b equals pre b,then flag=1 else flag+=1
a b c flag
1 a 1 1 1 <- group a start with 1
2 a 1 2 1 <-- in group a, 1(in row 2)=1(in row 1)
3 a 2 3 2 <- in group a, 2(in row 3)!=1(in row 2)
4 b 1 4 1 <- group b start with 1
5 b 2 5 2 <- in group b, 2(in row 5)!=1(in row 4)
6 b 3 6 3 <- in group b, 3(in row 6)!=2(in row 5)
i now using this:
for(i in 2:nrow(x)){
x[i, 'flag'] = ifelse(x[i, 'a']!=x[i-1,'a'], 1, ifelse(x[i, 'b']==x[i-1, 'b'], x[i-1, 'flag'], x[i-1,'flag']+1))
but it is inefficiency in large dataset
dense_rank in dplyr give me the answer
> x %>% group_by(a) %>% mutate(dense_rank(b))
Source: local data frame [10 x 4]
Groups: a
a b c dense_rank(b)
1 a x 1 1
2 a x 2 1
3 a y 3 2
4 b x 4 1
5 b y 5 2
6 b z 6 3
7 c x 7 1
8 c y 8 2
9 c z 9 3
10 c z 10 3
I am not entirely sure what you are trying to do. But it seems to me that you are trying to assign index numbers to values in b for each group (a or b).
#I modified your example here.
a <- rep(c("a","b"), each =3)
b <- c(4,4,5,11,12,13)
c <- 1:6
foo <- data.frame(a,b,c, stringsAsFactors = F)
a b c
1 a 4 1
2 a 4 2
3 a 5 3
4 b 11 4
5 b 12 5
6 b 13 6
#Since you referred to dplyr, I will use it.
cats <- list()
for(i in unique(foo$a)){
ana <- foo %>%
filter(a == i) %>%
arrange(b) %>%
mutate(indexInb = as.integer(as.factor(b)))
cats[[i]] <- ana
bob <- rbindlist(cats)
a b c indexInb
1: a 4 1 1
2: a 4 2 1
3: a 5 3 2
4: b 11 4 1
5: b 12 5 2
6: b 13 6 3
Hers's a quick vectorized way to solve this without using any for loops
Base R solution using ave and transform
transform(x, flag = ave(b, a, FUN = function(x) cumsum(c(1, diff(x)))))
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
Or a data.table solution (more efficient)
setDT(x)[, flag := cumsum(c(1, diff(b))), by = a]
# a b c flag
# 1: a 1 1 1
# 2: a 1 2 1
# 3: a 2 3 2
# 4: b 1 4 1
# 5: b 2 5 2
# 6: b 3 6 3
Or a dplyr solution (because you tagged it)
x %>%
group_by(a) %>%
mutate(flag = cumsum(c(1, diff(b))))
# Source: local data frame [6 x 4]
# Groups: a
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
