numbering duplicated rows in dplyr [duplicate] - r

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3

Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3

A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

Related

Suming up consecutive values in groups [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 2 years ago.
I'd like to sum up consecutive values in one column by groups, without long explanation, I have df like this:
set.seed(1)
gr <- c(rep('A',3),rep('B',2),rep('C',5),rep('D',3))
vals <- floor(runif(length(gr), min=0, max=10))
idx <- c(seq(1:3),seq(1:2),seq(1:5),seq(1:3))
df <- data.frame(gr,vals,idx)
gr vals idx
1 A 2 1
2 A 3 2
3 A 5 3
4 B 9 1
5 B 2 2
6 C 8 1
7 C 9 2
8 C 6 3
9 C 6 4
10 C 0 5
11 D 2 1
12 D 1 2
13 D 6 3
And I'm looking for this one:
gr vals idx
1 A 2 1
2 A 5 2
3 A 10 3
4 B 9 1
5 B 11 2
6 C 8 1
7 C 17 2
8 C 23 3
9 C 29 4
10 C 29 5
11 D 2 1
12 D 3 2
13 D 9 3
So ex. in group C we have 8+9=17 (first and second element of the group) and second value is replaced by the sum. Then 17+6=23 (sum of previously summed elements and third element), 3rd element replaced by the new result and so on...
I was looking for some solution here but it isn't what I'm looking for.
Ok, I think I got it
df %>%
group_by(gr) %>%
mutate(nvals = cumsum(vals))
gr vals idx nvals
1 A 2 1 2
2 A 3 2 5
3 A 5 3 10
4 B 9 1 9
5 B 2 2 11
6 C 8 1 8
7 C 9 2 17
8 C 6 3 23
9 C 6 4 29
10 C 0 5 29
11 D 2 1 2
12 D 1 2 3
13 D 6 3 9

Creating two columns of cumulative sum based on the categories of one column

I like to create two columns with cumulative frequency of "A" and "B" in the assignment columns.
df = data.frame(id = 1:10, assignment= c("B","A","B","B","B","A","B","B","A","B"))
id assignment
1 1 B
2 2 A
3 3 B
4 4 B
5 5 B
6 6 A
7 7 B
8 8 B
9 9 A
10 10 B
The resulting table would have this format
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7
How to generalize the codes for more than 2 categories (say for "A","B",C")?
Thanks
Use lapply over unique values in assignment to create new columns.
vals <- sort(unique(df$assignment))
df[vals] <- lapply(vals, function(x) cumsum(df$assignment == x))
df
# id assignment A B
#1 1 B 0 1
#2 2 A 1 1
#3 3 B 1 2
#4 4 B 1 3
#5 5 B 1 4
#6 6 A 2 4
#7 7 B 2 5
#8 8 B 2 6
#9 9 A 3 6
#10 10 B 3 7
We can use model.matrix with colCumsums
library(matrixStats)
cbind(df, colCumsums(model.matrix(~ assignment - 1, df[-1])))
A base R option
transform(
df,
A = cumsum(assignment == "A"),
B = cumsum(assignment == "B")
)
gives
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7

convert lists of vectors in just one tibble data frame

I have two lists. Each of them with many vectors (around 500) of different lengths and I would like to get a tibble data frame with three columns.
My reproducible example is the following:
> a
[[1]]
[1] 1 3 6
[[2]]
[1] 5 4
> b
[[1]]
[1] 3 4
[[2]]
[1] 5 6 7
I would like to get the following tibble data frame:
name index value
a 1 1
a 1 3
a 1 6
a 2 5
a 2 4
b 1 3
b 1 4
b 2 5
b 2 6
b 2 7
I would be grateful if someone could help me with this issue
using Base R:
transform(stack(c(a=a,b=b)),name=substr(ind,1,1),ind=substr(ind,2,2))
values ind name
1 1 1 a
2 2 1 a
3 3 1 a
4 5 2 a
5 6 2 a
6 3 1 b
7 4 1 b
8 5 2 b
9 6 2 b
10 7 2 b
using tidyverse:
library(tidyverse)
list(a=a,b=b)%>%map(~stack(setNames(.x,1:length(.x))))%>%bind_rows(.id = "name")
name values ind
1 a 1 1
2 a 2 1
3 a 3 1
4 a 5 2
5 a 6 2
6 b 3 1
7 b 4 1
8 b 5 2
9 b 6 2
10 b 7 2
Here is one option with tidyverse
library(tidyverse)
list(a= a, b = b) %>%
map_df(enframe, name = "index", .id = 'name') %>%
unnest
# A tibble: 10 x 3
# name index value
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 1 3
# 3 a 1 6
# 4 a 2 5
# 5 a 2 4
# 6 b 1 3
# 7 b 1 4
# 8 b 2 5
# 9 b 2 6
#10 b 2 7
data
a <- list(c(1, 3, 6), c(5, 4))
b <- list(c(3, 4), c(5, 6, 7))

Using mutate to create a new column with the first value of each group in R

I'm currently working on a Sabermetric research project and I've been stuck all day trying to create a new column in a data frame that displays the starting pitcher for a given game. Essentially, if I use the sample below, I have data for 'a' and 'b', but I can't figure out how to create 'c' to be the first value of 'b' for each unique value of 'a'. This should be easy, but I just started learning R.
a b c
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 2 7 7
8 2 8 7
9 2 1 7
10 2 2 7
11 2 3 7
12 2 4 7
13 3 5 5
14 3 6 5
15 3 7 5
So far I've used mutate and group_by to come up with
sample <- sample %>% group_by(a) %>% mutate(c = first(b))
But this just makes every value of 'c' the first value of the first 'b'. So in the sample above, my current code makes every value of 'c' equal to 1.
I'm missing something, any suggestions?
We can use base R
df1$c <- with(df1, ave(b, a, FUN= function(x) head(x,1)))
Or with data.table
library(data.table)
setDT(df1)[, c:= head(b, 1), by = a]
Using library dplyr, you can do something like this:
library(dplyr)
df %>% group_by(a) %>% mutate(c = b[1])
Output is as follows:
Source: local data frame [15 x 3]
Groups: a [3]
a b c
(int) (int) (int)
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 2 7 7
8 2 8 7
9 2 1 7
10 2 2 7
11 2 3 7
12 2 4 7
13 3 5 5
14 3 6 5
15 3 7 5
Changing columns to the types mentioned below in comments and running code produces desired output:
df$b <- as.factor(df$b)
df$a <- as.character(df$a)
str(df)
'data.frame': 15 obs. of 3 variables:
$ a: chr "1" "1" "1" "1" ...
$ b: Factor w/ 8 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 1 2 ...
$ c: int 1 1 1 1 1 1 7 7 7 7 ...
df %>% group_by(a) %>% mutate(c = b[1])
Source: local data frame [15 x 3]
Groups: a [3]
a b c
(chr) (fctr) (fctr)
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 2 7 7
8 2 8 7
9 2 1 7
10 2 2 7
11 2 3 7
12 2 4 7
13 3 5 5
14 3 6 5
15 3 7 5
Not so elegant but it works, I hope it works for you too:
df1 %>% group_by(a) %>% mutate(c = rep(first(b), length(a)))
Source: local data frame [15 x 3]
Groups: a [3]
a b c
(int) (int) (int)
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 2 7 7
8 2 8 7
9 2 1 7
10 2 2 7
11 2 3 7
12 2 4 7
13 3 5 5
14 3 6 5
15 3 7 5

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

Resources