Create new id by matching two column value - r

I have the following data and I want to create a new id called newid using the column id and class. The first id repeat the value but with different class value, so that I need to create a new id by matching both id and class.
data <- data.frame(id=c(1,1,1,1,1, 2,2,2,2,2,3,3,3,3,1,1,1,1,2,2,2,4,4,4),
class=c('x','x','x','x','x', 'y','y','y','y','y', 'z','z','z','z', 'w','w','w','w', 'v','v','v','n','n','n'))
Expected output
id class newid
1 1 x 1
2 1 x 1
3 1 x 1
4 1 x 1
5 1 x 1
6 2 y 2
7 2 y 2
8 2 y 2
9 2 y 2
10 2 y 2
11 3 z 3
12 3 z 3
13 3 z 3
14 3 z 3
15 1 w 4
16 1 w 4
17 1 w 4
18 1 w 4
19 2 v 5
20 2 v 5
21 2 v 5
22 4 n 6
23 4 n 6
24 4 n 6

You could use match():
library(dplyr)
data %>%
mutate(grp = paste(id, class),
newid = match(grp, unique(grp))) %>%
select(-grp)
id class newid
1 1 x 1
2 1 x 1
3 1 x 1
4 1 x 1
5 1 x 1
6 2 y 2
7 2 y 2
8 2 y 2
9 2 y 2
10 2 y 2
11 3 z 3
12 3 z 3
13 3 z 3
14 3 z 3
15 1 w 4
16 1 w 4
17 1 w 4
18 1 w 4
19 2 v 5
20 2 v 5
21 2 v 5
22 4 n 6
23 4 n 6
24 4 n 6

One option is to use cur_group_id() but see the note at the end.
data %>%
group_by(id, class) %>%
mutate(newid = cur_group_id()) %>%
ungroup()
## A tibble: 24 × 3
# id class newid
# <dbl> <chr> <int>
# 1 1 x 2
# 2 1 x 2
# 3 1 x 2
# 4 1 x 2
# 5 1 x 2
# 6 2 y 4
# 7 2 y 4
# 8 2 y 4
# 9 2 y 4
#10 2 y 4
## … with 14 more rows
## ℹ Use `print(n = ...)` to see more rows
Note: This creates a unique newid per (id, class) combination; the order is different from your expected output in that it uses numerical/lexicographical ordering: (1, w) comes before (1, x) which comes before (2, v) and so on.
So as long as you don't care about the actual value, cur_group_id() will always create a unique id per value combination of the grouping variables.

Related

How to create another column in a data frame based on repeated observations in another column?

So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3

How can I create a new column with mutate function in R that is a sequence of values of other columns in R?

I have a data frame that looks like this :
a
b
c
1
2
10
2
2
10
3
2
10
4
2
10
5
2
10
I want to create a column with mutate function of something else under the dplyr framework of functions (or base) that will be sequence from b to c (i.e from 2 to 10 with length the number of rows of this tibble or data frame)
Ideally my new data frame I want to like like this :
a
b
c
c
1
2
10
2
2
2
10
4
3
2
10
6
4
2
10
8
5
2
10
10
How can I do this with R using dplyr ?
library(tidyverse)
n=5
a = seq(1,n,length.out=n)
b = rep(2,n)
c = rep(10,n)
data = tibble(a,b,c)
We may do
library(dplyr)
data %>%
rowwise %>%
mutate(new = seq(b, c, length.out = n)[a]) %>%
ungroup
-output
# A tibble: 5 × 4
a b c new
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10
If you want this done "by group" for each a value (creating many new rows), we can create the sequence as a list column and then unnest it:
data %>%
mutate(result = map2(b, c, seq, length.out = n)) %>%
unnest(result)
# # A tibble: 25 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 1 2 10 4
# 3 1 2 10 6
# 4 1 2 10 8
# 5 1 2 10 10
# 6 2 2 10 2
# 7 2 2 10 4
# 8 2 2 10 6
# 9 2 2 10 8
# 10 2 2 10 10
# # … with 15 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want to keep the same number of rows and go from the first b value to the last c value, we can use seq directly in mutate:
data %>%
mutate(result = seq(from = first(b), to = last(c), length.out = n()))
# # A tibble: 5 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 2 2 10 4
# 3 3 2 10 6
# 4 4 2 10 8
# 5 5 2 10 10
This one?
library(dplyr)
df %>%
mutate(c1 = a*b)
a b c c1
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10

R CUMSUM When Value is Met

DATA = data.frame(STUDENT=c(1,1,1,1,2,2,2,3,3,3,3,3),
T = c(1,2,3,4,1,2,3,1,2,3,4,5),
SCORE=c(NA,1,5,2,3,4,4,1,4,5,2,2),
WANT=c('N','N','P','P','N','N','N','N','N','P','P','P'))
)
I have 'DATA' and wish to create 'WANT' variable where is 'N' but within each 'STUDENT' when there is a score of '5' OR HIGHER than the 'WANT' value is 'P' and stays that way I seek a dplyr solutions
You can use cumany:
library(dplyr)
DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = ifelse(cumany(ifelse(is.na(SCORE), 0, SCORE) == 5),
"N", "P"))
# A tibble: 12 × 5
# Groups: STUDENT [3]
STUDENT T SCORE WANT WANT2
<dbl> <dbl> <dbl> <chr> <chr>
1 1 1 NA N N
2 1 2 1 N N
3 1 3 5 P P
4 1 4 2 P P
5 2 1 3 N N
6 2 2 4 N N
7 2 3 4 N N
8 3 1 1 N N
9 3 2 4 N N
10 3 3 5 P P
11 3 4 2 P P
12 3 5 2 P P
You can use cummax():
library(dplyr)
DATA %>%
group_by(STUDENT) %>%
mutate(WANT = c("N", "P")[cummax(SCORE >= 5 & !is.na(SCORE))+1])
# A tibble: 12 × 4
# Groups: STUDENT [3]
STUDENT T SCORE WANT
<dbl> <dbl> <dbl> <chr>
1 1 1 NA N
2 1 2 1 N
3 1 3 5 P
4 1 4 2 P
5 2 1 3 N
6 2 2 4 N
7 2 3 4 N
8 3 1 1 N
9 3 2 4 N
10 3 3 5 P
11 3 4 2 P
12 3 5 2 P

Extract Index of repeat value

how do I extract specific row of data when the column has repetitive value? my data looks like this: I want to extract the row of the end of each repeat of x (A 3 10, A 2 3 etc) or the index of the last value
Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3
Expected output
Index Name X M
3 A 3 10
5 A 2 3
10 A 5 3
13 B 3 10
15 B 2 3
Using base R duplicated and cumsum:
dups <- !duplicated(cumsum(dat$X == 1), fromLast=TRUE)
cbind(dat[dups,], Index=which(dups))
# Name X M Index
#3 A 3 10 3
#5 A 2 3 5
#10 A 5 3 10
#13 B 3 10 13
#15 B 2 3 15
A solution using dplyr.
library(dplyr)
df2 <- df %>%
mutate(Flag = ifelse(lead(X) < X, 1, 0)) %>%
mutate(Index = 1:n()) %>%
filter(Flag == 1 | is.na(Flag)) %>%
select(Index, X, M)
df2
# Index X M
# 1 3 3 10
# 2 5 2 3
# 3 10 5 3
# 4 13 3 10
# 5 15 2 3
Flag is a column showing if the next number in A is smaller than the previous number. If TRUE, Flag is 1, otherwise is 0. We can then filter for Flag == 1 or where Flag is NA, which is the last row. df2 is the final filtered data frame.
DATA
df <- read.table(text = "Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3",
header = TRUE, stringsAsFactors = FALSE)

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

Resources