R CUMSUM When Value is Met - r

DATA = data.frame(STUDENT=c(1,1,1,1,2,2,2,3,3,3,3,3),
T = c(1,2,3,4,1,2,3,1,2,3,4,5),
SCORE=c(NA,1,5,2,3,4,4,1,4,5,2,2),
WANT=c('N','N','P','P','N','N','N','N','N','P','P','P'))
)
I have 'DATA' and wish to create 'WANT' variable where is 'N' but within each 'STUDENT' when there is a score of '5' OR HIGHER than the 'WANT' value is 'P' and stays that way I seek a dplyr solutions

You can use cumany:
library(dplyr)
DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = ifelse(cumany(ifelse(is.na(SCORE), 0, SCORE) == 5),
"N", "P"))
# A tibble: 12 × 5
# Groups: STUDENT [3]
STUDENT T SCORE WANT WANT2
<dbl> <dbl> <dbl> <chr> <chr>
1 1 1 NA N N
2 1 2 1 N N
3 1 3 5 P P
4 1 4 2 P P
5 2 1 3 N N
6 2 2 4 N N
7 2 3 4 N N
8 3 1 1 N N
9 3 2 4 N N
10 3 3 5 P P
11 3 4 2 P P
12 3 5 2 P P

You can use cummax():
library(dplyr)
DATA %>%
group_by(STUDENT) %>%
mutate(WANT = c("N", "P")[cummax(SCORE >= 5 & !is.na(SCORE))+1])
# A tibble: 12 × 4
# Groups: STUDENT [3]
STUDENT T SCORE WANT
<dbl> <dbl> <dbl> <chr>
1 1 1 NA N
2 1 2 1 N
3 1 3 5 P
4 1 4 2 P
5 2 1 3 N
6 2 2 4 N
7 2 3 4 N
8 3 1 1 N
9 3 2 4 N
10 3 3 5 P
11 3 4 2 P
12 3 5 2 P

Related

How can I create a new column with mutate function in R that is a sequence of values of other columns in R?

I have a data frame that looks like this :
a
b
c
1
2
10
2
2
10
3
2
10
4
2
10
5
2
10
I want to create a column with mutate function of something else under the dplyr framework of functions (or base) that will be sequence from b to c (i.e from 2 to 10 with length the number of rows of this tibble or data frame)
Ideally my new data frame I want to like like this :
a
b
c
c
1
2
10
2
2
2
10
4
3
2
10
6
4
2
10
8
5
2
10
10
How can I do this with R using dplyr ?
library(tidyverse)
n=5
a = seq(1,n,length.out=n)
b = rep(2,n)
c = rep(10,n)
data = tibble(a,b,c)
We may do
library(dplyr)
data %>%
rowwise %>%
mutate(new = seq(b, c, length.out = n)[a]) %>%
ungroup
-output
# A tibble: 5 × 4
a b c new
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10
If you want this done "by group" for each a value (creating many new rows), we can create the sequence as a list column and then unnest it:
data %>%
mutate(result = map2(b, c, seq, length.out = n)) %>%
unnest(result)
# # A tibble: 25 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 1 2 10 4
# 3 1 2 10 6
# 4 1 2 10 8
# 5 1 2 10 10
# 6 2 2 10 2
# 7 2 2 10 4
# 8 2 2 10 6
# 9 2 2 10 8
# 10 2 2 10 10
# # … with 15 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want to keep the same number of rows and go from the first b value to the last c value, we can use seq directly in mutate:
data %>%
mutate(result = seq(from = first(b), to = last(c), length.out = n()))
# # A tibble: 5 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 2 2 10 4
# 3 3 2 10 6
# 4 4 2 10 8
# 5 5 2 10 10
This one?
library(dplyr)
df %>%
mutate(c1 = a*b)
a b c c1
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10

Create new id by matching two column value

I have the following data and I want to create a new id called newid using the column id and class. The first id repeat the value but with different class value, so that I need to create a new id by matching both id and class.
data <- data.frame(id=c(1,1,1,1,1, 2,2,2,2,2,3,3,3,3,1,1,1,1,2,2,2,4,4,4),
class=c('x','x','x','x','x', 'y','y','y','y','y', 'z','z','z','z', 'w','w','w','w', 'v','v','v','n','n','n'))
Expected output
id class newid
1 1 x 1
2 1 x 1
3 1 x 1
4 1 x 1
5 1 x 1
6 2 y 2
7 2 y 2
8 2 y 2
9 2 y 2
10 2 y 2
11 3 z 3
12 3 z 3
13 3 z 3
14 3 z 3
15 1 w 4
16 1 w 4
17 1 w 4
18 1 w 4
19 2 v 5
20 2 v 5
21 2 v 5
22 4 n 6
23 4 n 6
24 4 n 6
You could use match():
library(dplyr)
data %>%
mutate(grp = paste(id, class),
newid = match(grp, unique(grp))) %>%
select(-grp)
id class newid
1 1 x 1
2 1 x 1
3 1 x 1
4 1 x 1
5 1 x 1
6 2 y 2
7 2 y 2
8 2 y 2
9 2 y 2
10 2 y 2
11 3 z 3
12 3 z 3
13 3 z 3
14 3 z 3
15 1 w 4
16 1 w 4
17 1 w 4
18 1 w 4
19 2 v 5
20 2 v 5
21 2 v 5
22 4 n 6
23 4 n 6
24 4 n 6
One option is to use cur_group_id() but see the note at the end.
data %>%
group_by(id, class) %>%
mutate(newid = cur_group_id()) %>%
ungroup()
## A tibble: 24 × 3
# id class newid
# <dbl> <chr> <int>
# 1 1 x 2
# 2 1 x 2
# 3 1 x 2
# 4 1 x 2
# 5 1 x 2
# 6 2 y 4
# 7 2 y 4
# 8 2 y 4
# 9 2 y 4
#10 2 y 4
## … with 14 more rows
## ℹ Use `print(n = ...)` to see more rows
Note: This creates a unique newid per (id, class) combination; the order is different from your expected output in that it uses numerical/lexicographical ordering: (1, w) comes before (1, x) which comes before (2, v) and so on.
So as long as you don't care about the actual value, cur_group_id() will always create a unique id per value combination of the grouping variables.

Use dynamically generated column names in dplyr

I have a data frame with multiple columns, the user provides a vector with the column names, and I want to count maximum amount of times an element appears
set.seed(42)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var1", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(c(var1,var3)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
This does exactly what I want, but when I try to use a vector of variables i cant get it to work
df %>%
rowwise() %>%
mutate(consensus=max(unlist(table(select_vars)) )))
You can wrap it in c(!!! syms()) to get it working, and you don't need the unlist apparently. But honestly, I'm not sure what you are trying to do, and why table is needed here. Do you just want to check if var2 and var3 are the same value and if then 2 and if not then 1?
library(dplyr)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var2", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(c(!!!syms(select_vars)))))
#> # A tibble: 10 x 4
#> # Rowwise:
#> var1 var2 var3 consensus
#> <int> <int> <int> <int>
#> 1 2 3 2 1
#> 2 3 1 3 1
#> 3 3 1 1 2
#> 4 3 3 3 2
#> 5 1 1 2 1
#> 6 2 1 3 1
#> 7 3 2 3 1
#> 8 1 2 3 1
#> 9 2 1 2 1
#> 10 2 1 1 2
Created on 2021-07-22 by the reprex package (v0.3.0)
In the OP's code, we need select
library(dplyr)
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
-output
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or just subset from cur_data() which would only return the data keeping the group attributes
df %>%
rowwise %>%
mutate(consensus = max(table(unlist(cur_data()[select_vars]))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or using pmap
library(purrr)
df %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
# A tibble: 10 x 4
var1 var2 var3 consensus
<int> <int> <int> <dbl>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
As these are rowwise operations, can get some efficiency if we use collapse functions
library(collapse)
tfm(df, consensus = dapply(slt(df, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
# A tibble: 10 x 4
var1 var2 var3 consensus
* <int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Benchmarks
As noted above, collapse is faster (run on a slightly bigger dataset)
df1 <- df[rep(seq_len(nrow(df)), 1e5), ]
system.time({
tfm(df1, consensus = dapply(slt(df1, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
})
#user system elapsed
# 5.257 0.123 5.323
system.time({
df1 %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
})
#user system elapsed
# 54.813 0.517 55.246
The rowwise operation is taking too much time, so stopped the execution
df1 %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
})
Timing stopped at: 575.5 3.342 581.3
What you need is to use the verb all_of
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(all_of(select_vars)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 2 3 3 1
2 2 2 2 1
3 1 2 2 1
4 2 3 3 1
5 1 2 1 1
6 2 1 2 1
7 2 2 2 1
8 3 1 2 1
9 2 1 3 1
10 3 2 1 1

Create new rows based on numbers in a column [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
I currently have a table with a quantity in it.
ID
Code
Quantity
1
A
1
2
B
3
3
C
2
4
D
1
Is there anyway to get this table?
ID
Code
Quantity
1
A
1
2
B
1
2
B
1
2
B
1
3
C
1
3
C
1
4
D
1
I need to break out the quantity and have that many number of rows.
Thanks!!!!
Updated
Now we have stored the separated, collapsed values into a new column:
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
uncount(Quantity, .remove = FALSE) %>%
mutate(NewQ = 1)
# A tibble: 7 x 4
# Groups: ID [4]
ID Code Quantity NewQ
<int> <chr> <int> <dbl>
1 1 A 1 1
2 2 B 3 1
3 2 B 3 1
4 2 B 3 1
5 3 C 2 1
6 3 C 2 1
7 4 D 1 1
Updated
In case we opt not to replace the existing Quantity column with the collapsed values.
df %>%
group_by(ID) %>%
mutate(NewQ = ifelse(Quantity != 1, paste(rep(1, Quantity), collapse = ", "),
as.character(Quantity))) %>%
separate_rows(NewQ) %>%
mutate(NewQ = as.numeric(NewQ))
# A tibble: 7 x 4
# Groups: ID [4]
ID Code Quantity NewQ
<int> <chr> <int> <dbl>
1 1 A 1 1
2 2 B 3 1
3 2 B 3 1
4 2 B 3 1
5 3 C 2 1
6 3 C 2 1
7 4 D 1 1
We could use slice
library(dplyr)
df %>%
group_by(ID) %>%
slice(rep(1:n(), each = Quantity)) %>%
mutate(Quantity= rep(1))
Output:
ID Code Quantity
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 2 B 1
4 2 B 1
5 3 C 1
6 3 C 1
7 4 D 1
A base R option using rep
transform(
`row.names<-`(df[rep(1:nrow(df), df$Quantity), ], NULL),
Quantity = 1
)
gives
ID Code Quantity
1 1 A 1
2 2 B 1
3 2 B 1
4 2 B 1
5 3 C 1
6 3 C 1
7 4 D 1

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Resources