How to add a column with progressive number based on condition - r

I am trying to add a column to my existing data set.
The data set has three columns:
Student (which is the column with the participant ID),
Week (the number of the week of the year during which the data were collected),
and
Day (the number of the weekday during which the data were
collected).
Now, a new column Obs that I am trying to create would contain a progressive number (from 1 to n) referring to the week during which every student was tested.
I have tried to use group_by in combination with rep but it does not seem to produce the result I want:
Week <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4)
Day <- c(1, 2, 3, 2, 3, 5, 1, 3, 2, 3, 4, 5)
Student <- c("A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C")
fake.db <- data.frame(Student, Week, Day)
library(dplyr)
fake.db %>%
group_by(Student) %>%
mutate(Obs = rep(1:length(Student), each = Week))
# Student Week Day Obs
# <fct> <dbl> <dbl> <int>
# 1 A 1 1 1
# 2 A 1 2 2
# 3 A 1 3 3
# 4 B 2 2 1
# 5 B 2 3 2
# 6 B 2 5 3
# 7 B 3 1 4
# 8 B 3 3 5
# 9 C 4 2 1
#10 C 4 3 2
#11 C 4 4 3
#12 C 4 5 4
What I would like to obtain is different. For the first week of data collection, 1 should be reported, and for the students for whom data were collected during a second week, 2 should be reported, etc.:
# Student Week Day Obs
#1 A 1 1 1
#2 A 1 2 1
#3 A 1 3 1
#4 B 2 2 1
#5 B 2 3 1
#6 B 2 5 1
#7 B 3 1 2
#8 B 3 3 2
#9 C 4 2 1
#10 C 4 3 1
#11 C 4 4 1
#12 C 4 5 1

One dplyr possibility could be:
fake.db %>%
group_by(Student) %>%
mutate(Obs = cumsum(!duplicated(Week)))
Student Week Day Obs
<fct> <dbl> <dbl> <int>
1 A 1 1 1
2 A 1 2 1
3 A 1 3 1
4 B 2 2 1
5 B 2 3 1
6 B 2 5 1
7 B 3 1 2
8 B 3 3 2
9 C 4 2 1
10 C 4 3 1
11 C 4 4 1
12 C 4 5 1
It groups by "Student" column and calculates the cumulative sum of non-duplicate "Week" values.
Or:
fake.db %>%
group_by(Student) %>%
mutate(Obs = with(rle(Week), rep(seq_along(lengths), lengths)))
It groups by "Student" column and creates a run-length type group ID around "Week" column".
Or:
fake.db %>%
group_by(Student) %>%
mutate(Obs = dense_rank(Week))
It groups by "Student" column and ranks the values in "Week" column.

What I understand the issue to be is that you want to count the weeks since the first test week for each student. I.e. Week 2 is student B's first week of testing, so it gets Obs = 1. That means you can do a grouped mutate:
library(dplyr)
fake.db <- structure(list(Student = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Week = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4), Day = c(1, 2, 3, 2, 3, 5, 1, 3, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA, -12L))
fake.db %>%
group_by(Student) %>%
mutate(Obs = Week - min(Week) + 1)
#> # A tibble: 12 x 4
#> # Groups: Student [3]
#> Student Week Day Obs
#> <fct> <dbl> <dbl> <dbl>
#> 1 A 1 1 1
#> 2 A 1 2 1
#> 3 A 1 3 1
#> 4 B 2 2 1
#> 5 B 2 3 1
#> 6 B 2 5 1
#> 7 B 3 1 2
#> 8 B 3 3 2
#> 9 C 4 2 1
#> 10 C 4 3 1
#> 11 C 4 4 1
#> 12 C 4 5 1
Created on 2019-05-10 by the reprex package (v0.2.1)

A brief method with by
unlist(by(fake.db, fake.db[, 1], function(x) as.numeric(factor(x[, 2]))))
# A1 A2 A3 B1 B2 B3 B4 B5 C1 C2 C3 C4
# 1 1 1 1 1 1 2 2 1 1 1 1
Data
fake.db <- structure(list(Student = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Week = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4), Day = c(1,
2, 3, 2, 3, 5, 1, 3, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA,
-12L))

You can see if there is a non-zero difference
fake.db %>%
group_by(Student) %>%
arrange(Week) %>%
mutate(Obs = cumsum(c(1, diff(Week)!=0)))
or if they values arne't numeric, you can compare to the lag value
fake.db %>%
group_by(Student) %>%
arrange(Week) %>%
mutate(Obs = cumsum(Week != lag(Week, default=first(Week))) + 1)

Related

R incrementing a variable in dplyr

I have the following grouped data frame:
library(dplyr)
# Create a sample dataframe
df <- data.frame(
student = c("A", "A", "A","B","B", "B", "C", "C","C"),
grade = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
age= c(NA, 6, 6, 7, 7, 7, NA, NA, 9)
)
I want to update the age of each student so that it is one plus the age in the previous year, with their age in the first year they appear in the dataset remaining unchanged. For example, student A's age should be NA, 6, 7, student B's age should be 7,8,9, and student C's age should be NA, NA, 9.
How about this:
library(dplyr)
df <- data.frame(
student = c("A", "A", "A","B","B", "B", "C", "C","C"),
grade = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
age= c(NA, 6, 6, 7, 7, 7, NA, NA, 9)
)
df %>%
group_by(student) %>%
mutate(age = age + cumsum(!is.na(age))-1)
#> # A tibble: 9 × 3
#> # Groups: student [3]
#> student grade age
#> <chr> <dbl> <dbl>
#> 1 A 1 NA
#> 2 A 2 6
#> 3 A 3 7
#> 4 B 1 7
#> 5 B 2 8
#> 6 B 3 9
#> 7 C 1 NA
#> 8 C 2 NA
#> 9 C 3 9
Created on 2022-12-30 by the reprex package (v2.0.1)
in data.table, assuming the order of the rows is the 'correct' order:
library(data.table)
setDT(df)[, new_age := age + rowid(age) - 1, by = .(student)]
# student grade age new_age
# 1: A 1 NA NA
# 2: A 2 6 6
# 3: A 3 6 7
# 4: B 1 7 7
# 5: B 2 7 8
# 6: B 3 7 9
# 7: C 1 NA NA
# 8: C 2 NA NA
# 9: C 3 9 9

Calculate difference between rows in R based on a specifc row for each group

Hi everyone,
I have a dataframe with where each ID has multiple visits from 1-5. I am trying to calculate the difference of a score between each visit to visit 1. eg. (Score(Visit 5-score(Visit1) and so on). How do I achieve that in R ? Below is a sample dataset and result dataset
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B"),
Visit = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), Score = c(16,
15, 13, 12, 12, 20, 19, 18)), class = "data.frame", row.names = c(NA,
-8L))
#> ID Visit Score
#> 1 A 1 16
#> 2 A 2 15
#> 3 A 3 13
#> 4 A 4 12
#> 5 A 5 12
#> 6 B 1 20
#> 7 B 2 19
#> 8 B 3 18
Created on 2021-05-20 by the reprex package (v2.0.0)
Here is the expected output
Here's a solution using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Difference = ifelse(Visit == 1, NA, Score[Visit == 1] - Score))
# A tibble: 8 x 4
# Groups: ID [2]
ID Visit Score Difference
<chr> <int> <dbl> <dbl>
1 A 1 16 NA
2 A 2 15 1
3 A 3 13 3
4 A 4 12 4
5 A 5 12 4
6 B 1 20 NA
7 B 2 19 1
8 B 3 18 2
Sample data
df <- data.frame(
ID = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'),
Visit = c(1:5, 1:3),
Score = c(16,15,13,12,12,20,19,18)
)
Sidenote: next time I suggest you to post not images but a sample data using the dput() function on your dataframe
Solution with dplyr using first
data <- data.frame(
ID = c(rep("A", 5), rep("B", 3)),
Visit = c(1:5, 1:3),
Score = c(16, 15, 13, 12, 12, 20, 19, 18))
library(dplyr)
data %>%
group_by(ID) %>%
arrange(Visit) %>%
mutate(Difference = first(Score) - Score)
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID Visit Score Difference
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 16 0
#> 2 A 2 15 1
#> 3 A 3 13 3
#> 4 A 4 12 4
#> 5 A 5 12 4
#> 6 B 1 20 0
#> 7 B 2 19 1
#> 8 B 3 18 2
Created on 2021-05-20 by the reprex package (v2.0.0)

Count non-`NA` of several columns by group using summarize and across from dplyr

I want to use summarize and across from dplyrto count the number of non-NA values by my grouping variable. For example, using these data:
library(tidyverse)
d <- tibble(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
Col1 = c(5, 8, 2, NA, 2, 2, NA, NA, 1),
Col2 = c(NA, 2, 1, NA, NA, NA, 1, NA, NA),
Col3 = c(1, 5, 2, 4, 1, NA, NA, NA, NA))
# A tibble: 9 x 4
ID Col1 Col2 Col3
<dbl> <dbl> <dbl> <dbl>
1 1 5 NA 1
2 1 8 2 5
3 1 2 1 2
4 2 NA NA 4
5 2 2 NA 1
6 2 2 NA NA
7 3 NA 1 NA
8 3 NA NA NA
9 3 1 NA NA
With a solution resembling:
d %>%
group_by(ID) %>%
summarize(across(matches("^Col[1-3]$"),
#function to count non-NA per column per ID
))
With the following result:
# A tibble: 3 x 4
ID Col1 Col2 Col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 2 3
2 2 2 0 2
3 3 1 1 0
I hope this is what you are looking for:
library(dplyr)
d %>%
group_by(ID) %>%
summarise(across(Col1:Col3, ~ sum(!is.na(.x)), .names = "non-{.col}"))
# A tibble: 3 x 4
ID `non-Col1` `non-Col2` `non-Col3`
<dbl> <int> <int> <int>
1 1 3 2 3
2 2 2 0 2
3 3 1 1 0
Or if you would like to select columns by their shared string you can use this:
d %>%
group_by(ID) %>%
summarise(across(contains("Col"), ~ sum(!is.na(.x)), .names = "non-{.col}"))

Count number of times an account_ID is shared between groups in a dataframe in R

I have a dataframe in R that has a large number of bank_account_IDs and Vendor_Codes. Bank_account_IDs should not be shared between Vendor_Codes, but sometimes a fraudulent vendor exists that shares another vendor's bank_account_ID.
I want to add a new field to the dataframe that provides a count for the number of times an account_ID exists with more than 1 Vendor_Code.
My sample dataframe is as follows:
bank_account_ID = c(a, b, c, a, a, d, e, f, b, c)
Vendor_Code = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <-data.frame(a,b)
My ideal new dataframe should look something like this:
bank_account_ID Vendor_Code duplicate_count
a 1 2
b 2 1
c 3 1
a 4 2
a 5 2
d 6 0
e 7 0
f 8 0
b 9 1
c 10 1
Thanks in advance!
We can get the number of distinct elements with n_distinct grouped by the 'bank_account_ID' and subtract 1
library(dplyr)
df %>%
group_by(bank_account_ID) %>%
mutate(dupe_count = n_distinct(Vendor_Code)-1) %>%
ungroup
-output
# A tibble: 10 x 4
# bank_account_ID Vendor_Code duplicate_count dupe_count
# <chr> <int> <int> <dbl>
# 1 a 1 2 2
# 2 b 2 1 1
# 3 c 3 1 1
# 4 a 4 2 2
# 5 a 5 2 2
# 6 d 6 0 0
# 7 e 7 0 0
# 8 f 8 0 0
# 9 b 9 1 1
#10 c 10 1 1
data
df <- structure(list(bank_account_ID = c("a", "b", "c", "a", "a", "d",
"e", "f", "b", "c"), Vendor_Code = 1:10, duplicate_count = c(2L,
1L, 1L, 2L, 2L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))

R - After grouping, how do I get the maximum times a value is repeated?

Say I have a dataset like this:
id <- c(1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
foo <- c('a', 'b', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'a', 'a')
dat <- data.frame(id, foo)
I.e.,
id foo
1 1 a
2 1 b
3 2 a
4 2 a
5 2 b
6 2 b
7 2 b
8 3 c
9 3 c
10 3 a
11 3 a
For each id, how would I get the max repetition of the values of foo
I.e.,
id max_repeat
1 1 1
2 2 3
3 3 2
For example, id 2 has a max_repeat of 3 because one of it's values of foo (b) is repeated 3 times.
Using tidyverse:
dat %>%
group_by(id, foo) %>% #Grouping by id and foo
tally() %>% #Calculating the count
group_by(id) %>%
summarise(res = max(n)) #Keeping the max count per id
id res
<dbl> <dbl>
1 1. 1.
2 2. 3.
3 3. 2.
dplyr
library(tidyverse)
dat %>%
group_by(id) %>%
summarise(max_repeat = max(tabulate(foo)))
# # A tibble: 3 x 2
# id max_repeat
# <dbl> <int>
# 1 1 1
# 2 2 3
# 3 3 2
data.table
library(data.table)
setDT(dat)
dat[, .(max_repeat = max(tabulate(foo))), by = id]
# id max_repeat
# 1: 1 1
# 2: 2 3
# 3: 3 2
base (can use setNames to change the name if needed)
aggregate(foo ~ id, dat, function(x) max(tabulate(x)))
# id foo
# 1 1 1
# 2 2 3
# 3 3 2
Without packages you could combine two aggregate()s, one with length and one with maximum.
x1 <- with(dat, aggregate(list(count=id), list(id=id, foo=foo), FUN=length))
x2 <- with(x1, aggregate(list(max_repeat=count), list(id=id), FUN=max))
Yields:
> x2
id max_repeat
1 1 1
2 2 3
3 3 2
Data:
dat <- structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3), foo = structure(c(1L,
2L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("a", "b",
"c"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))

Resources