I am doing some coding in R. I am trying to use the doBy package to get a sum total score for a variable (x) by both date (date) and by id (id). The doBy command works fine and I get this output.
data
id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2
I want to recode the date so that everyone's first chronological date 1 is 1, the chronological second date is 2, the chronological 3rd date is 3, etc. I want my date to look something like this.
data2
id daycount x
1 1 1
1 2 2
1 3 3
2 1 2
2 2 3
2 3 4
3 1 3
3 2 3
3 3 2
I was able to order the days using order() but I am not sure how to get the dates to match up. I think I need some kind of sequence or autonumber. Also, some participants may have different number of days. Some participants may have 1 day and other participants may have 10 days.
1) doBy Assuming that the dates are already sorted within id:
library(doBy)
transform_by(data, ~ id, countdays = seq_along(id))
giving:
id date x countdays
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
2) Base R It could also be done using transform and ave in base R.
transform(data, daycount = ave(id, id, FUN = seq_along))
giving:
id date x daycount
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
Note
data in reproducible form:
Lines <- "id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2"
data <- read.table(text = Lines, header = TRUE)
You may want to use group_by by id and then create a new column using rank or dense_rank (note the difference with them with handling the duplicates).
To recreate your data, I used:
# recreate data frame
id_vec <- rep(c(1,2,3), each = 3)
date_vec <- c(
'01/01/2021',
'01/02/2021',
'01/03/2021',
'02/01/2021',
'02/02/2021',
'02/02/2021',
'03/11/2021',
'03/12/2021',
'03/13/2021'
)
x_vec <- rep(c(1,2,3), times = 3)
data <- data.frame(id = id_vec, date = date_vec, x = x_vec)
I also converted the data column to an actual date format for your convenience:
# convert string to date object
library(lubridate)
library(dplyr)
data <- data %>% mutate(date_formatted = mdy(date))
Creating a column with rank:
data %>%
group_by(id) %>%
mutate(day_count = rank(date_formatted, ties.method = "first")) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 3
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3
Creating new column with dense_rank:
data %>%
group_by(id) %>%
mutate(day_count = dense_rank(date_formatted)) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 2
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3
Related
So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
I have some results cluster labels from kmeans done on different ids (reprex example below). the problem is the kmeans clusters codes are not ordered consistently across ids although all ids have 3 clusters.
reprex = data.frame(id = rep(1:2, each = 41,
v1 = rep(seq(1:4), 2),
cluster = c(2,2,1,3,3,1,2,2))
reprex
id v1 cluster
1 1 1 2
2 1 2 2
3 1 3 1
4 1 4 3
5 2 1 3
6 2 2 1
7 2 3 2
8 2 4 2
what I want is that the variable cluster should always start with 1 within each ID. Note I don't want to reorder that dataframe by cluster, the order needs to remain the same. so the desired result would be:
reprex_desired<- data.frame(id = rep(1:2, each = 4),
v1 = rep(seq(1:4), 2),
cluster = c(2,2,1,3,3,1,2,2),
what_iWant = c(1,1,2,3,1,2,3,3))
reprex_desired
id v1 cluster what_iWant
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3
We can use match after grouping by 'id'
library(dplyr)
reprex <- reprex %>%
group_by(id) %>%
mutate(what_IWant = match(cluster, unique(cluster))) %>%
ungroup
-output
reprex
# A tibble: 8 × 4
id v1 cluster what_IWant
<int> <int> <dbl> <int>
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3
Here is a version with cumsum combined with lag:
library(dplyr)
df %>%
group_by(id) %>%
mutate(what_i_want = cumsum(cluster != lag(cluster, def = first(cluster)))+1)
id v1 cluster what_i_want
<int> <int> <dbl> <dbl>
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3
This is my df:
df = tibble(week = c(1,1,2,2,3,3,3,4,4,4,4),
session = c(1,2,1,2,1,2,3,1,2,3,4),
work =rep("done",11))
df
# A tibble: 11 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 2 1 done
4 2 2 done
5 3 1 done
6 3 2 done
7 3 3 done
8 4 1 done
9 4 2 done
10 4 3 done
11 4 4 done
For each week there should be 4 rows with session 1 to 4.
How can I add the "missing" session rows (the rest of the variables are NA) so the df is:
df1= tibble(week = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
session = rep(1:4,4),
work = c("done", "done" ,NA, NA, "done", "done" ,NA, NA,"done", "done" ,"done", NA, rep("done",4)))
df1
week session work
<dbl> <int> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
tidyr::complete(df, week, session)
# A tibble: 16 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
Here's a data.table solution in case speed is important
# load package
library(data.table)
# set as data table
setDT(df)
# cross join to get complete combination
week <- 1:4
session <- 1:4
z <- CJ(week,session)
# join
df_1 <- df[z, on=.(week, session)]
data
data=data.frame("person"=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
"score"=c(1,2,1,2,3,1,3,NA,4,2,1,NA,2,NA,3,1,2,4),
"want"=c(1,2,1,2,3,3,3,3,4,2,1,1,2,2,3,3,3,4))
attempt
library(dplyr)
data = data %>%
group_by(person) %>%
mutate(wantTEST = ifelse(score >= 3 | (row_number() >= which.max(score == 3)),
cummax(score), score),
wantTEST = replace(wantTEST, duplicated(wantTEST == 4) & wantTEST == 4, NA))
i am basically working to use the cummax function but only under specific circumstances. i want to keep any values (1-2-1-1) except if there is a 3 or 4 (1-2-1-3-2-1-4) should be (1-2-1-3-3-4). if there is NA value i want to carry forward previous value. thank you.
Here's one way with tidyverse. You may want to use fill() after group_by() but that's somewhat unclear.
data %>%
fill(score) %>%
group_by(person) %>%
mutate(
w = ifelse(cummax(score) > 2, cummax(score), score)
) %>%
ungroup()
# A tibble: 18 x 4
person score want w
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 2 2
3 1 1 1 1
4 1 2 2 2
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 1 3 3 3
9 1 4 4 4
10 2 2 2 2
11 2 1 1 1
12 2 1 1 1
13 2 2 2 2
14 2 2 2 2
15 2 3 3 3
16 2 1 3 3
17 2 2 3 3
18 2 4 4 4
One way to do this is to first fill NA values and then for each row check if anytime the score of 3 or more is passed in the group. If the score of 3 is reached till that point we take the max score until that point or else return the same score.
library(tidyverse)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(want1 = map_dbl(seq_len(n()), ~if(. >= which.max(score == 3))
max(score[seq_len(.)]) else score[.]))
# person score want want1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 1 2 2 2
# 3 1 1 1 1
# 4 1 2 2 2
# 5 1 3 3 3
# 6 1 1 3 3
# 7 1 3 3 3
# 8 1 3 3 3
# 9 1 4 4 4
#10 2 2 2 2
#11 2 1 1 1
#12 2 1 1 1
#13 2 2 2 2
#14 2 2 2 2
#15 2 3 3 3
#16 2 1 3 3
#17 2 2 3 3
#18 2 4 4 4
Another way is to use accumulate from purrr. I use if_else_ from hablar for type stability:
library(tidyverse)
library(hablar)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(wt = accumulate(score, ~if_else_(.x > 2, max(.x, .y), .y)))
I have data in which subjects completed multiple ratings per day over 6-7 days. The number of ratings per day varies. The data set includes subject ID, date, and the ratings. I would like to create a new variable that recodes the dates for each subject into "study day" --- so 1 for first day of ratings, 2 for second day of ratings, etc.
For example, I would like to take this:
id Date Rating
1 10/20/2018 2
1 10/20/2018 3
1 10/20/2018 5
1 10/21/2018 1
1 10/21/2018 7
1 10/21/2018 9
1 10/22/2018 4
1 10/22/2018 5
1 10/22/2018 9
2 11/15/2018 1
2 11/15/2018 3
2 11/15/2018 4
2 11/16/2018 3
2 11/16/2018 1
2 11/17/2018 0
2 11/17/2018 2
2 11/17/2018 9
And end up with this:
id Day Date Rating
1 1 10/20/2018 2
1 1 10/20/2018 3
1 1 10/20/2018 5
1 2 10/21/2018 1
1 2 10/21/2018 7
1 2 10/21/2018 9
1 3 10/22/2018 4
1 3 10/22/2018 5
1 3 10/22/2018 9
2 1 11/15/2018 1
2 1 11/15/2018 3
2 1 11/15/2018 4
2 2 11/16/2018 3
2 2 11/16/2018 1
2 3 11/17/2018 0
2 3 11/17/2018 2
2 3 11/17/2018 9
I was going to look into setting up some kind of loop, but I thought it would be worth asking if there is a more efficient way to pull this off. Are there any functions that would allow me to automate this sort of thing? Thanks very much for any suggestions.
Since you want to reset the count after every id , makes this question a bit different.
Using only base R, we can split the Date based on id and then create a count of each distinct group.
df$Day <- unlist(sapply(split(df$Date, df$id), function(x) match(x,unique(x))))
df
# id Date Rating Day
#1 1 10/20/2018 2 1
#2 1 10/20/2018 3 1
#3 1 10/20/2018 5 1
#4 1 10/21/2018 1 2
#5 1 10/21/2018 7 2
#6 1 10/21/2018 9 2
#7 1 10/22/2018 4 3
#8 1 10/22/2018 5 3
#9 1 10/22/2018 9 3
#10 2 11/15/2018 1 1
#11 2 11/15/2018 3 1
#12 2 11/15/2018 4 1
#13 2 11/16/2018 3 2
#14 2 11/16/2018 1 2
#15 2 11/17/2018 0 3
#16 2 11/17/2018 2 3
#17 2 11/17/2018 9 3
I don't know how I missed this but thanks to #thelatemail who reminded that this is basically the same as
library(dplyr)
df %>%
group_by(id) %>%
mutate(Day = match(Date, unique(Date)))
AND
df$Day <- as.numeric(with(df, ave(Date, id, FUN = function(x) match(x, unique(x)))))
If you want a slightly hacky dplyr version....you can use the date column and convert it to a numeric date then manipulate that number to give the desired result
library(tidyverse)
library(lubridate)
df <- data_frame(id=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
Date= c('10/20/2018', '10/20/2018', '10/20/2018', '10/21/2018', '10/21/2018', '10/21/2018',
'10/22/2018', '10/22/2018', '10/22/2018','11/15/2018', '11/15/2018', '11/15/2018',
'11/16/2018', '11/16/2018', '11/17/2018', '11/17/2018', '11/17/2018'),
Rating=c(2,3,5,1,7,9,4,5,9,1,3,4,3,1,0,2,9))
df %>%
group_by(id) %>%
mutate(
Date = mdy(Date),
Day = as.numeric(Date),
Day = Day-min(Day)+1)
# A tibble: 17 x 4
# Groups: id [2]
id Date Rating Day
<dbl> <date> <dbl> <dbl>
1 1 2018-10-20 2 1
2 1 2018-10-20 3 1
3 1 2018-10-20 5 1
4 1 2018-10-21 1 2
5 1 2018-10-21 7 2
6 1 2018-10-21 9 2
7 1 2018-10-22 4 3
8 1 2018-10-22 5 3
9 1 2018-10-22 9 3
10 2 2018-11-15 1 1
11 2 2018-11-15 3 1
12 2 2018-11-15 4 1
13 2 2018-11-16 3 2
14 2 2018-11-16 1 2
15 2 2018-11-17 0 3
16 2 2018-11-17 2 3
17 2 2018-11-17 9 3