How to properly merge a variable by group from another data frame? - r

I have two datasets- one is a baseline and the other is a follow up dataset.
DF1 is the baseline (cross-sectional) data with id, date, score1, score2, level, and grade.
DF2 has id, date, score1, and score2, in a long format with multiple rows per id.
df1 <- as.data.frame(cbind(id = c(1,2,3),
date = c("2020-06-03","2020-07-02","2020-06-11"),
score1 =c(6,8,5),
score2=c(1,1,6),
baselevel=c(2,2,2),
basegrade=c("B","B","A")))
df2 <- as.data.frame(cbind(id =c(1,1,1,1,2,2,2,3,3,3),
date = c("2020-06-10","2020-06-17","2020-06-24",
"2020-07-01", "2020-07-03", "2020-07-10","2020-07-17", "2020-06-14",
"2020-06-22", "2020-06-29"),
score1 = c(3,1,7,8,8,6,5,5,3,5),
score2 = c(1,4,5,4,1,1,2,6,7,1)) )
This is what I want as a result of merging the two dfs.
id date score1 score 2 baselevel basegrade
1 2020-06-03 6 1 2 "B"
1 2020-06-10 3 1 2 "B"
1 2020-06-17 1 4 2 "B"
1 2020-06-24 7 5 2 "B"
1 2020-07-01 8 4 2 "B"
2 2020-07-02 8 1 2 "B"
2 2020-07-03 8 1 2 "B"
2 2020-07-10 6 1 2 "B"
2 2020-07-17 5 2 2 "B"
3 2020-06-11 5 6 1 "A"
3 2020-06-14 5 6 1 "A"
3 2020-06-22 3 7 1 "A"
3 2020-06-29 5 1 1 "A"
I tried two different code below using merge, but I still get NAs.. what am I missing here?
Any help would be appreciated!!
dfcombined1 <- merge(df1, df2, by=c("id","date"), all= TRUE)
dfcombined2 <- merge(df1, df2, by=intersect(names(df1), names(df2)), all= TRUE)

You can use bind_rows() in dplyr.
library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>%
group_by(id) %>%
fill(starts_with("base"), .direction = "updown") %>%
arrange(date, .by_group = T)
# # A tibble: 13 x 6
# # Groups: id [3]
# id date score1 score2 baselevel basegrade
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2020-06-03 6 1 2 B
# 2 1 2020-06-10 3 1 2 B
# 3 1 2020-06-17 1 4 2 B
# 4 1 2020-06-24 7 5 2 B
# 5 1 2020-07-01 8 4 2 B
# 6 2 2020-07-02 8 1 2 B
# 7 2 2020-07-03 8 1 2 B
# 8 2 2020-07-10 6 1 2 B
# 9 2 2020-07-17 5 2 2 B
# 10 3 2020-06-11 5 6 2 A
# 11 3 2020-06-14 5 6 2 A
# 12 3 2020-06-22 3 7 2 A
# 13 3 2020-06-29 5 1 2 A

I think you are looking for this:
library(tidyverse)
#Code
df1 %>% bind_rows(df2) %>% arrange(id) %>% group_by(id) %>% fill(baselevel) %>% fill(basegrade)
Output:
# A tibble: 13 x 6
# Groups: id [3]
id date score1 score2 baselevel basegrade
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 2020-06-03 6 1 2 B
2 1 2020-06-10 3 1 2 B
3 1 2020-06-17 1 4 2 B
4 1 2020-06-24 7 5 2 B
5 1 2020-07-01 8 4 2 B
6 2 2020-07-02 8 1 2 B
7 2 2020-07-03 8 1 2 B
8 2 2020-07-10 6 1 2 B
9 2 2020-07-17 5 2 2 B
10 3 2020-06-11 5 6 2 A
11 3 2020-06-14 5 6 2 A
12 3 2020-06-22 3 7 2 A
13 3 2020-06-29 5 1 2 A

A double merge maybe.
merge(merge(df1[-(3:4)], df2, all.y=TRUE)[-(3:4)], df1[-(2:4)], all.x=TRUE)
# id date score1 score2 baselevel basegrade
# 1 1 2020-06-10 3 1 2 B
# 2 1 2020-06-17 1 4 2 B
# 3 1 2020-06-24 7 5 2 B
# 4 1 2020-07-01 8 4 2 B
# 5 2 2020-07-03 8 1 2 B
# 6 2 2020-07-10 6 1 2 B
# 7 2 2020-07-17 5 2 2 B
# 8 3 2020-06-14 5 6 2 A
# 9 3 2020-06-22 3 7 2 A
# 10 3 2020-06-29 5 1 2 A

Related

Adding new unique grouping ID based on group_by variable

I have a dataset with drugs being administered overtime. I want to create groups for each block of the drug being administered. I figured out a simple method to do this with a for-loop that I can apply to each patients set of drugs.
BUT I am curious if there is simple way to do this within the realm of tidyverse?
Not that it matters, but more so I am curious if there is a simple method already created for this problem.
Set-up
have <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3))
)
## Want
want <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3)),
grp = sort(rep(1:4,3))
)
> have
# A tibble: 12 × 3
patinet date drug
<dbl> <date> <chr>
1 1 2022-03-16 a
2 1 2022-03-17 a
3 1 2022-03-18 a
4 1 2022-03-19 b
5 1 2022-03-20 b
6 1 2022-03-21 b
7 1 2022-03-22 c
8 1 2022-03-23 c
9 1 2022-03-24 c
10 1 2022-03-25 a
11 1 2022-03-26 a
12 1 2022-03-27 a
> want
# A tibble: 12 × 4
patinet date drug grp
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
You can use data.table::rleid
have %>% mutate(group = data.table::rleid(drug))
# A tibble: 12 x 4
patinet date drug group
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4

R: Using doBy with Dates

I am doing some coding in R. I am trying to use the doBy package to get a sum total score for a variable (x) by both date (date) and by id (id). The doBy command works fine and I get this output.
data
id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2
I want to recode the date so that everyone's first chronological date 1 is 1, the chronological second date is 2, the chronological 3rd date is 3, etc. I want my date to look something like this.
data2
id daycount x
1 1 1
1 2 2
1 3 3
2 1 2
2 2 3
2 3 4
3 1 3
3 2 3
3 3 2
I was able to order the days using order() but I am not sure how to get the dates to match up. I think I need some kind of sequence or autonumber. Also, some participants may have different number of days. Some participants may have 1 day and other participants may have 10 days.
1) doBy Assuming that the dates are already sorted within id:
library(doBy)
transform_by(data, ~ id, countdays = seq_along(id))
giving:
id date x countdays
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
2) Base R It could also be done using transform and ave in base R.
transform(data, daycount = ave(id, id, FUN = seq_along))
giving:
id date x daycount
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
Note
data in reproducible form:
Lines <- "id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2"
data <- read.table(text = Lines, header = TRUE)
You may want to use group_by by id and then create a new column using rank or dense_rank (note the difference with them with handling the duplicates).
To recreate your data, I used:
# recreate data frame
id_vec <- rep(c(1,2,3), each = 3)
date_vec <- c(
'01/01/2021',
'01/02/2021',
'01/03/2021',
'02/01/2021',
'02/02/2021',
'02/02/2021',
'03/11/2021',
'03/12/2021',
'03/13/2021'
)
x_vec <- rep(c(1,2,3), times = 3)
data <- data.frame(id = id_vec, date = date_vec, x = x_vec)
I also converted the data column to an actual date format for your convenience:
# convert string to date object
library(lubridate)
library(dplyr)
data <- data %>% mutate(date_formatted = mdy(date))
Creating a column with rank:
data %>%
group_by(id) %>%
mutate(day_count = rank(date_formatted, ties.method = "first")) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 3
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3
Creating new column with dense_rank:
data %>%
group_by(id) %>%
mutate(day_count = dense_rank(date_formatted)) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 2
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3

Split information from two columns, R, tidyverse

i've got some data in two columns:
# A tibble: 16 x 2
code niveau
<chr> <dbl>
1 A 1
2 1 2
3 2 2
4 3 2
5 4 2
6 5 2
7 B 1
8 6 2
9 7 2
My desired output is:
A tibble: 16 x 3
code niveau cat
<chr> <dbl> <chr>
1 A 1 A
2 1 2 A
3 2 2 A
4 3 2 A
5 4 2 A
6 5 2 A
7 B 1 B
8 6 2 B
I there a tidy way to convert these data without looping through it?
Here some dummy data:
data<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2))
desired_output<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2),
cat=c(rep('A', 6),rep('B', 5), rep('C', 5)))
Nicolas
Probably, you can create a new column cat and replace code values with NA where there is a number. We can then use fill to replace missing values with previous non-NA value.
library(dplyr)
data %>% mutate(cat = replace(code, grepl('\\d', code), NA)) %>% tidyr::fill(cat)
# A tibble: 16 x 3
# code niveau cat
# <chr> <dbl> <chr>
# 1 A 1 A
# 2 1 2 A
# 3 2 2 A
# 4 3 2 A
# 5 4 2 A
# 6 5 2 A
# 7 B 1 B
# 8 6 2 B
# 9 7 2 B
#10 8 2 B
#11 9 2 B
#12 C 1 C
#13 10 2 C
#14 11 2 C
#15 12 2 C
#16 13 2 C
We can use str_detect from stringr
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(cat = replace(code, str_detect(code, '\\d'), NA)) %>%
fill(cat)

Count number of new and lost friends between two data frames in R

I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)

How to arrange/sort by unique sequences?

A) Here is my data frame arranged by plate:
df <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
plate phase score
A 1 1
A 1 1
A 1 1
A 2 1
A 2 1
A 2 1
A 3 2
A 3 2
A 3 2
B 1 1
B 1 1
B 1 2
B 2 1
B 2 1
B 2 3")
B) Goal: I want to order it by both plate first and then phase but sequentially (see below how the rows are ordered alphabetically by plate but sequentially by phase
plate phase score
<chr> <int> <int>
1 A 1 1
2 A 2 1
3 A 3 2
4 A 1 1
5 A 2 1
6 A 3 2
7 A 1 1
8 A 2 1
9 A 3 2
10 B 1 1
11 B 2 1
12 B 1 1
13 B 2 1
14 B 1 2
15 B 2 3
One option is to create a sequence variable grouped by 'plate', 'phase' and arrange on it along with 'plate' and 'score'
library(dplyr)
df %>%
group_by(plate, phase) %>%
mutate(rn = row_number()) %>%
ungroup %>%
arrange(plate, rn, score) %>%
select(-rn)
# A tibble: 15 x 3
# plate phase score
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 2
# 4 A 1 1
# 5 A 2 1
# 6 A 3 2
# 7 A 1 1
# 8 A 2 1
# 9 A 3 2
#10 B 1 1
#11 B 2 1
#12 B 1 1
#13 B 2 1
#14 B 1 2
#15 B 2 3
Or using data.table
library(data.table)
setDT(df)[order(plate, rowid(phase), score)]
df[with(df, order(plate, ave(phase, phase, FUN = seq_along), phase)),]
#> plate phase score
#> 1 A 1 1
#> 4 A 2 1
#> 7 A 3 2
#> 2 A 1 1
#> 5 A 2 1
#> 8 A 3 2
#> 3 A 1 1
#> 6 A 2 1
#> 9 A 3 2
#> 10 B 1 1
#> 13 B 2 1
#> 11 B 1 1
#> 14 B 2 1
#> 12 B 1 2
#> 15 B 2 3

Resources