How to merge datasets with repeated measures - r

I have Three datasets that I want to MERGE/JOIN.
This This examples only include the first participants I have a total of 25
df1
ID Grup pretest
1 1 A 2
2 1 A 1
3 1 A 3
4 2 B NA
5 2 B 1
6 2 B 3
7 3 A 2
8 3 A 1
9 3 A NA
10 4 B 2
11 4 B 1
12 4 B 3
df2 (this is missing one ID (5)
ID Grup posttest
1 1 A NA
2 1 A 5
3 1 A 4
4 2 B 2
5 2 B 4
6 2 B 3
7 3 A 5
8 3 A 6
9 3 A 3
10 6 B 4
11 6 B 2
12 6 B NA
Updated
df3( this have 5 Measurements for per ID)
ID Grup traning
1 1 A 2
2 1 A 6
3 1 A 3
4 1 A NA
5 1 A 1
6 2 B 3
7 2 B 4
8 2 B 1
9 2 B NA
10 2 B 2
11 3 A 1
12 3 A 3
I’ve been trying merge() and full_join() but both end up creating duplicates that I don’t want.
It won’t recognize the ID as an independent value, it’s creating 9 IDs for every ID value.
New <- merge(df1, df2, by= 'ID')
New <- full_join(df1, df2, By = "ID")
Setting all = TRUE doesn’t help.
I need the dataset to look like this
ID Grup pretest posttest traning
1 1 A 2 NA. 3
2 1 A 1 5. 4
3 1 A 3 4. 4
4 1 A NA Na. 4
5 1. A NA Na. 3
6 2 B 3 3. Na
7 2 B. 2 5. 3
8 2 B Na 6. 2
9 2 B NA Na. 5
10 2 B Na Na. 4
11 3 A. 1 2. 3
12 3 A. 3 3. 4

Since you are relying on the order of the frames, you can simply use cbind()
cbind(df1,df2[,3,F])
Output:
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

You can add a helper column iid to separate the entries.
df1 <- cbind(iid = 1:nrow(df1), df1)
df2 <- cbind(iid = 1:nrow(df2), df2)
With dplyr
library(dplyr)
left_join(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
With base R merge
merge(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 4 B 2 4
3 4 B 1 2
4 4 B 3 NA
5 1 A 1 5
6 1 A 3 4
7 2 B NA 2
8 2 B 1 4
9 2 B 3 3
10 3 A 2 5
11 3 A 1 6
12 3 A NA 3
Data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), pretest = c(2L, 1L, 3L, NA, 1L, 3L, 2L, 1L, NA,
2L, 1L, 3L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12"))
df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), posttest = c(NA, 5L, 4L, 2L, 4L, 3L, 5L, 6L,
3L, 4L, 2L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Another option is joining by rownames, eg. row numbers:
library(tibble)
library(dplyr)
left_join(rownames_to_column(df1), df2 %>% rownames_to_column() , by="rowname") %>%
select(ID = ID.x, Grup = Grup.x, pretest, posttest)
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

Related

R Create multiple rows from 1 row based on presence of values in certain columns

I have a data frame that looks like the following:
ID Date Participant_1 Participant_2 Participant_3 Covariate 1 Covariate 2 Covariate 3
1 9/1 A B 16 2 1
2 5/4 B 4 2 2
3 6/3 C A B 8 3 6
4 2/8 A 7 8 4
5 9/3 C A 7 1 3
I need to expand this data frame so that a row is present for all of the participants present at each event "ID", with the date and all other variables in all the created rows. The multiple participant columns would now only be one column for participant. The output would therefore be:
ID Date Participant Covariate 1 Covariate 2 Covariate 3
1 9/1 A 16 2 1
1 9/1 B 16 2 1
2 5/4 B 4 2 2
3 6/3 C 8 3 6
3 6/3 A 8 3 6
3 6/3 B 8 3 6
4 2/8 A 7 8 4
5 9/3 C 7 1 3
5 9/3 A 7 1 3
Is there a way to do this efficiently? Perhaps with a pivot function?
We can use pivot_longer and then some formatting
library(tidyr)
df %>%
pivot_longer(starts_with("Participant"), values_to = "Participant") %>%
select(-name) %>%
relocate(Participant, .before = Covariate_1) %>%
drop_na()
# A tibble: 9 × 6
ID Date Participant Covariate_1 Covariate_2 Covariate_3
<int> <chr> <chr> <int> <int> <int>
1 1 9/1 A 16 2 1
2 1 9/1 B 16 2 1
3 2 5/4 B 4 2 2
4 3 6/3 C 8 3 6
5 3 6/3 A 8 3 6
6 3 6/3 B 8 3 6
7 4 2/8 A 7 8 4
8 5 9/3 C 7 1 3
9 5 9/3 A 7 1 3
Here's the example data used:
df <- structure(list(ID = 1:5, Date = c("9/1", "5/4", "6/3", "2/8",
"9/3"), Participant_1 = c("A", "B", "C", "A", "C"), Participant_2 = c("B",
NA, "A", NA, "A"), Participant_3 = c(NA, NA, "B", NA, NA), Covariate_1 = c(16L,
4L, 8L, 7L, 7L), Covariate_2 = c(2L, 2L, 3L, 8L, 1L), Covariate_3 = c(1L,
2L, 6L, 4L, 3L)), class = "data.frame", row.names = c(NA, -5L
))

Expand a dataframe n times and add a column numbering replicates 1 to n

A simple question probably but couldn't figure it out. I want to join two tables by replicating the former. I tried dplyr join functions but they don't seem to add the category column in the example below. Any help is appreciated.
> # I have two tables
>
> table1
Place Round Value
1 A 1 12397
2 A 2 18413
3 A 3 7351
4 A 4 5820
5 B 1 3874
6 B 2 10140
7 B 3 10073
8 B 4 7379
>
> table2
Place Category
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B 3
>
> # I want to add the category column from table2 and expand table1 as follows
>
> final_table
Place Round Value Category
1 A 1 12397 1
2 A 2 18413 1
3 A 3 7351 1
4 A 4 5820 1
5 B 1 3874 1
6 B 2 10140 1
7 B 3 10073 1
8 B 4 7379 1
9 A 1 12397 2
10 A 2 18413 2
11 A 3 7351 2
12 A 4 5820 2
13 B 1 3874 2
14 B 2 10140 2
15 B 3 10073 2
16 B 4 7379 2
17 A 1 12397 3
18 A 2 18413 3
19 A 3 7351 3
20 A 4 5820 3
21 B 1 3874 3
22 B 2 10140 3
23 B 3 10073 3
24 B 4 7379 3
We could use crossing
library(tidyr)
library(dplyr)
crossing(table1, table2[2]) %>%
arrange(Category)
# A tibble: 24 x 4
# Place Round Value Category
# <chr> <int> <int> <int>
# 1 A 1 12397 1
# 2 A 2 18413 1
# 3 A 3 7351 1
# 4 A 4 5820 1
# 5 B 1 3874 1
# 6 B 2 10140 1
# 7 B 3 10073 1
# 8 B 4 7379 1
# 9 A 1 12397 2
#10 A 2 18413 2
# … with 14 more rows
data
table1 <- structure(list(Place = c("A", "A", "A", "A", "B", "B", "B", "B"
), Round = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Value = c(12397L,
18413L, 7351L, 5820L, 3874L, 10140L, 10073L, 7379L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
table2 <- structure(list(Place = c("A", "A", "A", "B", "B", "B"),
Category = c(1L,
2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Calculation in Dataframe keeping first row as reference

My first row is the reference value for addition of the below rows, for N number of columns.
Data
A B C D
3 5 1 2
1 4 5 3
2 2 2 4
3 1 3 1
4 3 1 2
Calculation as follows:
3, is reference value that is added, which is 3 should be added to 1, 2, 3, and 4, similarly 5 is the reference value that is - added to 4,2,1,3 and then 1 is reference value added to 5,2,3,1 and so .... till n columns.
1 + 3 4 + 5 5 + 1 3 + 2
2 + 3 2 + 5 2 + 1 4 + 2
3 + 3 1 + 5 3 + 1 1 + 2
4 + 3 3 + 5 1 + 1 2 + 2
Expected output:
A B C D
4 9 6 5
5 7 3 6
6 6 4 3
7 8 2 4
Please help. Thanks.
Maybe just this:
c(mydf[1, ]) + mydf[-1, ]
## A B C D
## 2 4 9 6 5
## 3 5 7 3 6
## 4 6 6 4 3
## 5 7 8 2 4
Starting data.frame:
mydf <- structure(list(A = c(3L, 1L, 2L, 3L, 4L), B = c(5L, 4L, 2L, 1L,
3L), C = c(1L, 5L, 2L, 3L, 1L), D = c(2L, 3L, 4L, 1L, 2L)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, 5L), class = "data.frame")
We can do
(df1[1,][col(df1)] + df1)[-1,]
# A B C D
#2 4 9 6 5
#3 5 7 3 6
#4 6 6 4 3
#5 7 8 2 4
If you are trying to replace the values in your initial dataframe with the new values, you could do the following:
df <- data.frame(c(3,1,2,3,4), c(5,4,2,1,3), c(1,5,2,3,1), c(2,3,4,1,2))
names(df) <- c("A", "B", "C", "D")
for (i in 2:nrow(df)) {
for (j in 1:ncol(df)) {
df[i,j] <- df[1,j] + df[i,j]
}
}
This could probably be vectorized and run faster which would be helpful if you have a very large dataframe, but it will work if you need a quick and dirty solution.
Output:
A B C D
1 3 5 1 2
2 4 9 6 5
3 5 7 3 6
4 6 6 4 3
5 7 8 2 4
Hope this is helpful!

R - Subset dataframe based on a repeated sequence

I am trying to subset a data frame based on specific sequence occurring in column v3.
A sample of a dataframe:
v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3) # creating a dataframe
sample output for my_df
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
4 4 0 3
5 5 1 2
6 6 0 3
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
11 11 1 3
12 12 1 2
13 13 0 3
14 14 0 3
15 15 0 3
16 16 1 4
17 17 1 4
18 18 0 2
19 19 0 3
20 20 0 3
The output I am trying to achieve should look like this
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
So I want to subset my df according to sequence of 4 4 2 in column v3. What I tried so far is:
my_df[which(c(diff(v3))==-2),]
but this only extracts the middle four of the sequence 4 4 2 like
v1 v2 v3
2 2 1 4
9 9 1 4
17 17 1 4
Another option I tried:
m = match(v3, c(4,4,2))
> m
[1] 1 1 3 NA 3 NA 3 1 1 3 NA 3 NA NA NA 1 1 3 NA NA
> my_df[!is.na(m),]
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
5 5 1 2
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
12 12 1 2
16 16 1 4
17 17 1 4
18 18 0 2
This output gives me all 4 and 2 but not the sequence 4 4 2 that I want. Any help would be appreciated.
I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(my_df)). Using shift from data.table, we get the next element with type = "lead". As shift takes a vector of n, we specify n = 0:2, so that we get three columns with the n = 0 corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste the elements rowwise (do.call(paste0, ...), check whether it is equal to 442, get the index of TRUE values (which), use rep to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.
library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
type= "lead")) == 442), each = 3) + 0:2]]
# v1 v2 v3
#1: 1 1 4
#2: 2 1 4
#3: 3 0 2
#4: 8 1 4
#5: 9 1 4
#6: 10 0 2
#7: 16 1 4
#8: 17 1 4
#9: 18 0 2
data
my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L,
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L,
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
As long as v3 does not have any missing values and the values of v3 are single characters, you can also use gregexpr to accomplish this as follows
# get the row indices where the pattern 442 starts c(1 , 8, 16)
rowstarts <- unlist(gregexpr("442", paste(my_df$v3, collapse="")))
# extract rows from the data fram
dfNew <- my_df[sort(c(outer(rowstarts, (0:2), "+"))), ]
which returns
dfNew
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
paste with the collapse argument turns the vector v3 into a single character string. grexpr then finds the starting position in this string for any "442" subexpression.
The final step subsets the data.frame using the outer function suggested by #alexis-laz's in the comments above.

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.
We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA
Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Resources