Repeated measures in messy format, need help to tidy - r

I have a very large data set containing weekly weights that have been coded with week of study and the weight at that visit. There are some missing visits and the data is not currently aligned.
df <- data.frame(ID=1:3, Week_A=c(6,6,7), Weight_A=c(23,24,23), Week_B=c(7,7,8),
Weight_B=c(25,26,27), Week_C=c(8,9,9), Weight_C=c(27,26,28))
df
ID Week_A Weight_A Week_B Weight_B Week_C Weight_C
1 1 6 23 7 25 8 27
2 2 6 24 7 26 9 26
3 3 7 23 8 27 9 28
I would like to align the data by week number (ideal output below).
df_ideal <- data.frame (ID=1:3, Week_6=c(23,24,NA), Week_7=c(25,26,23),
Week_8=c(27,NA,27), Week_9=c(NA,26,28))
df_ideal
ID Week_6 Week_7 Week_8 Week_9
1 1 23 25 27 NA
2 2 24 26 NA 26
3 3 NA 23 27 28
I would appreciate some help with this, even to find a starting point to manipulate this data to an easier to manage format.

A tidyverse solution:
df <- data.frame(ID=1:3,
Week_A=c(6,6,7),
Weight_A=c(23,24,23),
Week_B=c(7,7,8),
Weight_B=c(25,26,27),
Week_C=c(8,9,9),
Weight_C=c(27,26,28))
library(tidyverse)
df_long <- df %>% gather(key="v", value="value", -ID) %>%
separate(v, into=c("v1", "v2")) %>%
spread(v1, value) %>%
complete(ID, Week) %>%
arrange(Week, ID)
df_long
# A tibble: 12 x 4
# ID Week v2 Weight
# <int> <dbl> <chr> <dbl>
# 1 1 6 A 23
# 2 2 6 A 24
# 3 3 6 <NA> NA
# 4 1 7 B 25
# 5 2 7 B 26
# 6 3 7 A 23
# 7 1 8 C 27
# 8 2 8 <NA> NA
# 9 3 8 B 27
#10 1 9 <NA> NA
#11 2 9 C 26
#12 3 9 C 28
df_wide <- df_long %>% select(-v2) %>%
spread(Week, Weight, sep="_")
df_wide
# A tibble: 3 x 5
# ID Week_6 Week_7 Week_8 Week_9
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
Personally, I'd keep using df_long instead of df_wide, as it is a tidy data frame, while df_wide is not.

Here is a possible approach using the data.table package
library(data.table)
#convert into a data.table
setDT(df)
#convert into a long format
mdat <- melt(df, id.vars="ID", measure.vars=patterns("^Week", "^Weight", cols=names(df)))
#pivot into desired output
ans <- dcast(mdat, ID ~ value1, value.var="value2")
ans output:
ID 6 7 8 9
1: 1 23 25 27 NA
2: 2 24 26 NA 26
3: 3 NA 23 27 28
And if you really need the "Week_" in your column names, you can use
setnames(ans, names(ans)[-1L], paste("Week_", names(ans)[-1L]))

Another tidyverse solution using a double-gather with a final spread
df %>%
gather(k, v, -ID, -starts_with("Weight")) %>%
separate(k, into = c("k1", "k2")) %>%
unite(k1, k1, v) %>%
gather(k, v, starts_with("Weight")) %>%
separate(k, into = c("k3", "k4")) %>%
filter(k2 == k4) %>%
select(-k2, -k3, -k4) %>%
spread(k1, v)
# ID Week_6 Week_7 Week_8 Week_9
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28

In base R, it's a double reshape, firstly to long and then back to wide on a different variable:
tmp <- reshape(df, idvar="ID", varying=lapply(c("Week_","Weight_"), grep, names(df)),
v.names=c("time","Week"), direction="long")
reshape(tmp, idvar="ID", direction="wide", sep="_")
# ID Week_6 Week_7 Week_8 Week_9
#1.1 1 23 25 27 NA
#2.1 2 24 26 NA 26
#3.1 3 NA 23 27 28

Related

Merge/combine rows with same ID and Date in R

I have an excel database like below. The Excel database had option to enter only 3 drug details. Wherever there are more than 3 drugs, it has been entered into another row with PID and Date.
Is there a way I can merge the rows in R so that each patient's records will be in a single row? In the example below, I need to merge Row 1 & 2 and 4 & 6.
Thanks.
Row
PID
Date
Drug1
Dose1
Drug2
Dose2
Drug3
Dose3
Age
Place
1
11A
25/10/2021
RPG
12
NAT
34
QRT
5
45
PMk
2
11A
25/10/2021
BET
10
SET
43
BLT
45
3
12B
20/10/2021
ATY
13
LTP
3
CRT
3
56
GTL
4
13A
22/10/2021
GGS
7
GSF
12
ERE
45
45
RKS
5
13A
26/10/2021
BRT
9
ARR
4
GSF
34
46
GLO
6
13A
22/10/2021
DFS
5
7
14B
04/08/2021
GDS
2
TRE
55
HHS
34
25
MTK
Up front, the two methods below are completely different, not equivalents in "base R vs dplyr". I'm sure either can be translated to the other.
dplyr
The premise here is to first reshape/pivot the data longer so that each Drug/Dose is on its own line, renumber them appropriately, and then bring it back to a wide state.
NOTE: frankly, I usually prefer to deal with data in a long format, so consider keeping it in its state immediately before pivot_wider. This means you'd need to bring Age and Place back into it somehow.
Why? A long format deals very well with many types of aggregation; ggplot2 really really prefers data in the long format; I dislike seeing and having to deal with all of the NA/empty values that will invariably happen with this wide format, since many PIDs don't have (e.g.) Drug6 or later. This seems subjective, but it can really be an objective change/improvement to data-mangling, depending on your workflow.
library(dplyr)
# library(tidyr) # pivot_longer, pivot_wider
dat0 <- select(dat, PID, Date, Age, Place) %>%
group_by(PID, Date) %>%
summarize(across(everything(), ~ .[!is.na(.) & nzchar(trimws(.))][1] ))
dat %>%
select(-Age, -Place) %>%
tidyr::pivot_longer(
-c(Row, PID, Date),
names_to = c(".value", "iter"),
names_pattern = "^([^0-9]+)([123]?)$") %>%
arrange(Row, iter) %>%
group_by(PID, Date) %>%
mutate(iter = row_number()) %>%
select(-Row) %>%
tidyr::pivot_wider(
c("PID", "Date"), names_sep = "",
names_from = "iter", values_from = c("Drug", "Dose")) %>%
left_join(dat0, by = c("PID", "Date"))
# # A tibble: 5 x 16
# # Groups: PID, Date [5]
# PID Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11A 25/10/2021 RPG NAT QRT BET "SET" "BLT" 12 34 5 10 43 45 45 PMk
# 2 12B 20/10/2021 ATY LTP CRT <NA> <NA> <NA> 13 3 3 NA NA NA 56 GTL
# 3 13A 22/10/2021 GGS GSF ERE DFS "" "" 7 12 45 5 NA NA 45 RKS
# 4 13A 26/10/2021 BRT ARR GSF <NA> <NA> <NA> 9 4 34 NA NA NA 46 GLO
# 5 14B 04/08/2021 GDS TRE HHS <NA> <NA> <NA> 2 55 34 NA NA NA 25 MTK
Notes:
I broke out dat0 early, since Age and Place don't really fit into the pivot/renumber/pivot mindset.
base R
Here's a base R method that splits (according to your grouping criteria: PID and Date), finds the Drug/Dose columns that need to be renumbered, renames them, and the merges all of the frames back together.
spl <- split(dat, ave(rep(1L, nrow(dat)), dat[,c("PID", "Date")], FUN = seq_along))
spl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 2 2 11A 25/10/2021 BET 10 SET 43 BLT 45 NA
# 6 6 13A 22/10/2021 DFS 5 NA NA NA
nms <- lapply(spl, function(x) grep("^(Drug|Dose)", colnames(x), value = TRUE))
nms <- data.frame(i = rep(names(nms), lengths(nms)), oldnm = unlist(nms))
nms$grp <- gsub("[0-9]+$", "", nms$oldnm)
nms$newnm <- paste0(nms$grp, ave(nms$grp, nms$grp, FUN = seq_along))
nms <- split(nms, nms$i)
newspl <- Map(function(x, nm) {
colnames(x)[ match(nm$oldnm, colnames(x)) ] <- nm$newnm
x
}, spl, nms)
newspl[-1] <- lapply(newspl[-1], function(x) x[, c("PID", "Date", grep("^(Drug|Dose)", colnames(x), value = TRUE)), drop = FALSE ])
newspl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# PID Date Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 2 11A 25/10/2021 BET 10 SET 43 BLT 45
# 6 13A 22/10/2021 DFS 5 NA NA
Reduce(function(a, b) merge(a, b, by = c("PID", "Date"), all = TRUE), newspl)
# PID Date Row Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 1 11A 25/10/2021 1 RPG 12 NAT 34 QRT 5 45 PMk BET 10 SET 43 BLT 45
# 2 12B 20/10/2021 3 ATY 13 LTP 3 CRT 3 56 GTL <NA> NA <NA> NA <NA> NA
# 3 13A 22/10/2021 4 GGS 7 GSF 12 ERE 45 45 RKS DFS 5 NA NA
# 4 13A 26/10/2021 5 BRT 9 ARR 4 GSF 34 46 GLO <NA> NA <NA> NA <NA> NA
# 5 14B 04/08/2021 7 GDS 2 TRE 55 HHS 34 25 MTK <NA> NA <NA> NA <NA> NA
Notes:
The underlying premise of this is that you want to merge the rows onto previous rows. This means (to me) using base::merge or dplyr::full_join; two good links for understanding these concepts, in case you are not aware: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
To do that, we need to determine which rows are duplicates of previous; further, we need to know how many previous same-key rows there are. There are a few ways to do this, but I think the easiest is with base::split. In this case, no PID/Date combination has more than two rows, but if you had one combination that mandated a third row, spl would be length-3, and the resulting names would go out to Drug9/Dose9.
The second portion (nms <- ...) is where we work on the names. The first few steps create a nms dataframe that we'll use to map from old to new names. Since we're concerned about contiguous numbering through all multi-row groups, we aggregate on the base (number removed) of the Drug/Dose names, so that we number all Drug columns from Drug1 through how many there are.
Note: this assumes that there are always perfect pairs of Drug#/Dose#; if there is ever a mismatch, then the numbering will be suspect.
We end with nms being a split dataframe, just like spl of the data. This is useful and important, since we'll Map (zip-like lapply) them together.
The third block updates spl with the new names. The result in newspl is just renaming of the columns so that when we merge them together, no column-duplication will occur.
One additional step here is removing unrelated columns from the 2nd and subsequent frame in the list. That is, we keep Age and Place in the first such frame but remove it from the rest. My assumption (based on the NA/empty nature of those fields in duplicate rows) is that we only want to keep the first row's values.
The last step is to iteratively merge them together. The Reduce function is nice for this.
Update:
With the help of akrun see here: Use ~separate after mutate and across
We could:
library(dplyr)
library(stringr)
library(tidyr)
df %>%
group_by(PID) %>%
summarise(across(everything(), ~toString(.))) %>%
mutate(across(everything(), ~ list(tibble(col1 = .) %>%
separate(col1, into = str_c(cur_column(), 1:3), sep = ",\\s+", fill = "left", extra = "drop")))) %>%
unnest(c(PID, Row, Date, Drug1, Dose1, Drug2, Dose2, Drug3, Dose3, Age,
Place)) %>%
distinct() %>%
select(-1, -2)
PID3 Row1 Row2 Row3 Date1 Date2 Date3 Drug11 Drug12 Drug13 Dose11 Dose12 Dose13 Drug21 Drug22 Drug23 Dose21 Dose22 Dose23 Drug31 Drug32 Drug33 Dose31 Dose32 Dose33 Age1 Age2 Age3 Place1 Place2 Place3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 11A NA 1 2 NA 25/10/2021 25/10/2021 NA RPG BET NA 12 10 NA NAT SET NA 34 43 NA QRT BLT NA 5 45 NA 45 NA NA PMk NA
2 12B NA NA 3 NA NA 20/10/2021 NA NA ATY NA NA 13 NA NA LTP NA NA 3 NA NA CRT NA NA 3 NA NA 56 NA NA GTL
3 13A 4 5 6 22/10/2021 26/10/2021 22/10/2021 GGS BRT DFS 7 9 5 GSF ARR NA 12 4 NA ERE GSF NA 45 34 NA 45 46 NA RKS GLO NA
4 14B NA NA 7 NA NA 04/08/2021 NA NA GDS NA NA 2 NA NA TRE NA NA 55 NA NA HHS NA NA 34 NA NA 25 NA NA MTK
First answer:
Keeping the excellent explanation of #r2evans in mind! We could do it this way if really desired.
library(dplyr)
df %>%
group_by(PID) %>%
summarise(across(everything(), ~toString(.)))
output:
PID Row Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 11A 1, 2 25/10/2021, 25/10/2021 RPG, BET 12, 10 NAT, SET 34, 43 QRT, BLT 5, 45 45, NA PMk, NA
2 12B 3 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
3 13A 4, 5, 6 22/10/2021, 26/10/2021, 22/10/2021 GGS, BRT, DFS 7, 9, 5 GSF, ARR, NA 12, 4, NA ERE, GSF, NA 45, 34, NA 45, 46, NA RKS, GLO, NA
4 14B 7 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
Another tidyverse-based solution, with a pivot_longer followed by a pivot_wider:
library(tidyverse)
# Note that my dataframe does not contain column Row
df %>%
mutate(across(starts_with("Dose"), as.character)) %>%
pivot_longer(!c(PID, Date, Age, Place),names_to = "trm") %>%
group_by(PID, Date) %>%
fill(Age, Place) %>%
mutate(trm = paste(trm,1:n(),sep="_")) %>%
ungroup %>%
pivot_wider(c(PID, Date, Age, Place), names_from = trm) %>%
rename_with(~ paste0("Drug",1:length(.x)), starts_with("Drug")) %>%
rename_with(~ paste0("Dose",1:length(.x)), starts_with("Dose")) %>%
mutate(across(starts_with("Dose"), as.numeric))
#> # A tibble: 5 × 16
#> PID Date Age Place Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Drug4 Dose4 Drug5
#> <chr> <chr> <int> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
#> 1 11A 25/10… 45 PMk RPG 12 NAT 34 QRT 5 BET 10 SET
#> 2 12B 20/10… 56 GTL ATY 13 LTP 3 CRT 3 <NA> NA <NA>
#> 3 13A 22/10… 45 RKS GGS 7 GSF 12 ERE 45 DFS 5 <NA>
#> 4 13A 26/10… 46 GLO BRT 9 ARR 4 GSF 34 <NA> NA <NA>
#> 5 14B 04/08… 25 MTK GDS 2 TRE 55 HHS 34 <NA> NA <NA>
#> # … with 3 more variables: Dose5 <dbl>, Drug6 <chr>, Dose6 <dbl>
a data.table approach
library(data.table)
DT <- fread("Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
2 11A 25/10/2021 BET 10 SET 43 BLT 45
3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
6 13A 22/10/2021 DFS 5
7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK")
dcast(DT)
DT
# Melt to long format
ans <- melt(DT, id.vars = c("PID", "Date"),
measure.vars = patterns(drug = "^Drug", dose = "^Dose"),
na.rm = TRUE)
# Paste and Collapse, use ; as separator
ans <- ans[, lapply(.SD, paste0, collapse = ";"), by = .(PID, Date)]
# Split string on ;
ans[, paste0("Drug", 1:length(tstrsplit(ans$drug, ";"))) := tstrsplit(drug, ";")]
ans[, paste0("Dose", 1:length(tstrsplit(ans$dose, ";"))) := tstrsplit(dose, ";")]
#join Age + Place data
ans[DT[!is.na(Age), ], `:=`(Age = i.Age, Place = i.Place), on = .(PID, Date)]
ans[, -c("variable", "drug", "dose")]
# PID Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# 1: 11A 25/10/2021 RPG BET NAT SET QRT BLT 12 10 34 43 5 45 45 PMk
# 2: 12B 20/10/2021 ATY LTP CRT <NA> <NA> <NA> 13 3 3 <NA> <NA> <NA> 56 GTL
# 3: 13A 22/10/2021 GGS DFS GSF ERE <NA> <NA> 7 5 12 45 <NA> <NA> 45 RKS
# 4: 13A 26/10/2021 BRT ARR GSF <NA> <NA> <NA> 9 4 34 <NA> <NA> <NA> 46 GLO
# 5: 14B 04/08/2021 GDS TRE HHS <NA> <NA> <NA> 2 55 34 <NA> <NA> <NA> 25 MTK
Another answer to the festival.
Reading data from this page
require(rvest)
require(tidyverse)
d = read_html("https://stackoverflow.com/q/69787018/694915") %>%
html_nodes("table") %>%
html_table(fill = TRUE)
List of dose per PID and DATE
# primera tabla
d[[1]] -> df
df %>%
pivot_longer(
cols = starts_with("Drug"),
values_to = "Drug"
) %>%
select( !name ) %>%
pivot_longer(
cols = starts_with("Dose"),
values_to = "Dose"
) %>%
select( !name ) %>%
drop_na() %>%
pivot_wider(
names_from = Drug,
values_from = Dose ,
values_fill = list(0)
) -> dose
Variable dose contains this data
(https://i.stack.imgur.com/lc3iN.png)
Not that elegant as previous ones, but is an idea to see the whole treatment per PID.

Sampling the same number of imbalanced classes from two data frames

I am trying to sample the same number of classess from two data frames of differing sizes. I can do this manually, but the number of classes in some of my data frames are quite large.
I have been able to use the dplyr::count function to get a list of the classes of interest from the smaller data frame, as well as their counts. I then extract these classes and their counts as vectors. I then have attempted to create a function using these vectors and call it using mapply, so I can create filtered slices for each class then re-join the lists using do.call, but I am getting the errors when I attempt to run mapply.
Below are example datasets. df is the smaller data frame which has 6 rows containing ControlVarA == "Group_1" and 10 rows containing ControlVarA == "Group_2", and I am wanting to extract the same number of rows/classes from the larger data frame df2 (which has 6 rows ControlVarA == "Group_1" and 20 rows containing ControlVarA == "Group_2").
df <- data.frame("ID" = 1:16)
df$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16)
df$VarB <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
df
df2 <- data.frame("ID" = 1:26)
df2$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16,16,16,16,16,16,16,16,16,16,16)
df2$VarB <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16,16,16,16,16,16,16,16,16,16,16)
df2$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16,16,16,16,16,16,16,16,16,16,16)
df2$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16,16,16,16,16,16,16,16,16,16,16)
df2$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2","Group_2","Group_2"))
df2
To extract the class names and class counts I use the code below.
slice_vars <- df %>%
count(ControlVarA) %>%
filter(!is.na(.)) %>%
t() %>%
janitor::row_to_names(1) %>%
colnames()
slice_nums <- df %>%
count(ControlVarA) %>%
filter(!is.na(.)) %>%
t() %>%
janitor::row_to_names(2) %>%
as.data.frame() %>%
rename_with(~ gsub(" ", "", .x)) %>%
colnames() %>%
as.numeric()
The function I created and mapply statement are below
func_group <- function(dataset, x, y) {
dataset %>%
group_by(ControlVarA) %>%
slice_sample(n = all_of(x)) %>%
ungroup() %>%
filter(ControlVarA == data[[y]])
}
combine_lists <- mapply(func_group, slice_nums, slice_vars, MoreArgs = list(dataset = df2))
do.call(rbind, combine_lists)
count to get number of rows for each value in ControlVarA, join with df2 and select n random rows from each group using sample_n. (Unfortunately, slice_sample(n = first(n)) returns an error)
library(dplyr)
df %>%
count(ControlVarA) %>%
left_join(df2, by = 'ControlVarA') %>%
group_by(ControlVarA) %>%
sample_n(first(n)) %>%
ungroup %>%
select(-n)
# ControlVarA ID VarA VarB VarC VarD
# <fct> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Group_1 1 1 10 10 10
# 2 Group_1 4 1 0 16 16
# 3 Group_1 3 1 0 14 14
# 4 Group_1 2 1 0 12 12
# 5 Group_1 5 1 12 10 10
# 6 Group_1 6 1 12 12 12
# 7 Group_2 12 14 14 16 16
# 8 Group_2 25 16 16 16 16
# 9 Group_2 15 NA 16 14 14
#10 Group_2 22 16 16 16 16
#11 Group_2 9 1 0 10 10
#12 Group_2 8 1 12 16 16
#13 Group_2 24 16 16 16 16
#14 Group_2 21 16 16 16 16
#15 Group_2 7 1 12 14 14
#16 Group_2 14 NA 16 12 12
library(tidyverse)
df <- data.frame("ID" = 1:16)
df$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16)
df$VarB <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
df2 <- data.frame("ID" = 1:26)
df2$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16,16,16,16,16,16,16,16,16,16,16)
df2$VarB <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16,16,16,16,16,16,16,16,16,16,16)
df2$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16,16,16,16,16,16,16,16,16,16,16)
df2$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16,16,16,16,16,16,16,16,16,16,16)
df2$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2","Group_2","Group_2"))
df <- as_tibble(df) %>%
mutate(table = "df")
df2 <- as_tibble(df2) %>%
mutate(table = "df2")
final_df <- df %>%
bind_rows(df2)
set.seed(2021)
final_df %>%
filter(!if_any(.cols = VarA:VarD, is.na)) %>%
group_by(table, ControlVarA) %>%
slice_sample(n = 5)
#> # A tibble: 20 x 7
#> # Groups: table, ControlVarA [4]
#> ID VarA VarB VarC VarD ControlVarA table
#> <int> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#> 1 6 1 12 12 12 Group_1 df
#> 2 2 1 0 12 12 Group_1 df
#> 3 3 1 0 14 14 Group_1 df
#> 4 5 1 12 10 10 Group_1 df
#> 5 4 1 0 16 16 Group_1 df
#> 6 16 16 16 16 16 Group_2 df
#> 7 9 1 0 10 10 Group_2 df
#> 8 8 1 12 16 16 Group_2 df
#> 9 10 1 14 12 12 Group_2 df
#> 10 7 1 12 14 14 Group_2 df
#> 11 1 1 10 10 10 Group_1 df2
#> 12 4 1 0 16 16 Group_1 df2
#> 13 3 1 0 14 14 Group_1 df2
#> 14 2 1 0 12 12 Group_1 df2
#> 15 6 1 12 12 12 Group_1 df2
#> 16 22 16 16 16 16 Group_2 df2
#> 17 23 16 16 16 16 Group_2 df2
#> 18 9 1 0 10 10 Group_2 df2
#> 19 18 16 16 16 16 Group_2 df2
#> 20 20 16 16 16 16 Group_2 df2
Created on 2021-07-13 by the reprex package (v2.0.0)

How to bring information from one dataframe to another without losing information for duplicates?

Consider df1:
id=c(1,2,3,4,5,6); n_df2=c(1,1,2,2,3,3);
df1=as.data.frame(cbind(id,n_df2)); df1
where n_df2 is the number of appearances for that id in df2.
id2=c(1,2,3,4,5,6,3,4,5,5,6,6);
value=c(25,35,46,78,12,34,12,33,87,56,11,8);
df2=as.data.frame(cbind(id2,value)); df2
(id2 is equivalent to id)
I want df1 to look like this:
df1$value.1=c(25,35,46,78,12,34)
df1$value.2=c(NA,NA,12,33,87,11);
df1$value.3=c(NA,NA,NA,NA,56,8); df1;
Any help will be very much appreciated!
Thanks.
In base R you could do:
merge(df1, reshape(transform(df2[c('id2','value')],
time = ave(id2, id2, FUN = seq_along)),dir = 'wide', idvar = 'id2'),
by.x = 'id', by.y = 'id2', all.x = TRUE)
id n_df2 value.1 value.2 value.3
1 1 1 25 NA NA
2 2 1 35 NA NA
3 3 2 46 12 NA
4 4 2 78 33 NA
5 5 3 12 87 56
6 6 3 34 11 8
In tidyverse
library(tidyverse)
df1 %>%
left_join(df2 %>%
select(id2, value) %>%
group_by(id2) %>%
mutate(time = row_number()) %>%
pivot_wider(id2, names_from=time, names_prefix='value.', values_from = 'value'),
c('id'='id2'))
id n_df2 value.1 value.2 value.3
1 1 1 25 NA NA
2 2 1 35 NA NA
3 3 2 46 12 NA
4 4 2 78 33 NA
5 5 3 12 87 56
6 6 3 34 11 8

In search of a more efficient solution converting Wide data to long data

I want to convert the data from wide to long.I have solved the problem with the reshape package but then I manually had to define which column belonged the "gather columns", if there are hundreds of columns (which is the case in my data) that would be time consuming and a high risk of writing errors.
Does anyone know how to make a more efficient function to reach to this result?
id <- 1001:1003
qA2 <- c(10,5,1)
qB2 <- c(11,6,3)
qC2 <- c(10,7,5)
qA3 <- c(15,12,8)
qB3 <- c(18,15,7)
qC3 <- c(19,11,10)
df <- data.frame(id,qA2,qB2,qC2, qA3, qB3, qC3)
df
id qA2 qB2 qC2 qA3 qB3 qC3
1 1001 10 11 10 15 18 19
2 1002 5 6 7 12 15 11
3 1003 1 3 5 8 7 10
Solution with the reshape package:
library(reshape2)
df_test <- reshape(df, idvar="id", direction="long", varying=list(c(2,5), c(3,6), c(4,7)),v.names=c("qA", "qB", "qC"),times=2:3)
df_test
df_test <- df_test[order(df_test$id, df_test$time),]
id time qA qB qC
1001.2 1001 2 10 11 10
1001.3 1001 3 15 18 19
1002.2 1002 2 5 6 7
1002.3 1002 3 12 15 11
1003.2 1003 2 1 3 5
1003.3 1003 3 8 7 10
Using dplyr and tidyr, here is one way not sure about the efficiency though
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -id) %>%
mutate(key = sub("\\d+", "", key)) %>%
group_by(key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# A tibble: 6 x 4
# id qA qB qC
# <int> <dbl> <dbl> <dbl>
#1 1001 10 11 10
#2 1001 15 18 19
#3 1002 5 6 7
#4 1002 12 15 11
#5 1003 1 3 5
#6 1003 8 7 10
With the new version of tidyr (1.0.0) (already on CRAN, just update it):
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("q"),
names_to = "time",
names_prefix = "q[A-Z]",
values_to = c("qA","qB","qC"))
Here is a base R one liner,
df1 <- cbind(id = df$id, (do.call(cbind, lapply(split.default(df[-1],
gsub('\\d+', '', names(df)[-1])), stack))[c(TRUE, FALSE)]))
df1[with(df1, order(id)),]
# id qA.values qB.values qC.values
#1 1001 10 11 10
#4 1001 15 18 19
#2 1002 5 6 7
#5 1002 12 15 11
#3 1003 1 3 5
#6 1003 8 7 10
We can use names_pattern with pivot_longer
library(tidyr)
pivot_longer(df, -id, names_to = c(".value", "time"), names_pattern= "(\\D+)(\\d+)")
# A tibble: 6 x 5
# id time qA qB qC
# <int> <chr> <dbl> <dbl> <dbl>
#1 1001 2 10 11 10
#2 1001 3 15 18 19
#3 1002 2 5 6 7
#4 1002 3 12 15 11
#5 1003 2 1 3 5
#6 1003 3 8 7 10

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

Resources