Matching values in different datasets by groups in R - r

I have the following two datasets:
df1 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60))
df2 <- data.frame(
"group" = c(1, 1, 2, 2, 2, 3, 3, 4, 5),
"P1" = c(55, NA, 60, 55, 75, 75, 55, 55, 60),
"P2" = c(55, 75, 55, 60, NA, 75, 55, NA, 60),
"P3" = c(75, 55, 60, 75, NA, 75, 60, 55, 60))
In df1 each group has the same three numbers (in reality there are around 500 numbers).
I want to check whether the values in the column "numbers" in df1 are contained in the columns P1, P2, and P3 of df2. There are two problems I am stuck with. 1. the values in the numbers column of df1 can occur in different groups in df2 (defined by the group column in df1 and df2). 2. the datasets have different lengths. Is there a way to merge both datasets and have the following dataset:
df3 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60,),
"P1new" = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1),
"P2new" = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1),
"P3new" = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1))
where P1new (P2new and P3new respectively) contain the value 1 if df2$P1 contains the value in df1$numbers within the correct group (as I said numbers can reoccur in different groups). For example, P3 has the value 75 in group 1 but not in group 5. So in group 1 P3new would have a 1 and in group 5 P3new would have a 0.
This question is similar to Find matching values in different datasets by groups in R
but I could not adapt the code according to my objectives. So, I would really appreciate any help.

Interesting question. Here's a way with dplyr functions:
library(dplyr)
df2 %>%
group_by(group) %>%
summarise(across(P1:P3, ~ list(unique(na.omit(.x))))) %>%
inner_join(df1, .) %>%
rowwise() %>%
mutate(across(P1:P3, ~ +(numbers %in% .x)))
group numbers P1 P2 P3
<dbl> <dbl> <int> <int> <int>
1 1 55 1 1 1
2 1 75 0 1 1
3 1 60 0 0 0
4 2 55 1 1 0
5 2 75 1 0 1
6 2 60 1 1 1
7 3 55 1 1 0
8 3 75 1 1 1
9 3 60 0 0 1
10 4 55 1 0 1
11 4 75 0 0 0
12 4 60 0 0 0
13 5 55 0 0 0
14 5 75 0 0 0
15 5 60 1 1 1

Another possible solution:
library(tidyverse)
map_dfc(names(df2[-1]),
~ df1 %>%
group_by(group) %>%
mutate(!!.x := +(numbers %in% df2[df2$group == cur_group_id(), .x])) %>%
ungroup %>%
select(all_of(.x))) %>%
bind_cols(df1, .)
#> group numbers P1 P2 P3
#> 1 1 55 1 1 1
#> 2 1 75 0 1 1
#> 3 1 60 0 0 0
#> 4 2 55 1 1 0
#> 5 2 75 1 0 1
#> 6 2 60 1 1 1
#> 7 3 55 1 1 0
#> 8 3 75 1 1 1
#> 9 3 60 0 0 1
#> 10 4 55 1 0 1
#> 11 4 75 0 0 0
#> 12 4 60 0 0 0
#> 13 5 55 0 0 0
#> 14 5 75 0 0 0
#> 15 5 60 1 1 1
Or, without purrr, another possibility:
library(dplyr)
df1 %>%
inner_join(df2) %>%
group_by(group) %>%
mutate(across(starts_with("P"), ~ +(numbers %in% .x))) %>%
ungroup %>%
distinct

Related

r : calculating time interval on condition

I would like to calculate Day.Before_nextCLS with 3 columns below
tibble::tribble(
~Day, ~CLS, ~BAL.D,
0, 0, NA,
3, 0, 15000,
6, 0, 10000,
20, 0, 2000,
25, 0, -4771299,
26, 0, -1615637,
27, 0, -920917,
31, 1, -923089,
32, 1, -81863,
33, 1, 19865,
34, 1, 9865,
37, 1, 609865
)
Desired output is below tribble.
For Day27, Day.Before_nextCLS is 4,
because when CLS is 2, Day is 31, and interval between 27 and 31 is 4.
tibble::tribble(
~Day, ~CLS, ~BAL.D, ~Day.Before_nextCLS
0, 0, NA, 31,
3, 0, 15000, 28,
6, 0, 10000, 25,
20, 0, 2000, 11,
25, 0, -4771299, 6,
26, 0, -1615637, 5,
27, 0, -920917, 4,
31, 1, -923089, NA, (for we don't have date when CLS ==2)
32, 1, -81863, NA,
33, 1, 19865, NA,
34, 1, 9865, NA,
37, 1, 609865, NA,
)
How can I achieve this?
Thank you very much!!
We create a lead column and then do a group by subtract from the last value of lead column with the Day column
library(dplyr)
df1 %>%
mutate(DayLead = lead(Day)) %>%
group_by(CLS) %>%
mutate(Day.Before_nextCLS = last(DayLead) - Day, DayLead = NULL) %>%
ungroup
-output
# A tibble: 12 × 4
Day CLS BAL.D Day.Before_nextCLS
<dbl> <dbl> <dbl> <dbl>
1 0 0 NA 31
2 3 0 15000 28
3 6 0 10000 25
4 20 0 2000 11
5 25 0 -4771299 6
6 26 0 -1615637 5
7 27 0 -920917 4
8 31 1 -923089 NA
9 32 1 -81863 NA
10 33 1 19865 NA
11 34 1 9865 NA
12 37 1 609865 NA

R: automate table for results of several multivariable logistic regressions

structure(list(Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15), age = c(25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0), bmi = c(35, 32, 29, 26, 23, 20, 17, 35, 32, 29,
26, 23, 20, 17, 21), Phenotype1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1), `Phenotype 2` = c(0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1), `Phenotype 3` = c(1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0), `Phenotype 4` = c(0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 15 x 8
Number age sex bmi Phenotype1 `Phenotype 2` `Phenotype 3` `Phenotype 4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 0 35 0 0 1 0
2 2 26 1 32 0 1 0 0
3 3 27 0 29 0 0 1 0
4 4 28 1 26 0 1 0 0
5 5 29 0 23 0 0 1 1
6 6 30 1 20 0 1 1 1
7 7 31 0 17 0 0 1 0
8 8 32 1 35 0 1 1 1
9 9 33 0 32 0 0 1 0
10 10 34 1 29 0 1 1 1
11 11 35 0 26 0 0 1 1
12 12 36 1 23 0 1 0 1
13 13 37 0 20 1 1 0 1
14 14 38 1 17 1 1 0 1
15 15 39 0 21 1 1 0 1
Hi all,
I have a dataset of 100 patients (15 are shown here), 3 covariates and 50 phenotypes(4 are shown here).
I want to perform a multivariable logistic regression for each phenotype using age, sex and BMI as covariates,
I would like to get a table like this, where I have the p-value, OR and confidence interval(CI)per each of the covariates.
I just don't know how to start.
Thank you very much for your help!
Best,
Caro
I wrote a function that should accomplish what you need. There are likely more elegant and more R-like ways of doing this, but this approach worked in my testing:
## Load libraries
library(broom)
library(tidyr)
library(dplyr)
## Define a function to create your summary table
summary_table <- function(x) {
# Capture number of columns passed to the function
num_vars <- ncol(x)
# Pre-define lists that will be populated and then collapsed by rest of function
models <- vector("list", length = num_vars)
first_tables <- vector("list", length = num_vars)
second_tables <- vector("list", length = num_vars)
# Loop to create each row for the final table
for (i in 1:num_vars) {
models[[i]] <- glm(x[[i]] ~ age + sex + bmi, family = "binomial", data = df)
first_tables[[i]] <- broom::tidy(models[[i]])
first_tables[[i]]$OR <- exp(first_tables[[i]]$estimate)
first_tables[[i]]$CI1 <- exp(first_tables[[i]]$estimate - (1.96 * first_tables[[i]]$std.error))
first_tables[[i]]$CI2 <- exp(first_tables[[i]]$estimate + (1.96 * first_tables[[i]]$std.error))
first_tables[[i]] <- as.data.frame(first_tables[[i]][first_tables[[i]]$term != "(Intercept)", c("term", "p.value", "OR", "CI1", "CI2")])[1:3,]
second_tables[[i]] <- first_tables[[i]] %>%
pivot_wider(names_from = term, values_from = c("p.value", "OR", "CI1", "CI2")) %>%
select("p.value_age", "OR_age", "CI1_age", "CI2_age", "p.value_bmi", "OR_bmi", "CI1_bmi", "CI2_bmi",
"p.value_sex", "OR_sex", "CI1_sex", "CI2_sex")
}
# Combine the rows together into a final table
final_table <- do.call("rbind", second_tables)
final_table <- round(final_table, 3)
row.names(final_table) <- rep(paste0("Phenotype", 1:num_vars))
return(final_table)
}
## Let "df" be your data.frame with 100 rows and 54 columns
## Use the summary_table() function, passing in the 50 columns containing your Phenotype outcome vars (I assumed they're in columns 5:54)
final_table <- summary_table(df[5:54])
## Write the final table to your working directory as a CSV
write.csv(final_table, "final_table.csv")

Remove duplicate lines while keeping the bottom lines

I would like to remove duplicate lines in R keeping the information of the lower lines, that is, from this data:
example <- structure(list(var1 = c(1, 1, 2, 2, 3, 4, 5, 6, 6), var2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0), var3 = c(1, 0, 0, 0, 0, 1, 0, 0, 0),
var4 = c(1, 1, 1, 1, 0, 1, 1, 0, 0), var5 = c(1, 1, 1, 0,
0, 1, 1, 0, 0), Year = 2001:2009), row.names = c(NA, -9L), class = "data.frame")
I would like to remove the duplicates keeping the lines at the bottom, so that I get:
example1 <- structure(list(var1 = c(1, 2, 3, 4, 5, 6), var2 = c(0, 0, 0,
0, 0, 0), var3 = c(0, 0, 0, 1, 0, 0), var4 = c(1, 1, 0, 1, 1,
0), var5 = c(1, 0, 0, 1, 1, 0), Year = c(2002, 2004, 2005, 2006,
2007, 2009)), row.names = c(NA, -6L), class = "data.frame")
Is it possible to apply the duplicated function or the distinct function of the `dplyr package?
I appreciate any help. Thanks.
Is this what you want?
example %>%
group_by(var1) %>%
slice_tail()
output
# A tibble: 6 x 6
# Groups: var1 [6]
var1 var2 var3 var4 var5 Year
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 0 0 1 1 2002
2 2 0 0 1 0 2004
3 3 0 0 0 0 2005
4 4 0 1 1 1 2006
5 5 0 0 1 1 2007
6 6 0 0 0 0 2009
The #ThomasIsCoding response, with the dplyr tools, worked well. I found another possibility, which seems faster:
example1 <- example[!duplicated(example$var1, fromLast = T), ]
distinct keeps 1st row in each group, if you want to keep last row you can reverse the rows and then apply distinct.
library(dplyr)
example %>%
slice(n():1) %>%
distinct(var1, .keep_all = TRUE) %>%
arrange(var1)
# var1 var2 var3 var4 var5 Year
#1 1 0 0 1 1 2002
#2 2 0 0 1 0 2004
#3 3 0 0 0 0 2005
#4 4 0 1 1 1 2006
#5 5 0 0 1 1 2007
#6 6 0 0 0 0 2009
Alternatively you can also use slice :
example %>% group_by(var1) %>% slice(n())

Find maximum in a group, subset by a subset from a different dataframe, to select other value's

I have two data.frames df1 with raw data. df2 has information on where to look in df1.
df1 has groups, defined by "id". In those groups, a subset is defined by df2$value_a1 and df2$value_a2, which represent the range of rows to look in the group. In that subsetgroup I want to find the maximum value_a, to select value_b.
code for df1 and df2
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
This is df1
id value_a value_b value_c value_d
1 1 0 100 0 1
2 1 10 101 -1 2
3 1 21 100 -2 3
4 1 30 95 -2 4
5 1 43 90 -2 5
6 1 53 104 -2 6
7 1 69 88 -1 7
8 1 81 84 -1 8
9 1 93 75 0 9
10 2 5 110 0 10
11 2 16 105 0 11
12 2 27 106 0 12
13 2 33 104 1 13
14 2 45 95 1 14
15 2 61 109 2 15
16 2 75 96 2 16
17 2 90 89 1 17
18 3 2 104 -1 18
19 3 11 104 0 19
20 3 16 104 0 20
21 3 24 103 1 21
22 3 31 106 1 22
23 3 40 103 2 23
24 3 47 101 2 24
25 3 60 99 1 25
26 3 75 98 1 26
27 3 88 97 0 27
This is df2
id value_a1 value_a2
1 1 21 69
2 2 33 75
3 3 16 60
My result would be df3, which would look like this
id value_a value_c
1 1 53 -2
2 2 61 2
3 3 31 1
I wrote this code to show my line of thinking.
df3 <- df1 %>%
group_by(id) %>%
filter(value_a >= df2$value_a1 & value_a <= df2$value_a2) %>%
filter(value_a == max(value_a)) %>%
pull(value_b)
This however generates a value with three entry's:
[1] 88 95 99
These are not the maximum value_b's...
Perhaps by() would work, but this gets stuck on using a function on two different df's.
It feels like I'm almost there, but still far away...
You can try this. I hope this helps.
df1 %>% left_join(df2) %>% mutate(val=ifelse(value_a>value_a1 & value_a<value_a2,value_b,NA)) %>%
group_by(id) %>% summarise(val=max(val,na.rm=T))
# A tibble: 3 x 2
id val
<dbl> <dbl>
1 1 104
2 2 109
3 3 106
Why don't you try a merge?
Then with data.table syntax:
library(data.table)
df3 <- merge(df1, df2, by = "id", all.x = TRUE)
max_values <- df3[value_a > value_a1 & value_a < value_a2, max(value_b), by = "id"]
max_values
# id V1
# 1: 1 104
# 2: 2 109
# 3: 3 106
I would do this using data.table package since is just what I'm used to
library(data.table)
dt.1 <- data.table("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
dt.2 <- data.table("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
dt.3 <- dt.1[id %in% dt.2[,id],max(value_b), by="id"]
setnames(dt.3, "V1", "max_value_b")
dt.3
To get corresponding line where b is the max values there are several ways, here's one where I only modified a line from the previous code
dt.1[id %in% dt.2[,id],.SD[which.max(value_b), .(value_a, value_b, value_c, value_d)], by="id"]
.SD means the sub-table you already selected with by so for each id selects the local max b and then returns a table which.max() selects the row, and finally .() is an alias for list, so lists the columns you wish from that table.
Perhaps a more readable approach is to first select the desired rows
max.b.rows <- dt.1[id %in% dt.2[,id], which.max(value_b), by="id"][,V1]
dt.3 <- dt.1[max.b.rows,]
BTW, the id %in% dt.2[,id] part is just there to make sure you only select maxima for those ids in table 2
Best

Expand Data Frame

I want to expand a data frame given some conditions. It is a bit similar to this question expand data frames inside data frame, but not quite the same.
I have a data frame:
df = data.frame(ID = c(3,3,3,3, 17,17,17, 74, 74, 210, 210, 210, 210), amount = c(101, 135, 101, 68, 196, 65 ,135, 76, 136, 15, 15, 15 ,15), week.number = c(4, 6, 8, 10, 2, 5, 7, 2, 6, 2, 3, 5, 6))
I want to expand the data frame for each ID, given a min and max week.number, and having 0 in the amount column for this expansion. Min week.number is 1 and max week.number is 10. The expected results would be:
df1 <- data.frame(ID = c(rep(3,10), rep(17, 10), rep(74, 10), rep(210, 10)),
amount = c(0, 0, 0, 101, 0, 135, 0, 101, 0, 68, 0, 196,
0, 0, 65, 0, 135, 0, 0, 0, 0, 76, 0, 0, 0,
136, 0, 0, 0, 0, 0, 15, 15, 0, 15, 15, 0, 0,
0, 0))
(In reality, I have thousands of ID and week number goes from 1 to 160).
Is there a simple, fast way to do this?
Thank you!
With data.table (tx to Frank for correcting the length of the result):
require(data.table)
dt<-as.data.table(df)
f<-function(x,y,len=max(y)) {res<-numeric(len);res[y]<-x;res}
dt[,list(amount=f(amount,weeek.number,10)),by=ID]
# ID amount
# 1: 3 0
# 2: 3 0
# 3: 3 0
# 4: 3 101
# 5: 3 0
# 6: 3 135
# 7: 3 0
# 8: 3 101
# 9: 3 0
#10: 3 68
# ......
Edit
I just noticed that your amount and weeek.number actually define a sparseVector, i.e. a vector made mainly of zeroes where just the indices of the non-zero elements is kept. So, you can try with the Matrix package:
require(Matrix)
dt[,list(as.vector(sparseVector(amount,weeek.number,10))),by=ID]
to get the same result as above.
Here's how you could do it using tidyr:
library(tidyr)
complete(df, ID, weeek.number = 1:10, fill = list(amount = 0))
#Source: local data frame [40 x 3]
#
# ID weeek.number amount
# (dbl) (dbl) (dbl)
#1 3 1 0
#2 3 2 0
#3 3 3 0
#4 3 4 101
#5 3 5 0
#6 3 6 135
#7 3 7 0
#8 3 8 101
#9 3 9 0
#10 3 10 68
#.. ... ... ...
An approach in base R would be to use expand.grid and merge:
newdf <- merge(expand.grid(ID = unique(df$ID), weeek.number = 1:10), df, all.x = TRUE)
newdf$amount[is.na(newdf$amount)] <- 0 # replace NA with 0

Resources