r : calculating time interval on condition - r

I would like to calculate Day.Before_nextCLS with 3 columns below
tibble::tribble(
~Day, ~CLS, ~BAL.D,
0, 0, NA,
3, 0, 15000,
6, 0, 10000,
20, 0, 2000,
25, 0, -4771299,
26, 0, -1615637,
27, 0, -920917,
31, 1, -923089,
32, 1, -81863,
33, 1, 19865,
34, 1, 9865,
37, 1, 609865
)
Desired output is below tribble.
For Day27, Day.Before_nextCLS is 4,
because when CLS is 2, Day is 31, and interval between 27 and 31 is 4.
tibble::tribble(
~Day, ~CLS, ~BAL.D, ~Day.Before_nextCLS
0, 0, NA, 31,
3, 0, 15000, 28,
6, 0, 10000, 25,
20, 0, 2000, 11,
25, 0, -4771299, 6,
26, 0, -1615637, 5,
27, 0, -920917, 4,
31, 1, -923089, NA, (for we don't have date when CLS ==2)
32, 1, -81863, NA,
33, 1, 19865, NA,
34, 1, 9865, NA,
37, 1, 609865, NA,
)
How can I achieve this?
Thank you very much!!

We create a lead column and then do a group by subtract from the last value of lead column with the Day column
library(dplyr)
df1 %>%
mutate(DayLead = lead(Day)) %>%
group_by(CLS) %>%
mutate(Day.Before_nextCLS = last(DayLead) - Day, DayLead = NULL) %>%
ungroup
-output
# A tibble: 12 × 4
Day CLS BAL.D Day.Before_nextCLS
<dbl> <dbl> <dbl> <dbl>
1 0 0 NA 31
2 3 0 15000 28
3 6 0 10000 25
4 20 0 2000 11
5 25 0 -4771299 6
6 26 0 -1615637 5
7 27 0 -920917 4
8 31 1 -923089 NA
9 32 1 -81863 NA
10 33 1 19865 NA
11 34 1 9865 NA
12 37 1 609865 NA

Related

Data imputation by linear interpolation according to date in R

I have a large dataset of mineral nitrogen values from different plots which includes some missing data were on some dates we could not take samples. it is known that mineral N values in soil change linearly between samplings.
for the sake of simplification I have created a data frame that has 10 plots with 4 dates (with different distances between them) with missing data in one of the dates:
df <- data.frame(plot= c(1,2,3,4,5,6,7,8,9,10),
date = c("2020-10-01", "2020-10-01","2020-10-01","2020-10-01","2020-10-01","2020-10-01","2020-10-01","2020-10-01","2020-10-01","2020-10-01",
"2020-10-08", "2020-10-08","2020-10-08","2020-10-08","2020-10-08","2020-10-08","2020-10-08","2020-10-08","2020-10-08","2020-10-08",
"2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29","2020-10-29",
"2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05","2020-11-05"),
Nmin = c(100, 120, 50, 60, 70, 80, 100, 70, 30, 50, 90, 130, 60, 60, 60, 90, 105, 60, 25, 40, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 50, 170, 100, 60, 20, 130, 125, 20, 5, 0))
df$date <- as.Date(df$date, format="%d.%m.%Y")
df$Nmin <- as.numeric(df$Nmin)
is there a function that can calculate the missing values of Nmin plot-wise and takes in concideration the time between samplings (date)?
Using approx.
df <- transform(df, flag=ifelse(is.na(Nmin), 1, 0)) ## set flag for sake of identification
res <- by(df, df$plot, transform, Nmin=approx(date, Nmin, date)$y) |> unsplit(df$plot)
res
# plot date Nmin flag
# 1 1 2020-10-01 100 0
# 2 2 2020-10-01 120 0
# 3 3 2020-10-01 50 0
# 4 4 2020-10-01 60 0
# 5 5 2020-10-01 70 0
# 6 6 2020-10-01 80 0
# 7 7 2020-10-01 100 0
# 8 8 2020-10-01 70 0
# 9 9 2020-10-01 30 0
# 10 10 2020-10-01 50 0
# 11 1 2020-10-08 90 0
# 12 2 2020-10-08 130 0
# 13 3 2020-10-08 60 0
# 14 4 2020-10-08 60 0
# 15 5 2020-10-08 60 0
# 16 6 2020-10-08 90 0
# 17 7 2020-10-08 105 0
# 18 8 2020-10-08 60 0
# 19 9 2020-10-08 25 0
# 20 10 2020-10-08 40 0
# 21 1 2020-10-29 60 1
# 22 2 2020-10-29 160 1
# 23 3 2020-10-29 90 1
# 24 4 2020-10-29 60 1
# 25 5 2020-10-29 30 1
# 26 6 2020-10-29 120 1
# 27 7 2020-10-29 120 1
# 28 8 2020-10-29 30 1
# 29 9 2020-10-29 10 1
# 30 10 2020-10-29 10 1
# 31 1 2020-11-05 50 0
# 32 2 2020-11-05 170 0
# 33 3 2020-11-05 100 0
# 34 4 2020-11-05 60 0
# 35 5 2020-11-05 20 0
# 36 6 2020-11-05 130 0
# 37 7 2020-11-05 125 0
# 38 8 2020-11-05 20 0
# 39 9 2020-11-05 5 0
# 40 10 2020-11-05 0 0
Let's take a look at the plot.
clr <- rainbow(10)
with(res, plot(Nmin ~ date, type='n'))
by(res, res$plot, with, points(jitter(Nmin) ~ date, type='b', pch=ifelse(flag == 1, 21, 16), col=clr[plot], bg='white'))
legend('topleft', legend=paste('plot', 1:10), lty=1, col=clr, ncol=4, bty='n', cex=.7)
Note: For non-linear inter/extrapolation, see this answer.
Data:
df <- structure(list(plot = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10), date = structure(c(18536, 18536, 18536,
18536, 18536, 18536, 18536, 18536, 18536, 18536, 18543, 18543,
18543, 18543, 18543, 18543, 18543, 18543, 18543, 18543, 18564,
18564, 18564, 18564, 18564, 18564, 18564, 18564, 18564, 18564,
18571, 18571, 18571, 18571, 18571, 18571, 18571, 18571, 18571,
18571), class = "Date"), Nmin = c(100, 120, 50, 60, 70, 80, 100,
70, 30, 50, 90, 130, 60, 60, 60, 90, 105, 60, 25, 40, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 50, 170, 100, 60, 20, 130, 125,
20, 5, 0), flag = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -40L
))

Creating a column based on filtering two data frames of different lengths using R

I got two data sets of different lengths. I want to create a new column in the dataset which got more rows based on filtering a specific column from the shorter df. I am getting a waring " Longer object length is not a multiple of shorter object length". And the result is also not correct. I tried to created a smaller example datasets and tried the same code and its working with correct results. I am not sure why on my original data the results are not correct and I am getting the warning.
The example datasets are
structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
the second df
structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
I tried this on both my original datasets where I get the warning and these dummy dfs where no warning and correct results.
data2 <- data2 %>%
mutate(d_Activity = ifelse(code2 %in% data1$code & activity == 1, 1,0))
Actually, you are doing it wrong way. Let me explain-
In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.
df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
library(tidyverse)
df2 %>%
mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#> id2 code2 d_Activity
#> 1 1 2 0
#> 2 2 5 0
#> 3 3 11 0
#> 4 4 15 0
#> 5 5 9 0
#> 6 6 18 0
#> 7 7 21 0
#> 8 8 3 1
#> 9 9 27 0
#> 10 10 55 0
#> 11 11 2 0
#> 12 12 5 0
#> 13 13 11 0
#> 14 14 15 0
#> 15 15 3 1
#> 16 16 18 0
#> 17 17 21 0
#> 18 18 3 1
#> 19 19 27 0
#> 20 20 55 0
Created on 2021-06-17 by the reprex package (v2.0.0)

How to create two columns that count the total number of two conditions

I have a diabetes dataset that has a column called Outcome and only has two values, 1 = Diabetes, 0 = Non-Diabetes. I want to count the total number of 1's and 0's based on age and then have a % of 1's based on age.
I have this code below:
by_age1 <-
diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
This code generates this table
Age | Outcome | Count
21 0 58
21 1 5
And so on
I want the table to look like this though
Age | Count_Outcome=1 | Count_Outcome=0
21 5 58
22 11 61
So I can eventually get to this
Age | Count_Outcome=1 | Count_Outcome=0 | Count_Outcome=1/Count_Outcome=0
21 5 58 0.086
22 11 61 0.180
Here is the dataset
Rows: 768
Columns: 23
$ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, ...
$ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100, 118, 107, 103, 115, 126, ...
$ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 70, 88, 84, 90, 80, 94, 70, ...
$ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, 0, 0, 35, 33, 26, 0, 15, 19...
$ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 235, 0, 0, 0, 146, 115, 0, 1...
$ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1, 30.1, 25.8, 30.0, 45.8, 2...
$ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191, 0.537, 1.441, 0.398, 0.58...
$ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33, 32, 27, 50, 41, 29, 51, 41...
$ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...
$ Skin.log <dbl> 3.555634, 3.367641, -4.605170, 3.135929, 3.555634, -4.605170, 3.466048, -4.605170, 3.806885, -4.605170...
$ Insulin.log <dbl> -2.302585, -2.302585, -2.302585, 4.544358, 5.124559, -2.302585, 4.478473, -2.302585, 6.297293, -2.3025...
$ DPF.log <dbl> -0.46680874, -1.04696906, -0.39749694, -1.78976147, 0.82767807, -1.60445037, -1.39432653, -2.00991548,...
$ Preg.log <dbl> 1.793424749, 0.009950331, 2.080690761, 0.009950331, -4.605170186, 1.611435915, 1.101940079, 2.30358459...
$ Age.log <dbl> 3.912023, 3.433987, 3.465736, 3.044522, 3.496508, 3.401197, 3.258097, 3.367296, 3.970292, 3.988984, 3....
$ G <dbl> 0.84777132, -1.12266474, 1.94245802, -0.99755769, 0.50372693, -0.15308509, -1.34160209, -0.18436186, 2...
$ BP <dbl> 0.14954330, -0.16044119, -0.26376935, -0.16044119, -1.50370731, 0.25287146, -0.98706650, -3.57027057, ...
$ S <dbl> 0.7143403, 0.6624894, -1.5365134, 0.5985804, 0.7143403, -1.5365134, 0.6896315, -1.5365134, 0.7836385, ...
$ I <dbl> -1.0157459, -1.0157459, -1.0157459, 0.8904827, 1.0520140, -1.0157459, 0.8721398, -1.0157459, 1.3785101...
$ D <dbl> 0.76534970, -0.13507072, 0.87292300, -1.28789940, 2.77441913, -1.00029287, -0.67417647, -1.62958283, -...
$ BM <dbl> 0.20387991, -0.68397621, -1.10253696, -0.49372133, 1.40882750, -0.81081280, -0.12589522, 0.41950211, -...
$ P <dbl> 0.6504082, -0.1684863, 0.7823084, -0.1684863, -2.2875506, 0.5668468, 0.3329083, 0.8846516, 0.1474983, ...
$ A <dbl> 1.43544387, -0.04590939, 0.05247453, -1.25279578, 0.14783077, -0.14751959, -0.59096525, -0.25257485, 1...
$ Segment <int> 4, 3, 2, 3, 5, 2, 3, 1, 4, 2, 2, 2, 2, 4, 4, 1, 5, 2, 3, 3, 4, 2, 2, 3, 4, 4, 2, 3, 4, 2, 4, 4, 3, 2, ...
``
Random data:
r <- function(x) {rnorm(x, 50, 2)}
set.seed(123)
diabetes.df <- data.frame(Age = round(r(10)), Outcome = as.character((r(10) < 50)*1))
> diabetes.df
Age Outcome
1 49 0
2 50 0
3 53 0
4 50 0
5 50 1
6 53 0
7 51 0
8 47 1
9 49 0
10 49 1
Then pivot_wider() will do what you want:
df <- diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
dplyr::summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
df = pivot_wider(df, names_from = c("Outcome"), values_from = "Diabetes_Count", names_prefix = "Outcome_", values_fill = 0)
> df
# A tibble: 5 x 3
# Groups: Age [5]
Age Outcome_1 Outcome_0
<dbl> <int> <int>
1 47 1 0
2 49 1 2
3 50 1 2
4 51 0 1
5 53 0 2
> df %>% mutate(`Outcome_1/Outcome_0` = Outcome_1 / Outcome_0)
# A tibble: 5 x 4
# Groups: Age [5]
Age Outcome_1 Outcome_0 `Outcome_1/Outcome_0`
<dbl> <int> <int> <dbl>
1 47 1 0 Inf
2 49 1 2 0.5
3 50 1 2 0.5
4 51 0 1 0
5 53 0 2 0

R: automate table for results of several multivariable logistic regressions

structure(list(Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15), age = c(25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0), bmi = c(35, 32, 29, 26, 23, 20, 17, 35, 32, 29,
26, 23, 20, 17, 21), Phenotype1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1), `Phenotype 2` = c(0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1), `Phenotype 3` = c(1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0), `Phenotype 4` = c(0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 15 x 8
Number age sex bmi Phenotype1 `Phenotype 2` `Phenotype 3` `Phenotype 4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 0 35 0 0 1 0
2 2 26 1 32 0 1 0 0
3 3 27 0 29 0 0 1 0
4 4 28 1 26 0 1 0 0
5 5 29 0 23 0 0 1 1
6 6 30 1 20 0 1 1 1
7 7 31 0 17 0 0 1 0
8 8 32 1 35 0 1 1 1
9 9 33 0 32 0 0 1 0
10 10 34 1 29 0 1 1 1
11 11 35 0 26 0 0 1 1
12 12 36 1 23 0 1 0 1
13 13 37 0 20 1 1 0 1
14 14 38 1 17 1 1 0 1
15 15 39 0 21 1 1 0 1
Hi all,
I have a dataset of 100 patients (15 are shown here), 3 covariates and 50 phenotypes(4 are shown here).
I want to perform a multivariable logistic regression for each phenotype using age, sex and BMI as covariates,
I would like to get a table like this, where I have the p-value, OR and confidence interval(CI)per each of the covariates.
I just don't know how to start.
Thank you very much for your help!
Best,
Caro
I wrote a function that should accomplish what you need. There are likely more elegant and more R-like ways of doing this, but this approach worked in my testing:
## Load libraries
library(broom)
library(tidyr)
library(dplyr)
## Define a function to create your summary table
summary_table <- function(x) {
# Capture number of columns passed to the function
num_vars <- ncol(x)
# Pre-define lists that will be populated and then collapsed by rest of function
models <- vector("list", length = num_vars)
first_tables <- vector("list", length = num_vars)
second_tables <- vector("list", length = num_vars)
# Loop to create each row for the final table
for (i in 1:num_vars) {
models[[i]] <- glm(x[[i]] ~ age + sex + bmi, family = "binomial", data = df)
first_tables[[i]] <- broom::tidy(models[[i]])
first_tables[[i]]$OR <- exp(first_tables[[i]]$estimate)
first_tables[[i]]$CI1 <- exp(first_tables[[i]]$estimate - (1.96 * first_tables[[i]]$std.error))
first_tables[[i]]$CI2 <- exp(first_tables[[i]]$estimate + (1.96 * first_tables[[i]]$std.error))
first_tables[[i]] <- as.data.frame(first_tables[[i]][first_tables[[i]]$term != "(Intercept)", c("term", "p.value", "OR", "CI1", "CI2")])[1:3,]
second_tables[[i]] <- first_tables[[i]] %>%
pivot_wider(names_from = term, values_from = c("p.value", "OR", "CI1", "CI2")) %>%
select("p.value_age", "OR_age", "CI1_age", "CI2_age", "p.value_bmi", "OR_bmi", "CI1_bmi", "CI2_bmi",
"p.value_sex", "OR_sex", "CI1_sex", "CI2_sex")
}
# Combine the rows together into a final table
final_table <- do.call("rbind", second_tables)
final_table <- round(final_table, 3)
row.names(final_table) <- rep(paste0("Phenotype", 1:num_vars))
return(final_table)
}
## Let "df" be your data.frame with 100 rows and 54 columns
## Use the summary_table() function, passing in the 50 columns containing your Phenotype outcome vars (I assumed they're in columns 5:54)
final_table <- summary_table(df[5:54])
## Write the final table to your working directory as a CSV
write.csv(final_table, "final_table.csv")

Find maximum in a group, subset by a subset from a different dataframe, to select other value's

I have two data.frames df1 with raw data. df2 has information on where to look in df1.
df1 has groups, defined by "id". In those groups, a subset is defined by df2$value_a1 and df2$value_a2, which represent the range of rows to look in the group. In that subsetgroup I want to find the maximum value_a, to select value_b.
code for df1 and df2
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
This is df1
id value_a value_b value_c value_d
1 1 0 100 0 1
2 1 10 101 -1 2
3 1 21 100 -2 3
4 1 30 95 -2 4
5 1 43 90 -2 5
6 1 53 104 -2 6
7 1 69 88 -1 7
8 1 81 84 -1 8
9 1 93 75 0 9
10 2 5 110 0 10
11 2 16 105 0 11
12 2 27 106 0 12
13 2 33 104 1 13
14 2 45 95 1 14
15 2 61 109 2 15
16 2 75 96 2 16
17 2 90 89 1 17
18 3 2 104 -1 18
19 3 11 104 0 19
20 3 16 104 0 20
21 3 24 103 1 21
22 3 31 106 1 22
23 3 40 103 2 23
24 3 47 101 2 24
25 3 60 99 1 25
26 3 75 98 1 26
27 3 88 97 0 27
This is df2
id value_a1 value_a2
1 1 21 69
2 2 33 75
3 3 16 60
My result would be df3, which would look like this
id value_a value_c
1 1 53 -2
2 2 61 2
3 3 31 1
I wrote this code to show my line of thinking.
df3 <- df1 %>%
group_by(id) %>%
filter(value_a >= df2$value_a1 & value_a <= df2$value_a2) %>%
filter(value_a == max(value_a)) %>%
pull(value_b)
This however generates a value with three entry's:
[1] 88 95 99
These are not the maximum value_b's...
Perhaps by() would work, but this gets stuck on using a function on two different df's.
It feels like I'm almost there, but still far away...
You can try this. I hope this helps.
df1 %>% left_join(df2) %>% mutate(val=ifelse(value_a>value_a1 & value_a<value_a2,value_b,NA)) %>%
group_by(id) %>% summarise(val=max(val,na.rm=T))
# A tibble: 3 x 2
id val
<dbl> <dbl>
1 1 104
2 2 109
3 3 106
Why don't you try a merge?
Then with data.table syntax:
library(data.table)
df3 <- merge(df1, df2, by = "id", all.x = TRUE)
max_values <- df3[value_a > value_a1 & value_a < value_a2, max(value_b), by = "id"]
max_values
# id V1
# 1: 1 104
# 2: 2 109
# 3: 3 106
I would do this using data.table package since is just what I'm used to
library(data.table)
dt.1 <- data.table("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
dt.2 <- data.table("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
dt.3 <- dt.1[id %in% dt.2[,id],max(value_b), by="id"]
setnames(dt.3, "V1", "max_value_b")
dt.3
To get corresponding line where b is the max values there are several ways, here's one where I only modified a line from the previous code
dt.1[id %in% dt.2[,id],.SD[which.max(value_b), .(value_a, value_b, value_c, value_d)], by="id"]
.SD means the sub-table you already selected with by so for each id selects the local max b and then returns a table which.max() selects the row, and finally .() is an alias for list, so lists the columns you wish from that table.
Perhaps a more readable approach is to first select the desired rows
max.b.rows <- dt.1[id %in% dt.2[,id], which.max(value_b), by="id"][,V1]
dt.3 <- dt.1[max.b.rows,]
BTW, the id %in% dt.2[,id] part is just there to make sure you only select maxima for those ids in table 2
Best

Resources