Related
I have the following two datasets:
df1 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60))
df2 <- data.frame(
"group" = c(1, 1, 2, 2, 2, 3, 3, 4, 5),
"P1" = c(55, NA, 60, 55, 75, 75, 55, 55, 60),
"P2" = c(55, 75, 55, 60, NA, 75, 55, NA, 60),
"P3" = c(75, 55, 60, 75, NA, 75, 60, 55, 60))
In df1 each group has the same three numbers (in reality there are around 500 numbers).
I want to check whether the values in the column "numbers" in df1 are contained in the columns P1, P2, and P3 of df2. There are two problems I am stuck with. 1. the values in the numbers column of df1 can occur in different groups in df2 (defined by the group column in df1 and df2). 2. the datasets have different lengths. Is there a way to merge both datasets and have the following dataset:
df3 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60,),
"P1new" = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1),
"P2new" = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1),
"P3new" = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1))
where P1new (P2new and P3new respectively) contain the value 1 if df2$P1 contains the value in df1$numbers within the correct group (as I said numbers can reoccur in different groups). For example, P3 has the value 75 in group 1 but not in group 5. So in group 1 P3new would have a 1 and in group 5 P3new would have a 0.
This question is similar to Find matching values in different datasets by groups in R
but I could not adapt the code according to my objectives. So, I would really appreciate any help.
Interesting question. Here's a way with dplyr functions:
library(dplyr)
df2 %>%
group_by(group) %>%
summarise(across(P1:P3, ~ list(unique(na.omit(.x))))) %>%
inner_join(df1, .) %>%
rowwise() %>%
mutate(across(P1:P3, ~ +(numbers %in% .x)))
group numbers P1 P2 P3
<dbl> <dbl> <int> <int> <int>
1 1 55 1 1 1
2 1 75 0 1 1
3 1 60 0 0 0
4 2 55 1 1 0
5 2 75 1 0 1
6 2 60 1 1 1
7 3 55 1 1 0
8 3 75 1 1 1
9 3 60 0 0 1
10 4 55 1 0 1
11 4 75 0 0 0
12 4 60 0 0 0
13 5 55 0 0 0
14 5 75 0 0 0
15 5 60 1 1 1
Another possible solution:
library(tidyverse)
map_dfc(names(df2[-1]),
~ df1 %>%
group_by(group) %>%
mutate(!!.x := +(numbers %in% df2[df2$group == cur_group_id(), .x])) %>%
ungroup %>%
select(all_of(.x))) %>%
bind_cols(df1, .)
#> group numbers P1 P2 P3
#> 1 1 55 1 1 1
#> 2 1 75 0 1 1
#> 3 1 60 0 0 0
#> 4 2 55 1 1 0
#> 5 2 75 1 0 1
#> 6 2 60 1 1 1
#> 7 3 55 1 1 0
#> 8 3 75 1 1 1
#> 9 3 60 0 0 1
#> 10 4 55 1 0 1
#> 11 4 75 0 0 0
#> 12 4 60 0 0 0
#> 13 5 55 0 0 0
#> 14 5 75 0 0 0
#> 15 5 60 1 1 1
Or, without purrr, another possibility:
library(dplyr)
df1 %>%
inner_join(df2) %>%
group_by(group) %>%
mutate(across(starts_with("P"), ~ +(numbers %in% .x))) %>%
ungroup %>%
distinct
I'm struggling with the way to check whether the value of a variable for each case is the same as the value of a different variable, within groups, and without usign for() loops. I'd be thankful if anyone could help me. These are a simplified form of my data:
library(data.table)
df<-data.table(personid<-c(101, 102, 103, 104, 105, 201, 202, 203, 301, 302, 401),
hh_id<-c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4),
fatherid<-c(NA, NA, 101, 101, 101, NA, NA, 201, NA, NA, NA),
(the real one is 185000 rows and fatherid is not the only variable that I need to cjeck)
What I'm trying to do is to create a variable that checks whether the value of variable personid of a given row is the same as the value of fatherid within the members of the same group (the grouping variable is hh_id). For the given data, the outcome should be:
df$result <- c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)
but I need something to do it automatically. Any ideas?
In case you want to understand the sense of my calculations, I'm trying to build household grids; this is: whether each observation is the father of at least another observation within the same household (hh_id)
Thank you very much in advance!
It seems you are looking for this:
df[, result2 := personid %in% fatherid, by = hh_id]
# personid hh_id fatherid result result2
# 1: 101 1 NA 1 TRUE
# 2: 102 1 NA 0 FALSE
# 3: 103 1 101 0 FALSE
# 4: 104 1 101 0 FALSE
# 5: 105 1 101 0 FALSE
# 6: 201 2 NA 1 TRUE
# 7: 202 2 NA 0 FALSE
# 8: 203 2 201 0 FALSE
# 9: 301 3 NA 0 FALSE
# 10: 302 3 NA 0 FALSE
# 11: 401 4 NA 0 FALSE
Where
df <- data.table(
personid = c(101, 102, 103, 104, 105, 201, 202, 203, 301, 302, 401),
hh_id = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4),
fatherid = c(NA, NA, 101, 101, 101, NA, NA, 201, NA, NA, NA),
result = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)
)
structure(list(Number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15), age = c(25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0), bmi = c(35, 32, 29, 26, 23, 20, 17, 35, 32, 29,
26, 23, 20, 17, 21), Phenotype1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1), `Phenotype 2` = c(0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1), `Phenotype 3` = c(1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0), `Phenotype 4` = c(0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 15 x 8
Number age sex bmi Phenotype1 `Phenotype 2` `Phenotype 3` `Phenotype 4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 0 35 0 0 1 0
2 2 26 1 32 0 1 0 0
3 3 27 0 29 0 0 1 0
4 4 28 1 26 0 1 0 0
5 5 29 0 23 0 0 1 1
6 6 30 1 20 0 1 1 1
7 7 31 0 17 0 0 1 0
8 8 32 1 35 0 1 1 1
9 9 33 0 32 0 0 1 0
10 10 34 1 29 0 1 1 1
11 11 35 0 26 0 0 1 1
12 12 36 1 23 0 1 0 1
13 13 37 0 20 1 1 0 1
14 14 38 1 17 1 1 0 1
15 15 39 0 21 1 1 0 1
Hi all,
I have a dataset of 100 patients (15 are shown here), 3 covariates and 50 phenotypes(4 are shown here).
I want to perform a multivariable logistic regression for each phenotype using age, sex and BMI as covariates,
I would like to get a table like this, where I have the p-value, OR and confidence interval(CI)per each of the covariates.
I just don't know how to start.
Thank you very much for your help!
Best,
Caro
I wrote a function that should accomplish what you need. There are likely more elegant and more R-like ways of doing this, but this approach worked in my testing:
## Load libraries
library(broom)
library(tidyr)
library(dplyr)
## Define a function to create your summary table
summary_table <- function(x) {
# Capture number of columns passed to the function
num_vars <- ncol(x)
# Pre-define lists that will be populated and then collapsed by rest of function
models <- vector("list", length = num_vars)
first_tables <- vector("list", length = num_vars)
second_tables <- vector("list", length = num_vars)
# Loop to create each row for the final table
for (i in 1:num_vars) {
models[[i]] <- glm(x[[i]] ~ age + sex + bmi, family = "binomial", data = df)
first_tables[[i]] <- broom::tidy(models[[i]])
first_tables[[i]]$OR <- exp(first_tables[[i]]$estimate)
first_tables[[i]]$CI1 <- exp(first_tables[[i]]$estimate - (1.96 * first_tables[[i]]$std.error))
first_tables[[i]]$CI2 <- exp(first_tables[[i]]$estimate + (1.96 * first_tables[[i]]$std.error))
first_tables[[i]] <- as.data.frame(first_tables[[i]][first_tables[[i]]$term != "(Intercept)", c("term", "p.value", "OR", "CI1", "CI2")])[1:3,]
second_tables[[i]] <- first_tables[[i]] %>%
pivot_wider(names_from = term, values_from = c("p.value", "OR", "CI1", "CI2")) %>%
select("p.value_age", "OR_age", "CI1_age", "CI2_age", "p.value_bmi", "OR_bmi", "CI1_bmi", "CI2_bmi",
"p.value_sex", "OR_sex", "CI1_sex", "CI2_sex")
}
# Combine the rows together into a final table
final_table <- do.call("rbind", second_tables)
final_table <- round(final_table, 3)
row.names(final_table) <- rep(paste0("Phenotype", 1:num_vars))
return(final_table)
}
## Let "df" be your data.frame with 100 rows and 54 columns
## Use the summary_table() function, passing in the 50 columns containing your Phenotype outcome vars (I assumed they're in columns 5:54)
final_table <- summary_table(df[5:54])
## Write the final table to your working directory as a CSV
write.csv(final_table, "final_table.csv")
I have two data.frames df1 with raw data. df2 has information on where to look in df1.
df1 has groups, defined by "id". In those groups, a subset is defined by df2$value_a1 and df2$value_a2, which represent the range of rows to look in the group. In that subsetgroup I want to find the maximum value_a, to select value_b.
code for df1 and df2
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
This is df1
id value_a value_b value_c value_d
1 1 0 100 0 1
2 1 10 101 -1 2
3 1 21 100 -2 3
4 1 30 95 -2 4
5 1 43 90 -2 5
6 1 53 104 -2 6
7 1 69 88 -1 7
8 1 81 84 -1 8
9 1 93 75 0 9
10 2 5 110 0 10
11 2 16 105 0 11
12 2 27 106 0 12
13 2 33 104 1 13
14 2 45 95 1 14
15 2 61 109 2 15
16 2 75 96 2 16
17 2 90 89 1 17
18 3 2 104 -1 18
19 3 11 104 0 19
20 3 16 104 0 20
21 3 24 103 1 21
22 3 31 106 1 22
23 3 40 103 2 23
24 3 47 101 2 24
25 3 60 99 1 25
26 3 75 98 1 26
27 3 88 97 0 27
This is df2
id value_a1 value_a2
1 1 21 69
2 2 33 75
3 3 16 60
My result would be df3, which would look like this
id value_a value_c
1 1 53 -2
2 2 61 2
3 3 31 1
I wrote this code to show my line of thinking.
df3 <- df1 %>%
group_by(id) %>%
filter(value_a >= df2$value_a1 & value_a <= df2$value_a2) %>%
filter(value_a == max(value_a)) %>%
pull(value_b)
This however generates a value with three entry's:
[1] 88 95 99
These are not the maximum value_b's...
Perhaps by() would work, but this gets stuck on using a function on two different df's.
It feels like I'm almost there, but still far away...
You can try this. I hope this helps.
df1 %>% left_join(df2) %>% mutate(val=ifelse(value_a>value_a1 & value_a<value_a2,value_b,NA)) %>%
group_by(id) %>% summarise(val=max(val,na.rm=T))
# A tibble: 3 x 2
id val
<dbl> <dbl>
1 1 104
2 2 109
3 3 106
Why don't you try a merge?
Then with data.table syntax:
library(data.table)
df3 <- merge(df1, df2, by = "id", all.x = TRUE)
max_values <- df3[value_a > value_a1 & value_a < value_a2, max(value_b), by = "id"]
max_values
# id V1
# 1: 1 104
# 2: 2 109
# 3: 3 106
I would do this using data.table package since is just what I'm used to
library(data.table)
dt.1 <- data.table("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
dt.2 <- data.table("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
dt.3 <- dt.1[id %in% dt.2[,id],max(value_b), by="id"]
setnames(dt.3, "V1", "max_value_b")
dt.3
To get corresponding line where b is the max values there are several ways, here's one where I only modified a line from the previous code
dt.1[id %in% dt.2[,id],.SD[which.max(value_b), .(value_a, value_b, value_c, value_d)], by="id"]
.SD means the sub-table you already selected with by so for each id selects the local max b and then returns a table which.max() selects the row, and finally .() is an alias for list, so lists the columns you wish from that table.
Perhaps a more readable approach is to first select the desired rows
max.b.rows <- dt.1[id %in% dt.2[,id], which.max(value_b), by="id"][,V1]
dt.3 <- dt.1[max.b.rows,]
BTW, the id %in% dt.2[,id] part is just there to make sure you only select maxima for those ids in table 2
Best
I know that there are more questions like this. However, I can not get it done.
I have a very large data.table with about 800.000 rows. Every row contains some specific data, then some demand data for 2 years (from column 12 onward), and in the last column is a number. This number determines how many demand columns should be replaced.
Example with one row:
ITEM COUNTRY Q1 Q2 Q3 Q4 ---- Q24 NUMBER
1 1 0 0 0 0 ---- 2 2
If this is a vector called x for example, I would do this:
x[2:(2+x$NUMBER-1)] <- NA
Now I want to do this for all rows. A for loop is way too slow. So I was thinking of apply.
fun_NA <- function(x){ #x is then a row of the data.table
#still use that with x[,] as x is still a data.table and not a vector
if(x[,60]>0){
x[,12:(12+as.numeric(x[,60])-1)] <- NA
}
}
dt = apply(dt, 1, fun_NA)
where dtis my data.table, which has 60 columns. The first demand value starts in column 12. It does not work.. Now it gives the error "Error in x[,60] : incorrect number of dimensions", but I tried other ways too and received other errors.
Reason why I want to do this:
The number represent the month (after first month of my data set) when the item was able to sell, so in the months before, the demand was not 0, but just did not exist. I need it to be NA, as 0 will cause wrong calculations later on.
EDIT:
Removed the comma's so this is the new code,
fun_NA <- function(x){ #x is then a row of the data.table
if(x[60]>0){
x[12:(12+as.numeric(x[60])-1)] <- NA
}
}
dt = apply(dt, 1, fun_NA)
However, this returns a large list with all NULL and NA elements..
Edit: the head of the data-table is as follows: (dput)
structure(list(ITEM = c(1, 1, 2, 2, 2, 2), COUNTRY = c(1, 2,
3, 4, 5, 2), DATE = c("2015-02-02", "2015-02-02", "2014-09-27",
"2014-09-27", "2014-09-27", "2014-09-27"), q_1 = c(0, 0, 2, 0,
0, 133), q_2 = c(0, 0, 24, 0, 9, 119), q_3 = c(0, 0, 15, 0, 13,
121), q_4 = c(0, 0, 7, 0, 2, 51), q_5 = c(0, 0, 12, 0, 6, 59),
q_6 = c(0, 0, 3, 0, 0, 36), q_7 = c(0, 0, 6, 0, 6, 41), q_8 = c(0,
0, 19, 0, 4, 42), q_9 = c(0, 0, 3, 0, 5, 48), q_10 = c(0,
0, 5, 0, 11, 49), q_11 = c(0, 0, 6, 0, 1, 42), q_12 = c(0,
0, 0, 0, 8, 70), q_13 = c(0, 0, 1, 0, 19, 81), q_14 = c(0,
0, 5, 0, 98, 86), q_15 = c(0, 0, 12, 0, 10, 152), q_16 = c(0,
0, 7, 0, 8, 95), q_17 = c(0, 0, 30, 0, 5, 62), q_18 = c(0,
0, 6, 0, 10, 47), q_19 = c(0, 0, 7, 0, 1, 35), q_20 = c(2,
0, 7, 0, 0, 47), q_21 = c(0, 2, 16, 5, 4, 70), q_22 = c(0,
0, 7, 0, 7, 46), q_23 = c(0, 0, 8, 0, 79, 20), q_24 = c(0,
0, 5, 0, 26, 45), NUMBER = c(13, 13, 8, 8, 8, 8)), .Names = c("ITEM",
"COUNTRY", "DATE", "q_1", "q_2", "q_3", "q_4", "q_5", "q_6",
"q_7", "q_8", "q_9", "q_10", "q_11", "q_12", "q_13", "q_14",
"q_15", "q_16", "q_17", "q_18", "q_19", "q_20", "q_21", "q_22",
"q_23", "q_24", "NUMBER"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000004490788>)
Usually, it is more efficient to work column-wise.
lapply(1:24, function(i) dt[i <= NUMBER, (paste0("q_", i)) := NA])
ITEM COUNTRY DATE q_1 q_2 q_3 q_4 q_5 q_6 q_7 q_8 q_9 q_10 q_11 q_12 q_13 q_14 q_15 q_16 q_17 q_18 q_19 q_20
1: 1 1 2015-02-02 NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 0 0 2
2: 1 2 2015-02-02 NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 0 0 0
3: 2 3 2014-09-27 NA NA NA NA NA NA NA NA 3 5 6 0 1 5 12 7 30 6 7 7
4: 2 4 2014-09-27 NA NA NA NA NA NA NA NA 0 0 0 0 0 0 0 0 0 0 0 0
5: 2 5 2014-09-27 NA NA NA NA NA NA NA NA 5 11 1 8 19 98 10 8 5 10 1 0
6: 2 2 2014-09-27 NA NA NA NA NA NA NA NA 48 49 42 70 81 86 152 95 62 47 35 47
q_21 q_22 q_23 q_24 NUMBER
1: 0 0 0 0 13
2: 2 0 0 0 13
3: 16 7 8 5 8
4: 5 0 0 0 8
5: 4 7 79 26 8
6: 70 46 20 45 8
Explanation
We loop over all 24 columns. For each row it is tested whether the value in the current column needs to be replaced by NA according to its column number. The values are updated in place which saves time and memory.
I've tested the solution with a sample data set of 1 million rows which took less than 0.2 seconds.
Since your input to the function is a vector, you should get rid of all the commas, because it only has one dimension.
fun_NA <- function(x){ #x is then a row of the data.table
#still use that with x[,] as x is still a data.table and not a vector
if(x[60]>0){
x[12:(12+as.numeric(x[60])-1)] <- NA
}
}
dt = apply(dt, 1, fun_NA)