I'm trying to perform calculations on different elements in a matrix in R. My Matrix is 18x18 and I would like to get e.g. the mean of each 6x6 array (which makes 9 arrays in total). My desired arrays would be:
A1 <- df[1:6,1:6]
A2 <- df[1:6,7:12]
A3 <- df[1:6,13:18]
B1 <- df[7:12,1:6]
B2 <- df[7:12,7:12]
B3 <- df[7:12,13:18]
C1 <- df[13:18,1:6]
C2 <- df[13:18,7:12]
C3 <- df[13:18,13:18]
The matrix looks like this:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
5 14 17 9 10 8 4 10 12 18 9 13 14 NA NA 19 15 10 10
10 30 32 23 27 17 28 25 12 28 29 28 26 19 25 34 24 11 17
15 16 16 16 9 17 27 17 16 30 13 18 13 15 13 19 8 7 9
20 15 12 18 18 18 6 4 6 9 11 10 10 13 11 8 10 15 15
25 7 13 21 7 3 5 2 5 5 4 3 2 3 5 2 1 5 6
30 5 9 1 7 7 4 4 12 8 9 2 0 5 2 1 0 2 6
35 3 0 2 0 0 4 4 7 4 4 5 2 0 0 1 0 0 0
40 0 4 0 0 0 1 3 9 10 10 1 0 0 0 1 0 1 0
45 0 0 0 0 0 3 10 9 17 9 1 0 0 0 0 0 0 0
50 0 0 2 0 0 0 2 8 20 0 0 0 0 0 1 0 0 0
55 0 0 0 0 0 0 7 3 21 0 0 0 0 0 0 0 0 0
60 0 0 0 0 3 4 10 2 2 0 0 1 0 0 0 0 0 0
65 0 0 0 0 0 4 8 4 8 11 0 0 0 0 0 0 0 0
70 0 0 0 0 0 6 2 5 14 0 0 0 0 0 0 0 0 0
75 0 0 0 0 0 4 0 5 9 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 4 4 0 4 2 0 0 0 0 0 0 0 0
85 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Is there a clean way to solve this issue with a loop?
Thanks a lot in advance,
Paul
Given your matrix, e.g.
x <- matrix(1:(18*18), ncol=18)
Try, for example for sub matrices of 6
step <- 6
nx <- nrow(x)
if((nx %% step) != 0) stop("nx %% step should be 0")
indI <- seq(1, nx, by=step)
nbStep <- length(indI)
for(Col in 1:nbStep){
for(Row in 1:nbStep){
name <- paste0(LETTERS[Col],Row)
theCol <- indI[Col]:(indI[Col]+step-1)
theRow <- indI[Row]:(indI[Row]+step-1)
assign(name, sum(x[theCol, theRow]))
}
}
You'll get your results in A1, A2, A3...
This is the idea. Twist the code for non square matrices, different size of sub matrices, ...
Here's one way:
# generate fake data
set.seed(47)
n = 18
m = matrix(rpois(n * n, lambda = 5), nrow = n)
# generate starting indices
n_array = 6
start_i = seq(1, n, by = n_array)
arr_starts = expand.grid(row = start_i, col = start_i)
# calculate sums
with(arr_starts, mapply(function(x, y) sum(m[(x + 1:n_array) - 1, (y + 1:n_array) - 1]), row, col))
# [1] 158 188 176 201 188 201 197 206 204
Hi stack overflow community,
I'm relatively new to R (9 months) and this is my first stack overflow question with reprex and would really appreciate any help. I mainly use tidyverse although I am open to base R solutions.
Problem:
I have ~21,000 rows of symptom data with >10 variables per day. I would like to be able to classify "exacerbations" of a disease (in this case chest infections in lung disease) by using rules to define the start and end of the episode so that I can later calculate duration of episodes, type of episode (this depends on the combination of symptoms) and treatment received. As with any data set involving patients there are missing values. I have imputed from the most recent day if less than 2 days of data is missing.
The below code is a simplified, made up example involving only 1 symptom.
Exacerbation Rule:
Start of exacerbation = 2 days of worse symptoms (>= 3)
Resolution of exacerbation = 5 days with normal breathing (<=2)
I would ideally want to be able to identify all days when an exacerbation is happening too.
Here is the data:
#load packages
library(tidyverse)
#load data
id <- "A"
day <- c(1:50)
symptom <- c(2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2)
df <- data.frame(id,day,symptom)
#Data Dictionary
#Symptom: 1 = Better than usual, 2 = Normal/usual, 3 = Worse than usual, 4 = Much worse than usual
What I have tried:
I have tried to approach this by using a combination of lag() and lead() with conditional statements case_when() and ifelse().
df %>%
mutate_at(vars("symptom"), #used for more variables within vars() argument
.funs = list(lead1 = ~ lead(., n = 1),
lead2 = ~ lead(., n = 2),
lead3 = ~ lead(., n = 3),
lead4 = ~ lead(., n = 4),
lead5 = ~ lead(., n = 5),
lag1 = ~ lag(., n = 1),
lag2 = ~ lag(., n = 2),
lag3 = ~ lag(., n = 3))) %>%
mutate(start = case_when(symptom <= 2 ~ 0,
symptom >= 3 ~
ifelse(symptom >= lag2 & symptom <= lag1,1,0)),
end = case_when(symptom >=3 ~
ifelse(lead1 <=2 &
lead2 <=2 &
lead3 <=2 &
lead4 <=2 &
lead5 <=2,1,0)))
My main issue is that of complexity. As I build in more symptoms and rules I have to refer to different variables that have ifelse()/case_when() statements within it. I am sure there is a more elegant solution to my problem.
The other issue is that during an "exacerbation" the exacerbation_start variable should only be used at the start and not during the episode. Similarly for exacerbation_end it would only be applicable when an exacerbation is already happening. I have tried using ifelse() statements to refer to when an exacerbation is happening but not been able to get this to work and obey the rule I desire.
The output I would like is:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 1 1
36 A 36 2 0 0 0
37 A 37 2 0 0 0
38 A 38 2 0 0 0
39 A 39 2 0 0 0
40 A 40 2 0 0 0
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0
I look forward to your replies!
EDIT
I have added 50 more rows of data to simulate multiple exacerbations and the issue with right censoring and NAs. I have also included a second participant "B" to see if this is a reason for issues.
id <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
day <- c(1:50,1:50)
symptom <- c(2,3,3,3,3,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2, 2,2,2,2,2,2,3,2,3,3,2,3,2,3,2,2,2,2,2,2,3,3,3,3,NA,NA,NA,2,2,2,3,2,2,2,2,2,3,2,2,3,NA,NA,NA,3,3,3,3,3,3,2)
df <- data.frame(id,day,symptom)
id day symptom start end exacerbation censor
1 A 1 2 0 0 0 0
2 A 2 3 1 0 1 0
3 A 3 3 0 0 1 0
4 A 4 3 0 0 1 0
5 A 5 3 0 1 1 0
6 A 6 2 0 0 0 0
7 A 7 2 0 0 0 0
8 A 8 2 0 0 0 0
9 A 9 2 0 0 0 0
10 A 10 2 0 0 0 0
11 A 11 2 0 0 0 0
12 A 12 3 0 0 0 0
13 A 13 2 0 0 0 0
14 A 14 2 0 0 0 0
15 A 15 2 0 0 0 0
16 A 16 2 0 0 0 0
17 A 17 NA 0 0 0 0
18 A 18 NA 0 0 0 0
19 A 19 NA 0 0 0 0
20 A 20 2 0 0 0 0
21 A 21 2 0 0 0 0
22 A 22 2 0 0 0 0
23 A 23 3 1 0 1 0
24 A 24 3 0 0 1 0
25 A 25 3 0 0 1 0
26 A 26 4 0 0 1 0
27 A 27 4 0 0 1 0
28 A 28 3 0 0 1 0
29 A 29 3 0 0 1 0
30 A 30 2 0 0 1 0
31 A 31 3 0 0 1 0
32 A 32 2 0 0 1 0
33 A 33 2 0 0 1 0
34 A 34 3 0 0 1 0
35 A 35 3 0 0 1 0
36 A 36 2 0 0 1 0
37 A 37 2 0 0 1 0
38 A 38 2 0 0 1 0
39 A 39 2 0 0 1 0
40 A 40 2 0 0 1 0
41 A 41 2 0 1 1 0
42 A 42 3 0 0 0 0
43 A 43 2 0 0 0 0
44 A 44 2 0 0 0 0
45 A 45 2 0 0 0 0
46 A 46 2 0 0 0 0
47 A 47 2 0 0 0 0
48 A 48 3 0 0 0 0
49 A 49 2 0 0 0 0
50 A 50 2 0 0 0 0
51 B 1 2 0 0 0 0
52 B 2 2 0 0 0 0
53 B 3 2 0 0 0 0
54 B 4 2 0 0 0 0
55 B 5 2 0 0 0 0
56 B 6 2 0 0 0 0
57 B 7 3 0 0 0 0
58 B 8 2 0 0 0 0
59 B 9 3 0 0 0 0
60 B 10 3 1 0 1 0
61 B 11 2 0 0 1 0
62 B 12 3 0 0 1 0
63 B 13 2 0 0 1 0
64 B 14 3 0 0 1 0
65 B 15 2 0 0 1 0
66 B 16 2 0 0 1 0
67 B 17 2 0 0 1 0
68 B 18 2 0 0 1 0
69 B 19 2 0 1 1 0
70 B 20 2 0 0 0 0
71 B 21 3 1 0 1 0
72 B 22 3 0 0 1 0
73 B 23 3 0 0 1 0
74 B 24 3 0 0 1 0
75 B 25 NA 0 0 0 1
76 B 26 NA 0 0 0 1
77 B 27 NA 0 0 0 1
78 B 28 2 0 0 0 1
79 B 29 2 0 0 0 1
80 B 30 2 0 0 0 1
81 B 31 3 0 0 0 1
82 B 32 2 0 0 0 1
83 B 33 2 0 0 0 1
84 B 34 2 0 0 0 1
85 B 35 2 0 0 0 1
86 B 36 2 0 0 0 1
87 B 37 3 0 0 0 0
88 B 38 2 0 0 0 0
89 B 39 2 0 0 0 0
90 B 40 3 0 0 0 0
91 B 41 NA 0 0 0 0
92 B 42 NA 0 0 0 0
93 B 43 NA 0 0 0 0
94 B 44 3 1 0 1 0
95 B 45 3 0 0 1 0
96 B 46 3 0 0 1 0
97 B 47 3 0 0 1 0
98 B 48 3 0 0 1 0
99 B 49 3 0 0 1 0
100 B 50 2 0 0 1 0
>
Here is a try for a more elegant and scalable way to write your algorithm:
First, you do not have to compute the lead and lag calls before you can use case_when. Of note, I find it good practice to explicitly write the TRUE option of case_when. Here is some code.
df2=df %>%
mutate(
exacerbation_start = case_when(
is.na(symptom) ~ NA_real_,
symptom <= 2 ~ 0,
symptom >= 3 & symptom >= lag(symptom, n=2) & symptom <= lag(symptom, n=1) ~ 1,
TRUE ~ 0
),
exacerbation_end = case_when(
symptom >=3 ~ ifelse(lead(symptom, n=1) <=2 &
lead(symptom, n=2) <=2 & lead(symptom, n=3) <=2 &
lead(symptom, n=4) <=2 & lead(symptom, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
)
all.equal(df1,df2) #TRUE
Alternatively, if your algorithm is the same for all symptoms, you might want to use custom functions:
get_exacerbation_start = function(x){
case_when(
is.na(x) ~ NA_real_,
x <= 2 ~ 0,
x >= 3 & x >= lag(x, n=2) & x <= lag(x, n=1) ~ 1,
TRUE ~ 0
)
}
get_exacerbation_end = function(x){
case_when(
x >=3 ~ ifelse(x >=3 & lead(x, n=1) <=2 &
lead(x, n=2) <=2 & lead(x, n=3) <=2 &
lead(x, n=4) <=2 & lead(x, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
}
df3=df %>%
mutate(
exacerbation_start = get_exacerbation_start(symptom),
exacerbation_end = get_exacerbation_end(symptom)
)
all.equal(df1,df3) #also TRUE
This latter way might be even more powerful with some mutate_at calls.
EDIT: after seeing your edit, here is a try to get the exacerbation period. The code is quite ugly in my opinion, I'm not sure that row_number was supposed to be used this way.
df_final=df %>%
transmute(
id,day,symptom,
start = get_exacerbation_start(symptom),
end = get_exacerbation_end(symptom),
exacerbation = row_number()>=which(start==1)[1] & row_number()<=which(end==1)[1]
)
I may come back with a less convoluted approach, but try this:
library(dplyr)
library(tidyr)
df %>%
group_by(id,
idx = with(
rle(
case_when(symptom <= 2 ~ 'normal',
symptom >= 3 ~ 'worse',
TRUE ~ symptom %>% as.character)),
rep(seq_along(lengths), lengths)
)
) %>%
mutate(
trajectory = case_when(cumsum(symptom <= 2) == 5 ~ 2, cumsum(symptom >= 3) == 2 ~ 1)
) %>%
group_by(id) %>% fill(trajectory) %>%
mutate(
trajectory = replace_na(trajectory, 0),
start = +(trajectory == 1 & lag(trajectory) == 2),
end = +(trajectory == 2 & lag(trajectory) == 1),
exacerbation = +(trajectory == 1 | start == 1 | end == 1)
) %>%
select(-idx, -trajectory) %>% as.data.frame
Output:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 0 1
36 A 36 2 0 0 1
37 A 37 2 0 0 1
38 A 38 2 0 0 1
39 A 39 2 0 0 1
40 A 40 2 0 1 1
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0