Is it possible to round the numbers in a data frame to the nearest integer 10?
df <- data.frame(a = c(17,1,15,5,1,1,11,0,24,0),
b = c(0,10,19,1,1,32,0,5,7,8),
c = c(1,1,12,18,7,3,12,1,1,20))
round_df <- function(x, digits) {
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round_any(x[numeric_columns], digits)
x
}
df_10 <- round_df(df, 10)
I tried this way but it doesn't work for data.frame
Use round with digits = -1:
round(df, digits = -1)
a b c
1 20 0 0
2 0 10 0
3 20 20 10
4 0 0 20
5 0 0 10
6 0 30 0
7 10 0 10
8 0 0 0
9 20 10 0
10 0 10 20
If you want to round to 20, maybe something like this?
round(df / 20) * 20
a b c
1 20 0 0
2 0 0 0
3 20 20 20
4 0 0 20
5 0 0 0
6 0 40 0
7 20 0 20
8 0 0 0
9 20 0 0
10 0 0 20
You can use DescTools::RoundTo:
DescTools::RoundTo(df, multiple = 10, FUN = round)
a b c
1 20 0 0
2 0 10 0
3 20 20 10
4 0 0 20
5 0 0 10
6 0 30 0
7 10 0 10
8 0 0 0
9 20 10 0
10 0 10 20
You can also change to round to different multiples (e.g., to 20).
DescTools::RoundTo(df, multiple = 20, FUN = round)
a b c
1 20 0 0
2 0 0 0
3 20 20 20
4 0 0 20
5 0 0 0
6 0 40 0
7 20 0 20
8 0 0 0
9 20 0 0
10 0 0 20
df[] <- lapply(df, plyr::round_any, accuracy = 10)
-ouptut
> df
a b c
1 20 0 0
2 0 10 0
3 20 20 10
4 0 0 20
5 0 0 10
6 0 30 0
7 10 0 10
8 0 0 0
9 20 10 0
10 0 10 20
I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.
For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")
then the result table should like that:
letter
first position
second position
third position
fourth position
fifth position
a
0
1
1
0
0
b
0
0
0
0
0
c
0
0
0
1
0
d
1
0
0
0
0
e
0
1
1
1
0
f
0
0
0
0
0
...continued until z
...
...
...
...
...
All words are of same length (5).
What I have so far is:
alphabet <- letters[1:26]
words.df <- data.frame("Words" = words)
words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))
x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))
x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))
x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))
x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))
x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))
My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.
in base R use table:
table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
pos
let 1 2 3 4 5
a 0 1 1 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
g 1 0 0 0 0
h 0 1 0 0 1
i 0 1 2 0 0
l 0 0 1 1 0
m 0 0 0 2 0
r 0 1 0 0 1
s 2 0 0 0 4
t 0 0 1 0 0
w 2 1 0 1 0
Note that if you need all the values from a-z then use
table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
Also to get a dataframe you could do:
d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))
Here is a tidyverse solution using dplyr, purrr, and tidyr:
strsplit(words.df$Words, "") %>%
map_dfr(~setNames(.x, seq_along(.x))) %>%
pivot_longer(everything(),
values_drop_na = T,
names_to = "pos",
values_to = "letter") %>%
count(pos, letter) %>%
pivot_wider(names_from = pos,
names_glue = "pos{pos}",
id_cols = letter,
values_from = n,
values_fill = 0L)
Output
letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1 a 65 127 88 38 28 17 14 5 3 0 0
2 b 58 4 7 9 2 4 2 0 1 0 0
3 c 83 14 45 37 20 19 8 3 2 0 0
4 C 2 0 0 0 0 0 0 0 0 0 0
5 d 43 8 33 47 21 22 9 3 1 1 0
6 e 45 156 81 132 114 69 48 23 14 2 2
7 f 54 11 18 10 5 2 1 0 0 0 0
8 g 23 7 27 21 15 8 7 1 0 0 0
9 h 38 56 6 28 21 10 3 3 1 1 0
10 i 25 106 51 58 38 28 8 4 1 0 0
11 j 6 0 2 2 0 0 0 0 0 0 0
12 k 9 1 6 22 12 0 0 0 0 0 0
13 l 45 41 54 54 36 9 7 6 0 2 0
14 m 45 8 31 19 8 8 4 2 0 0 0
15 n 23 42 75 53 34 41 16 16 4 2 0
16 o 28 167 76 41 38 9 11 2 1 0 0
17 p 72 20 34 30 8 3 1 1 1 0 0
18 q 7 2 1 0 0 0 0 0 0 0 0
19 r 46 74 92 59 56 45 12 9 1 1 0
20 s 119 8 67 35 31 22 18 4 1 0 0
21 t 65 30 73 83 57 42 31 9 6 3 1
22 u 12 66 39 36 20 7 7 2 0 0 0
23 v 8 7 20 12 5 5 1 0 0 0 0
24 w 53 8 13 10 2 3 0 1 0 0 0
25 y 6 4 16 15 17 15 10 5 6 1 1
26 x 0 12 5 0 0 0 0 0 0 0 0
27 z 0 0 1 0 0 0 1 1 0 0 0
Hi stack overflow community,
I'm relatively new to R (9 months) and this is my first stack overflow question with reprex and would really appreciate any help. I mainly use tidyverse although I am open to base R solutions.
Problem:
I have ~21,000 rows of symptom data with >10 variables per day. I would like to be able to classify "exacerbations" of a disease (in this case chest infections in lung disease) by using rules to define the start and end of the episode so that I can later calculate duration of episodes, type of episode (this depends on the combination of symptoms) and treatment received. As with any data set involving patients there are missing values. I have imputed from the most recent day if less than 2 days of data is missing.
The below code is a simplified, made up example involving only 1 symptom.
Exacerbation Rule:
Start of exacerbation = 2 days of worse symptoms (>= 3)
Resolution of exacerbation = 5 days with normal breathing (<=2)
I would ideally want to be able to identify all days when an exacerbation is happening too.
Here is the data:
#load packages
library(tidyverse)
#load data
id <- "A"
day <- c(1:50)
symptom <- c(2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2)
df <- data.frame(id,day,symptom)
#Data Dictionary
#Symptom: 1 = Better than usual, 2 = Normal/usual, 3 = Worse than usual, 4 = Much worse than usual
What I have tried:
I have tried to approach this by using a combination of lag() and lead() with conditional statements case_when() and ifelse().
df %>%
mutate_at(vars("symptom"), #used for more variables within vars() argument
.funs = list(lead1 = ~ lead(., n = 1),
lead2 = ~ lead(., n = 2),
lead3 = ~ lead(., n = 3),
lead4 = ~ lead(., n = 4),
lead5 = ~ lead(., n = 5),
lag1 = ~ lag(., n = 1),
lag2 = ~ lag(., n = 2),
lag3 = ~ lag(., n = 3))) %>%
mutate(start = case_when(symptom <= 2 ~ 0,
symptom >= 3 ~
ifelse(symptom >= lag2 & symptom <= lag1,1,0)),
end = case_when(symptom >=3 ~
ifelse(lead1 <=2 &
lead2 <=2 &
lead3 <=2 &
lead4 <=2 &
lead5 <=2,1,0)))
My main issue is that of complexity. As I build in more symptoms and rules I have to refer to different variables that have ifelse()/case_when() statements within it. I am sure there is a more elegant solution to my problem.
The other issue is that during an "exacerbation" the exacerbation_start variable should only be used at the start and not during the episode. Similarly for exacerbation_end it would only be applicable when an exacerbation is already happening. I have tried using ifelse() statements to refer to when an exacerbation is happening but not been able to get this to work and obey the rule I desire.
The output I would like is:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 1 1
36 A 36 2 0 0 0
37 A 37 2 0 0 0
38 A 38 2 0 0 0
39 A 39 2 0 0 0
40 A 40 2 0 0 0
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0
I look forward to your replies!
EDIT
I have added 50 more rows of data to simulate multiple exacerbations and the issue with right censoring and NAs. I have also included a second participant "B" to see if this is a reason for issues.
id <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
day <- c(1:50,1:50)
symptom <- c(2,3,3,3,3,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2, 2,2,2,2,2,2,3,2,3,3,2,3,2,3,2,2,2,2,2,2,3,3,3,3,NA,NA,NA,2,2,2,3,2,2,2,2,2,3,2,2,3,NA,NA,NA,3,3,3,3,3,3,2)
df <- data.frame(id,day,symptom)
id day symptom start end exacerbation censor
1 A 1 2 0 0 0 0
2 A 2 3 1 0 1 0
3 A 3 3 0 0 1 0
4 A 4 3 0 0 1 0
5 A 5 3 0 1 1 0
6 A 6 2 0 0 0 0
7 A 7 2 0 0 0 0
8 A 8 2 0 0 0 0
9 A 9 2 0 0 0 0
10 A 10 2 0 0 0 0
11 A 11 2 0 0 0 0
12 A 12 3 0 0 0 0
13 A 13 2 0 0 0 0
14 A 14 2 0 0 0 0
15 A 15 2 0 0 0 0
16 A 16 2 0 0 0 0
17 A 17 NA 0 0 0 0
18 A 18 NA 0 0 0 0
19 A 19 NA 0 0 0 0
20 A 20 2 0 0 0 0
21 A 21 2 0 0 0 0
22 A 22 2 0 0 0 0
23 A 23 3 1 0 1 0
24 A 24 3 0 0 1 0
25 A 25 3 0 0 1 0
26 A 26 4 0 0 1 0
27 A 27 4 0 0 1 0
28 A 28 3 0 0 1 0
29 A 29 3 0 0 1 0
30 A 30 2 0 0 1 0
31 A 31 3 0 0 1 0
32 A 32 2 0 0 1 0
33 A 33 2 0 0 1 0
34 A 34 3 0 0 1 0
35 A 35 3 0 0 1 0
36 A 36 2 0 0 1 0
37 A 37 2 0 0 1 0
38 A 38 2 0 0 1 0
39 A 39 2 0 0 1 0
40 A 40 2 0 0 1 0
41 A 41 2 0 1 1 0
42 A 42 3 0 0 0 0
43 A 43 2 0 0 0 0
44 A 44 2 0 0 0 0
45 A 45 2 0 0 0 0
46 A 46 2 0 0 0 0
47 A 47 2 0 0 0 0
48 A 48 3 0 0 0 0
49 A 49 2 0 0 0 0
50 A 50 2 0 0 0 0
51 B 1 2 0 0 0 0
52 B 2 2 0 0 0 0
53 B 3 2 0 0 0 0
54 B 4 2 0 0 0 0
55 B 5 2 0 0 0 0
56 B 6 2 0 0 0 0
57 B 7 3 0 0 0 0
58 B 8 2 0 0 0 0
59 B 9 3 0 0 0 0
60 B 10 3 1 0 1 0
61 B 11 2 0 0 1 0
62 B 12 3 0 0 1 0
63 B 13 2 0 0 1 0
64 B 14 3 0 0 1 0
65 B 15 2 0 0 1 0
66 B 16 2 0 0 1 0
67 B 17 2 0 0 1 0
68 B 18 2 0 0 1 0
69 B 19 2 0 1 1 0
70 B 20 2 0 0 0 0
71 B 21 3 1 0 1 0
72 B 22 3 0 0 1 0
73 B 23 3 0 0 1 0
74 B 24 3 0 0 1 0
75 B 25 NA 0 0 0 1
76 B 26 NA 0 0 0 1
77 B 27 NA 0 0 0 1
78 B 28 2 0 0 0 1
79 B 29 2 0 0 0 1
80 B 30 2 0 0 0 1
81 B 31 3 0 0 0 1
82 B 32 2 0 0 0 1
83 B 33 2 0 0 0 1
84 B 34 2 0 0 0 1
85 B 35 2 0 0 0 1
86 B 36 2 0 0 0 1
87 B 37 3 0 0 0 0
88 B 38 2 0 0 0 0
89 B 39 2 0 0 0 0
90 B 40 3 0 0 0 0
91 B 41 NA 0 0 0 0
92 B 42 NA 0 0 0 0
93 B 43 NA 0 0 0 0
94 B 44 3 1 0 1 0
95 B 45 3 0 0 1 0
96 B 46 3 0 0 1 0
97 B 47 3 0 0 1 0
98 B 48 3 0 0 1 0
99 B 49 3 0 0 1 0
100 B 50 2 0 0 1 0
>
Here is a try for a more elegant and scalable way to write your algorithm:
First, you do not have to compute the lead and lag calls before you can use case_when. Of note, I find it good practice to explicitly write the TRUE option of case_when. Here is some code.
df2=df %>%
mutate(
exacerbation_start = case_when(
is.na(symptom) ~ NA_real_,
symptom <= 2 ~ 0,
symptom >= 3 & symptom >= lag(symptom, n=2) & symptom <= lag(symptom, n=1) ~ 1,
TRUE ~ 0
),
exacerbation_end = case_when(
symptom >=3 ~ ifelse(lead(symptom, n=1) <=2 &
lead(symptom, n=2) <=2 & lead(symptom, n=3) <=2 &
lead(symptom, n=4) <=2 & lead(symptom, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
)
all.equal(df1,df2) #TRUE
Alternatively, if your algorithm is the same for all symptoms, you might want to use custom functions:
get_exacerbation_start = function(x){
case_when(
is.na(x) ~ NA_real_,
x <= 2 ~ 0,
x >= 3 & x >= lag(x, n=2) & x <= lag(x, n=1) ~ 1,
TRUE ~ 0
)
}
get_exacerbation_end = function(x){
case_when(
x >=3 ~ ifelse(x >=3 & lead(x, n=1) <=2 &
lead(x, n=2) <=2 & lead(x, n=3) <=2 &
lead(x, n=4) <=2 & lead(x, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
}
df3=df %>%
mutate(
exacerbation_start = get_exacerbation_start(symptom),
exacerbation_end = get_exacerbation_end(symptom)
)
all.equal(df1,df3) #also TRUE
This latter way might be even more powerful with some mutate_at calls.
EDIT: after seeing your edit, here is a try to get the exacerbation period. The code is quite ugly in my opinion, I'm not sure that row_number was supposed to be used this way.
df_final=df %>%
transmute(
id,day,symptom,
start = get_exacerbation_start(symptom),
end = get_exacerbation_end(symptom),
exacerbation = row_number()>=which(start==1)[1] & row_number()<=which(end==1)[1]
)
I may come back with a less convoluted approach, but try this:
library(dplyr)
library(tidyr)
df %>%
group_by(id,
idx = with(
rle(
case_when(symptom <= 2 ~ 'normal',
symptom >= 3 ~ 'worse',
TRUE ~ symptom %>% as.character)),
rep(seq_along(lengths), lengths)
)
) %>%
mutate(
trajectory = case_when(cumsum(symptom <= 2) == 5 ~ 2, cumsum(symptom >= 3) == 2 ~ 1)
) %>%
group_by(id) %>% fill(trajectory) %>%
mutate(
trajectory = replace_na(trajectory, 0),
start = +(trajectory == 1 & lag(trajectory) == 2),
end = +(trajectory == 2 & lag(trajectory) == 1),
exacerbation = +(trajectory == 1 | start == 1 | end == 1)
) %>%
select(-idx, -trajectory) %>% as.data.frame
Output:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 0 1
36 A 36 2 0 0 1
37 A 37 2 0 0 1
38 A 38 2 0 0 1
39 A 39 2 0 0 1
40 A 40 2 0 1 1
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0