I am trying to duplicated "manually" the example in this Wikipedia post using R.
Here is the data:
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
after before sgn abs
1 125 110 1 15
2 115 122 -1 7
3 130 125 1 5
4 140 120 1 20
5 140 140 0 0
6 115 124 -1 9
7 140 123 1 17
8 125 137 -1 12
9 140 135 1 5
10 135 145 -1 10
If I try to rank the rows based on the abs column, the 0 entry is naturally ranked as 1:
rank = rank(abs)
(d = data.frame(after,before,sgn,abs,rank))
after before sgn abs rank
1 125 110 1 15 8.0
2 115 122 -1 7 4.0
3 130 125 1 5 2.5
4 140 120 1 20 10.0
5 140 140 0 0 1.0
6 115 124 -1 9 5.0
7 140 123 1 17 9.0
8 125 137 -1 12 7.0
9 140 135 1 5 2.5
10 135 145 -1 10 6.0
However, zeros are ignored in the Wilcoxon signed-test.
How can I get R to ignore that row, so as to end up with:
after before sgn abs rank
1 125 110 1 15 7.0
2 115 122 -1 7 3.0
3 130 125 1 5 1.5
4 140 120 1 20 9.0
5 140 140 0 0 0
6 115 124 -1 9 4.0
7 140 123 1 17 8.0
8 125 137 -1 12 6.0
9 140 135 1 5 1.5
10 135 145 -1 10 5.0
SOLUTION (accepted answer below):
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
d$rank = rank(replace(abs,abs==0,NA), na='keep')
d$multi = d$sgn * d$rank
(W=abs(sum(d$multi, na.rm = T)))
9
From the Wikipedia article:
Exclude pairs with |x2,i − x1,i| = 0. Let Nr be the reduced sample size.
We need to exclude zeroes. By my thinking, you should replace zeroes with NA, and then specify to rank() that you want to exclude NAs from consideration for ranking. Since you need to return a vector of the same length as the input, you can specify 'keep' as the argument:
d$rank <- rank(replace(abs,abs==0,NA),na='keep');
d;
## after before sgn abs rank
## 1 125 110 1 15 7.0
## 2 115 122 -1 7 3.0
## 3 130 125 1 5 1.5
## 4 140 120 1 20 9.0
## 5 140 140 0 0 NA
## 6 115 124 -1 9 4.0
## 7 140 123 1 17 8.0
## 8 125 137 -1 12 6.0
## 9 140 135 1 5 1.5
## 10 135 145 -1 10 5.0
The subtraction-based solutions will not work if the input vector contains zero zeroes or multiple zeroes.
You could create the new column and then just update the rank where the abs value isn't 0
d$rank <- 0 # default value for rows with abs=0
d$rank[d$abs!=0] <- rank(d$abs[d$abs!=0])
If you wanted to drop the row completely, you could just do
transform(subset(d, abs!=0), rank=rank(abs))
A quick way to do it would be to rank as normal and then do:
d$rank <- ifelse(d$rank == 1, 0, d$rank - 1)
This switches all ranks of 1 to 0, and reduces any other ranks by 1.
Related
I am trying to determine the values above and below a variable that fall on a multiple, preferably in a dplyr::mutate.
In this simplified example. I want to determine the multiple of 50 both above and below my values of x. I am under the impression cut is what I should use, but I haven't gotten it to work.
df <- data.frame(
x = c(265, 617, 88, 99, 143, 378)
)
x
1 265
2 617
3 88
4 99
5 143
6 378
desired_result <- data.frame(
x = c(265, 617, 88, 99, 143, 378),
above = c(300, 650, 100, 100, 150, 400),
below = c(250, 600, 50, 50, 100, 350)
)
x above below
1 265 300 250
2 617 650 600
3 88 100 50
4 99 100 50
5 143 150 100
6 378 400 350
df$above = ceiling(df$x/50)*50
df$below = floor(df$x/50)*50
x above below
1 265 300 250
2 617 650 600
3 88 100 50
4 99 100 50
5 143 150 100
6 378 400 350
You could use the modulus operator %%:
df %>%
mutate(above = x - (x %% 50) + 50,
below = x - (x %% 50))
Output:
x above below
1 265 300 250
2 617 650 600
3 88 100 50
4 99 100 50
5 143 150 100
6 378 400 350
I have a data frame df:
library(tidyverse)
t <- c(103,104,108,120,127,129,140,142,150,151,160,177,178,183,186,187,191,194,198,199)
w <- c(1,1,1,-1,-1,-1,-1,-1,1,1,-1,-1,1,1,1,-1,1,1,-1,-1)
df <- data_frame(t, w)
> dput(df)
structure(list(t = c(103, 104, 108, 120, 127, 129, 140, 142,
150, 151, 160, 177, 178, 183, 186, 187, 191, 194, 198, 199),
w = c(1, 1, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1,
-1, 1, 1, -1, -1)), .Names = c("t", "w"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> df
# A tibble: 20 x 2
t w
<dbl> <dbl>
1 103 1.00
2 104 1.00
3 108 1.00
4 120 -1.00
5 127 -1.00
6 129 -1.00
7 140 -1.00
8 142 -1.00
9 150 1.00
10 151 1.00
11 160 -1.00
12 177 -1.00
13 178 1.00
14 183 1.00
15 186 1.00
16 187 -1.00
17 191 1.00
18 194 1.00
19 198 -1.00
20 199 -1.00
Now, if the value in w is larger than zero, find the nearest previous negative w, and assign the difference between the corresponding t values to a new column d. Otherwise, d is equal to zero. I.e. the desired output should look like this:
t w d
103 1.00 NA (there is no previous w < 0)
104 1.00 NA (there is no previous w < 0)
108 1.00 NA (there is no previous w < 0)
120 -1.00 0
127 -1.00 0
129 -1.00 0
140 -1.00 0
142 -1.00 0
150 1.00 8 = 150 - 142
151 1.00 9 = 151 - 142
160 -1.00 0
177 -1.00 0
178 1.00 1 = 178 - 177
183 1.00 6 = 183 - 177
186 1.00 9 = 186 - 177
187 -1.00 0
191 1.00 4 = 191 - 187
194 1.00 7 = 194 - 187
198 -1.00 0
199 -1.00 0
(The NAs above might be zero as well.)
Since yesterday I'm trying to attack this problem using findInterval(), which(), etc. but without success. Another way I was thinking about is to introduce somehow a variable shift in lag() function...
Ideally, I would like to have a tidyverse-like solution.
Any help would be very much appreciated.
Thank you in advance!
Using data.table (since tidyverse currently has no non-equi joins):
library(data.table)
DT = data.table(df)
DT[, v := 0]
DT[w > 0, v :=
DT[w < 0][.SD, on=.(t < t), mult="last", i.t - x.t]
]
t w v
1: 103 1 NA
2: 104 1 NA
3: 108 1 NA
4: 120 -1 0
5: 127 -1 0
6: 129 -1 0
7: 140 -1 0
8: 142 -1 0
9: 150 1 8
10: 151 1 9
11: 160 -1 0
12: 177 -1 0
13: 178 1 1
14: 183 1 6
15: 186 1 9
16: 187 -1 0
17: 191 1 4
18: 194 1 7
19: 198 -1 0
20: 199 -1 0
It initializes the new column to 0, then replaces it on the subset of rows where w > 0. The replacement uses a join of the subset of data, .SD, where w > 0 to the part of the table where w < 0, DT[w < 0]. The join syntax is x[i, on=, j] where in this case...
x = DT[w < 0]
i = .SD = DT[w > 0]
The join uses each row of i to look up rows in x based on the rules in on=. When multiple matches are found, we take only the last (mult = "last").
j is what we use the join to do, here calculate the difference between two columns. To disambiguate columns from each table, we use prefixes x.* and i.*.
Using cummax. I'm not sure if this generalizes, but it works for the example:
DT[, v := t - cummax(t*(w < 0))]
DT[cumsum(w < 0) == 0, v := NA]
I guess this requires that the t column is sorted in increasing order.
A tidverse way:
First, make an intermediate column (t2) with NA if positive and and t if neg
df <- mutate(df, t2 = case_when(w > 0 ~ as.numeric(NA), TRUE ~ t))
#fill NA in t2 so that for each row, t2 is value of t when w was last neg
df <- fill(df, t2)
#> df
# A tibble: 20 x 3
# t w t2
# <dbl> <dbl> <dbl>
# 1 103 1 NA
# 2 104 1 NA
# 3 108 1 NA
# 4 120 -1 120
# 5 127 -1 127
# 6 129 -1 129
# 7 140 -1 140
# 8 142 -1 142
# 9 150 1 142
#10 151 1 142
#11 160 -1 160
#12 177 -1 177
#13 178 1 177
#14 183 1 177
#15 186 1 177
#16 187 -1 187
#17 191 1 187
#18 194 1 187
#19 198 -1 198
#20 199 -1 199
Then subtract t2 from t
df$d <- with(df, t - t2)
#> df
# A tibble: 20 x 4
# t w t2 d
# <dbl> <dbl> <dbl> <dbl>
# 1 103 1 NA NA
# 2 104 1 NA NA
# 3 108 1 NA NA
# 4 120 -1 120 0
# 5 127 -1 127 0
# 6 129 -1 129 0
# 7 140 -1 140 0
# 8 142 -1 142 0
# 9 150 1 142 8
#10 151 1 142 9
#11 160 -1 160 0
#12 177 -1 177 0
#13 178 1 177 1
#14 183 1 177 6
#15 186 1 177 9
#16 187 -1 187 0
#17 191 1 187 4
#18 194 1 187 7
#19 198 -1 198 0
#20 199 -1 199 0
PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)
I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...
One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#
I thought this would be a trivial thing to do but I still have some trouble adjusting to writing code instead of pointing and clicking on a spreadsheet.
month = as.integer(c(1,2,3,4,5,6,7,8,9,10,11,12))
remaining = c(1000,925,852,790,711,658,601,567,530,501,485,466)
left = c(75, 73, 62, 79, 53, 57, 34, 37, 29, 16, 19, 0)
KPdata = data.frame(month, remaining, left)
> KPdata
month remaining left
1 1 1000 75
2 2 925 73
3 3 852 62
4 4 790 79
5 5 711 53
6 6 658 57
7 7 601 34
8 8 567 37
9 9 530 29
10 10 501 16
11 11 485 19
12 12 466 12
How do I calculate the Kaplan-Meier survival function at each month? Note that I want to do this manually, I am aware that there are packages which will do it for me.
I think this is what you're trying to do. We use lag and cumprod to get a manual KM estimator:
KPdata$KM_init <- lag((KPdata$remaining - KPdata$left) / KPdata$remaining)
KPdata[1,ncol(KPdata)] <- 1
KPdata$KM_final <- cumprod(KPdata$KM_init)
KPdata
month remaining left KM_init KM_final
1 1 1000 75 1.0000000 1.000
2 2 925 73 0.9250000 0.925
3 3 852 62 0.9210811 0.852
4 4 790 79 0.9272300 0.790
5 5 711 53 0.9000000 0.711
6 6 658 57 0.9254571 0.658
7 7 601 34 0.9133739 0.601
8 8 567 37 0.9434276 0.567
9 9 530 29 0.9347443 0.530
10 10 501 16 0.9452830 0.501
11 11 485 19 0.9680639 0.485
12 12 466 0 0.9608247 0.466
Alternatively, I think there's a different form of a KM estimator that looks like this (note that I've added a row corresponding to month = 0):
month = as.integer(c(0,1,2,3,4,5,6,7,8,9,10,11,12))
remaining = c(1000,1000,925,852,790,711,658,601,567,530,501,485,466)
left = c(0,75, 73, 62, 79, 53, 57, 34, 37, 29, 16, 19, 0)
KPdata2 = data.frame(month, remaining, left)
KPdata2$KM_init <- (KPdata2$remaining - KPdata2$left) / KPdata2$remaining
KPdata2$KM_final <- cumprod(KPdata2$KM_init)
KPdata2
month remaining left KM_init KM_final
1 0 1000 0 1.0000000 1.000
2 1 1000 75 0.9250000 0.925
3 2 925 73 0.9210811 0.852
4 3 852 62 0.9272300 0.790
5 4 790 79 0.9000000 0.711
6 5 711 53 0.9254571 0.658
7 6 658 57 0.9133739 0.601
8 7 601 34 0.9434276 0.567
9 8 567 37 0.9347443 0.530
10 9 530 29 0.9452830 0.501
11 10 501 16 0.9680639 0.485
12 11 485 19 0.9608247 0.466
13 12 466 0 1.0000000 0.466
I was so taken with the question and #bouncyball's inspiring answer, I thought I'd add my ha'penny worth with an attempt at handling censoring. This is intended to be in the spirit of the original question - doing things 'handraulically' to develop key insights.
## rename remaining -> survived; left -> died
month = as.integer(c(1,2,3,4,5,6,7,8,9,10,11,12))
survived = c(1000,925,852,790,711,658,601,567,530,501,485,466)
died = c(75, 73, 62, 79, 53, 57, 34, 37, 29, 16, 19, 0)
## arbitrary censoring # 10 per time period
censored <- c(0, rep(10,11))
KPdata3 = data.frame(month, at.risk, censored, died, survived)
## define those at risk <= those who survived
## awful bit of R fiddling for (something simple like) offsetting the index in base R
len <- length(month)
at.risk <- c(survived[1],
survived[-len] - died[-len] - cumsum(censored[-len]) )
## note use of cumsum()
## censoring uses at risk, rather than survived/remained
KPdata3$KM_increment <- (KPdata3$at.risk - KPdata3$died)/ KPdata3$at.risk
## code credit to #bouncyball
KPdata3$KM_cumulative <- cumprod(KPdata3$KM_increment)
KPdata3
Gives this.....
month at.risk censored died survived KM_increment KM_cumulative
1 1 1000 0 75 1000 0.9250000 0.9250000
2 2 925 10 73 925 0.9210811 0.8520000
3 3 842 10 62 852 0.9263658 0.7892637
4 4 770 10 79 790 0.8974026 0.7082873
5 5 681 10 53 711 0.9221733 0.6531636
6 6 618 10 57 658 0.9077670 0.5929203
7 7 551 10 34 601 0.9382940 0.5563336
8 8 507 10 37 567 0.9270217 0.5157333
9 9 460 10 29 530 0.9369565 0.4832197
10 10 421 10 16 501 0.9619952 0.4648551
11 11 395 10 19 485 0.9518987 0.4424949
12 12 366 10 0 466 1.0000000 0.4424949
Setting rep(0,11) gives the same answer as #bouncyball's.
I have the following dataset.
dat2 <- read.table(header=TRUE, text="
ID De Ep Ti ID1
1123 113 121 100 11231
1123 105 107 110 11232
1134 122 111 107 11241
1134 117 120 111 11242
1154 122 116 109 11243
1165 108 111 118 11251
1175 106 115 113 11252
1185 113 104 108 11253
1226 109 119 116 11261
")
dat2
ID De Ep Ti ID1
1 1 2 121 100 11231
2 1 1 107 110 11232
3 2 3 111 107 11241
4 2 2 120 111 11242
5 2 3 116 109 11243
6 3 1 111 118 11251
7 3 1 115 113 11252
8 4 2 104 108 11253
9 4 1 119 116 11261
I want to change first two columns to be changed like the following numeric labels. But it turns them into factor.
dat2$ID <- cut(dat2$ID, breaks=c(0,1124,1154,1184,Inf),
labels=c(5, 25, 55, 75))
table(dat2$ID)
5 25 55 75
2 3 2 2
dat2$De <- cut(dat2$De, breaks=c(0,110,118,125,Inf),
labels=c(10, 20, 30, 40))
table(dat2$De)
10 20 30 40
4 3 2 0
str(dat2)
'data.frame': 9 obs. of 5 variables:
$ ID : Factor w/ 4 levels "5","25","55",..: 1 1 2 2 2 3 3 4 4
$ De : Factor w/ 4 levels "10","20","30",..: 2 1 3 2 3 1 1 2 1
$ Ep : int 121 107 111 120 116 111 115 104 119
$ Ti : int 100 110 107 111 109 118 113 108 116
$ ID1: int 11231 11232 11241 11242 11243 11251 11252 11253 11261
I used as.numeric to convert them back to numeric that eventually creates new labeling (like 1, 2, 3) what I don't want. I need a simple line of code to transform it easily.
dat2$ID <- as.numeric(dat2$ID)
table(dat2$ID)
1 2 3 4
2 3 2 2
dat2$De <- as.numeric(dat2$De)
table(dat2$De)
1 2 3
4 3 2
In your case it will probably be more efficient to use findInterval directly instead of converting numeric to factors and then back to numeric values as shown here
c(5, 25, 55, 75)[findInterval(dat2$ID, c(0, 1124, 1154, 1184, Inf))]
## [1] 5 5 25 25 55 55 55 75 75
Or (as per the second column)
c(10, 20, 30, 40)[findInterval(dat2$De, c(0, 110, 118, 125, Inf))]
## [1] 20 10 30 20 30 10 10 20 10
Which is equivalent to using cut but returns the numeric values directly
cut(dat2$ID, breaks=c(0, 1124, 1154, 1184, Inf), labels=c(5, 25, 55, 75))
# [1] 5 5 25 25 25 55 55 75 75
# Levels: 5 25 55 75
Here's a quick benchmark showing ~X18 speed improvement
set.seed(123)
x <- sample(1e8, 1e7, replace = TRUE)
system.time({
res1 <- cut(x, breaks = c(0, 1e4, 1e5, 1e6, Inf), labels = c(5, 25, 55, 75))
res1 <- as.numeric(levels(res1))[res1]
})
# user system elapsed
# 3.40 0.09 3.51
system.time(res2 <- c(5, 25, 55, 75)[findInterval(x, c(0, 1e4, 1e5, 1e6, Inf))])
# user system elapsed
# 0.18 0.03 0.20
identical(res1, res2)
## [1] TRUE