R Impute NA's by Linear Increase Depending on Time Interval - r

PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)

I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...

One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#

Related

Sample part of a dataset while keeping subgroups intact

I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))
With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)
library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

Calculating percentiles across certain rows in a data frame in r

My data has a temperature measurement for each day in a year and other variables necessary for the analysis by villageID. I would like to create a new variable that calculates the 95 percentile threshold of all 365 temperature measurements for each village.
My data is in wide format and looks like this:
villageID temp1 temp2 temp3.... temp365 otherVars
1 1 70 86 98 79 x
2 2 73 89 99 86 x
3 3 71 82 96 75 x
4 4 78 79 94 81 x
5 5 90 91 89 85 x
I would like to create this 95% threshold variable that calculates the threshold (or temperature measure) that indicates at what temperature the 95th percentile starts at. I would like to do this across all temperature measures columns[2:366] and keep all other variables the same.
Like this:
villageID temp1 temp2 temp3 .....temp365 otherVars 95per
1 1 70 86 98 79 x 81
2 2 73 89 99 86 x 90
3 3 71 82 96 75 x 86
4 4 78 79 94 81 x 82
5 5 90 91 89 85 x 99
Although I think you should keep your data in long format here is some code that will compute it and put it back in the wide format that you have. Just know that often times it's not the best way to go about things, especially if you want to plot your data later:
library(tidyverse)
dat <- tribble(~"villageID", ~"temp1", ~"temp2", ~"temp3", ~"temp365",
1, 70, 86, 98, 79,
2, 73, 89, 99, 86,
3, 71, 82, 96, 75,
4, 78, 79, 94, 81,
5, 90, 91, 89, 85)
dat %>%
gather(key = "day", value = "temp", -villageID) %>%
group_by(villageID) %>%
mutate(perc_95 = quantile(temp, probs = .95)) %>%
spread(day, temp)
#> # A tibble: 5 x 6
#> # Groups: villageID [5]
#> villageID perc_95 temp1 temp2 temp3 temp365
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 96.2 70 86 98 79
#> 2 2 97.5 73 89 99 86
#> 3 3 93.9 71 82 96 75
#> 4 4 92.0 78 79 94 81
#> 5 5 90.8 90 91 89 85
Created on 2019-02-27 by the reprex package (v0.2.1)
In base R it would just be (assuming that only the temperature column have the string "temp" in them):
dfrm$temp95perc <-
apply( dfrm[ ,grep("temp", names(dfrm) )], #select just `tempNNN` columns
1, # row-wise calcs
quantile, probs=0.95) # give `quantile` a probs

Create vector by matching vector to a dataframe [R]

I have the following dataframe:
> zCode <- sample(50:150, size = 10, replace = TRUE)
> x <- sample(50:150, size = 10, replace = TRUE)
> test <- data.frame(x,zCode )
> test
zCode x
1 110 114
2 108 150
3 57 100
4 53 98
5 114 67
6 143 126
7 110 95
8 106 101
9 103 70
10 149 73
I also have this vector:
> z <- c(53, 57, 110)
> z
[1] 53 57 110
I want to create a new dataframe based on vector Z, that pulls the maximum x value associated with that z-code, like so:
Z x
53 98
57 100
110 114
Here are some possibilities. They do not use any packages.
1) For each element of z compute the subset of rows in test with that zCode and then take the maximum of each x:
data.frame(z, x = sapply(z, function(z) max(subset(test, z == zCode)$x)))
giving:
z x
1 53 98
2 57 100
3 110 114
2) Another approach is to use aggregate to find all the maxima and the merge with z to get just those:
merge(data.frame(z), aggregate(x ~ zCode, test, max), by = 1, all.x = TRUE)
giving:
z x
1 53 98
2 57 100
3 110 114
Hote: The input used, in reproducible form, is:
Lines <- "
zCode x
1 110 114
2 108 150
3 57 100
4 53 98
5 114 67
6 143 126
7 110 95
8 106 101
9 103 70
10 149 73"
test <- read.table(text = Lines)
z <- c(53, 57, 110)
Here is a data.table solution:
# Original data
dt <- data.table(zCode = c(110, 108, 57, 53, 114, 143, 110, 106, 103, 149),
x = c(114, 150, 100, 98, 67, 126, 95, 101, 70, 73))
z <- c(53, 57, 110)
# a new dataframe based on vector z
dt[zCode %in% z, max(x), by = zCode]
zCode V1
1: 110 114
2: 57 100
3: 53 98
EDIT:
# Keeps the columns names unchanged
dt[zCode %in% z, .(x = max(x)), by = zCode]
zCode x
1: 110 114
2: 57 100
3: 53 98

Rank() in R excluding zeros

I am trying to duplicated "manually" the example in this Wikipedia post using R.
Here is the data:
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
after before sgn abs
1 125 110 1 15
2 115 122 -1 7
3 130 125 1 5
4 140 120 1 20
5 140 140 0 0
6 115 124 -1 9
7 140 123 1 17
8 125 137 -1 12
9 140 135 1 5
10 135 145 -1 10
If I try to rank the rows based on the abs column, the 0 entry is naturally ranked as 1:
rank = rank(abs)
(d = data.frame(after,before,sgn,abs,rank))
after before sgn abs rank
1 125 110 1 15 8.0
2 115 122 -1 7 4.0
3 130 125 1 5 2.5
4 140 120 1 20 10.0
5 140 140 0 0 1.0
6 115 124 -1 9 5.0
7 140 123 1 17 9.0
8 125 137 -1 12 7.0
9 140 135 1 5 2.5
10 135 145 -1 10 6.0
However, zeros are ignored in the Wilcoxon signed-test.
How can I get R to ignore that row, so as to end up with:
after before sgn abs rank
1 125 110 1 15 7.0
2 115 122 -1 7 3.0
3 130 125 1 5 1.5
4 140 120 1 20 9.0
5 140 140 0 0 0
6 115 124 -1 9 4.0
7 140 123 1 17 8.0
8 125 137 -1 12 6.0
9 140 135 1 5 1.5
10 135 145 -1 10 5.0
SOLUTION (accepted answer below):
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
d$rank = rank(replace(abs,abs==0,NA), na='keep')
d$multi = d$sgn * d$rank
(W=abs(sum(d$multi, na.rm = T)))
9
From the Wikipedia article:
Exclude pairs with |x2,i − x1,i| = 0. Let Nr be the reduced sample size.
We need to exclude zeroes. By my thinking, you should replace zeroes with NA, and then specify to rank() that you want to exclude NAs from consideration for ranking. Since you need to return a vector of the same length as the input, you can specify 'keep' as the argument:
d$rank <- rank(replace(abs,abs==0,NA),na='keep');
d;
## after before sgn abs rank
## 1 125 110 1 15 7.0
## 2 115 122 -1 7 3.0
## 3 130 125 1 5 1.5
## 4 140 120 1 20 9.0
## 5 140 140 0 0 NA
## 6 115 124 -1 9 4.0
## 7 140 123 1 17 8.0
## 8 125 137 -1 12 6.0
## 9 140 135 1 5 1.5
## 10 135 145 -1 10 5.0
The subtraction-based solutions will not work if the input vector contains zero zeroes or multiple zeroes.
You could create the new column and then just update the rank where the abs value isn't 0
d$rank <- 0 # default value for rows with abs=0
d$rank[d$abs!=0] <- rank(d$abs[d$abs!=0])
If you wanted to drop the row completely, you could just do
transform(subset(d, abs!=0), rank=rank(abs))
A quick way to do it would be to rank as normal and then do:
d$rank <- ifelse(d$rank == 1, 0, d$rank - 1)
This switches all ranks of 1 to 0, and reduces any other ranks by 1.

Change labels from numeric to numeric

I have the following dataset.
dat2 <- read.table(header=TRUE, text="
ID De Ep Ti ID1
1123 113 121 100 11231
1123 105 107 110 11232
1134 122 111 107 11241
1134 117 120 111 11242
1154 122 116 109 11243
1165 108 111 118 11251
1175 106 115 113 11252
1185 113 104 108 11253
1226 109 119 116 11261
")
dat2
ID De Ep Ti ID1
1 1 2 121 100 11231
2 1 1 107 110 11232
3 2 3 111 107 11241
4 2 2 120 111 11242
5 2 3 116 109 11243
6 3 1 111 118 11251
7 3 1 115 113 11252
8 4 2 104 108 11253
9 4 1 119 116 11261
I want to change first two columns to be changed like the following numeric labels. But it turns them into factor.
dat2$ID <- cut(dat2$ID, breaks=c(0,1124,1154,1184,Inf),
labels=c(5, 25, 55, 75))
table(dat2$ID)
5 25 55 75
2 3 2 2
dat2$De <- cut(dat2$De, breaks=c(0,110,118,125,Inf),
labels=c(10, 20, 30, 40))
table(dat2$De)
10 20 30 40
4 3 2 0
str(dat2)
'data.frame': 9 obs. of 5 variables:
$ ID : Factor w/ 4 levels "5","25","55",..: 1 1 2 2 2 3 3 4 4
$ De : Factor w/ 4 levels "10","20","30",..: 2 1 3 2 3 1 1 2 1
$ Ep : int 121 107 111 120 116 111 115 104 119
$ Ti : int 100 110 107 111 109 118 113 108 116
$ ID1: int 11231 11232 11241 11242 11243 11251 11252 11253 11261
I used as.numeric to convert them back to numeric that eventually creates new labeling (like 1, 2, 3) what I don't want. I need a simple line of code to transform it easily.
dat2$ID <- as.numeric(dat2$ID)
table(dat2$ID)
1 2 3 4
2 3 2 2
dat2$De <- as.numeric(dat2$De)
table(dat2$De)
1 2 3
4 3 2
In your case it will probably be more efficient to use findInterval directly instead of converting numeric to factors and then back to numeric values as shown here
c(5, 25, 55, 75)[findInterval(dat2$ID, c(0, 1124, 1154, 1184, Inf))]
## [1] 5 5 25 25 55 55 55 75 75
Or (as per the second column)
c(10, 20, 30, 40)[findInterval(dat2$De, c(0, 110, 118, 125, Inf))]
## [1] 20 10 30 20 30 10 10 20 10
Which is equivalent to using cut but returns the numeric values directly
cut(dat2$ID, breaks=c(0, 1124, 1154, 1184, Inf), labels=c(5, 25, 55, 75))
# [1] 5 5 25 25 25 55 55 75 75
# Levels: 5 25 55 75
Here's a quick benchmark showing ~X18 speed improvement
set.seed(123)
x <- sample(1e8, 1e7, replace = TRUE)
system.time({
res1 <- cut(x, breaks = c(0, 1e4, 1e5, 1e6, Inf), labels = c(5, 25, 55, 75))
res1 <- as.numeric(levels(res1))[res1]
})
# user system elapsed
# 3.40 0.09 3.51
system.time(res2 <- c(5, 25, 55, 75)[findInterval(x, c(0, 1e4, 1e5, 1e6, Inf))])
# user system elapsed
# 0.18 0.03 0.20
identical(res1, res2)
## [1] TRUE

Resources