R - Impute missing values by group (linear / moving average) - r

I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
> d
name year V W X Y Z
1 a 1 NA 11 11 11 NA
2 a 2 21 NA 21 21 NA
3 a 3 31 31 NA 31 31
4 a 4 41 41 41 NA 41
5 b 1 11 11 NA NA 11
6 b 2 NA 21 21 21 NA
7 b 3 NA NA NA 31 31
8 b 4 41 NA 41 NA NA
9 c 1 NA NA 11 NA NA
10 c 2 NA NA NA 21 NA
11 c 3 NA 31 NA NA NA
12 c 4 41 NA NA NA NA
Hopefully the results can be as closed as the following:
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 21 21 21 21
3 a 3 31 31 31 31 31
4 a 4 41 41 41 41 41
5 b 1 11 11 11 11 11
6 b 2 21 21 21 21 21
7 b 3 31 31 31 31 31
8 b 4 41 41 41 41 41
9 c 1 11 11 11 11 NA
10 c 2 21 21 21 21 NA
11 c 3 31 31 31 31 NA
12 c 4 41 41 41 41 NA
I found this and this. Tried the following without groupby but it didn't work:
data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))
imputeTS::na_ma(d, k = 2, weighting = "simple")
The first one gave an error as below:
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
So I tried the second one and it keep loading for a long time and nothing happened. According to the reply from the first link,
the package requires time series/vector input (that's why each column
has to be called separately).
Any help is greatly appreciated!

You can use zoo::na.spline -
library(dplyr)
d %>%
group_by(name) %>%
mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
ungroup
# name year V W X Y Z
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 11 11 11 11 11
# 2 a 2 21 21 21 21 21
# 3 a 3 31 31 31 31 31
# 4 a 4 41 41 41 41 41
# 5 b 1 11 11 11 11 11
# 6 b 2 21 21 21 21 21
# 7 b 3 31 31 31 31 31
# 8 b 4 41 41 41 41 41
# 9 c 1 41 31 11 21 NA
#10 c 2 41 31 11 21 NA
#11 c 3 41 31 11 21 NA
#12 c 4 41 31 11 21 NA
For name, "c" I think it would be difficult to impute the missing values only from 1 number.

One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma or na_interpolation from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.
That is why in this solution I created a impute_select function for you, that let's you choose, what to to when > 1 values or present, when exactly == 1 values are present or when there are only NAs.
In this case, when > 1 values is present, it uses na_ma, but you could also use na_interpoltion or any other imputation function from imputeTS here.
When only 1 value is present, it uses na_locf since this method also works with only 1 value in the series.
When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)
By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.
library("imputeTS")
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
impute_select <- function(x) {
# select a method to use when more than 1 values are available
if (sum(!is.na(x)) > 1) {
result <- na_ma(x)
}
# Select value when only 1 value is in series
if (sum(!is.na(x)) == 1) {
result <- na_locf(x)
}
# Select method, when no non-NA value is present
else {
result <- na_replace(x, 11)
}
}
# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))
d
This are the results (hopefully exactly what you wanted):
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 11 21 21 11
3 a 3 31 31 11 31 31
4 a 4 41 41 41 11 41
5 b 1 11 11 11 11 11
6 b 2 11 21 21 21 11
7 b 3 11 11 11 31 31
8 b 4 41 11 41 11 11
9 c 1 11 11 11 11 11
10 c 2 21 21 21 21 21
11 c 3 31 31 31 31 31
12 c 4 41 41 41 41 41

Related

R how to obtain the mean of previous values depending on a second column with data in long format

I prepared a reproducible example with an example of dataset here:
patient <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
month <- rep (1:10, 2)
fev1 <- c(58, NA, NA, NA, 57, NA, NA, NA, NA, 60, NA, NA, NA, NA, 32, NA, NA, NA, NA, 40)
adherence <- c (30, 32, 34, 36, 34, 32, 30, 34, 32, 36, 70, 65, 75, 70, 70, 55, 50, 65, 70, 70)
data <- tibble(patient, month, fev1, adherence)
data
I would like to obtain a 5th column, called mean adherence, which for each fev1 value which is not NA, would provide the mean adherence of the previous values since the last fev1 value, including the adherence value corresponding to the preceding fev1 value, and excluding the adherence value corresponding to the current fev1 value, and this for each patient id.
For example, for patient 1, at month 5, for fev1 = 57, it would compute the mean of (30, 32, 34, 36); then, for fev1 = 60, it would compute the mean of (34, 32, 30, 34, 32)
Thank you very much for your help
We could create a grouping variable based on the NA values in fev1, and then get the mean by group
library(dplyr)
data %>%
group_by(patient) %>%
mutate(lagadher = lag(adherence),
grp = lag(cumsum(!is.na(fev1)))) %>%
group_by(grp, .add = TRUE) %>%
mutate(Mean_adhere = mean(lagadher) * NA^(is.na(fev1))) %>%
ungroup %>%
select(-grp, -lagadher)
-output
# A tibble: 20 × 5
patient month fev1 adherence Mean_adhere
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 58 30 NA
2 1 2 NA 32 NA
3 1 3 NA 34 NA
4 1 4 NA 36 NA
5 1 5 57 34 33
6 1 6 NA 32 NA
7 1 7 NA 30 NA
8 1 8 NA 34 NA
9 1 9 NA 32 NA
10 1 10 60 36 32.4
11 2 1 NA 70 NA
12 2 2 NA 65 NA
13 2 3 NA 75 NA
14 2 4 NA 70 NA
15 2 5 32 70 70
16 2 6 NA 55 NA
17 2 7 NA 50 NA
18 2 8 NA 65 NA
19 2 9 NA 70 NA
20 2 10 40 70 62

How to replace missing values only for people who have positive values on the first observation in panel data?

I have four people who are followed for four years. I would like to replace the NA by 0, but only for people who has a positive value in workhours on the first wave they were interviewed. For example, in my data, this means that the persons with ID 3 and 4 will have their data replaced by 0, but the person with ID 2 will keep his/her NA.
id wave year work_hours
1 1 2007 40
1 2 2008 39
1 3 2009 39
1 4 2010 38
2 1 2005 NA
2 2 2006 35
2 3 2007 35
2 4 2008 NA
3 1 2007 40
3 2 2008 NA
3 3 2009 40
3 4 2010 40
4 1 2009 32
4 2 2010 NA
4 3 2011 32
4 4 2012 NA
I tried the following code, but it is replacing the first wave with 0 but not the waves that follows:
df= df %>% group_by(id) %>%
mutate(workhours_imputed= ifelse(work_hours>0 & wave==1, replace_na(0), work_hours))
Here Is the Data:
structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4), wave = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,
3, 4), year = c(2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008,
2007, 2008, 2009, 2010, 2009, 2010, 2011, 2012), work_hours = c(40,
39, 39, 38, NA, 35, 35, NA, 40, NA, 40, 40, 32, NA, 32, NA),
workhours_imputed = c(0, 39, 39, 38, NA, 35, 35, NA, 0, NA,
40, 40, 0, NA, 32, NA)), row.names = c(NA, -16L), groups = structure(list(
id = c(1, 2, 3, 4), .rows = structure(list(1:4, 5:8, 9:12,
13:16), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Alternative dplyr solution:
df %>%
mutate(workhours_imputed = if_else(
is.na(work_hours) & any(wave == 1 & !is.na(work_hours)),
0, work_hours)
)
# # A tibble: 16 x 5
# # Groups: id [4]
# id wave year work_hours workhours_imputed
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2007 40 40
# 2 1 2 2008 39 39
# 3 1 3 2009 39 39
# 4 1 4 2010 38 38
# 5 2 1 2005 NA NA
# 6 2 2 2006 35 35
# 7 2 3 2007 35 35
# 8 2 4 2008 NA NA
# 9 3 1 2007 40 40
# 10 3 2 2008 NA 0
# 11 3 3 2009 40 40
# 12 3 4 2010 40 40
# 13 4 1 2009 32 32
# 14 4 2 2010 NA 0
# 15 4 3 2011 32 32
# 16 4 4 2012 NA 0
If wave does not always start at 1 but you always want to check the first value of wave, then you can use this instead:
df %>%
mutate(workhours_imputed = if_else(
is.na(work_hours) & !is.na(work_hours[which.min(wave)]),
0, work_hours)
)
One way to do this using match -
library(dplyr)
df %>%
group_by(id) %>%
mutate(workhours_imputed = {
tmp <- work_hours[match(1, wave)]
#If the 1st wave has a positive value
#replace NA with 0
if(!is.na(tmp) && tmp > 0) replace(work_hours, is.na(work_hours), 0) else work_hours
})
# id wave year work_hours workhours_imputed
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2007 40 40
# 2 1 2 2008 39 39
# 3 1 3 2009 39 39
# 4 1 4 2010 38 38
# 5 2 1 2005 NA NA
# 6 2 2 2006 35 35
# 7 2 3 2007 35 35
# 8 2 4 2008 NA NA
# 9 3 1 2007 40 40
#10 3 2 2008 NA 0
#11 3 3 2009 40 40
#12 3 4 2010 40 40
#13 4 1 2009 32 32
#14 4 2 2010 NA 0
#15 4 3 2011 32 32
#16 4 4 2012 NA 0

Running delta - Lag difference over row of data frame

I have a data frame such as
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(1, 5, 3, 8, 13),
`X1981` = c(20, 13, 23, 11, 29),
`X1982` = c(33, 32, 31, 41, 42),
`X1983` = c(45, 47, 53, 58, 55))
I would like to create a new data frame (df2) keeping the company column as is. The values for the years 1980 and 1983 should be calculated by using the current value minus the previous value. So basically I would like a data frame resulting in the rolling deltas.
Company 1980 1981 1982 1983
A NA 19 13 12
B NA 8 19 15
C NA 20 8 22
D NA 3 30 17
E NA 16 13 13
Thanks for the help! If there's any way for me to improve the question, then just let me know.
You can find difference in each row adding NA to first value.
df1[-1] <- t(apply(df1[-1], 1, function(x) c(NA, diff(x))))
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13
You can also use tidyverse functions.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Company) %>%
group_by(Company) %>%
mutate(value = value - lag(value)) %>%
pivot_wider()
We can use rowDiffs from matrixStats
library(matrixStats)
df1[-1] <- cbind(NA, rowDiffs(as.matrix(df1[-1])))
-output
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13

How to handle or ignore NAs when using ifelse to mutate a new column with multiple conditions (solved)

I am a newcomer to dplyr and tried to create a new composite variable from three different age variables using dplyr and ifelse. I made a data frame to explain the situation as follows:
library(dplyr)
z <- data.frame("j6" = c(6, 19, NA, NA, NA, NA, NA, 8, 20, 20, NA),
"j7" = c(27, 20, NA, 7, 19, NA, NA, 20, 30, 9, NA),
"j8" = c(8, 22, NA, 20, NA, 8, 30, NA, NA, NA, 3))
z <- z %>%
mutate(., age_event = NA) %>%
mutate(., age_event = ifelse(j6 < 18 | j7 < 18 | j8 < 18, 1, 0))
My expectations:
The three columns (j6, j7, and j8) indicate ages, and if at lease one of them is less than 18 year-old, the new column (age_event) should be "1", otherwise 0.
And if the two of the three columns are both 18-year or older and the other is NA, the age_event variable should be 0 .
Likewise if the one of the three columns is 18-year or older and the others are NAs, the age_event variable should be 0.
Also it is NA if all of the three columns are NAs.
However, the result and problems are shown as follows:
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA NA <-- should be 0, but NA
6 NA NA 8 1
7 NA NA 30 NA <-- should be 0, but NA
8 8 20 NA 1
9 20 30 NA NA <-- should be 0, but NA
10 20 9 NA 1
11 NA NA 3 1
I'd like to know if there is a way to turn 5th, 7th, and 9th observations above to 0s using mutate and ifelse. Any suggestions would be greatly appreciated!
Update (2/27/2020): I found a solution with pmin when using mutate and ifelse:
z <- z %>%
mutate(., age_event = ifelse(is.na(j6) & is.na(j7) & is.na(j8), NA,
ifelse(pmin(j6, j7, j8, na.rm = T) < 18, 1, 0)))
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
You can use rowMeans() in place of if_else() which will handle cases that are all NA.
z %>%
mutate(age_event = +(rowMeans(. < 18, na.rm = TRUE) > 0))
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
We can use rowSums to calculate number of NA values in a row and number of values that are less than 18. We can then use case_when to assign numbers based on different conditions.
library(dplyr)
z %>%
mutate(calc = rowSums(!is.na(.), na.rm = TRUE),
ls18 = rowSums(. < 18, na.rm = TRUE),
age_event = case_when(calc == 0 & ls18 == 0 ~ NA_integer_,
ls18 > 0 ~ 1L,
TRUE ~ 0L)) %>%
select(-calc, -ls18)
# j6 j7 j8 age_event
#1 6 27 8 1
#2 19 20 22 0
#3 NA NA NA NA
#4 NA 7 20 1
#5 NA 19 NA 0
#6 NA NA 8 1
#7 NA NA 30 0
#8 8 20 NA 1
#9 20 30 NA 0
#10 20 9 NA 1
#11 NA NA 3 1

Replace missing values with the values of related ID

I have a dataframe with missing values.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5, 6), value1 = c(23, 14, NA, 45, NA, NA),
value2 = c(25, 15, NA, 34, NA, NA), value3 = c(33, 29, NA, 29, NA, NA))
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 NA NA NA
4 45 34 29
5 NA NA NA
6 NA NA NA
And a dataframe with id relations.
df2 <- data.frame(ID1 = c(1, 2, 4), ID2 = c(3, 5, 6))
ID1 ID2
1 3
2 5
4 6
I want to replace the missing values, with the values of the related ID.
So the dataframe will look like this.
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 23 25 33
4 45 34 29
5 14 15 29
6 45 34 29
Any help would be appreciated.
you will need a for-loop like this:
for (i in seq_along(df2[, "ID2"])) {
df1[df2[i, "ID2"], c("value1", "value2", "value3")] <- df1[df2[i, "ID1"], c("value1", "value2", "value3")] }
You can use as #FannieY already suggested a for loop. In addition I test with is.na to avoid to overwrite existing values.
for(i in seq_len(nrow(df2))) {
idx <- is.na(df1[df2[i,2],-1])
df1[df2[i,2],-1][idx] <- df1[df2[i,1],-1][idx]
}
df1
# ID value1 value2 value3
#1 1 23 25 33
#2 2 14 15 29
#3 3 23 25 33
#4 4 45 34 29
#5 5 14 15 29
#6 6 45 34 29

Resources