Fill missing values (NA) before the first non-NA value by group - r

I have a data frame grouped by 'id' and a variable 'age' which contains missing values, NA.
Within each 'id', I want to replace missing values of 'age', but only "fill up" before the first non-NA value.
data <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
age=c(NA,6,NA,8,NA,NA,NA,NA,3,8,NA,NA,NA,7,NA,9))
id age
1 1 NA
2 1 6 # first non-NA in id = 1. Fill up from here
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 NA
8 2 NA
9 2 3 # first non-NA in id = 2. Fill up from here
10 2 8
11 2 NA
12 3 NA
13 3 NA
14 3 7 # first non-NA in id = 3. Fill up from here
15 3 NA
16 3 9
Expected output:
1 1 6
2 1 6
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 3
8 2 3
9 2 3
10 2 8
11 2 NA
12 3 7
13 3 7
14 3 7
15 3 NA
16 3 9
I tried using fill with .direction = "up" like this:
library(dplyr)
library(tidyr)
data1 <- data %>% group_by(id) %>%
fill(!is.na(age[1]), .direction = "up")

You could use cumall(is.na(age)) to find the positions before the first non-NA value.
library(dplyr)
data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()
# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9

Another option (agnostic about where the missing and non-missing values start) could be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(is.na(age)), rep(seq_along(lengths), lengths)),
age2 = ifelse(rleid == min(rleid[is.na(age)]),
age[rleid == (min(rleid[is.na(age)]) + 1)][1],
age))
id age rleid age2
<dbl> <dbl> <int> <dbl>
1 1 NA 1 6
2 1 6 2 6
3 1 NA 3 NA
4 1 8 4 8
5 1 NA 5 NA
6 1 NA 5 NA
7 2 NA 1 3
8 2 NA 1 3
9 2 3 2 3
10 2 8 2 8
11 2 NA 3 NA
12 3 NA 1 7
13 3 NA 1 7
14 3 7 2 7
15 3 NA 3 NA
16 3 9 4 9

Related

Match and re-order rows in multiple columns in R (tidyverse)

I have a dataset like this (in the actual dataset, I have more columns like subj01):
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 6
3 3 5 5 9
4 4 9 6 NA
5 5 10 8 NA
6 6 NA 9 NA
7 7 NA 10 NA
8 8 NA NA NA
9 9 NA NA NA
10 10 NA NA NA
I created the dataset using the code below.
data = tibble(item = 1:10, subj01 = c(1,2,5,9,10,NA,NA,NA,NA,NA), subj02 = c(1,2,5,6,8,9,10,NA,NA,NA), subj03 = c(1,6,9,NA,NA,NA,NA,NA,NA,NA))
I would like to reorder all the columns beginning with "subj" so that the position of the values match that in the item column.
That is, for this example dataset, I would like to end up with this:
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA
I've figured that I can match and re-order one column by running this:
data$subj01[match(data$item,data$subj01)]
[1] 1 2 NA NA 5 NA NA NA 9 10
But I am struggling to apply this across multiple columns (ideally I'd like to embed the command in a dplyr pipe).
I tried the command below, but this gave me an error "Error in mutate(x. = x.[match(item, x.)]) : object 'x.' not found".
data = data %>% across(mutate(x.=x.[match(item,x.)]))
I'd appreciate any suggestions! Thank you.
library(tidyverse)
data %>%
pivot_longer(-item) %>%
filter(!is.na(value)) %>%
mutate(item = value) %>%
complete(item = 1:10, name) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 10 × 4
item subj01 subj02 subj03
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA

Create run-length ID for subset of values

In this type of dataframe:
df <- data.frame(
x = c(3,3,1,12,2,2,10,10,10,1,5,5,2,2,17,17)
)
how can I create a new column recording the run-length ID of only a subset of x values, say, 3-20?
My own attempt only succeeds at inserting NA where the run-length count should be interrupted; but internally it seems the count is uninterrupted:
library(data.table)
df %>%
mutate(rle = ifelse(x %in% 3:20, rleid(x), NA))
x rle
1 3 1
2 3 1
3 1 NA
4 12 3
5 2 NA
6 2 NA
7 10 5
8 10 5
9 10 5
10 1 NA
11 5 7
12 5 7
13 2 NA
14 2 NA
15 17 9
16 17 9
The expected result:
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
In base R:
df[df$x %in% 3:20, "rle"] <- data.table::rleid(df[df$x %in% 3:20, ])
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With left_join:
left_join(df, df %>%
filter(x %in% 3:20) %>%
distinct() %>%
mutate(rle = row_number()))
Joining, by = "x"
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With data.table:
library(data.table)
setDT(df)
df[x %between% c(3,20),rle:=rleid(x)][]
x rle
<num> <int>
1: 3 1
2: 3 1
3: 1 NA
4: 12 2
5: 2 NA
6: 2 NA
7: 10 3
8: 10 3
9: 10 3
10: 1 NA
11: 5 4
12: 5 4
13: 2 NA
14: 2 NA
15: 17 5
16: 17 5

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Get a value based on the value of another column in R - dplyr

i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
day = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2,NA,1,6,10,2,NA,NA,NA,NA,NA))
and i want to reach this result:
month day flow dayofminflow
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
I was using this solution, but it returns NA when the first value is NA:
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
And this solution returns an error when all data is NA:
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
You would just change the default na.rm = TRUE in min() from the first solution to ignore NAs?
df %>%
group_by(month) %>%
mutate(dayofminflow = day[which(min(flow, na.rm = TRUE) == flow)][1])
# A tibble: 20 x 4
# Groups: month [4]
month day flow dayofminflow
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
Though you get a warning no non-missing arguments to min; returning Inf from month 4 since all flow values are NA.

R apply function to groups within data frame adding result as additional column

Here is the code for my example dataset.
df = data.frame("group" =c(rep(1,5),rep(1,6),rep(2,4),rep(2,3)), "time" = c(rep(NA,5),seq(1,6),rep(NA,4),seq(1,3)), "p" = seq(1,18) )
group time p
1 1 NA 1
2 1 NA 2
3 1 NA 3
4 1 NA 4
5 1 NA 5
6 1 1 6
7 1 2 7
8 1 3 8
9 1 4 9
10 1 5 10
11 1 6 11
12 2 NA 12
13 2 NA 13
14 2 NA 14
15 2 NA 15
16 2 1 16
17 2 2 17
18 2 3 18
I would like to figure out how to apply a function by group to only the values that have time then append the result as a new column in the data frame. Here is my example function I would like to apply.
pfunc <- function(p){
p+5
}
The output I am hoping to obtain would look as follows.
group time p new_p
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 11
7 1 2 7 12
8 1 3 8 13
9 1 4 9 14
10 1 5 10 15
11 1 6 11 16
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 21
17 2 2 17 22
18 2 3 18 23
You can try this:
library(dplyr)
df %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+5))
# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8
Update
You can use this function:
increase <- function(data,n)
{
data %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+n)) -> result
return(result)
}
increase(df,n = 10)
# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 11
7 1 2 7 12
8 1 3 8 13
9 1 4 9 14
10 1 5 10 15
11 1 6 11 16
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 11
17 2 2 17 12
18 2 3 18 13
Update 2
I hope this helps:
df %>% group_by(group) %>% rowwise() %>% mutate(pnew=ifelse(is.na(time),NA,pfunc(time)))
# A tibble: 18 x 4
# Rowwise: group
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8

Resources