Is there a way to group values in a column between data gaps in R? - r

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?

Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3

A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3

Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Related

Fill missing values (NA) before the first non-NA value by group

I have a data frame grouped by 'id' and a variable 'age' which contains missing values, NA.
Within each 'id', I want to replace missing values of 'age', but only "fill up" before the first non-NA value.
data <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
age=c(NA,6,NA,8,NA,NA,NA,NA,3,8,NA,NA,NA,7,NA,9))
id age
1 1 NA
2 1 6 # first non-NA in id = 1. Fill up from here
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 NA
8 2 NA
9 2 3 # first non-NA in id = 2. Fill up from here
10 2 8
11 2 NA
12 3 NA
13 3 NA
14 3 7 # first non-NA in id = 3. Fill up from here
15 3 NA
16 3 9
Expected output:
1 1 6
2 1 6
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 3
8 2 3
9 2 3
10 2 8
11 2 NA
12 3 7
13 3 7
14 3 7
15 3 NA
16 3 9
I tried using fill with .direction = "up" like this:
library(dplyr)
library(tidyr)
data1 <- data %>% group_by(id) %>%
fill(!is.na(age[1]), .direction = "up")
You could use cumall(is.na(age)) to find the positions before the first non-NA value.
library(dplyr)
data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()
# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9
Another option (agnostic about where the missing and non-missing values start) could be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(is.na(age)), rep(seq_along(lengths), lengths)),
age2 = ifelse(rleid == min(rleid[is.na(age)]),
age[rleid == (min(rleid[is.na(age)]) + 1)][1],
age))
id age rleid age2
<dbl> <dbl> <int> <dbl>
1 1 NA 1 6
2 1 6 2 6
3 1 NA 3 NA
4 1 8 4 8
5 1 NA 5 NA
6 1 NA 5 NA
7 2 NA 1 3
8 2 NA 1 3
9 2 3 2 3
10 2 8 2 8
11 2 NA 3 NA
12 3 NA 1 7
13 3 NA 1 7
14 3 7 2 7
15 3 NA 3 NA
16 3 9 4 9

Coalesce multiple columns at once

My question is similar to existing questions about coalesce, but I want to coalesce several columns by row such that NAs are pushed to the last column.
Here's an example:
If I have
a <- data.frame(A=c(2,NA,4,3,2), B=c(NA,3,4,NA,5), C= c(1,3,6,7,NA), D=c(5,6,NA,4,3), E=c(2,NA,1,3,NA))
A B C D E
1 2 NA 1 5 2
2 NA 3 3 6 NA
3 4 4 6 NA 1
4 3 NA 7 4 3
5 2 5 NA 3 NA
I would like to get
b <- data.frame(A=c(2,3,4,3,2), B=c(1,3,4,7,5), C=c(5,6,6,4,3), D=c(2,NA,1,3,NA))
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
Does anyone have any ideas for how I could do this? I would be so grateful for any tips, as my searches have come up dry.
You can use unite and separate:
library(tidyverse)
a %>%
unite(newcol, everything(), na.rm = TRUE) %>%
separate(newcol, into = LETTERS[1:4])
A B C D
1 2 1 5 2
2 3 3 6 <NA>
3 4 4 6 1
4 3 7 4 3
5 2 5 3 <NA>
Since you have an unknown number of new columns in separate, one can use splitstackshape's function cSplit:
library(splitstackshape)
a %>%
unite(newcol, na.rm = TRUE) %>%
cSplit("newcol", "_", type.convert = F) %>%
rename_with(~ LETTERS)
This could be another solution. From what I understood you basically just want to shift the values in each row after the first NA to the left replacing the NA and I don't think coalesce can help you here.
library(dplyr)
library(purrr)
a %>%
pmap_dfr(~ {x <- c(...)[-which(is.na(c(...)))[1]]
setNames(x, LETTERS[seq_along(x)])})
# A tibble: 5 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
We may use base R - loop over the rows, order based on the NA elements and remove the columns that have all NAs
a[] <- t(apply(a, 1, \(x) x[order(is.na(x))]))
a[colSums(!is.na(a)) > 0]
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA

Get a value based on the value of another column in R - dplyr

i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
day = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2,NA,1,6,10,2,NA,NA,NA,NA,NA))
and i want to reach this result:
month day flow dayofminflow
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
I was using this solution, but it returns NA when the first value is NA:
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
And this solution returns an error when all data is NA:
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
You would just change the default na.rm = TRUE in min() from the first solution to ignore NAs?
df %>%
group_by(month) %>%
mutate(dayofminflow = day[which(min(flow, na.rm = TRUE) == flow)][1])
# A tibble: 20 x 4
# Groups: month [4]
month day flow dayofminflow
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
Though you get a warning no non-missing arguments to min; returning Inf from month 4 since all flow values are NA.

Removing groups with all NA in Data.Table or DPLYR in R

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))
I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...
Is this what you want?
library(dplyr)
dataHAVE %>%
group_by(student) %>%
filter(!all(is.na(score)))
student time score
<dbl> <dbl> <dbl>
1 1 1 7
2 1 2 9
3 1 3 5
4 3 1 NA
5 3 2 3
6 3 3 9
7 5 NA 7
8 5 2 NA
9 5 3 5
Each student is only kept if not (!) all score values are NA
Since nobody suggested one, here is a solution using data.table:
library(data.table)
dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
Edit:
Previous but wrong code:
dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]
New and correct code:
dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]
Returns:
student time score
1: 1 1 7
2: 1 2 9
3: 1 3 5
4: 3 1 NA
5: 3 2 3
6: 3 3 9
7: 5 NA 7
8: 5 2 NA
9: 5 3 5
Edit:
Updatet data.table solution with #Cole s suggestion...
Here is a base R solution using subset + ave
dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))
or
dataWANT <- subset(dataHAVE,
!Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))
Another option:
library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]
Create a dummy variable, and filter based on that
library("dplyr")
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataHAVE %>%
mutate(check=is.na(time)&is.na(score)) %>%
filter(check == FALSE) %>%
select(-check)
#> student time score
#> 1 1 1 7
#> 2 1 2 9
#> 3 1 3 5
#> 4 2 1 NA
#> 5 2 2 NA
#> 6 2 3 NA
#> 7 3 1 NA
#> 8 3 2 3
#> 9 3 3 9
#> 10 5 NA 7
#> 11 5 2 NA
#> 12 5 3 5
Created on 2020-02-21 by the reprex package (v0.3.0)
data.table solution generalising to any number of columns:
dataHAVE[,
.SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)],
by = student]
# student time score
# 1: 1 1 7
# 2: 1 2 9
# 3: 1 3 5
# 4: 3 1 NA
# 5: 3 2 3
# 6: 3 3 9
# 7: 5 NA 7
# 8: 5 2 NA
# 9: 5 3 5

R: creating multiple new variables based on conditions of selection of other variables with similar names

I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA

Resources