Removing groups with all NA in Data.Table or DPLYR in R - r

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))
I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...

Is this what you want?
library(dplyr)
dataHAVE %>%
group_by(student) %>%
filter(!all(is.na(score)))
student time score
<dbl> <dbl> <dbl>
1 1 1 7
2 1 2 9
3 1 3 5
4 3 1 NA
5 3 2 3
6 3 3 9
7 5 NA 7
8 5 2 NA
9 5 3 5
Each student is only kept if not (!) all score values are NA

Since nobody suggested one, here is a solution using data.table:
library(data.table)
dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
Edit:
Previous but wrong code:
dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]
New and correct code:
dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]
Returns:
student time score
1: 1 1 7
2: 1 2 9
3: 1 3 5
4: 3 1 NA
5: 3 2 3
6: 3 3 9
7: 5 NA 7
8: 5 2 NA
9: 5 3 5
Edit:
Updatet data.table solution with #Cole s suggestion...

Here is a base R solution using subset + ave
dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))
or
dataWANT <- subset(dataHAVE,
!Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))

Another option:
library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]

Create a dummy variable, and filter based on that
library("dplyr")
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataHAVE %>%
mutate(check=is.na(time)&is.na(score)) %>%
filter(check == FALSE) %>%
select(-check)
#> student time score
#> 1 1 1 7
#> 2 1 2 9
#> 3 1 3 5
#> 4 2 1 NA
#> 5 2 2 NA
#> 6 2 3 NA
#> 7 3 1 NA
#> 8 3 2 3
#> 9 3 3 9
#> 10 5 NA 7
#> 11 5 2 NA
#> 12 5 3 5
Created on 2020-02-21 by the reprex package (v0.3.0)

data.table solution generalising to any number of columns:
dataHAVE[,
.SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)],
by = student]
# student time score
# 1: 1 1 7
# 2: 1 2 9
# 3: 1 3 5
# 4: 3 1 NA
# 5: 3 2 3
# 6: 3 3 9
# 7: 5 NA 7
# 8: 5 2 NA
# 9: 5 3 5

Related

Coalesce multiple columns at once

My question is similar to existing questions about coalesce, but I want to coalesce several columns by row such that NAs are pushed to the last column.
Here's an example:
If I have
a <- data.frame(A=c(2,NA,4,3,2), B=c(NA,3,4,NA,5), C= c(1,3,6,7,NA), D=c(5,6,NA,4,3), E=c(2,NA,1,3,NA))
A B C D E
1 2 NA 1 5 2
2 NA 3 3 6 NA
3 4 4 6 NA 1
4 3 NA 7 4 3
5 2 5 NA 3 NA
I would like to get
b <- data.frame(A=c(2,3,4,3,2), B=c(1,3,4,7,5), C=c(5,6,6,4,3), D=c(2,NA,1,3,NA))
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
Does anyone have any ideas for how I could do this? I would be so grateful for any tips, as my searches have come up dry.
You can use unite and separate:
library(tidyverse)
a %>%
unite(newcol, everything(), na.rm = TRUE) %>%
separate(newcol, into = LETTERS[1:4])
A B C D
1 2 1 5 2
2 3 3 6 <NA>
3 4 4 6 1
4 3 7 4 3
5 2 5 3 <NA>
Since you have an unknown number of new columns in separate, one can use splitstackshape's function cSplit:
library(splitstackshape)
a %>%
unite(newcol, na.rm = TRUE) %>%
cSplit("newcol", "_", type.convert = F) %>%
rename_with(~ LETTERS)
This could be another solution. From what I understood you basically just want to shift the values in each row after the first NA to the left replacing the NA and I don't think coalesce can help you here.
library(dplyr)
library(purrr)
a %>%
pmap_dfr(~ {x <- c(...)[-which(is.na(c(...)))[1]]
setNames(x, LETTERS[seq_along(x)])})
# A tibble: 5 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
We may use base R - loop over the rows, order based on the NA elements and remove the columns that have all NAs
a[] <- t(apply(a, 1, \(x) x[order(is.na(x))]))
a[colSums(!is.na(a)) > 0]
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Get a value based on the value of another column in R - dplyr

i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
day = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2,NA,1,6,10,2,NA,NA,NA,NA,NA))
and i want to reach this result:
month day flow dayofminflow
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
I was using this solution, but it returns NA when the first value is NA:
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
And this solution returns an error when all data is NA:
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
You would just change the default na.rm = TRUE in min() from the first solution to ignore NAs?
df %>%
group_by(month) %>%
mutate(dayofminflow = day[which(min(flow, na.rm = TRUE) == flow)][1])
# A tibble: 20 x 4
# Groups: month [4]
month day flow dayofminflow
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
Though you get a warning no non-missing arguments to min; returning Inf from month 4 since all flow values are NA.

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

Shifting a variable up in R data frame with respect to another variable

I have a data.frame:
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
I would like to push the Pur up for each ID to get the up.Pur as follows:
ID Pur up.Pur
2 0 1
2 1 2
2 2 3
2 3 NA
3 1 2
3 2 NA
5 4 5
5 5 NA
Would appreciate your help with this.
Here is a dplyr approach
library(dplyr)
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
df %>%
group_by(ID) %>%
mutate(up.Pur = lead(Pur))
# Source: local data frame [8 x 3]
# Groups: ID [3]
#
# ID Pur up.Pur
# <dbl> <dbl> <dbl>
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA
For completeness, I've added a base R approach, just in case you don't feel like installing any packages.
dfList = split(df, ID)
dfList = lapply(dfList, function(x){
x$up.Pur = c(x$Pur[-1], NA)
return(x)
})
unsplit(dfList, ID)
# ID Pur up.Pur
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA
We can use shift from data.table
library(data.table)
setDT(df)[, up.Pur := shift(Pur, type = "lead"), by = ID]
df
# ID Pur up.Pur
#1: 2 0 1
#2: 2 1 2
#3: 2 2 3
#4: 2 3 NA
#5: 3 1 2
#6: 3 2 NA
#7: 5 4 5
#8: 5 5 NA

Resources