I have a large dataset, which contains many NAs. I want to find the rows where the first NA and the last NA appear. For example, for column A, I want the output to be the second row (the last NA before a number) and the fifth row (the first NA after a number). My code, which was shown below, does not work very well.
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
Data:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
I believe this function might be what you are looking for:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
which returns
first_and_last_na_row(DT, "A")
ID A B C
1: 2 NA 2 2
2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C
1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C
1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
in case of
DT
ID A B C
1: 1 NA NA 3
2: 2 NA 2 2
3: 3 3 3 1
4: 4 4 5 NA
5: 5 NA 6 NA
or
first_and_last_na_row(DT2, "D")
ID A B C D
1: 1 NA NA 3 NA
in case of Akrun's (simplified) example
DT2
ID A B C D
1: 1 NA NA 3 NA
2: 2 NA 2 2 2
3: 3 3 3 1 NA
4: 4 4 5 NA NA
5: 5 NA 6 NA 4
Edit: Faster version using melt()
The OP has commented that his production data set consists of 4000 columns and 192 rows and that he needs the indices to clean another data set. He tried a for loop across all columns which is very slow.
Therefore, I suggest to reshape the data set from wide to long format and to use data.table's efficient grouping mechanism:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp
1: 1 A NA 1
2: 2 A NA 1
3: 3 A 3 2
4: 4 A 4 2
5: 5 A NA 3
6: 1 B NA 4
7: 2 B 2 5
8: 3 B 3 5
9: 4 B 5 5
10: 5 B 6 5
11: 1 C 3 6
12: 2 C 2 6
13: 3 C 1 6
14: 4 C NA 7
15: 5 C NA 7
16: 1 D NA 8
17: 2 D 2 9
18: 3 D NA 10
19: 4 D NA 10
20: 5 D 4 11
Now, we get the indices of the starting or ending, resp., NA sequence for each variable (if any) by
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID
1: A 1
2: A 2
3: B 1
4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID
1: A 5
2: C 4
3: C 5
Note that this returns all indices of the starting or ending NA sequences which might be more convenient for subsequent cleaning of another data set. If only the last and first indices are required this can be achieved by
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID
1: A 2
2: B 1
3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID
1: A 5
2: C 4
I have tested this approach using a dummy data set of 192 rows times 4000 columns. The whole operation needed less than one second.
Related
I want to filter my data using conditions, but the presence of NA affects the results.
For example:
dt <- data.table(a=c(1:4,NA), b=c(NA,2,1,4,5), d=c(1,2,NA,4,NA))
dt
a b d
1: 1 NA 1
2: 2 2 2
3: 3 1 NA
4: 4 4 4
5: NA 5 NA
when I do
subset(dt, !(b < a))
a b d
1: 2 2 2
2: 4 4 4
i.e., if either a or b is NA, that row is excluded:
but the result I want is
a b d
1: 1 NA 1
2: 2 2 2
3: 4 4 4
4: NA 5 NA
that is, I just want one row to be excluded if and only if the condition verifies.
If I add more conditions, like subset(dt, is.na(a) | is.na(b) | !(b < a)) it works as expected but I was looking for a way to express 'if and only if' through operators like & and |
Is this possible?
Thank you!
This works:
dt[!which(dt$b < dt$a), ]
a b d
1: 1 NA 1
2: 2 2 2
3: 4 4 4
4: NA 5 NA
In this workaround I am only selecting the rows which don't return TRUE for condition df$b < df$a. Meaning they can return FALSE or NA or whatever really.
We may use if_any
library(dplyr)
dt %>%
filter(if_any(c(b, a), is.na)|b >=a)
a b d
1: 1 NA 1
2: 2 2 2
3: 4 4 4
4: NA 5 NA
I have a data table that looks like this:
DT<-data.table(day=c(1,2,3,4,5,6,7,8),Consumption=c(5,9,10,2,NA,NA,NA,NA),id=c(1,2,3,1,1,2,2,1))
day Consumption id
1: 1 5 1
2: 2 9 2
3: 3 10 3
4: 4 2 1
5: 5 NA 1
6: 6 NA 2
7: 7 NA 2
8: 8 NA 1
I want to create two columns that show the last non-Na consumption value before the observation, and the day difference between those observations using the id groups. So far, I tried this:
DT[, j := day-shift(day, fill = NA,n=1), by = id]
DT[, yj := shift(Consumption, fill = NA,n=1), by = id]
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 1 NA
8: 8 NA 1 3 NA
However, I want that the lagged consumption values with n=1 come from the rows which have non-NA consumption values. For example, in the 7th row and column "yj", the yj value is NA because it comes from the 6th row which has NA consumption. I want it to come from the 2nd row. Therefore, I would like the end up with this data table:
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 5 9
8: 8 NA 1 4 2
Note: The reason for specifically using the parameter n of shift function is that I will also need the 2nd last non-Na consumption values in the next step.
Thank You
Here's a data.table solution with an assist from zoo:
library(data.table)
library(zoo)
DT[, `:=`(day_shift = shift(day),
yj = shift(Consumption)),
by = id]
#make the NA yj records NA for the days
DT[is.na(yj), day_shift := NA_integer_]
#fill the DT with the last non-NA value
DT[,
`:=`(day_shift = na.locf(day_shift, na.rm = F),
yj = zoo::na.locf(yj, na.rm = F)),
by = id]
# finally calculate j
DT[, j:= day - day_shift]
# you can clean up the ordering or remove columns later
DT
day Consumption id day_shift yj j
1: 1 5 1 NA NA NA
2: 2 9 2 NA NA NA
3: 3 10 3 NA NA NA
4: 4 2 1 1 5 3
5: 5 NA 1 4 2 1
6: 6 NA 2 2 9 4
7: 7 NA 2 2 9 5
8: 8 NA 1 4 2 4
I'm trying to missing values in a data.table column with the value below it using shift, but I can only get it to work if I first create a temporary variable. Is this the expected behavior? MWE:
library(data.table)
dt <- data.table(x=c(1, NA))
dt[is.na(x), x:=shift(x)]
# Fails
dt <- data.table(x=c(1, NA))
dt <- dt[, x.lag:=shift(x)]
dt[is.na(x), x:=x.lag]
# Works
I'm a little new to data.table, but I think the rolling join might be what you're after here. Presumably you want to be able to impute a data point when there are multiple missing values in sequence, in which case your shift method will just fill NA.
Your example is a little too minimal to really see what you're doing, but if I expand it a little to include a record column, where various x values are missing;
library(data.table)
dt <- data.table(record=1:10, x=c(1, NA, NA, 4, 5, 6, NA, NA, NA, 10))
> dt
record x
1: 1 1
2: 2 NA
3: 3 NA
4: 4 4
5: 5 5
6: 6 6
7: 7 NA
8: 8 NA
9: 9 NA
10: 10 10
Then create a copy with only the non-missing rows, and set a key as the x column
dtNA <- dt[!is.na(x)]
setkey(dtNA, record)
> dtNA
record x
1: 1 1
2: 4 4
3: 5 5
4: 6 6
5: 10 10
Then do a rolling join (whereby if a value is missing, the previous record is rolled forwards) on the full list of records
dtNA[data.table(record=dt$record, key="record"), roll=TRUE]
record x
1: 1 1
2: 2 1
3: 3 1
4: 4 4
5: 5 5
6: 6 6
7: 7 6
8: 8 6
9: 9 6
10: 10 10
Compared to your method which produces the following (still has NA values in x);
dt[, x.lag:=shift(x)]
dt[is.na(x), x:=x.lag]
> dt
record x x.lag
1: 1 1 NA
2: 2 1 1
3: 3 NA NA
4: 4 4 NA
5: 5 5 4
6: 6 6 5
7: 7 6 6
8: 8 NA NA
9: 9 NA NA
10: 10 10 NA
I have a very large data set (millions of rows) where I need to turn into NA certain rows when a var1 equals "Z". However, I also need to turn into NA the preceding row to a row with var1="Z".
E.g.:
id var1
1 A
1 B
1 Z
1 S
1 A
1 B
2 A
2 B
3 A
3 B
3 A
3 B
4 A
4 B
4 A
4 B
In this case, the second row and the third row for id==1 should be NA.
I have tried a loop but it doesn't work as the data set is very large.
for (i in 1:length(df$var1)){
if(df$var1[i] =="Z"){
df[i,] <- NA
df[(i-1),] <-- NA
}
}
I have also tried to use data.table package unsuccessfully. Do you have any idea of how I could do it or what is the right term to look for info on what I am trying to do?
Maybe do it like this using data.table:
df <- as.data.table(read.table(header=T, file='clipboard'))
df$var1 <- as.character(df$var1)
#find where var1 == Z
index <- df[, which(var1 == 'Z')]
#add the previous lines too
index <- c(index, index-1)
#convert to NA
df[index, var1 := NA ]
Or in one call:
df[c(which(var1 == 'Z'), which(var1 == 'Z') - 1), var1 := NA ]
Output:
> df
id var1
1: 1 A
2: 1 NA
3: 1 NA
4: 1 S
5: 1 A
6: 1 B
7: 2 A
8: 2 B
9: 3 A
10: 3 B
11: 3 A
12: 3 B
13: 4 A
14: 4 B
15: 4 A
16: 4 B
If you want to take in count the preceding indices only if they are from the same id, I would suggest to use the .I and by combination which will make sure that you are not taking indecies from previous id
setDT(df)[, var1 := as.character(var1)]
indx <- df[, {indx <- which(var1 == "Z") ; .I[c(indx - 1L, indx)]}, by = id]$V1
df[indx, var1 := NA_character_]
df
# id var1
# 1: 1 A
# 2: 1 NA
# 3: 1 NA
# 4: 1 S
# 5: 1 A
# 6: 1 B
# 7: 2 A
# 8: 2 B
# 9: 3 A
# 10: 3 B
# 11: 3 A
# 12: 3 B
# 13: 4 A
# 14: 4 B
# 15: 4 A
# 16: 4 B
You can have a base R approach:
x = var1=='Z'
df[x | c(x[-1],F), 'var1'] <- NA
# id var1
#1 1 A
#2 1 <NA>
#3 1 <NA>
#4 1 S
#5 1 A
#6 1 B
#7 2 A
#8 2 B
#9 3 A
#10 3 B
#11 3 A
#12 3 B
#13 4 A
#14 4 B
#15 4 A
#16 4 B
My question is essentially the same as this question: data.table join then add columns to existing data.frame without re-copy.
Basically I have a template with keys and I want to assign columns from other data.tables to the template by the same keys.
> template
id1 id2
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: b 1
7: b 2
8: b 3
9: b 4
10: b 5
> x
id1 id2 value
1: a 2 0.01649728
2: a 3 -0.27918482
3: b 3 0.86933718
> y
id1 id2 value
1: a 4 -1.163439
2: b 4 2.267872
3: b 5 1.083258
> template[x, value := i.value]
> template[y, value := i.value]
> template
id1 id2 value
1: a 1 NA
2: a 2 0.01649728
3: a 3 -0.27918482
4: a 4 -1.16343917
5: a 5 NA
6: b 1 NA
7: b 2 NA
8: b 3 0.86933718
9: b 4 2.26787248
10: b 5 1.08325793
>
But if x and y have say 100 columns, then it is not possible to write out the value := i.value syntax for all columns. Is there a way to do the same thing but for all the columns in x and y?
EDIT:
If I do y[x[template]], then it creates separate value columns, which is not intended:
> y[x[template]]
id1 id2 value value.1
1: a 1 NA NA
2: a 2 NA 0.01649728
3: a 3 NA -0.27918482
4: a 4 -1.163439 NA
5: a 5 NA NA
6: b 1 NA NA
7: b 2 NA NA
8: b 3 NA 0.86933718
9: b 4 2.267872 NA
10: b 5 1.083258 NA
>
Just create a function that takes names as arguments and constructs the expression for you. And then eval it each time by passing the names of each data.table you require. Here's an illustration:
get_expr <- function(x) {
# 'x' is the names vector
expr = paste0("i.", x)
expr = lapply(expr, as.name)
setattr(expr, 'names', x)
as.call(c(quote(`:=`), expr))
}
> get_expr('value') ## generates the required expression
# `:=`(value = i.value)
template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]
# id1 id2 value
# 1: a 1 NA
# 2: a 2 0.01649728
# 3: a 3 -0.27918482
# 4: a 4 -1.16343900
# 5: a 5 NA
# 6: b 1 NA
# 7: b 2 NA
# 8: b 3 0.86933718
# 9: b 4 2.26787200
# 10: b 5 1.08325800