I have a data table that looks like this:
DT<-data.table(day=c(1,2,3,4,5,6,7,8),Consumption=c(5,9,10,2,NA,NA,NA,NA),id=c(1,2,3,1,1,2,2,1))
day Consumption id
1: 1 5 1
2: 2 9 2
3: 3 10 3
4: 4 2 1
5: 5 NA 1
6: 6 NA 2
7: 7 NA 2
8: 8 NA 1
I want to create two columns that show the last non-Na consumption value before the observation, and the day difference between those observations using the id groups. So far, I tried this:
DT[, j := day-shift(day, fill = NA,n=1), by = id]
DT[, yj := shift(Consumption, fill = NA,n=1), by = id]
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 1 NA
8: 8 NA 1 3 NA
However, I want that the lagged consumption values with n=1 come from the rows which have non-NA consumption values. For example, in the 7th row and column "yj", the yj value is NA because it comes from the 6th row which has NA consumption. I want it to come from the 2nd row. Therefore, I would like the end up with this data table:
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 5 9
8: 8 NA 1 4 2
Note: The reason for specifically using the parameter n of shift function is that I will also need the 2nd last non-Na consumption values in the next step.
Thank You
Here's a data.table solution with an assist from zoo:
library(data.table)
library(zoo)
DT[, `:=`(day_shift = shift(day),
yj = shift(Consumption)),
by = id]
#make the NA yj records NA for the days
DT[is.na(yj), day_shift := NA_integer_]
#fill the DT with the last non-NA value
DT[,
`:=`(day_shift = na.locf(day_shift, na.rm = F),
yj = zoo::na.locf(yj, na.rm = F)),
by = id]
# finally calculate j
DT[, j:= day - day_shift]
# you can clean up the ordering or remove columns later
DT
day Consumption id day_shift yj j
1: 1 5 1 NA NA NA
2: 2 9 2 NA NA NA
3: 3 10 3 NA NA NA
4: 4 2 1 1 5 3
5: 5 NA 1 4 2 1
6: 6 NA 2 2 9 4
7: 7 NA 2 2 9 5
8: 8 NA 1 4 2 4
Related
i cant seem to figure this out. What i want to do is make a new column in my dataframe with the sum of several columns divided by the number of columns which constribute to the sum.
so like this:
ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2
to this:
ID 2003 2004 2005 2006 SUM/col
1 1 4 1 NA 2
2 2 2 NA 3 2.33
3 1 3 NA NA 2
4 4 1 1 NA 3
5 3 1 4 2 2.5
We can use the rowMeans function and set na.rm = TRUE. dt[, -1] is a way to exclude the first column for the analysis.
dt$`SUM/col` <- rowMeans(dt[, -1], na.rm = TRUE)
dt
ID X2003 X2004 X2005 X2006 SUM/col
1 1 1 4 1 NA 2.000000
2 2 2 2 NA 3 2.333333
3 3 1 3 NA NA 2.000000
4 4 4 1 1 NA 2.000000
5 5 3 1 4 2 2.500000
DATA
dt <- read.table(text = "ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2",
header = TRUE)
If your data.frame is called df, then try:
df$"SUM/col" <- apply(df, 1, function(x) mean(x, na.rm=T))
The apply function calculates, for each row, the sum (excluding NAs) divided by the total number of non-NA elements. The resulting vector is then added as a column to df.
I have a large dataset, which contains many NAs. I want to find the rows where the first NA and the last NA appear. For example, for column A, I want the output to be the second row (the last NA before a number) and the fifth row (the first NA after a number). My code, which was shown below, does not work very well.
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
Data:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
I believe this function might be what you are looking for:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
which returns
first_and_last_na_row(DT, "A")
ID A B C
1: 2 NA 2 2
2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C
1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C
1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
in case of
DT
ID A B C
1: 1 NA NA 3
2: 2 NA 2 2
3: 3 3 3 1
4: 4 4 5 NA
5: 5 NA 6 NA
or
first_and_last_na_row(DT2, "D")
ID A B C D
1: 1 NA NA 3 NA
in case of Akrun's (simplified) example
DT2
ID A B C D
1: 1 NA NA 3 NA
2: 2 NA 2 2 2
3: 3 3 3 1 NA
4: 4 4 5 NA NA
5: 5 NA 6 NA 4
Edit: Faster version using melt()
The OP has commented that his production data set consists of 4000 columns and 192 rows and that he needs the indices to clean another data set. He tried a for loop across all columns which is very slow.
Therefore, I suggest to reshape the data set from wide to long format and to use data.table's efficient grouping mechanism:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp
1: 1 A NA 1
2: 2 A NA 1
3: 3 A 3 2
4: 4 A 4 2
5: 5 A NA 3
6: 1 B NA 4
7: 2 B 2 5
8: 3 B 3 5
9: 4 B 5 5
10: 5 B 6 5
11: 1 C 3 6
12: 2 C 2 6
13: 3 C 1 6
14: 4 C NA 7
15: 5 C NA 7
16: 1 D NA 8
17: 2 D 2 9
18: 3 D NA 10
19: 4 D NA 10
20: 5 D 4 11
Now, we get the indices of the starting or ending, resp., NA sequence for each variable (if any) by
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID
1: A 1
2: A 2
3: B 1
4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID
1: A 5
2: C 4
3: C 5
Note that this returns all indices of the starting or ending NA sequences which might be more convenient for subsequent cleaning of another data set. If only the last and first indices are required this can be achieved by
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID
1: A 2
2: B 1
3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID
1: A 5
2: C 4
I have tested this approach using a dummy data set of 192 rows times 4000 columns. The whole operation needed less than one second.
I'm trying to missing values in a data.table column with the value below it using shift, but I can only get it to work if I first create a temporary variable. Is this the expected behavior? MWE:
library(data.table)
dt <- data.table(x=c(1, NA))
dt[is.na(x), x:=shift(x)]
# Fails
dt <- data.table(x=c(1, NA))
dt <- dt[, x.lag:=shift(x)]
dt[is.na(x), x:=x.lag]
# Works
I'm a little new to data.table, but I think the rolling join might be what you're after here. Presumably you want to be able to impute a data point when there are multiple missing values in sequence, in which case your shift method will just fill NA.
Your example is a little too minimal to really see what you're doing, but if I expand it a little to include a record column, where various x values are missing;
library(data.table)
dt <- data.table(record=1:10, x=c(1, NA, NA, 4, 5, 6, NA, NA, NA, 10))
> dt
record x
1: 1 1
2: 2 NA
3: 3 NA
4: 4 4
5: 5 5
6: 6 6
7: 7 NA
8: 8 NA
9: 9 NA
10: 10 10
Then create a copy with only the non-missing rows, and set a key as the x column
dtNA <- dt[!is.na(x)]
setkey(dtNA, record)
> dtNA
record x
1: 1 1
2: 4 4
3: 5 5
4: 6 6
5: 10 10
Then do a rolling join (whereby if a value is missing, the previous record is rolled forwards) on the full list of records
dtNA[data.table(record=dt$record, key="record"), roll=TRUE]
record x
1: 1 1
2: 2 1
3: 3 1
4: 4 4
5: 5 5
6: 6 6
7: 7 6
8: 8 6
9: 9 6
10: 10 10
Compared to your method which produces the following (still has NA values in x);
dt[, x.lag:=shift(x)]
dt[is.na(x), x:=x.lag]
> dt
record x x.lag
1: 1 1 NA
2: 2 1 1
3: 3 NA NA
4: 4 4 NA
5: 5 5 4
6: 6 6 5
7: 7 6 6
8: 8 NA NA
9: 9 NA NA
10: 10 10 NA
I am trying to impute some longitudinal data in this way (see below). For each individual (id), if first values are NA, I would like to impute using the first observed value for that individual regardless when that occurs. Then, I would like to impute forward based on the last value observed for each individual (see imputed below).
var values might not necessarily increase monotonically. Those values might be a character vector.
I have tried several ways to do this, but still I cannot get a satisfactory solution.
Any ideas?
id <- c(1,1,1,1,1,1,1,2,2,2,2)
time <- c(1,2,3,4,5,6,7,3,5,7,9)
var <- c(NA,NA,1,NA,2,3,NA,NA,2,3,NA)
imputed <- c(1,1,1,1,2,3,3,2,2,3,3)
dat <- data.table(id, time, var, imputed)
id time var imputed
1: 1 1 NA 1
2: 1 2 NA 1
3: 1 3 1 1
4: 1 4 NA 1
5: 1 5 2 2
6: 1 6 3 3
7: 1 7 NA 3
8: 2 3 NA 2
9: 2 5 2 2
10: 2 7 3 3
11: 2 9 NA 3
library(zoo)
dat[, newimp := na.locf(na.locf(var, FALSE), fromLast=TRUE), by = id]
dat
# id time var imputed newimp
# 1: 1 1 NA 1 1
# 2: 1 2 NA 1 1
# 3: 1 3 1 1 1
# 4: 1 4 NA 1 1
# 5: 1 5 2 2 2
# 6: 1 6 3 3 3
# 7: 1 7 NA 3 3
# 8: 2 3 NA 2 2
# 9: 2 5 2 2 2
#10: 2 7 3 3 3
#11: 2 9 NA 3 3
I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1