Data Table Solution In R To Find Group Min/Max - r

Data
data=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
"score"=c(NA,7,6,6,1,4,8,NA,3),
"min"=c(6,6,6,1,1,1,3,3,3),
"max"=c(7,7,7,6,6,6,8,8,8))
I have columns 'student' and 'score' and wish to use data.table to create 'min' and 'max' which simply put are the minimum and maximum values for each student IGNORING NA Values. If all values are NA then simply list 'NA' as the min/max.

Using data.table
library(data.table)
setDT(data)
data[, c("min", "max"):= list(min(score, na.rm = TRUE),
max(score, na.rm = TRUE)), student]
data
# student score min max
#1: 1 NA 6 7
#2: 1 7 6 7
#3: 1 6 6 7
#4: 2 6 1 6
#5: 2 1 1 6
#6: 2 4 1 6
#7: 3 8 3 8
#8: 3 NA 3 8
#9: 3 3 3 8
Or with dplyr
library(dplyr)
data %>%
group_by(student) %>%
mutate(min = min(score, na.rm = TRUE), max = max(score, na.rm = TRUE))

But OP wanted to return NA if all scores for any student were NA. This solution fixes the Inf problem.
data=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
"score"=c(NA,NA,NA,6,1,4,8,NA,3))
> dt <- data.table(data); dt
student score
1: 1 NA
2: 1 NA
3: 1 NA
4: 2 6
5: 2 1
6: 2 4
7: 3 8
8: 3 NA
9: 3 3
Create a function to handle the case where all values are NA, to return NA
min.na = function(x) if (all(is.na(x))) x[NA_integer_] else min(x, na.rm = TRUE)
max.na = function(x) if (all(is.na(x))) x[NA_integer_] else max(x, na.rm = TRUE)
dt[, c("min", "max") := list(min.na(score), max.na(score)), by=student]
dt
student score min max
1: 1 NA NA NA
2: 1 NA NA NA
3: 1 NA NA NA
4: 2 6 1 6
5: 2 1 1 6
6: 2 4 1 6
7: 3 8 3 8
8: 3 NA 3 8
9: 3 3 3 8
Edit: And I'm not sure why you'd want to do this anyway. Combining summary statistics to the original data is bad practice. It results in redundancy/duplication. Surely you just want a separate result for each student:
dt[, .(min=min.na(score), max=max.na(score)), by=student]
student min max
1: 1 NA NA
2: 2 1 6
3: 3 3 8
I know this last part is not what was asked, but I always check that what they ask for is what they really wanted. ;)

Another data.table option:
setDT(data)[, c("min","max") := as.list(range(score, na.rm=TRUE)), student]

You can do this using the function ave:
data=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
"score"=c(NA,7,6,6,1,4,8,NA,3))
data$min = ave(data$score, data$student, FUN = function(x){ min(x, na.rm = T) })
data$max = ave(data$score, data$student, FUN = function(x){ max(x, na.rm = T) })
Result:
> data
student score min max
1 1 NA 6 7
2 1 7 6 7
3 1 6 6 7
4 2 6 1 6
5 2 1 1 6
6 2 4 1 6
7 3 8 3 8
8 3 NA 3 8
9 3 3 3 8
The function ave takes a numeric vector as the first parameter and all following vectors are the grouping variables. The FUN parameter is the function you wish to apply.

Related

How can I identify the two smallest values (and indexes) rolling over a column in R?

I have been looking for an answer to this question for days. I think I am very close, but I cannot figure how to do the rolling piece of it.
I have a data.table with two columns like this:
df = data.table(a = c(8,3,6,12,15,21,4,5,1,32,13), b = c(12,3,1,66,4,7,32,6,76,2,11))
I would like to obtain the smallest two values and their indexes on a rolling four day window. Rfast::nth seems to give me everything I need except I cannot vectorize it. Obviously I am doing something wrong.
I need the output to look as follows:
a b low lowIdx low2 low2Idx
1: 8 12 NA NA NA NA
2: 3 3 NA NA NA NA
3: 6 1 NA NA NA NA
4: 12 66 3 2 6 3
5: 15 4 3 1 6 2
6: 21 7 6 1 12 2
7: 4 32 4 4 12 1
8: 5 6 4 3 4 3
9: 1 76 1 4 4 2
10: 32 2 1 3 4 1
11: 13 11 1 2 5 1
I have attempted this with different forms of the following:
n <- nrow(df)
df$low[4:n] <- Rfast::nth(df[(n-3):n]$a, 1)
df$lowIdx[4:n] <- Rfast::nth(df$a, 1, index.return = TRUE)
df$low2[4:n] <- Rfast::nth(df[(n-3):n]$a, 2)
df$low2Idx[4:n] <- Rfast::nth(df$a, 2, index.return = TRUE)
I have also been trying to work with the frollapply, but to no avail.
Thank you, Pete
Library runner also helps.
library(dplyr)
library(runner)
df %>% mutate(low = runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, min(x), NA)),
lowIdx = runner(x = a, k =4, function(x) ifelse(length(x) ==4, which.min(x), NA)),
Low2 = runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, sort(x)[2], NA)),
Low2Idx =runner(x=a, k=4, f = function(x) ifelse(length(x) ==4, order(x)[2], NA))
)
a b low lowIdx Low2 Low2Idx
1: 8 12 NA NA NA NA
2: 3 3 NA NA NA NA
3: 6 1 NA NA NA NA
4: 12 66 3 2 6 3
5: 15 4 3 1 6 2
6: 21 7 6 1 12 2
7: 4 32 4 4 12 1
8: 5 6 4 3 5 4
9: 1 76 1 4 4 2
10: 32 2 1 3 4 1
11: 13 11 1 2 5 1
With frollapply in data.table you can do :
library(data.table)
cols <- c('low', 'lowIdx', 'low2', 'low2Idx')
n <- 4
df[, (cols) := .(frollapply(a, n, min),
frollapply(a, n, which.min),
frollapply(a, n, function(x) sort(x)[2]),
frollapply(a, n, function(x) order(x)[2]))]
df
# a b low lowIdx low2 low2Idx
# 1: 8 12 NA NA NA NA
# 2: 3 3 NA NA NA NA
# 3: 6 1 NA NA NA NA
# 4: 12 66 3 2 6 3
# 5: 15 4 3 1 6 2
# 6: 21 7 6 1 12 2
# 7: 4 32 4 4 12 1
# 8: 5 6 4 3 5 4
# 9: 1 76 1 4 4 2
#10: 32 2 1 3 4 1
#11: 13 11 1 2 5 1
For a simple example like yours I suggest using frollapply. For better control of the window, especially if you have gaps in a dates column I recommend runner. See the documentation for more
df[, c('low', 'lowIdx', 'low2', 'low2Idx') := .(
min_run(a, k = 4, na_pad = TRUE),
runner(a, k = 4, function(x) ifelse(length(x) > 0, which.min(x), NA), na_pad = TRUE),
runner(a, k = 4, function(x) sort(x)[2], na_pad = TRUE),
runner(a, k = 4, function(x) order(x)[2], na_pad = TRUE)
)]
df
# a b low lowIdx low2 low2Idx
# 1: 8 12 NA NA NA NA
# 2: 3 3 NA NA NA NA
# 3: 6 1 NA NA NA NA
# 4: 12 66 3 2 6 3
# 5: 15 4 3 1 6 2
# 6: 21 7 6 1 12 2
# 7: 4 32 4 4 12 1
# 8: 5 6 4 3 5 4
# 9: 1 76 1 4 4 2
# 10: 32 2 1 3 4 1
# 11: 13 11 1 2 5 1

trying to calculate sum of row with dataframe having NA values

I am trying to sum the row of values if any column have values but not working for me like below
df=data.frame(
x3=c(2,NA,3,5,4,6,NA,NA,3,3),
x4=c(0,NA,NA,6,5,6,NA,0,4,2))
df$summ <- ifelse(is.na(c(df[,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
library(data.table)
setDT(df)[!( is.na(x3) & is.na(x4)),summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse(is.na(x3) & is.na(x4), NA,
rowSums(df, na.rm = TRUE)))
df
# x3 x4 summ
#1 2 0 2
#2 NA NA NA
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#7 NA NA NA
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums(is.na(df[cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums(is.na(df)) == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
library(dplyr)
df %>%
rowwise() %>%
mutate(Count = ifelse(all(is.na(cur_data())), NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5

Impute missing variables but not at the beginning and the end?

Consider the following working example:
library(data.table)
library(imputeTS)
DT <- data.table(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 NA 4 6
7: 7 NA NA NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
I want to impute the missing values at different points within the time series using the na_interpolation from the imputeTS package. However, I do not want to impute missing values at the beginning or the end of the series which can be of various length (In my application replacing those values would not make sense).
When I run the following code to impute the series, however all the NAs get replaced:
DT[,(cols_to_impute_example) := lapply(.SD, na_interpolation), .SDcols = cols_to_impute_example]
> DT
time var1 var2 var3
1: 1 1 1 1
2: 2 2 1 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 6
8: 8 8 6 6
9: 9 9 7 6
10: 10 10 8 6
What I want to achieve is:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
a dplyr implementation:
we select the middle part of the df where we do the NA interpolation and then we bind it back.
library(imputeTS)
library(dplyr)
DT <- data_frame(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
na_inter_middle<-function(row_start, row_end){
# extracts the first part of the df where no NA need to be replaced
DT[1:row_start,]->start
# middle part, interpolating NA values
DT[(row_start + 1):(nrow(DT) - row_end),]->middle
#end part
DT[(nrow(DT) - (row_end - 1) ):nrow(DT),]->end
start %>%
bind_rows(
middle %>%
mutate_all(na.interpolation)
) %>%
bind_rows(end)
}
na_inter_middle(2,3)
# A tibble: 10 x 4
time var1 var2 var3
<int> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 2 NA 2
3 3 3 1 3
4 4 4 2 4
5 5 5 3 5
6 6 5 4 6
7 7 5 4 6
8 8 8 6 NA
9 9 9 7 NA
10 10 10 8 NA
Maybe not so well known, you can also use additional parameters from approx in the na.interpolation function of imputeTS.
This one could be solved with:
library(imputeTS)
DT[,(2:4) := lapply(.SD, na_interpolation, yleft = NA , yright = NA), .SDcols = 2:4]
Here with yleft and yright you specify what to do with the trailing / leading NAs.
Which leads to the desired output:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
Basically nearly all parameters that you find on the approx function description can also be given to the na.interpolation function as additional parameters for finetuning.
Library zoo offers a function for interpolation that allows more customization:
library(zoo)
DT[,(2:4) := lapply(.SD, na.approx, x = time, na.rm = FALSE), .SDcols = 2:4]

Shifting the last non-NA value by id

I have a data table that looks like this:
DT<-data.table(day=c(1,2,3,4,5,6,7,8),Consumption=c(5,9,10,2,NA,NA,NA,NA),id=c(1,2,3,1,1,2,2,1))
day Consumption id
1: 1 5 1
2: 2 9 2
3: 3 10 3
4: 4 2 1
5: 5 NA 1
6: 6 NA 2
7: 7 NA 2
8: 8 NA 1
I want to create two columns that show the last non-Na consumption value before the observation, and the day difference between those observations using the id groups. So far, I tried this:
DT[, j := day-shift(day, fill = NA,n=1), by = id]
DT[, yj := shift(Consumption, fill = NA,n=1), by = id]
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 1 NA
8: 8 NA 1 3 NA
However, I want that the lagged consumption values with n=1 come from the rows which have non-NA consumption values. For example, in the 7th row and column "yj", the yj value is NA because it comes from the 6th row which has NA consumption. I want it to come from the 2nd row. Therefore, I would like the end up with this data table:
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 5 9
8: 8 NA 1 4 2
Note: The reason for specifically using the parameter n of shift function is that I will also need the 2nd last non-Na consumption values in the next step.
Thank You
Here's a data.table solution with an assist from zoo:
library(data.table)
library(zoo)
DT[, `:=`(day_shift = shift(day),
yj = shift(Consumption)),
by = id]
#make the NA yj records NA for the days
DT[is.na(yj), day_shift := NA_integer_]
#fill the DT with the last non-NA value
DT[,
`:=`(day_shift = na.locf(day_shift, na.rm = F),
yj = zoo::na.locf(yj, na.rm = F)),
by = id]
# finally calculate j
DT[, j:= day - day_shift]
# you can clean up the ordering or remove columns later
DT
day Consumption id day_shift yj j
1: 1 5 1 NA NA NA
2: 2 9 2 NA NA NA
3: 3 10 3 NA NA NA
4: 4 2 1 1 5 3
5: 5 NA 1 4 2 1
6: 6 NA 2 2 9 4
7: 7 NA 2 2 9 5
8: 8 NA 1 4 2 4

Merge dataframes on matching A, B and *closest* C?

I have two dataframes like so:
set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
y=c(1, 1, 1, 5, 4),
time=c(17, 12, 11.6, 22.5, 2),
val=letters[1:5],
stringsAsFactors=F)
I want to merge to.merge into df (with all.x=T) such that:
df$x == to.merge$x AND
df$y == to.merge$y AND
abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.
How can I do this?
So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):
x y time val
1 1 1 8 NA
2 2 1 11 c
3 3 1 17 NA
4 1 2 27 NA
5 2 2 6 NA
6 3 2 27 NA
7 1 3 28 NA
8 2 3 20 NA
9 3 3 19 NA
10 1 4 2 NA
11 2 4 6 NA
12 3 4 5 NA
13 1 5 21 NA
14 2 5 12 NA
15 3 5 23 d
where to.merge was:
x y time val
1 2 1 17.0 a
2 2 1 12.0 b
3 2 1 11.6 c
4 3 5 22.5 d
5 2 4 2.0 e
Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).
Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.
Finally, there may be rows in to.merge that do not match anything in df.
One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)
df$value <- NA
for (i in 1:nrow(df)) {
row <- df[i, ]
idx <- which(row$x == to.merge$x &
row$y == to.merge$y &
abs(row$time - to.merge$time) <= 1)
if (length(idx)) {
j <- idx[which.min(row$time - to.merge$time[idx])]
df$val[i] <- to.merge$val[j]
}
}
I feel that I can somehow do a merge, like:
to.merge$closest_time_in_df <- sapply(to.merge$time,
function (tm) {
dts <- abs(tm - df$time)
# difference must be at most 1
if (min(dts) <= 1) {
df$time[which.min(dts)]
} else {
NA
}
})
merge(df, to.merge,
by.x=c('x', 'y', 'time'),
by.y=c('x', 'y', 'closest_time_in_df'),
all.x=T)
But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.
Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)
eg
library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))
# use join syntax with roll = 'nearest'
tm[DT, roll='nearest']
# x y time val
# 1: 1 1 8 NA
# 2: 1 2 27 NA
# 3: 1 3 28 NA
# 4: 1 4 2 NA
# 5: 1 5 21 NA
# 6: 2 1 11 c
# 7: 2 2 6 NA
# 8: 2 3 20 NA
# 9: 2 4 6 e
# 10: 2 5 12 NA
# 11: 3 1 17 NA
# 12: 3 2 27 NA
# 13: 3 3 19 NA
# 14: 3 4 5 NA
# 15: 3 5 23 d
You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)
new <- tm[DT, roll=-1, rollends =c(TRUE,TRUE)]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)
new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Using merge couple of times and aggregate once, here is how to do it.
set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)
#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
## x y time.x val time.y dif
## 1 2 1 17.0 a 11 6.0
## 2 2 1 12.0 b 11 1.0
## 3 2 1 11.6 c 11 0.6
## 4 2 4 2.0 e 6 4.0
## 5 3 5 22.5 d 23 0.5
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
## x y dif time.x val time.y
## 1 2 1 0.6 11.6 c 11
## 2 2 4 4.0 2.0 e 6
## 3 3 5 0.5 22.5 d 23
#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
## x y time val
## 1 1 1 8 <NA>
## 2 1 2 27 <NA>
## 3 1 3 28 <NA>
## 4 1 4 2 <NA>
## 5 1 5 21 <NA>
## 6 2 1 11 c
## 7 2 2 6 <NA>
## 8 2 3 20 <NA>
## 9 2 4 6 <NA>
## 10 2 5 12 <NA>
## 11 3 1 17 <NA>
## 12 3 2 27 <NA>
## 13 3 3 19 <NA>
## 14 3 4 5 <NA>
## 15 3 5 23 d
mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. In addition, MichaelChirico has suggested to use the on parameter.
This approach uses
roll = "nearest",
an update by reference, i.e., without copying,
setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table),
the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6).
So, the code below
library(data.table) # version 1.11.4 used
setDT(df)[setDT(to.merge), on = .(x, y, time), roll = "nearest",
val := replace(val, abs(x.time - i.time) > 1, NA)]
df
has updated df:
x y time val
1: 1 1 8 <NA>
2: 2 1 11 c
3: 3 1 17 <NA>
4: 1 2 27 <NA>
5: 2 2 6 <NA>
6: 3 2 27 <NA>
7: 1 3 28 <NA>
8: 2 3 20 <NA>
9: 3 3 19 <NA>
10: 1 4 2 <NA>
11: 2 4 6 <NA>
12: 3 4 5 <NA>
13: 1 5 21 <NA>
14: 2 5 12 <NA>
15: 3 5 23 d
Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer)
In case df must not be changed, a new data.table can be created by
result <- setDT(to.merge)[setDT(df), on = .(x, y, time), roll = "nearest",
.(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result
which returns the same result as above.

Resources