data.table and pmin with na.rm=TRUE argument - r

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.
DT <- data.table(x = c(1,1,2,3,4,1,9),
y = c(2,4,1,2,5,6,6),
z = c(3,5,1,7,4,5,3),
a = c(1,3,NA,3,5,NA,2))
> DT
x y z a
1: 1 2 3 1
2: 1 4 5 3
3: 2 1 1 NA
4: 3 2 7 3
5: 4 5 4 5
6: 1 6 5 NA
7: 9 6 3 2
I can calculate the minimum across rows using columns directly:
DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]
giving
> DT
x y z a min_val
1: 1 2 3 1 1
2: 1 4 5 3 1
3: 2 1 1 NA 1
4: 3 2 7 3 2
5: 4 5 4 5 4
6: 1 6 5 NA 1
7: 9 6 3 2 2
However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')
I can do this:
DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]
But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as
DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]
but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.
Any ideas?

If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
# x y z a col_min
#1: 1 2 3 1 1
#2: 1 4 5 3 1
#3: 2 1 1 NA 1
#4: 3 2 7 3 2
#5: 4 5 4 5 4
#6: 1 6 5 NA 1
#7: 9 6 3 2 2
If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))),
.SDcols= col_names]

Related

Remove rows which are different with the first changing in R

I have data sets
ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
df <- data.frame(ID, x, y)
df
ID x y
1 1 1 2
2 1 2 2
3 1 3 3
4 2 1 6
5 2 2 6
6 2 3 4
7 2 4 5
8 3 1 1
9 3 2 1
10 4 1 5
11 4 2 5
12 4 3 5
13 4 4 2
14 4 5 2
15 4 6 2
If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.
The table will be
ID x y
1 1 2
1 2 2
1 3 2
2 1 6
2 2 6
2 3 6
3 1 1
3 2 1
4 1 5
4 2 5
4 3 5
4 4 5
I couldn't figure out, do you have any idea, please help me, thanks.
Or we can do
library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
ID x y
1: 1 1 2
2: 1 2 2
3: 1 3 2
4: 2 1 6
5: 2 2 6
6: 2 3 6
7: 3 1 1
8: 3 2 1
9: 4 1 5
10: 4 2 5
11: 4 3 5
12: 4 4 5
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', if there is only a unique element in 'y' get the sequence of rows (1:.N) or else get the difference of 'y' (diff), check whether it is not equal to 0, use which to return the numeric index of the first TRUE ([1]),get the sequence and wrap it with .I to return row index.
library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1) .I[seq(which(c(FALSE,diff(y)!=0))[1])]
else .I[1:.N], ID]$V1
Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=), the 1st element in 'y' to change the 'y' column.
df[i1][, y:= y[1], ID][]
# ID x y
#1: 1 1 2
#2: 1 2 2
#3: 1 3 2
#4: 2 1 6
#5: 2 2 6
#6: 2 3 6
#7: 3 1 1
#8: 3 2 1
#9: 4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5
Or we can use a bit more simple coding with dplyr. (Disclaimer: The idea is somewhat similar to #Psidom's code). After grouping by 'ID', we get the lag of 'y', get a logical index by comparing with the first observation, filter the rows based on that and change the 'y' values to the first value.
library(dplyr)
df %>%
group_by(ID) %>%
filter(first(y)==lag(y, default = first(y))) %>%
mutate(y, y=first(y))
# ID x y
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 1 2 2
#3 1 3 2
#4 2 1 6
#5 2 2 6
#6 2 3 6
#7 3 1 1
#8 3 2 1
#9 4 1 5
#10 4 2 5
#11 4 3 5
#12 4 4 5
Or another option is ave from base R
df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
lag(x, default= x[1])== x[1]))),]
df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))
You could use a for loop, matching to the first instance of a given ID:
for( i in 1:nrow(df) ){
df$new[i] <- df$y[ match( df$ID[i], df$ID ) ]
}
This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match returns the first value matching a given criteria, which works well for what you're after.
Or you could eliminate the for loop by first extracting ID as a variable:
ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]
EDIT TO ADD: Sorry, here's a step to add to delete rows as requested
df <- subset( df, y == new |
( shift( y, 1, type = "lag" ) != y &
shift( ID, 1, type = "lag" ) == ID )
)

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Data.table summary statistics from n first observations per group

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?
The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

row numbers for explicit rows in r

I need to get row numbers for explicit rows grouped over id. Let's say dataframe (df) looks like this:
id a b
3 2 NA
3 3 2
3 10 NA
3 21 0
3 2 NA
4 1 5
4 1 0
4 5 NA
I need to create one more column that would give row number sequence excluding the case where b == 0.
desired output:
id a b row
3 2 NA 1
3 3 2 2
3 10 NA 3
3 21 0 -
3 2 NA 4
4 1 5 1
4 1 0 -
4 5 NA 2
I used dplyr but not able to achieve the same,
My code:
df <- df %>%
group_by(id) %>%
mutate(row = row_number(id[b != 0]))
Please suggest some better way to do this.
I would propose using the data.table package for its nice capability in operating on subsets and thus avoiding inefficient operations such as ifelse or evaluation the whole data set. Also, it is better to keep you vector in numeric class (for future operations), thus NA will be probably preferable to - (character), here's a possible solution
library(data.table)
setDT(df)[is.na(b) | b != 0, row := seq_len(.N), by = id]
# id a b row
# 1: 3 2 NA 1
# 2: 3 3 2 2
# 3: 3 10 NA 3
# 4: 3 21 0 NA
# 5: 3 2 NA 4
# 6: 4 1 5 1
# 7: 4 1 0 NA
# 8: 4 5 NA 2
The idea here is to operate only on the rows where is.na(b) | b != 0 and generate a sequence of each group size (.N) while updating row in place (using :=). All the rest of the rows will be assigned with NAs by default.

Summary of a data.table using internal order

I need a way to calculate the third column of this data.table.
DT=data.table(group=c(1,1,0,0,1,1),x=c(1,1,1,2,2,2),ResultNeeded=c(2,2,3,3,4,4))
I am guessing the following can be modified to get the result I need.
DT[,sum:=sum(x),by=group]. I just don't know how to do it.
In the development version of data.table, v1.9.5, there's a function rleid(), that helps accomplish this in a slightly cleaner way:
require(data.table) ## v1.9.5+
DT[, ans := sum(x), by=rleid(group)]
# group x ResultNeeded ans
# 1: 1 1 2 2
# 2: 1 1 2 2
# 3: 0 1 3 3
# 4: 0 2 3 3
# 5: 1 2 4 4
# 6: 1 2 4 4
rleid() groups consecutive runs of identical values together (named after the base function rle()).
You can install the development version by following the instructions here.
You're close, you just need the correct grouping:
DT[, sum := sum(x), by = cumsum(c(F, diff(group) != 0))]
# group x ResultNeeded sum
#1: 1 1 2 2
#2: 1 1 2 2
#3: 0 1 3 3
#4: 0 2 3 3
#5: 1 2 4 4
#6: 1 2 4 4

Resources