r data.table - shift/lead - accessing value of multiple row - r

Is it possible to access the value of multiple previous rows? I would like to look up the value into the previous row (more like cumulative or relative way) e.g. get value from the column as a list from all previous rows
e.g. see below reference code which is calculating unbiased mean by excluding existing row. I am looking for an option to exclude all previous rows from the current row (i.e. relative processing) Based on my assumption, I am assuming that shift will allow us to access the previous or next row but not the value from all the previous rows OR all next row.
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#method-1-in-line
dt <- data.table(mtcars)[,.(cyl, gear, mpg)]
dt[, dt[!gear %in% unique(dt$gear)[.GRP], mean(mpg), by=cyl], by=gear] #unbiased mean

Related

Last up-to-x rows per group from DT

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.
The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.
setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]
returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.
Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item
12 is 1. Cannot mix positives and negatives.
leading me to believe that the .I[.N-10000:.N] may not be working as intended.
If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question
DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]
returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?
okay, to resolve this question can I have the key itself work backward?
setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)
NULL
setkey(DT,keycol1,keycol2,-ts)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: -ts
That'd be a no then.
Can this be resolved by using .SD rather than .I?
DT[
DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
nomatch=0, on=c("keycol1","keycol2","ts")
]
returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns
TRUE
Related threads:
Throw away first and last n rows
data.table - select first n rows within group
How to extract the first n rows per group?

Row aggregations using data.table

So I want to aggregate the values of the rows of my data.table using custom functions. For instance, I know that I can sum over the rows doing something like
iris[,ROWSUM := rowSums(.SD),.SDcols=names(iris)[names(iris)!="Species"]]
and I could get the mean using rowMeans. I can even control which columns I include using .SDcols. However, lets say I want to compute the 20th percentile of the values of each row (using, for example, quantile()). Is there any way to do this that does not entail looping over the rows and setting a value per row?

difference between last row and row meeting condition dplyr

This is probably easy, but in a grouped data frame, I'm trying to find the difference in diff.column between the last row and the row where var.col is B The condition only appears once within each group. I'd like to make that difference a new variable using summarize from dplyr.
my.data<-data.frame(diff.col=seq(1:10),var.col=c(rep('A',5),'B',rep('A',4)))
I'd like to keep this in dplyr and I know how to code it except for selecting diff.col where var.col==B.
my.data%>%summarize(new.var=last(diff.col)-????)

Divide every number in every column by 1000 in R

I would like to divide every number in all columns by 1000. I would like to omit the row header and the 1st column from this function.
I have tried this code:
TEST2=(TEST[2:503,]/(1000))
But it is not what I am looking for. My dataframe has 503 columns.
Is TEST a dataframe? In that case, the row header won't be divided by 1000. To choose all columns except the first, use an index in j to select all columns but the first? e.g.
TEST[, 2:ncol(TEST)]/1000 # selects every row and 2nd to last columns
# same thing
TEST[, -1]/1000 # selects every row and every but the 1st column
Or you can select columns by name, etc (you select columns just like how you are selecting rows at the moment).
Probably take a look at ?'[' to learn how to select particular rows and columns.

R - How to get value from a column based on value from another column of same row

I have a data frame (df) with 8 columns and 1200 rows. Among those 8 columns I want to find the minimum value of column 7 and find the corresponding value of column 2 in that particular row where the minimum value of column 7 was found. Also column 2 holds characters so I want a character vector giving me its value.
I found the minimum of column 7 using
min_val <- min(as.numeric(df[, 7]), na.rm = TRUE)
Now how do I get the value from column 2 (variable name of column being 'column.2') corresponding to the row in which column 7 contains value of 'min_val' as calculated above?
This might be a trivial question but I am new to R so any help will be much appreciated.
Use which.min to get the minimum value index. Something like :
df[which.min(df[,7]),2]
Note that which.min only returns the first index of the minimum, so if you've got several rows with the same minimal value, you will only get the first one.
If you want to get all the minimum rows, you can use :
df[which(df[,7]==min(df[,7])), 2]
The same answer from juba, but using data.table package (his answer uses just the R base, without the need of loading any libraries).
# Load data.table
library(data.table)
# Get 2nd column's value correspondent to the first minimum value in 7th column
df[which.min(V7), V2]
# Get all respective values in 2nd column correspondent to the minimum value in 7th column
df[V2 == min(V7), V2]
For handling data.frame-like objects, data.table is quite handly and helpful, just like the dplyr package. It's worth to look at them.
Here I've assumed your colnames were named as V1..V8. Otherwise, just replace the V7/V2 with the respective column names in 7th and 2nd position of your data, respectively.

Resources