Row aggregations using data.table - r

So I want to aggregate the values of the rows of my data.table using custom functions. For instance, I know that I can sum over the rows doing something like
iris[,ROWSUM := rowSums(.SD),.SDcols=names(iris)[names(iris)!="Species"]]
and I could get the mean using rowMeans. I can even control which columns I include using .SDcols. However, lets say I want to compute the 20th percentile of the values of each row (using, for example, quantile()). Is there any way to do this that does not entail looping over the rows and setting a value per row?

Related

Using group_by to determine median of second column after sorting/filtering for a specific value in first column?

I have a huge dataset which has been difficult to work with.
I want to find the median of a second column but only based on one value in the first column. I have used this formula to find general medians without specifying/sorting by the specific values in the first column:
df%>% +group_by(column1)%>% +summarise(Median=median(colum2))
However, there is a specific value in column1 I am hoping to sort by and I only want the medians of the second column based on this first value. Would I do something similar to the below?
df%>% +group_by(column1, specificvalue)%>% +summarise(Median=median(colum2))
Is there an easier way to do this? Would it be easier to make a new dataframe based on the specific value in the first column? How would that be done so that I could have column 1 only include the specific value I want but the rest of the rows included so I can easily determine the median of column2?
Thanks!!

r data.table - shift/lead - accessing value of multiple row

Is it possible to access the value of multiple previous rows? I would like to look up the value into the previous row (more like cumulative or relative way) e.g. get value from the column as a list from all previous rows
e.g. see below reference code which is calculating unbiased mean by excluding existing row. I am looking for an option to exclude all previous rows from the current row (i.e. relative processing) Based on my assumption, I am assuming that shift will allow us to access the previous or next row but not the value from all the previous rows OR all next row.
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#method-1-in-line
dt <- data.table(mtcars)[,.(cyl, gear, mpg)]
dt[, dt[!gear %in% unique(dt$gear)[.GRP], mean(mpg), by=cyl], by=gear] #unbiased mean

Filtering iteratevely through data table in R

I have a data table with 3 variables, 1 frequency column, and I am wishing to add another proportion column.
The variable 1 has 4 unique values.
Variable 2 has 5,
And Variable 3 has 2.
The frequencies captures the amount of times that happens.
But if I add the prop.table to it, it will calculate the proportion regarding the whole data.table, when I really want it to calculate the proportion in the subsets of Variable 2.
I thought of iterating, but it seems complicated in tables.
You could use the aggregate function (or tapply) to sum all the counts within the categories of variable 2, then use prop.table or similar on the result.
If you want to use the tidyverse instead of base R then this would be a group_by followed by summarise to add within each group, then prop_table again to calculate the proportions.

difference between last row and row meeting condition dplyr

This is probably easy, but in a grouped data frame, I'm trying to find the difference in diff.column between the last row and the row where var.col is B The condition only appears once within each group. I'd like to make that difference a new variable using summarize from dplyr.
my.data<-data.frame(diff.col=seq(1:10),var.col=c(rep('A',5),'B',rep('A',4)))
I'd like to keep this in dplyr and I know how to code it except for selecting diff.col where var.col==B.
my.data%>%summarize(new.var=last(diff.col)-????)

Subset dataframe based on statistical range of each column

I would like to subset a dataframe by selecting only columns that exceed a specific range. IE, I would like to evaluate max-min for each column individually and select only columns whose range is greater than a given value. For example, given the following simple dataframe, I would like to create a subset dataframe that only contains columns with a range > 99. (Columns b an c.)
d <- data.frame(a=seq(0,10,1),b=seq(0,100,10),c=seq(0,200,20))
I have tried modifying the example here: Subset a dataframe based on a single condition applied to multiple columns, but have had no luck. I'm sure I'm missing something simple.
You can use sapply() to apply function to each column of d and then calculate difference for range of column values. Then compare it to 99. As result you will get TRUE or FALSE and then use it to subset columns.
d[,sapply(d,function(x) diff(range(x))>99)]

Resources