difference between last row and row meeting condition dplyr - r

This is probably easy, but in a grouped data frame, I'm trying to find the difference in diff.column between the last row and the row where var.col is B The condition only appears once within each group. I'd like to make that difference a new variable using summarize from dplyr.
my.data<-data.frame(diff.col=seq(1:10),var.col=c(rep('A',5),'B',rep('A',4)))
I'd like to keep this in dplyr and I know how to code it except for selecting diff.col where var.col==B.
my.data%>%summarize(new.var=last(diff.col)-????)

Related

Using group_by to determine median of second column after sorting/filtering for a specific value in first column?

I have a huge dataset which has been difficult to work with.
I want to find the median of a second column but only based on one value in the first column. I have used this formula to find general medians without specifying/sorting by the specific values in the first column:
df%>% +group_by(column1)%>% +summarise(Median=median(colum2))
However, there is a specific value in column1 I am hoping to sort by and I only want the medians of the second column based on this first value. Would I do something similar to the below?
df%>% +group_by(column1, specificvalue)%>% +summarise(Median=median(colum2))
Is there an easier way to do this? Would it be easier to make a new dataframe based on the specific value in the first column? How would that be done so that I could have column 1 only include the specific value I want but the rest of the rows included so I can easily determine the median of column2?
Thanks!!

r data.table - shift/lead - accessing value of multiple row

Is it possible to access the value of multiple previous rows? I would like to look up the value into the previous row (more like cumulative or relative way) e.g. get value from the column as a list from all previous rows
e.g. see below reference code which is calculating unbiased mean by excluding existing row. I am looking for an option to exclude all previous rows from the current row (i.e. relative processing) Based on my assumption, I am assuming that shift will allow us to access the previous or next row but not the value from all the previous rows OR all next row.
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#method-1-in-line
dt <- data.table(mtcars)[,.(cyl, gear, mpg)]
dt[, dt[!gear %in% unique(dt$gear)[.GRP], mean(mpg), by=cyl], by=gear] #unbiased mean

How to use dplyr::Distinct Based on the Values of Another Variable

library(tidyverse)
Using the sample data below, I want to use dplyr::distinct() based on a condition. I want to eliminate duplicates in the ID column, but only the duplicates with the lowest value of "Rate". For example, for "A1A1",the row with the rate of 2 should be deduped, while for "CC33", the rows with "rate" equal to 2 and 3 should be removed. I also want to end up with all columns by using dplyr::distinct with ".keep_all=TRUE".
I tried the code below, but this removes the Subject column.
DF2%>%group_by(ID)%>%summarise(Min_rate=min(Rate))
I also played around with a group_by, mutate, and if_else, but couldn't get it to work...
DF2%>%group_by(ID)%>%mutate(if_else(Rate=min(Rate),Rate,distinct(ID)
Help would be appreciated...
Sample Data:
ID<-c("A1A1","A22B","CC33","D33D","A1A1","4DD8","4DD8","CC33","CC33","56DK","F4G5","8Y0R")
Subject<-c("Subject1","Subject2","Subject3","Subject4","Subject5","Subject6","Subject7","Subject8","Subject9","Subject10","Subject11","Subject12")
Rate<-c(1,2,3,2,2,3,2,1,2,2,2,3)
DF2<-data_frame(ID,Subject,Rate)
I found a way to accomplish what I want by first using dplyr's "group_by" and "mutate" functions together with "if_else" to recode the smallest value of the rate variable within each ID group with a 1, and all other values with a 0.
DF2<-DF2%>%group_by(ID)%>%mutate(Rate_Min=if_else(Rate==min(Rate),1,0))
I then use dplyr's "filter" to remove the 0's.
DF2<-DF2%>%filter(Rate_Min==1)

Row aggregations using data.table

So I want to aggregate the values of the rows of my data.table using custom functions. For instance, I know that I can sum over the rows doing something like
iris[,ROWSUM := rowSums(.SD),.SDcols=names(iris)[names(iris)!="Species"]]
and I could get the mean using rowMeans. I can even control which columns I include using .SDcols. However, lets say I want to compute the 20th percentile of the values of each row (using, for example, quantile()). Is there any way to do this that does not entail looping over the rows and setting a value per row?

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Resources