Return subset of rows based on aggregate of keyed rows [duplicate] - r

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 6 years ago.
I would like to subset a data table in R within each subset, based on an aggregate function over the subset of rows. For example, for each key, return all values greater than the mean of a field calculated only for rows in subset. Example:
library(data.table)
t=data.table(Group=rep(c(1:5),each=5),Detail=c(1:25))
setkey(t,'Group')
library(foreach)
library(dplyr)
ret=foreach(grp=t[,unique(Group)],.combine=bind_rows,.multicombine=T) %do%
t[Group==grp&Detail>t[Group==grp,mean(Detail)],]
# Group Detail
# 1: 1 4
# 2: 1 5
# 3: 2 9
# 4: 2 10
# 5: 3 14
# 6: 3 15
# 7: 4 19
# 8: 4 20
# 9: 5 24
#10: 5 25
The question is, is it possible to succinctly code the last two lines using data.table features? Sorry if this is a repeat, I am also struggling explaining the exact goal to have google/stackoverflow find it.

Using the .SD function works. Was not aware of it, thanks:
dt[, .SD[Detail > mean(Detail)], by = Group]
Also works, with some performance gains:
indx <- dt[, .I[Detail > mean(Detail)], by = Group]$V1 ; dt[indx]

Related

Adding group column to data frame [duplicate]

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?
In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

rolling cumulative sums conditional on missing data

I want to calculate rolling cumulative sums by item in a data.table. Sometimes, data is missing for a given time period.
set.seed(8)
item <- c(rep("A",4), rep("B",3))
time <- c(1,2,3,4,1,3,4)
sales <- rpois(7,5)
DT <- data.table(item, time,sales)
For a rolling window of 2 time periods I want the following output:
item time sales sales_rolling2
1: A 1 5 5
2: A 2 3 8
3: A 3 7 10
4: A 4 6 13
5: B 1 4 4
6: B 3 6 6
7: B 4 4 10
Note, that item B has no data at time 2. Thus the result for row 6 just includes the latest observation.
We can use rollsum from library(zoo) to do the rolling sum. Before applying the rollsum, I guess we need to create another grouping variable ('indx') based on the 'time' variable. I find that for the item 'B', the time is not continous, ie. 2 is missing. So, we can use diff to create a logical index based on the difference of adjacent elements. If the difference is not 1, it will return TRUE or else FALSE. As the diff output is of length 1 less than the length of the column, we can pad with TRUE and then do the cumsum to create the 'indx' variable.
library(zoo)
DT[, indx:=cumsum(c(TRUE, diff(time)!=1))]
In the second step, we use both 'indx' and 'time' as the grouping variable, get the rollsum of 'sales' with k=2 and also based on the condition that if the number of elements in the group is greater than 1 only we need to do this (if(.N >1)), otherwise it should return the 'sales', create the 'sales_rolling2', and assign (:=) the 'indx' to NULL as it is not needed in the expected output.
DT[, sales_rolling2 := if(.N>1) c(sales[1],rollsum(sales,2)) else sales,
by = .(indx, item)][,indx:= NULL]
# item time sales sales_rolling2
#1: A 1 5 5
#2: A 2 3 8
#3: A 3 7 10
#4: A 4 6 13
#5: B 1 4 4
#6: B 3 6 6
#7: B 4 4 10
Update
As per #Khashaa's suggestion, we can use roll_sum from library(RcppRoll) can be used more effectively as it will even work with number of rows less than 'k'. In this way, we can remove the if/else condition in my previous solution. (Full credit to #Khashaa)
library(RcppRoll)
DT[, sales_rolling2 := c(sales[1L], roll_sum(sales, 2)), by = .(indx, item)]

Specific removing all duplicates with R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Remove all duplicate rows including the "reference" row [duplicate]
(3 answers)
Closed 7 years ago.
For example I have two columns:
Var1 Var2
1 12
1 65
2 68
2 98
3 49
3 24
4 8
5 67
6 12
And I need to display only values which are unique for column Var1:
Var1 Var2
4 8
5 67
6 12
I can do you like this:
mydata=mydata[!unique(mydata$Var1),]
But when I use the same formula for my large data set with about 1 million observations, nothing happens - the sample size is still the same. Could you please explain my why?
Thank you!
With data.table (as it seem to be tagged with it) I would do
indx <- setDT(DT)[, .I[.N == 1], by = Var1]$V1
DT[indx]
# Var1 Var2
# 1: 4 8
# 2: 5 67
# 3: 6 12
Or... as #eddi reminded me, you can simply do
DT[, if(.N == 1) .SD, by = Var1]
Or (per the mentioned duplicates) with v >= 1.9.5 you could also do something like
setDT(DT, key = "Var1")[!(duplicated(DT) | duplicated(DT, fromLast = TRUE))]
You can use this:
df <- data.frame(Var1=c(1,1,2,2,3,3,4,5,6), Var2=c(12,65,68,98,49,24,8,67,12) );
df[ave(1:nrow(df),df$Var1,FUN=length)==1,];
## Var1 Var2
## 7 4 8
## 8 5 67
## 9 6 12
This will work even if the Var1 column is not ordered, because ave() does the necessary work to collect groups of equal elements (even if they are non-consecutive in the grouping vector) and map the result of the function call (length() in this case) back to each element that was a member of the group.
Regarding your code, it doesn't work because this is what unique() and its negation returns:
unique(df$Var1);
## [1] 1 2 3 4 5 6
!unique(df$Var1);
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
As you can see, unique() returns the actual unique values from the argument vector. Negation returns true for zero and false for everything else.
Thus, you end up row-indexing using a short logical vector (it will be short if there were any duplicates removed by unique()) consisting of TRUE where there were zeroes, and FALSE otherwise.

Return df with a columns values that occur more than once [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?
Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).
Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Transposition with aggregation of data.table [duplicate]

This question already has answers here:
Proper/fastest way to reshape a data.table
(4 answers)
Closed 9 years ago.
Suppose that we have data.table like that:
TYPE KEY VALUE
1: 1 A 10
2: 1 B 10
3: 1 A 40
4: 2 B 20
5: 2 B 40
I need to generate the following aggregated data.table (numbers are sums of values for given TYPE and KEY):
TYPE A B
1: 1 50 10
2: 2 0 60
In a real life problem there are a lot of different values for KEY so it's impossible to hardcode them.
How can I achieve that?
One way I could think of is:
# to ensure all levels are present when using `tapply`
DT[, KEY := factor(KEY, levels=unique(KEY))]
DT[, as.list(tapply(VALUE, KEY, sum)), by = TYPE]
# TYPE A B
# 1: 1 50 10
# 2: 2 NA 60

Resources