Data.table column generated with function taking multiple, cumulative and lagged arguments - r

I'm trying to add a column to a data.table, where that column of the data.table is populated by passing the cumulative (lag 1) vector of values per group as well as a group-level attribute to a function, and then returning the appropriate value.
I have 8M rows, one per agent-day. My function is more complicated that myfun, but the key thing is that it takes two arguments from the agent-day table: a vector (Vector) of values for a particular agent from all days prior to the particular day, and a vector of agent-level attributes (PerAgent) that are all the same per agent.
library(data.table)
library(dplyr)
library(zoo)
DT <- data.table(Agent=LETTERS[1:3],day=c(1,1,1,2,2,2,3,3,3), PerAgent=c(.2,.4,.6),Vector=1:9,Answer=c(NA,NA,NA,.2,.8,1.8,1,2.8,5.4))
myFun = function(Vector,PerAgent){
PerAgent=PerAgent[1]
Answer=PerAgent*sum(Vector)
return(Answer)
}
The "Answer" column is what I'm trying to generate (obviously not manually as I've done here).
What I have right now that doesn't work because I'm trying to pass the second argument is:
DT[,Answer:=lag(rollapplyr(Vector,PerAgent,seq_along(Vector),myFun),1),by=.(Agent)]
If I didn't need to pass the second argument to the (simplified) function, this works:
myFun = function(Vector){
Answer=.1*sum(Vector)
return(Answer)
}
DT[,Answer:=lag(rollapplyr(Vector,seq_along(Vector),myFun),1),by=.(Agent)]
Your help is VERY appreciated.

Related

in R, functions applied to vectors vs scalars in data.table

I want to check my understanding of data.table behavior, because I find myself trying to guess how R will behave when using functions on the RHS of the := operator.
Any nuance to my observations is welcome (as well as comments on how this relates to data.table vs. data.frame).
Mostly, I want to figure out how to know when := used with functions will take references to a column in data.table (on the RHS) as a vector of every value in the column versus simply applying the function by row to each value.
library(data.table)
dt <- data.table(x = c(-100,-50,0,50,100))
dt[,y := max(0,x)]
dt[,z := abs(x)]
dt[,sd := sd(x)]
Here, min seems to turn 0 and the vector of all x values into a single vector, then runs on that new vector. More to the point, it seems to take every value in column x, not just the value of x on a given row.
On the other hand, abs returns just the absolute value of the x for that row.
And then sd takes the entire column x as the argument (rather than applying by row).
Do I just have to dig into the documentation for a function to see if it takes an atomic argument vs a vector argument, and assume that if the function takes a vector, it will apply to the entire column (and not apply to each row)? Is that the best way I can do this?

Order based on multiple columns passed in as an input

I would like to write a function that sorts a given data.frame (which I'll refer to as dataSet) by any number of its columns, whose names are also passed into the function (in a vector which I will refer to as orderList). I know that to order by a single passed in string you can just use
sortDataset <- function(dataSet, sortCol) {
return(dataSet[order(dataSet[[sortCol]]),])
}
and that you can order by multiple passed in strings using
sortDataset <- function(dataSet, sortCol1, sortCol2) {
return(dataSet[order(dataSet[[sortCol1]], dataSet[[sortCol2]]),])
}
with however many sortCol# inputs as I would want. I would, however, like to be able to pass in a list of any number of strings. I tried the following:
dataSet[order(dataSet[[orderList]]),]
dataSet[order(dataSet$orderList),]
dataSet[order(dataSet[,orderList])]
and encountered issues that with the first 2, since they're just not a valid way to get multiple columns (I still tried, though ): ) and that in the third, order doesn't seem to accept the matrix returned by dataSet[,orderList] as a parameter.
I would like a function as follows:
sortDataset <- function(dataSet, sortCols)
where the first element of sortCols is the column which takes highest priority, then the second column is the first tiebreaker, the third column is the second tiebreaker, etc. and the function returns dataSet sorted appropriately. It would also be nice if I could specify whether each should be ascending in an optional input, so the first column could be sorting ascending, the second sorted descending, etc.
So far, the only method I can really think of is to assume each list only contains numeric values, and then do some multiplying of the various sorting columns by 10^n so that all the columns can be consolidated into one column that maintains the priorities, and then sort by that column. I feel like there should be a better way to do this, though, since this seems like a pretty basic function.
Use do.call:
data[do.call("order", data[sortCols]), ]
where data is a data frame and sortCols is a character vector of column names.
Also have a look at orderBy in the doBy package.
We can do this with tidyverse
library(dplyr)
data %>%
arrange_at(vars(sortCols))
which can be made into a function with either using quos/1!!
sortDataset <- function(dataSet, ...) {
stopifnot(rlang::is_quosures(...))
a1 <- c(...)
dataSet %>%
arrange(!!! a1)
}
sortDataset(mtcars, quos(mpg, cyl))
or with arrange_at if we are passing variable as string
sortDataset <- function(dataSet, ...) {
a1 <- c(...)
dataSet %>%
arrange_at(vars(a1))
}
sortDataset(mtcars, "mpg", "cyl")
As #Nettle mentioned in the comments, using arrange_at with group_by can cause some bugs (based on here

Difference between cbind with dataframe subset or indicating each column separately?

What is the difference between these two lines of codes?
varname1 <- cbind(df.name$var1, df.name$var2, df.name$var3)
varname2 <- cbind(df.name[1:3])
If I then try to use the next function I get an "invalid type (list) for variable "varname2".
This is the next function I try to use:
manova(varname ~ indepvar.snack+judge+rep,data = df.name)
So why does varname1 works and varname2 not?
Nulling my previous answer as I originaly thought you are column binding a series of columns in to a single columned dataframe.
check str(varname1) since it results in matrix while str(varname2) is dataframe.
manova is accepting matrix-type variable as argument.
do:
varname2 <- as.matrix(varname2)

Perform t-test using aggregate function in R

I'm having difficulty using the unpaired t-test and the aggregate function.
Example
dd<-data.frame(names=c("1st","1st","1st","1st","2nd","2nd","2nd","2nd"),a=c(11,12,13,14,2.1,2.2,2.3,2.4),b=c(3.1,3.2,3.3,3.4,3.1,3.2,3.3,3.4))
dd
# Compare all the values in the "a" column that match with "1st" against the values in the "b" column that match "1st".
# Then, do the same thing with those matching "2nd"
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
# Also need to replace any errors from t.test that have too low variance with NA
# An example of the type of error I might run into would be if the "b" column was replaced with c(3,3,3,3,3,3,3,3).
For paired data, I found a work around.
# Create Paired data.
data_paired<-dd[,3]-dd[,2]
# Create new t-test so that it doesn't crash upon the first instance of an error.
my_t.test<-function(x){
A<-try(t.test(x), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
# Use aggregate with new t-test.
aggregate(data_paired, by=list(dd$name),FUN=my_t.test)
This aggregate works with a single column of input. However, I can't get it to function when I must have several columns go into the function.
Example:
my_t.test2<-function(x,y){
A<-try(t.test(x,y,paired=FALSE), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
aggregate(dd[,c(2,3)],by=list(dd$name),function(x,y) my_t.test2(dd[,3],dd[,2]))
I had thought that the aggregate function would only send the rows matching the value in the list to the function my_t.test2 and then move onto the next list element. However, the results produced indicate that it is performing a t-test on all values in the column like below. And then placing each of those values in the results.
t.test(dd[,3],dd[,2])$p.value
What am I missing? Is this an issues with the original my_test.2, an issue with how to structure the aggregate function, or something else. The way I applied it doesn't seem to aggregate.
These are the results I want.
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
To Note, this is a toy example and the actual data set will have well over 100,000 entries that need to be grouped by the value in the names column. Hence why I need the aggregate function.
Thanks for the help.
aggregate isn't the right function to use here because the summary function only works on one column at a time. It's not possible to get both the a and b values simultaneously with this method.
Another way you could approach the problem is to split the data, then apply the t-test to each of the subset. Here's one implementation
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]], x[["b"]])$p.value
)
Here I split dd into a list of subset for each value of names. I use dd[-1] to drop the "names" column from the subsets to I just have a data.frame with two columns. One for a and one for b.
Then, to each subset in the list, I perform a t.test using the a and b columns. Then I extract the p-value. The sapply wrapper with calculate this p-value for each subset and rill returned a named vector of p-values where the names of the entries correspond to the levels of dd$names
1st 2nd
6.727462e-04 3.436403e-05
If you wanted to do paired t-test this way, you could do
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]] - x[["b"]])$p.value
)
As #MrFlick said, agregate is not the right function to do this. Here are some alternatives to the sapply approach, using the dplyr or data.table packages.
require(dplyr)
summarize(group_by(dd, names), t.test(a,b)$p.value)
require(data.table)
data.table(dd)[, t.test(a,b)$p.value, by=names]

first row for non-aggregate functions

I use ddply to avoid redundant calculations.
I am often dealing with values that are conserved within the split subsets, and doing non-aggregate analysis. So to avoid this (a toy example):
ddply(baseball,.(id,year),function(x){paste(x$id,x$year,sep="_")})
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
I have to take the first row of each mini data frame.
ddply(baseball,function(x){paste(x$id[1],x$year[1],sep="_")})
Is there a different approach or a helper I should be using? This syntax seems awkward.
--
Note: paste in my example is just for show - don't take it too literally. Imagine this is actual function:
ddply(baseball,function(x){the_slowest_function_ever(x$id[1],x$year[1])})
You might find data.table a little easier and faster in this case. The equivalent of .() variables is by= :
DT[, { paste(id,year,sep="_") }, by=list(id,year) ]
or
DT[, { do.call("paste",.BY) }, by=list(id,year) ]
I've shown the {} to illustrate you can put any (multi-line) anonymous body in j (rather than a function), but in these simple examples you don't need the {}.
The grouping variables are length 1 inside the scope of each group (which seems to be what you're asking), for speed and convenience. .BY contains the grouping variables in one list object as well, for generic access when the by criteria is decided programatically on the fly; i.e., when you don't know the by variables in advance.
You could use:
ddply(baseball, .(id, year), function(x){data.frame(paste(x$id,x$year,sep="_"))})
When you return a vector, putting it back together as a data.frame makes each entry a column. But there are different lengths, so they don't all have the same number of columns. By wrapping it in data.frame(), you make sure that your function returns a data.frame that has the column you want rather than relying on the implicit (and in this case, wrong) transformation. Also, you can name the new column easily within this construct.
UPDATE:
Given you only want to evaluate the function once (which is reasonable), then you can just pull the first row out by itself and operate on that.
ddply(baseball, .(id, year), function(x) {
x <- x[1,]
paste(x$id, x$year, sep="_")
})
This will (by itself) have only a single row for each id/year combo. If you want it to have the same number of rows as the original, then you can combine this with the previous idea.
ddply(baseball, .(id, year), function(x) {
firstrow <- x[1,]
data.frame(label=rep(paste(firstrow$id, firstrow$year, sep="_"), nrow(x)))
})

Resources