in R, functions applied to vectors vs scalars in data.table - r

I want to check my understanding of data.table behavior, because I find myself trying to guess how R will behave when using functions on the RHS of the := operator.
Any nuance to my observations is welcome (as well as comments on how this relates to data.table vs. data.frame).
Mostly, I want to figure out how to know when := used with functions will take references to a column in data.table (on the RHS) as a vector of every value in the column versus simply applying the function by row to each value.
library(data.table)
dt <- data.table(x = c(-100,-50,0,50,100))
dt[,y := max(0,x)]
dt[,z := abs(x)]
dt[,sd := sd(x)]
Here, min seems to turn 0 and the vector of all x values into a single vector, then runs on that new vector. More to the point, it seems to take every value in column x, not just the value of x on a given row.
On the other hand, abs returns just the absolute value of the x for that row.
And then sd takes the entire column x as the argument (rather than applying by row).
Do I just have to dig into the documentation for a function to see if it takes an atomic argument vs a vector argument, and assume that if the function takes a vector, it will apply to the entire column (and not apply to each row)? Is that the best way I can do this?

Related

how to access rownames in a chained data.table of R

We need rownames sometimes to create a new column that is a function of previous columns but aggregated just for one row (each row). In other words the function is operating across the row.
Consider this:
library(data.table)
library(geosphere)
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # correct
The second line of code works fine as the data.table name dt is available inside the square brackets (which in itself does not look quite elegant to me), but not always.
What if there is a chain of data.tables? Consider this extension of previous example:
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # incorrect
Clearly this is an incorrect use as rownames(dt) is a different length than the subsetted data.table passed inside to the next in chain.
I guess my larger question is: Is rownames() the only way to achieve summarisation on each row? If not then the specific question remains: how do we access the data.table inside the by= construct if it is a chained data.table?
Try cbind:
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# correct : 100 lines
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# also correct : 16 lines
:= works on each row without need for summarization.
cbind allows to supply the expexted n*2 lat-lon matrix to the function.

r: data.table, most efficient row wise normalization

This code normalizes each value in each row (all values end up between -1 and 1).
dt <- setDT(knime.in)
df <-as.data.frame(t(apply(dt[,-1], 1, function(x) x / sum(x) )))
df1<-cbind(knime.in$Majors_Final,df)
BUT
It is not dynamic. The code "knows" that the String categorical variable is in row one and removes it before running the calculations
It seems old school and I suspect it does not make full use of data.table's referencing memory allocations.
QUESTIONS
How do I use the most memory efficient data.table code to achieve the row wise normalization?
How do I exclude all is.character() columns (or include only is.numeric), if I do not know the position or name of these columns?

Data.table column generated with function taking multiple, cumulative and lagged arguments

I'm trying to add a column to a data.table, where that column of the data.table is populated by passing the cumulative (lag 1) vector of values per group as well as a group-level attribute to a function, and then returning the appropriate value.
I have 8M rows, one per agent-day. My function is more complicated that myfun, but the key thing is that it takes two arguments from the agent-day table: a vector (Vector) of values for a particular agent from all days prior to the particular day, and a vector of agent-level attributes (PerAgent) that are all the same per agent.
library(data.table)
library(dplyr)
library(zoo)
DT <- data.table(Agent=LETTERS[1:3],day=c(1,1,1,2,2,2,3,3,3), PerAgent=c(.2,.4,.6),Vector=1:9,Answer=c(NA,NA,NA,.2,.8,1.8,1,2.8,5.4))
myFun = function(Vector,PerAgent){
PerAgent=PerAgent[1]
Answer=PerAgent*sum(Vector)
return(Answer)
}
The "Answer" column is what I'm trying to generate (obviously not manually as I've done here).
What I have right now that doesn't work because I'm trying to pass the second argument is:
DT[,Answer:=lag(rollapplyr(Vector,PerAgent,seq_along(Vector),myFun),1),by=.(Agent)]
If I didn't need to pass the second argument to the (simplified) function, this works:
myFun = function(Vector){
Answer=.1*sum(Vector)
return(Answer)
}
DT[,Answer:=lag(rollapplyr(Vector,seq_along(Vector),myFun),1),by=.(Agent)]
Your help is VERY appreciated.

R: data.table : searching on multiple columns AND setting data type

Q1:
Is it possible for me to search on two different columns in a data table. I have a 2 million odd row data and I want to have the option to search on either of the two columns. One has names and other has integers.
Example:
x <- data.table(foo=letters,bar=1:length(letters))
x
want to do
x['c'] : searching on foo column
as well as
x[2] : searching on bar column
Q2:
Is it possible to change the default data types in a data table. I am reading in a matrix with both character and integer columns however everything is being read in as a character.
Thanks!
-Abhi
To answer your Q2 first, a data.table is a data.frame, both of which are internally a list. Each column of the data.table (or data.frame) can therefore be of a different class. But you can't do that with a matrix. You can use := to change the class (by reference - no unnecessary copy being made), for example, of "bar" here:
x[, bar := as.integer(as.character(bar))]
For Q1, if you want to use fast subset (using binary search) feature of data.table, then you've to set key, using the function setkey.
setkey(x, foo)
allows you to fast-subset on 'x' alone like: x['a'] (or x[J('a')]). Similarly setting a key on 'bar' allows you to fast-subset on that column.
If you set the key on both 'foo' and 'bar' then you can provide values for both like so:
setkey(x) # or alternatively setkey(x, foo, bar)
x[J('c', 3)]
However, this'll subset those where x == 'c' and y == 3. Currently, I don't think there is a way to do a | operation with fast-subset directly. You'll have to resort to a vector-scan approach in that case.
Hope this is what your question was about. Not sure.
Your matrix is already a character. Matrices hold only one data type. You can try X['c'] and X[J(2)]. You can change data types as X[,col := as.character(col)]

first row for non-aggregate functions

I use ddply to avoid redundant calculations.
I am often dealing with values that are conserved within the split subsets, and doing non-aggregate analysis. So to avoid this (a toy example):
ddply(baseball,.(id,year),function(x){paste(x$id,x$year,sep="_")})
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
I have to take the first row of each mini data frame.
ddply(baseball,function(x){paste(x$id[1],x$year[1],sep="_")})
Is there a different approach or a helper I should be using? This syntax seems awkward.
--
Note: paste in my example is just for show - don't take it too literally. Imagine this is actual function:
ddply(baseball,function(x){the_slowest_function_ever(x$id[1],x$year[1])})
You might find data.table a little easier and faster in this case. The equivalent of .() variables is by= :
DT[, { paste(id,year,sep="_") }, by=list(id,year) ]
or
DT[, { do.call("paste",.BY) }, by=list(id,year) ]
I've shown the {} to illustrate you can put any (multi-line) anonymous body in j (rather than a function), but in these simple examples you don't need the {}.
The grouping variables are length 1 inside the scope of each group (which seems to be what you're asking), for speed and convenience. .BY contains the grouping variables in one list object as well, for generic access when the by criteria is decided programatically on the fly; i.e., when you don't know the by variables in advance.
You could use:
ddply(baseball, .(id, year), function(x){data.frame(paste(x$id,x$year,sep="_"))})
When you return a vector, putting it back together as a data.frame makes each entry a column. But there are different lengths, so they don't all have the same number of columns. By wrapping it in data.frame(), you make sure that your function returns a data.frame that has the column you want rather than relying on the implicit (and in this case, wrong) transformation. Also, you can name the new column easily within this construct.
UPDATE:
Given you only want to evaluate the function once (which is reasonable), then you can just pull the first row out by itself and operate on that.
ddply(baseball, .(id, year), function(x) {
x <- x[1,]
paste(x$id, x$year, sep="_")
})
This will (by itself) have only a single row for each id/year combo. If you want it to have the same number of rows as the original, then you can combine this with the previous idea.
ddply(baseball, .(id, year), function(x) {
firstrow <- x[1,]
data.frame(label=rep(paste(firstrow$id, firstrow$year, sep="_"), nrow(x)))
})

Resources