first row for non-aggregate functions - r

I use ddply to avoid redundant calculations.
I am often dealing with values that are conserved within the split subsets, and doing non-aggregate analysis. So to avoid this (a toy example):
ddply(baseball,.(id,year),function(x){paste(x$id,x$year,sep="_")})
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
I have to take the first row of each mini data frame.
ddply(baseball,function(x){paste(x$id[1],x$year[1],sep="_")})
Is there a different approach or a helper I should be using? This syntax seems awkward.
--
Note: paste in my example is just for show - don't take it too literally. Imagine this is actual function:
ddply(baseball,function(x){the_slowest_function_ever(x$id[1],x$year[1])})

You might find data.table a little easier and faster in this case. The equivalent of .() variables is by= :
DT[, { paste(id,year,sep="_") }, by=list(id,year) ]
or
DT[, { do.call("paste",.BY) }, by=list(id,year) ]
I've shown the {} to illustrate you can put any (multi-line) anonymous body in j (rather than a function), but in these simple examples you don't need the {}.
The grouping variables are length 1 inside the scope of each group (which seems to be what you're asking), for speed and convenience. .BY contains the grouping variables in one list object as well, for generic access when the by criteria is decided programatically on the fly; i.e., when you don't know the by variables in advance.

You could use:
ddply(baseball, .(id, year), function(x){data.frame(paste(x$id,x$year,sep="_"))})
When you return a vector, putting it back together as a data.frame makes each entry a column. But there are different lengths, so they don't all have the same number of columns. By wrapping it in data.frame(), you make sure that your function returns a data.frame that has the column you want rather than relying on the implicit (and in this case, wrong) transformation. Also, you can name the new column easily within this construct.
UPDATE:
Given you only want to evaluate the function once (which is reasonable), then you can just pull the first row out by itself and operate on that.
ddply(baseball, .(id, year), function(x) {
x <- x[1,]
paste(x$id, x$year, sep="_")
})
This will (by itself) have only a single row for each id/year combo. If you want it to have the same number of rows as the original, then you can combine this with the previous idea.
ddply(baseball, .(id, year), function(x) {
firstrow <- x[1,]
data.frame(label=rep(paste(firstrow$id, firstrow$year, sep="_"), nrow(x)))
})

Related

how to access rownames in a chained data.table of R

We need rownames sometimes to create a new column that is a function of previous columns but aggregated just for one row (each row). In other words the function is operating across the row.
Consider this:
library(data.table)
library(geosphere)
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # correct
The second line of code works fine as the data.table name dt is available inside the square brackets (which in itself does not look quite elegant to me), but not always.
What if there is a chain of data.tables? Consider this extension of previous example:
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # incorrect
Clearly this is an incorrect use as rownames(dt) is a different length than the subsetted data.table passed inside to the next in chain.
I guess my larger question is: Is rownames() the only way to achieve summarisation on each row? If not then the specific question remains: how do we access the data.table inside the by= construct if it is a chained data.table?
Try cbind:
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# correct : 100 lines
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# also correct : 16 lines
:= works on each row without need for summarization.
cbind allows to supply the expexted n*2 lat-lon matrix to the function.

How do I apply a function to specific columns in a dataframe and replace the original columns?

I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))

in R, functions applied to vectors vs scalars in data.table

I want to check my understanding of data.table behavior, because I find myself trying to guess how R will behave when using functions on the RHS of the := operator.
Any nuance to my observations is welcome (as well as comments on how this relates to data.table vs. data.frame).
Mostly, I want to figure out how to know when := used with functions will take references to a column in data.table (on the RHS) as a vector of every value in the column versus simply applying the function by row to each value.
library(data.table)
dt <- data.table(x = c(-100,-50,0,50,100))
dt[,y := max(0,x)]
dt[,z := abs(x)]
dt[,sd := sd(x)]
Here, min seems to turn 0 and the vector of all x values into a single vector, then runs on that new vector. More to the point, it seems to take every value in column x, not just the value of x on a given row.
On the other hand, abs returns just the absolute value of the x for that row.
And then sd takes the entire column x as the argument (rather than applying by row).
Do I just have to dig into the documentation for a function to see if it takes an atomic argument vs a vector argument, and assume that if the function takes a vector, it will apply to the entire column (and not apply to each row)? Is that the best way I can do this?

Order based on multiple columns passed in as an input

I would like to write a function that sorts a given data.frame (which I'll refer to as dataSet) by any number of its columns, whose names are also passed into the function (in a vector which I will refer to as orderList). I know that to order by a single passed in string you can just use
sortDataset <- function(dataSet, sortCol) {
return(dataSet[order(dataSet[[sortCol]]),])
}
and that you can order by multiple passed in strings using
sortDataset <- function(dataSet, sortCol1, sortCol2) {
return(dataSet[order(dataSet[[sortCol1]], dataSet[[sortCol2]]),])
}
with however many sortCol# inputs as I would want. I would, however, like to be able to pass in a list of any number of strings. I tried the following:
dataSet[order(dataSet[[orderList]]),]
dataSet[order(dataSet$orderList),]
dataSet[order(dataSet[,orderList])]
and encountered issues that with the first 2, since they're just not a valid way to get multiple columns (I still tried, though ): ) and that in the third, order doesn't seem to accept the matrix returned by dataSet[,orderList] as a parameter.
I would like a function as follows:
sortDataset <- function(dataSet, sortCols)
where the first element of sortCols is the column which takes highest priority, then the second column is the first tiebreaker, the third column is the second tiebreaker, etc. and the function returns dataSet sorted appropriately. It would also be nice if I could specify whether each should be ascending in an optional input, so the first column could be sorting ascending, the second sorted descending, etc.
So far, the only method I can really think of is to assume each list only contains numeric values, and then do some multiplying of the various sorting columns by 10^n so that all the columns can be consolidated into one column that maintains the priorities, and then sort by that column. I feel like there should be a better way to do this, though, since this seems like a pretty basic function.
Use do.call:
data[do.call("order", data[sortCols]), ]
where data is a data frame and sortCols is a character vector of column names.
Also have a look at orderBy in the doBy package.
We can do this with tidyverse
library(dplyr)
data %>%
arrange_at(vars(sortCols))
which can be made into a function with either using quos/1!!
sortDataset <- function(dataSet, ...) {
stopifnot(rlang::is_quosures(...))
a1 <- c(...)
dataSet %>%
arrange(!!! a1)
}
sortDataset(mtcars, quos(mpg, cyl))
or with arrange_at if we are passing variable as string
sortDataset <- function(dataSet, ...) {
a1 <- c(...)
dataSet %>%
arrange_at(vars(a1))
}
sortDataset(mtcars, "mpg", "cyl")
As #Nettle mentioned in the comments, using arrange_at with group_by can cause some bugs (based on here

R: data.table : searching on multiple columns AND setting data type

Q1:
Is it possible for me to search on two different columns in a data table. I have a 2 million odd row data and I want to have the option to search on either of the two columns. One has names and other has integers.
Example:
x <- data.table(foo=letters,bar=1:length(letters))
x
want to do
x['c'] : searching on foo column
as well as
x[2] : searching on bar column
Q2:
Is it possible to change the default data types in a data table. I am reading in a matrix with both character and integer columns however everything is being read in as a character.
Thanks!
-Abhi
To answer your Q2 first, a data.table is a data.frame, both of which are internally a list. Each column of the data.table (or data.frame) can therefore be of a different class. But you can't do that with a matrix. You can use := to change the class (by reference - no unnecessary copy being made), for example, of "bar" here:
x[, bar := as.integer(as.character(bar))]
For Q1, if you want to use fast subset (using binary search) feature of data.table, then you've to set key, using the function setkey.
setkey(x, foo)
allows you to fast-subset on 'x' alone like: x['a'] (or x[J('a')]). Similarly setting a key on 'bar' allows you to fast-subset on that column.
If you set the key on both 'foo' and 'bar' then you can provide values for both like so:
setkey(x) # or alternatively setkey(x, foo, bar)
x[J('c', 3)]
However, this'll subset those where x == 'c' and y == 3. Currently, I don't think there is a way to do a | operation with fast-subset directly. You'll have to resort to a vector-scan approach in that case.
Hope this is what your question was about. Not sure.
Your matrix is already a character. Matrices hold only one data type. You can try X['c'] and X[J(2)]. You can change data types as X[,col := as.character(col)]

Resources