apply function to all values in data.table subset - r

I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.

Please see the solution posted here for an every more general solution. It may also help:
data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
Part 1. A Step-by-Step Solution
(1.a) Get the data into Data.Table format:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
(1.b) Then, Get the list of Column Names:
# Get the list of names
sub <- c('spB', 'spF', 'spD')
(1.c) Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
(1.d) Apply the function:
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Part 2. Some Advantages for Data.Table approach
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
(2.a) Apply functions within group by using the "by ="
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
First, we create a new column called Combined.Data for data in all three columns identified in "sub"
Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id
In the second step, done in this portion of the code:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
In addition, we can also include the Combined.Data column as well
So, with this single line:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
we subset columns,
collapse data across the subset,
keep track of the identifier for each datum (P.Id) even when collapsed,
apply a transformation on the entire collapsed data, and
end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
and, the order(P.Id) allows the output to appear meaningfully ordered.
The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
Part 3. Disadvantage: Time Cost
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91

Related

Equivalent of row_number for columns dplyr

I am trying to apply a function to columns of a tibble, or data.frame, depending on the index of columns. It appears to me several time, and I give just one MWE
library(tidyverse)
test <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(3,5,6))
test <- test %>% as_tibble() %>% mutate_all( ~lead(., 2))
This will lead by 2 every columns (just an example). But what I want is to lead the first column by 1, the second by 2, and so on. Doing something like mutate_all(~lead(., col_number()).
For this little example, I know one way to do it, like:
test <- as.matrix(test)
for (i in 1:ncol(test)){ test[,i] <- lead(test[,i], i) }
There might be other way to do it too, haven't thought about it much (one needs to convert as a matrix first, otherwise it doesn't produce the right result, I don't really know why).
But I'd like to do it with a mutate or apply, being able to get the index of column in general. With a more complex example.
Any idea?
One option is using purrr::map2_df to sequentially lead every column based on column number.
purrr::map2_df(test, seq_along(test), dplyr::lead)
# A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
#1 2 9 NA
#2 3 NA NA
#3 NA NA NA
We can also use base R Map
test[] <- Map(function(x, y) c(tail(x, -y), rep(NA, y)), test, seq_along(test))
We can use data.table shift
library(data.table)
setDT(test)[, Map(shift, .SD, n = 1:3, type = 'lead')]
# a b c
#1: 2 9 NA
#2: 3 NA NA
#3: NA NA NA
Or using purrr
library(purrr)
map2_dfr(test, 1:3, ~shift(.x, type = 'lead'))

How can I insert values into a data frame dynamically using R

After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.

Transform longitudinal table to wide format efficiently in data.table

I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.
I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.
In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
Any advice/improvement in the process is welcome.
I'd maybe do the char and num tables separately and then rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
which gives
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
Note that OP's final result casted2, differs in that it has all cols as char.

R function to calculate mean/median of top highest values

I have a data frame with 2 columns one with numeric values and one with a name. The name repeats itself but has different values each time.
Data <- data.frame(
Value = c(1:10),
Name = rep(LETTERS, each=4)[1:10])
I would like to write a function that takes the 3 highest numbers for each name and calculates mean and median (and in case there aren’t 3 values present throw an NA) and then take all the values for each name and calculate mean and median.
My initial attempt looks something like this:
my.mean <- function (x,y){
top3.x <- ifelse(x > 3 , NA, x)
return(mean(top3.x), median(top3.x))
}
Any hints on how to improve this will be appreciated.
I would probably recommend by for this.
Something put together really quickly might look like this (if I understood your question correctly):
myFun <- function(indf) {
do.call(rbind, with(indf, by(Value, Name, FUN=function(x) {
Vals <- head(sort(x, decreasing=TRUE), 3)
if (length(Vals) < 3) {
c(Mean = NA, Median = NA)
} else {
c(Mean = mean(Vals), Median = median(Vals))
}
})))
}
myFun(Data)
# Mean Median
# A 3 3
# B 7 7
# C NA NA
Note that it is not a very useful function in this form because of how many parameters are hard-coded into the function. It's really only useful if your data is in the form you shared.
Here's a data.table solution, assuming that you don't have any other NAs in your data:
require(data.table) ## 1.9.2+
setDT(Data) ## convert to data.table
Data[order(Name, -Value)][, list(m1=mean(Value[1:3]), m2=median(Value[1:3])), by=Name]
# Name m1 m2
# 1: A 3 3
# 2: B 7 7
# 3: C NA NA
Using dplyr
library(dplyr)
myFun1 <- function(dat){
dat %>%
group_by(Name)%>%
arrange(desc(Value))%>%
mutate(n=n(), Value=ifelse(n<=3, NA_integer_, Value))%>%
summarize(Mean=mean(head(Value,3)), Median=median(head(Value,3)))
}
myFun1(Data)
#Source: local data frame [3 x 3]
# Name Mean Median
#1 A 3 3
#2 B 7 7
#3 C NA NA

Replace NA with mean matching the same ID

I have a data frame:
id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)
and I need to replace the NA based on the following conditions:
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?
Thanks!
This is a solution using data.tables:
library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt
id rate
1: 1 1.0
2: 1 1.0
3: 1 1.0
4: 1 1.0
5: 2 0.5
6: 2 0.6
7: 2 0.6
8: 3 0.7
9: 3 0.7
10: 4 1.0
I am not sure though, if the ifelse could/should be avoided.
As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line!
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
If instead, these group averages are determined before hand, the loop can do a rapid lookup.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- foo$x[foo$Group.1 == df$id[i]]
...
However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:
myfun <- function(DF) {
avg <- avgs$rate[avgs$id == unique(DF$id)]
if (is.nan(avg)) {
avg <- 1
}
DF$rate[is.na(DF$rate)] <- avg
return (DF)
}
The plyr version:
library(plyr)
avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
result <- ddply(df, .(id), myfun)
And the likely much faster data.table version:
library(data.table)
DT <- data.table(df)
setkey(DT, id)
DT[, avg := mean(rate, na.rm=TRUE), by=id]
DT[is.nan(avg), avg := 1]
DT[, rate := ifelse(is.na(rate), avg, rate)]
This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:
DT[, avg := NULL]
The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!
I'm not sure this exactly answers the OP's question, but for others who read this later, there is a different and much faster method of performing calculations on a subset of data other than actually subsetting the data: vector math. The engineers in the crowd will know what I'm talking about.
Instead of subsetting, assign a very rapid function to create an identity vector and multiply the data by the identity.
Now, this isn't faster for all cases. There are cases where vectorized functions are actually slower than item-explicit functions, and it all depends on your specific application. [Insert O-notation rant of your choice here.]
Here is how we would do a vector math implementation for this case:
# Create the NA identity vector.
na_identity <- is.na(df$rate)
# Initialize the final data frame.
# This is for non-destructive purposes.
df_revised <- df
# Replace all NA occurrences in final
# data frame with zero values.
df_revised$rate[na_identity] <- 0
# Loop through each unique [id]
# value in the data.
# Create an identity vector for the
# current ID, calculate the mean
# rate for that ID (replacing NaN with 1),
# and insert the mean for any NA values
# associated with that ID.
for (i in unique(df$id)){
id_identity <- df$id==i
id_mean <- sum(df_revised$rate * id_identity * !na_identity) / sum(id_identity * !na_identity)
if(is.nan(id_mean)){id_mean <- 1}
df_revised$rate <- df_revised$rate + id_mean * id_identity * na_identity
}
# id rate
# 1 1 1.00
# 2 1 1.00
# 3 1 1.00
# 4 1 1.00
# 5 2 0.50
# 6 2 0.60
# 7 2 0.55
# 8 3 0.70
# 9 3 0.70
# 10 4 1.00
From a vector math perspective, this code is easily readable. In this small example, the code is very fast, but the cycle time increases directly with the number of unique ID values. I'm not sure if this is the correct approach for the OP's larger application, but the solution is workable and theoretically sound and removes the need for complex and difficult to read logic blocks.

Resources