I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.
Here is the example,
## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE) # factor columns
## So, it looks like this
setDT(dat)[]
# X1 X2 X3 X4 X5
# 1: u NA NA NA NA
# 2: f q NA NA NA
# 3: f b w NA NA
# 4: k g h NA NA
# 5: u b r NA NA
# 6: f q w x t
# 7: u g h i e
# 8: u q r n t
## I just want to get the labels of the factors
## that are 'rightmost' in each row. I tried a number of things
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]
This is the goal though, to extract these labels, here using regular base functions.
## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
Here's another way:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as #alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
Functions:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
Here is a one liner base R approach:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).
melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
In case, the data is already a data.table
dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
Here is another option
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
, as.character(.SD[[.BY[[1]]]]), by=colInd]
Or as #Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact
melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]
I'm not sure how to improve upon #alexis's answer beyond what #Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.
Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:
as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]
The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that #akrun shared.
Related
I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.
This works for the first list element:
values[[1]][values[[1]]==-10000] <-NA
As I do not want to loop over my thousand list elements I'm looking for a command which does the same for the whole list like:
values[values==-10000] <-NA
But this does not work for the type list:
Error in values[values == -10000] <- NA :
(list) object cannot be coerced to type 'double'
Try using lapply:
lst <- list(v1=-10000, v2=500, v3=c(1,2))
lapply(lst, function(x) ifelse(x==-10000, NA, x))
$v1
[1] NA
$v2
[1] 500
$v3
[1] 1 2
This approach is also robust even if some of the list elements are not numbers, but are other things, such as vectors. In that case, a vector would not match to your target value, and would not be changed.
As you mention "big list" I provide a second option that uses replace instead which is a bit quicker compared to ifelse.
Thanks to #TimBiegeleisen for the data
lst <- list(v1=-10000, v2=500, v3=c(1,2))
lapply(lst, function(x) replace(x, x == -10000, NA))
#$v1
#[1] NA
#
#$v2
#[1] 500
#
#$v3
#[1] 1 2
benchmark
l <- rep(lst, 100000)
library(microbenchmark)
benchmark <- microbenchmark(
tim = lapply(l, function(x) ifelse(x==-10000, NA, x)),
markus = lapply(l, function(x) replace(x, x==-10000, NA))
)
autoplot(benchmark)
benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# tim 931.5551 1003.0364 1054.7647 1018.7956 1082.3210 2536.373 100 b
# markus 432.3821 473.9881 500.4833 482.5838 515.9907 1023.392 100 a
Using na_if with map
library(tidyverse)
map(lst, na_if, y = -10000)
#$v1
#[1] NA
#$v2
#[1] 500
#$v3
#[1] 1 2
data
lst <- list(v1=-10000, v2=500, v3=c(1,2))
I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.
Here is the example,
## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE) # factor columns
## So, it looks like this
setDT(dat)[]
# X1 X2 X3 X4 X5
# 1: u NA NA NA NA
# 2: f q NA NA NA
# 3: f b w NA NA
# 4: k g h NA NA
# 5: u b r NA NA
# 6: f q w x t
# 7: u g h i e
# 8: u q r n t
## I just want to get the labels of the factors
## that are 'rightmost' in each row. I tried a number of things
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]
This is the goal though, to extract these labels, here using regular base functions.
## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
Here's another way:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as #alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
Functions:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
Here is a one liner base R approach:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).
melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
In case, the data is already a data.table
dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
Here is another option
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
, as.character(.SD[[.BY[[1]]]]), by=colInd]
Or as #Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact
melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]
I'm not sure how to improve upon #alexis's answer beyond what #Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.
Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:
as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]
The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that #akrun shared.
I have a data set with the structure shown below.
# example data set
a <- "a"
b <- "b"
d <- "d"
id1 <- c(a,a,a,a,b,b,d,d,a,a,d)
id2 <- c(b,d,d,d,a,a,a,a,b,b,d)
id3 <- c(b,d,d,a,a,a,a,d,b,d,d)
dat <- rbind(id1,id2,id3)
dat <- data.frame(dat)
I need to find across each row the first sequence with repeated elements "a" and identify the element following the sequence immediately.
# desired results
dat$s3 <- c("b","b","d")
dat
I was able to break the problem in 3 steps and solve the first one but as my programming skills are quite limited, I would appreciate any advice on how to approach steps 2 and 3. If you have an idea that solves the problem in another way that would be extremely helpful as well.
Here is what I have so far:
# Step 1: find the first occurence of "a" in the fist sequence
dat$s1 <- apply(dat, 1, function(x) match(a,x))
# Step 2: find the last occurence in the first sequence
# Step 3: find the element following the last occurence in the first sequence
Thanks in advance!
I'd use filter:
fun <- function(x) {
x <- as.character(x)
isa <- (x == "a") #find "a" values
#find sequences with two TRUE values and the last value FALSE
ids <- stats::filter(isa, c(1,1,1), sides = 1) == 2L & !isa
na.omit(x[ids])[1] #subset
}
apply(dat, 1, fun)
#id1 id2 id3
#"b" "b" "d"
Try this (assuming that you have repeated a at each row):
library(stringr)
dat$s3 <-apply(dat, 1, function(x) str_match(paste(x, collapse=''),'aa([^a])')[,2])
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 s3
id1 a a a a b b d d a a d b
id2 b d d d a a a a b b d b
id3 b d d a a a a d b d d d
Well, here is one attempt which is a bit messy,
l1 <- lapply(apply(dat, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
dat$s3 <- diag(as.matrix(dat[ind]))
dat$s3
#[1] "b" "b" "d"
or wrap it in a function,
fun1 <- function(df){
l1 <- lapply(apply(df, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
return(diag(as.matrix(df[ind])))
}
fun1(dat)
#[1] "b" "b" "d"
I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
Here you want to use apply over the rows instead of columns:
df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))
t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G" "G" "T" "T" "G" "G"
#[2,] "sample2" "C" "C" "T" "T" "G" "C"