Automatically creating derived variables in a dataframe [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Processing the list of data.frames with “apply” family of functions
I have a dataframe with six numeric variables V1, V2, V3 and V1.lag, V2.lag, V3.lag.
NOTE: My real dataset has much more variables but I use 3 for ilustration only!
I would like to be able to automatically (without hardcoding anything) run through all V variables (not lag variables) and create V1.over.V1.lag variables by dividing each V variable with coresponding lag variable.
df<-data.frame(matrix(rnorm(216),72,6));
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag");
Thanks in advance
**EDIT: I figured how to identify "raw" columns and "lag" columns **
raws <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "raw" } ); ## which are raw factors
lags <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "lag" } ); ## which are lagged factors
but I still can't figure how to divide all raw factors with their lag counterparts
which(raws);
will give me indices, but how do I combine them with lags into new factor?
df[which(raws)] / df[which(lags)]
doesn't work

Assuming you have only v.raw and v.lag columns in you data.frame, this should work
mm <- colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[,gregexpr('.raw',mm) > 0] /df[,gregexpr('.*lag',mm) > 0]
Edit some explanations to the solution :
gregexpr('.raw',mm) > 0
[1] TRUE TRUE TRUE FALSE FALSE FALSE
head(df[,gregexpr('.raw',mm) > 0],1)
v1.raw v2.raw v3.raw
1 0.7719037 -0.2078197 -1.223753
regexpr('.lag',mm) > 0
[1] FALSE FALSE FALSE TRUE TRUE TRUE
head(df[,gregexpr('.lag',mm) > 0],1)
v1.lag v2.lag v3.lag
1 0.7719037 -0.2078197 -1.223753
Than we use the vectorize / to do division, in one operation.
Here an example :
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
v1.raw v2.raw v3.raw v1.lag v2.lag v3.lag
[1,] 1 2 3 4 5 6
[2,] 1 2 3 4 5 6
[3,] 1 2 3 4 5 6
[4,] 1 2 3 4 5 6
[5,] 1 2 3 4 5 6
mm <- colnames(df)
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)]
v1.raw v2.raw v3.raw #as expected 1/4 2/5 3/6
[1,] 0.25 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5
Edit2 prevent Nan with zero
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[1,4] <- 0 ## I introduce a 0 here
mm <- colnames(df)
## I use ifelse , because it is vectorize also !
## If you find a 0 , don't compute , and retuen me the original value
## You can do other things here
ifelse(df[,which(gregexpr('.lag',mm) > 0)] != 0 ,
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)],
df[,which(gregexpr('.raw',mm) > 0)])
v1.lag v2.lag v3.lag ## for some reasons ifelse choose other columns names!(lag not raw)
[1,] 1.00 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5

Related

Identify groups of identical rows in a matrix

tl;dr What is the idiomatic way to identify groups of identical rows in a matrix in R?
Given an n-by-2 matrix where some rows occur more than once,
> mat <- matrix(c(2,5,5,3,4,6,2,5,4,6,4,6), ncol=2, byrow=T)
> mat
[,1] [,2]
[1,] 2 5
[2,] 5 3
[3,] 4 6
[4,] 2 5
[5,] 4 6
[6,] 4 6
I am looking to get the groups of row indices of identical rows. In the example above, rows (1,4) are identical, and so are rows (3,5,6). Finally, there is row (2). I am looking to get these groups, represented in whatever way is idiomatic in R.
The output could be something like this,
> groups <- matrix(c(1,1, 2,2, 3,3, 4,1, 5,3, 6,3), ncol=2, byrow=T)
> groups
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
where the first column contains the row indices of mat and the second the group index for each row index. Or it could be like this:
> split(groups[,1], groups[,2])
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Either will do. I am not sure what is the best way to represent groups in R, and advice on this is also welcome.
For benchmarking purposes, here's a larger dataset:
set.seed(123)
n <- 10000000
mat <- matrix(sample.int(10, 2*n, replace = T), ncol=2)
cbind with sequence of rows and the match between the rows and unique values of the row
v1 <- paste(mat[,1], mat[,2])
# or if there are more columns
#v1 <- do.call(paste, as.data.frame(mat))
out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
-output
> out
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
If we want a list output
split(out[,1], out[,2])
-ouptut
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Benchmarks
With the OP's big data
> system.time({
+ v1 <- paste(mat[,1], mat[,2])
+
+ out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
+
+ })
user system elapsed
2.603 0.130 2.706

Optimization : conditional test on several matrices & extraction in R

I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941

Need to vectorize function that using loop (replace NA rows with values from vector)

How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9

R apply and get values from previous row

I'd like to use the previous row value for a calculation involving the current row. The matrix looks something like:
A B
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
I want to do the following operation: (cell[i]/cell[i-1])-1, essentially calculating the % change (-1 to 1) from the current row to the previous (excluding the first row).
The output should look like:
C D
[1,] NA NA
[2,] 1.0 0.5
[3,] 0.5 0.33
[4,] 0.33 0.25
[5,] 0.25 0.20
This can be accomplished easily using for-loops, but I am working with large data sets so I would like to use apply (or other inbuilt functions) for performance and cleaner code.
So far I've come up with:
test.perc <- sapply(test, function(x,y) x-x[y])
But it's not working.
Any ideas?
Thanks.
df/rbind(c(NA,NA), df[-nrow(df),]) - 1
will work.
1) division
ans1 <- DF[-1,] / DF[-nrow(DF),] - 1
or rbind(NA, ans1) if its important to have the NAs in the first row
2) diff
ans2 <- exp(sapply(log(DF), diff)) - 1
or rbind(NA, ans2) if its important to have the NAs in the first row
3) diff.zoo
library(zoo)
coredata(diff(as.zoo(DF), arithmetic = FALSE)) - 1
If its important to have the NA at the beginning then add the na.pad=TRUE argument like this:
coredata(diff(as.zoo(DF), arithmetic = FALSE, na.pad = TRUE)) - 1
Alternatively, sticking with your original sapply method:
sapply(dat, function(x) x/c(NA,head(x,-1)) - 1 )
Or a variation on #user3114046's answer:
dat/rbind(NA,head(dat,-1))-1
# A B
#[1,] NA NA
#[2,] 1.0000000 0.5000000
#[3,] 0.5000000 0.3333333
#[4,] 0.3333333 0.2500000
#[5,] 0.2500000 0.2000000

How to vectorize this operation on every row of a matrix

I have a matrix filled with TRUE/FALSE values and I am trying to find the index position of the first TRUE value on each row (or return NA if there is no TRUE value in the row). The following code gets the job done, but it uses an apply() call, which I believe is just a wrapper around a for loop. I'm working with some large datasets and performance is suffering. Is there a faster way?
> x <- matrix(rep(c(F,T,T),10), nrow=10)
> x
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE TRUE FALSE
[3,] TRUE FALSE TRUE
[4,] FALSE TRUE TRUE
[5,] TRUE TRUE FALSE
[6,] TRUE FALSE TRUE
[7,] FALSE TRUE TRUE
[8,] TRUE TRUE FALSE
[9,] TRUE FALSE TRUE
[10,] FALSE TRUE TRUE
> apply(x,1,function(y) which(y)[1])
[1] 2 1 1 2 1 1 2 1 1 2
Not sure this is any better, but this is one solution:
> x2 <- t(t(matrix(as.numeric(x), nrow=10)) * 1:3)
> x2[x2 == 0] <- Inf
> rowMins(x2)
[1] 2 1 1 2 1 1 2 1 1 2
Edit: Here's a better solution using base R:
> x2 <- (x2 <- which(x, arr=TRUE))[order(x2[,1]),]
> x2[as.logical(c(1,diff(x2[,1]) != 0)),2]
[1] 2 1 1 2 1 1 2 1 1 2
A couple of years later, I want to add two alternative approaches.
1) With max.col:
> max.col(x, "first")
[1] 2 1 1 2 1 1 2 1 1 2
2) With aggregate:
> aggregate(col ~ row, data = which(x, arr.ind = TRUE), FUN = min)$col
[1] 2 1 1 2 1 1 2 1 1 2
As performance is an issue, let's test the different methods on a larger dataset. First create a function for each method:
abiel <- function(n){apply(n, 1, function(y) which(y)[1])}
maxcol <- function(n){max.col(n, "first")}
aggr.min <- function(n){aggregate(col ~ row, data = which(n, arr.ind = TRUE), FUN = min)$col}
shane.bR <- function(n){x2 <- (x2 <- which(n, arr=TRUE))[order(x2[,1]),]; x2[as.logical(c(1,diff(x2[,1]) != 0)),2]}
joris <- function(n){z <- which(t(n))-1;((z%%ncol(n))+1)[match(1:nrow(n), (z%/%ncol(n))+1)]}
Second, create a larger dataset:
xl <- matrix(sample(c(F,T),9e5,replace=TRUE), nrow=1e5)
Third, run the benchmark:
library(microbenchmark)
microbenchmark(abiel(xl), maxcol(xl), aggr.min(xl), shane.bR(xl), joris(xl),
unit = 'relative')
which results in:
Unit: relative
expr min lq mean median uq max neval cld
abiel(xl) 55.102815 33.458994 15.781460 33.243576 33.196486 2.911675 100 d
maxcol(xl) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
aggr.min(xl) 439.863935 262.595535 118.436328 263.387427 256.815607 16.709754 100 e
shane.bR(xl) 12.477856 8.522470 7.389083 13.549351 24.626431 1.748501 100 c
joris(xl) 7.922274 5.449662 4.418423 5.964554 9.855588 1.491417 100 b
You can gain a lot of speed by using %% and %/%:
x <- matrix(rep(c(F,T,T),10), nrow=10)
z <- which(t(x))-1
((z%%ncol(x))+1)[match(1:nrow(x), (z%/%ncol(x))+1)]
This can be adapted as needed: if you want to do this for columns, you don't have to transpose the matrix.
Tried out on a 1,000,000 X 5 matrix :
x <- matrix(sample(c(F,T),5000000,replace=T), ncol=5)
system.time(apply(x,1,function(y) which(y)[1]))
#> user system elapsed
#> 12.61 0.07 12.70
system.time({
z <- which(t(x))-1
(z%%ncol(x)+1)[match(1:nrow(x), (z%/%ncol(x))+1)]}
)
#> user system elapsed
#> 1.11 0.00 1.11
You could gain quite a lot this way.

Resources