Create a new variable from the minimum in R - r

The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.

As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.

Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2

a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).

Related

Creating a list with column-wise partitions of a data.frame

I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.
For example, given the below data.frame:
# Example data.frame
df <- data.frame(id = 1:10,
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10))
I'd like to have some function that converts it into this:
# Partitioning function
foo(df, partitions = 3)
# Expected output
list(data.frame(id = df$id, x1 = df[ ,2]),
data.frame(id = df$id, x2 = df[ ,3]),
data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]),
Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.
columns_per_element <- c(1,1,2)
foo(df, columns_per_element)
It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.
f2 <- function(df, n, split){
i1 <- rep(seq(n), split)
res_list <- split.default(df[-1], i1)
return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}
f2(df, 3, c(1, 1, 2))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2
1 1 0.8529416
2 2 0.4555094
3 3 -0.3620756
4 4 1.4779813
5 5 -1.6484066
6 6 -0.5697431
7 7 -0.2139384
8 8 0.1619074
9 9 -0.5390306
10 10 -0.2228809
$`3`
ID x3 x4
1 1 -0.2579865 1.185526074
2 2 -0.0519554 -0.388179976
3 3 2.5350092 -0.675504829
4 4 -1.7051955 0.073448252
5 5 0.6207733 -0.637220508
6 6 0.3015831 -1.324024114
7 7 -0.5647717 0.969025962
8 8 0.1404714 -1.575383604
9 9 1.3049560 -1.846413101
10 10 -0.6716643 0.008675125
f2(df, 3, c(1, 2, 1))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2 x3
1 1 0.8529416 -0.2579865
2 2 0.4555094 -0.0519554
3 3 -0.3620756 2.5350092
4 4 1.4779813 -1.7051955
5 5 -1.6484066 0.6207733
6 6 -0.5697431 0.3015831
7 7 -0.2139384 -0.5647717
8 8 0.1619074 0.1404714
9 9 -0.5390306 1.3049560
10 10 -0.2228809 -0.6716643
$`3`
ID x4
1 1 1.185526074
2 2 -0.388179976
3 3 -0.675504829
4 4 0.073448252
5 5 -0.637220508
6 6 -1.324024114
7 7 0.969025962
8 8 -1.575383604
9 9 -1.846413101
10 10 0.008675125
Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.
f2 <- function(x,y){
#keep id
id <- x[,"id" , drop = FALSE]
#keep all other variables
df2 <- x[,-1]
#get sequence for columns
y2 <- lapply(cumsum(y), function(x){sequence(x)})
#grab correct columns
y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
#recreate df
lapply(y3,
function(x){
cbind.data.frame(id, df2[,x, drop = FALSE])
})
}
f2(df, c(1,1,2))

R: Sample n elements in certain columns in a dataframe/matrix and replace their values

I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15

Convert 3D array into list of dataframes

Basically, I want to group a 3D array by its columns, transform it into a data frame, and bind to it a new column whose value equals to the sum of all existing columns.
For example, consider the following 3D array
> (src <- array(1:8, c(2, 2, 2), dimnames=list(c('X1', 'X2'), c('Y1', 'Y2'), 1:2)))
, , 1
Y1 Y2
X1 1 3
X2 2 4
, , 2
Y1 Y2
X1 5 7
X2 6 8
I would like to convert it to
> (dest <- list(Y1=data.frame(X1=c(1, 5), X2=c(2, 6), Y1=c(1, 5)+c(2, 6)),
Y2=data.frame(X1=c(3, 7), X2=c(4, 8), Y2=c(3, 7)+c(4, 8))))
$Y1
X1 X2 Y1
1 1 2 3
2 5 6 11
$Y2
X1 X2 Y2
1 3 4 7
2 7 8 15
I know how to do the transformation for each individual column in the original array, but have no idea how to handle multiple columns simultaneously.
> library(dplyr)
> as.data.frame(t(src[, 'Y1', ])) %>% mutate(Y1=X1+X2)
X1 X2 Y1
1 1 2 3
2 5 6 11
Feel free to use base R, dplyr, data.table, or whatever package you prefer, as long as it's fast enough. In the real-world application, dim(src) tend to be something like c(hundreds, tens, tens of thousands).
We could first apply data.frame-transformation on margin 2 of the transposed array, where we transpose arrays with aperm(). Then we proceed similarly with the colSums. In order to get the right names "Y1", "Y2" we make an interim step listing the columns as data frames. Finally Map evaluates both lists (the X* and colsums of Y*) element by element.
dest <- Map(cbind, apply(aperm(src, c(3, 2, 1)), 2, data.frame),
{tmp <- data.frame(apply(src, 2, colSums));list(tmp[1], tmp[2])})
dest
# $Y1
# X1 X2 Y1
# 1 1 2 3
# 2 5 6 11
#
# $Y2
# X1 X2 Y2
# 1 3 4 7
# 2 7 8 15

Filtering observations by using grep the reverse way in R

Shown as below:
df <- data.frame(X1 = rep(letters[1:3],3),
X2 = 1:9,
X3 = sample(1:50,9))
df
ind<- grep("a|c", df$X1)
library(data.table)
df_ac <- df[ind,]
df_b <- df[!ind,]
df_ac is created using the regular grep command. If I want to use the grep the reverse way: to select all observations with X1 == 'b'.
I know I can do this by:
ind2<- grep("a|c", df$X1, invert = T)
df_b <-df[ind2,]
But, in my original script, why does the command df_b <-df[!ind,] return a data frame with zero observation?
Anyone can explain to me why my logic here is wrong? Is there any other way to select observations in a data.frame by using the grep reversely without specifying invert = T? Thank you!
You may be more interested in grepl instead of grep:
ind<- grepl("a|c", df$X1)
df[ind,]
# X1 X2 X3
# 1 a 1 16
# 3 c 3 38
# 4 a 4 10
# 6 c 6 18
# 7 a 7 33
# 9 c 9 49
df[!ind,]
# X1 X2 X3
# 2 b 2 5
# 5 b 5 14
# 8 b 8 50
Alternatively, go ahead an make use of "data.table" and try out %in% to see what else might work for you. Notice the difference in the syntax.
ind2 <- c("a", "c")
library(data.table)
setDT(df)
df[X1 %in% ind2]
# X1 X2 X3
# 1: a 1 16
# 2: c 3 38
# 3: a 4 10
# 4: c 6 18
# 5: a 7 33
# 6: c 9 49
df[!X1 %in% ind2]
# X1 X2 X3
# 1: b 2 5
# 2: b 5 14
# 3: b 8 50

Order data frame by columns with different calling schemes

Say I have the following data frame:
df <- data.frame(x1 = c(2, 2, 2, 1),
x2 = c(3, 3, 2, 1),
let = c("B", "A", "A", "A"))
df
x1 x2 let
1 2 3 B
2 2 3 A
3 2 2 A
4 1 1 A
If I want to order df by x1, then x2 then let, I do this:
df2 <- df[with(df, order(x1, x2, let)), ]
df2
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
However, x1 and x2 have actually been saved as an id <- c("x1", "x2") vector earlier in the code, which I use for other purposes.
So my problem is that I want to reference id instead of x1 and x2 in my order function, but unfortunately anything like df[order(df[id], df$let), ] will result in a argument lengths differ error.
From what I can tell (and this has been addressed at another SO thread), the problem is that length(df[id]) == 2 and length(df$let) == 4.
I have been able to make it through this workaround:
df3 <- df[order(df[, id[1]], df[, id[2]], df[, "let"]), ]
df3
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
But it looks ugly and depends on knowing the size of id.
Is there a more elegant solution to sorting my data frame by id then let?
I would suggest using do.call(order, ...) and combining id and "let" with c():
id <- c("x1", "x2")
df[do.call(order, df[c(id, "let")]), ]
# x1 x2 let
# 4 1 1 A
# 3 2 2 A
# 2 2 3 A
# 1 2 3 B

Resources