R Matrix, get the index of minimum column - r

I am very new to R, I am learning.
I have calculated the difference, column wise like this. (difference with omega)
final_wights <- apply(wjs,2, function(x) (omega - x))^2
Now i want to get the column number of the minimum column. I can get minimum column value using
col <- apply(final_wights, 2, min),
But i want to get the index of that how do i just get the index column number in the matrix

You may not need apply here
final_weights <- (wjs-omega)^2
To get the index of the columns with minimum values, you can use which with arr.ind=TRUE to get the 'row/column' index (a modification of #Bhas comments)
which(final_weights == min(final_weights), arr.ind=TRUE)[,2]
data
set.seed(24)
wjs <- as.data.frame(matrix(sample(0:20, 5*10, replace=TRUE), ncol=5))
set.seed(42)
omega <- as.data.frame(matrix(sample(0:20, 5*10, replace=TRUE), ncol=5))

Related

Apply between function over a matrix by using lower bound and upper bound vectors

I have a data frame composed of numeric values. I calculated the standard deviation and mean for each column and created Upper_Bound and Lower_Bound vectors as follows:
std_devs = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, sd)
means = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, mean)
Upper_Bound = means + 3*std_devs
Lower_Bound = means - 3*std_devs
Now i want to detect the rows that has at least one value that does not fall between the relevant upperbound and lowerbound. For example a value in column j must be equal or greater than Lower_Bound[j] and equal or smaller than Upper_Bound[j], if at least one value in a row i violates this condition I want to save the index of that row (I also have row names, saving row names would be fine too.) What I want to obtain is a vector of indices (or row names) that shows all rows which violate the rule. I tried the following:
outliers = apply(my_data ,1, between(x,Lower_Bound, Upper_Bound,incbounds = TRUE))
But i guess it was too much to expect between to automatically go over every value in a row and compare them with the relevant bounds. This was my second hopeless attempt that did not work:
outliers = apply(exp_vars_numeric,1, apply(x,2,between(x,Lower_Bound, Upper_Bound, incbounds = TRUE)))
I know that i can do it with a for loop but i am hoping for a more efficient solution. Any suggestion is highly appreciated.
Thanks in advance.
Consider keeping everything in one data frame by adding lower and upper bound columns with help of ave() for inline aggregation of sd and mean. Then run conditional ifelse() for the flagging of such rows.
num_cols <- sapply(exp_vars,is.numeric)
num_names <- colnames(exp_vars)[num_cols]
means <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=mean))
std_devs <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=sd))
exp_vars[,paste0(num_names, "_lower")] <- means - 3*std_devs
exp_vars[,paste0(num_names, "_upper")] <- means + 3*std_devs
# CONDITIONALLY ASSIGN FLAG COLS
exp_vars[,paste0(num_names, "_flag")] <- ifelse(exp_vars[,num_names] >= exp_vars[,paste0(num_names, "_lower")] &
exp_vars[,num_names] <= exp_vars[,paste0(num_names, "_upper")], 1, 0)
# ADD ALL FLAG COLS HORIZONTALLY
exp_vars$index <- ifelse(rowSums(exp_vars[,paste0(num_names, "_flag")]) > 0, row.names(exp_vars), NA)
exp_vars[is.na(exp_vars$index), ]
It is recommended to include a small example of how your data looks like so that it is easier for us to respond to your question :) I generated data.frames based on your description, and it seems that the following solves your problem:
df <- data.frame(a=c(1:10),b=c(5:14))
ncols <- ncol(df)
bounds <- data.frame(lower=seq(.5,5,.5),upper=seq(6.5,11,.5))
one_plus_fall_outside <- sapply(1:nrow(df),
function(i)
sum(between(df[i,],bounds$lower[i],bounds$upper[i]))/ncols<1
)
which(one_plus_fall_outside)
you can check if this works well by looking at all the columns together:
cbind(df,bounds,one_plus_fall_outside)

R: Performance issue when finding maximum of splitted list

When trying to find the maximum values of a splitted list, I run into serious performance issues.
Is there a way I can optimize the following code:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind(y, x)
my_data <- data.frame(my_data)
# This is the critical part I would like to optimize
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
I want to get the rows where a given column hits its maximum for a given group (it should be easier to understand from the code).
I know that splitting into a list is probably the reason for the slow performance, but I don't know how to circumvent it.
This may not be immediately clear to you.
There is an internal function max.col doing something similar, except that it finds position index of the maximum along a matrix row (not column). So if you transpose your original matrix x, you will be able to use this function.
Complexity steps in when you want to do max.col by group. The split-lapply convention is needed. But, if after the transpose, we convert the matrix to a data frame, we can do split.default. (Note it is not split or split.data.frame. Here the data frame is treated as a list (vector), so the split happens among the data frame columns.) Finally, we do an sapply to apply max.col by group and cbind the result into a matrix.
tx <- data.frame(t(x))
tx.group <- split.default(tx, y) ## note the `split.default`, not `split`
pos <- sapply(tx.group, max.col)
The resulting pos is something like a look-up table. It has 9000 rows and 100 columns (groups). The pos[i, j] gives the index you want for the i-th column (of your original non-transposed matrix) and j-th group. So your final extraction for the 50-th column and all groups is
max_values <- Map("[[", tx.group, pos[50, ])
You just generate the look-up table once, and make arbitrary extraction at any time.
Disadvantage of this method:
After the split, data in each group are stored in a data frame rather than a matrix. That is, for example, tx.group[[1]] is a 9000 x 9 data frame. But max.col expects a matrix so it will convert this data frame into a matrix internally.
Thus, the major performance / memory overhead includes:
initial matrix transposition;
matrix to data frame conversion;
data frame to matrix conversion (per group).
I am not sure whether we eliminate all above with some functions from MatrixStats package. I look forward to seeing a solution with that.
But anyway, this answer is already much faster than what OP originally does.
A solution using {dplyr}:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind.data.frame(y, x)
# This is the critical part I would like to optimize
system.time({
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
})
# Using {dplyr} is 9 times faster, but you get results in a slightly different format
library(dplyr)
system.time({
max_values2 <- my_data %>%
group_by(y) %>%
do(max_values = .[which.max(.[[50]]), ])
})
all.equal(max_values[[1]], max_values2$max_values[[1]], check.attributes = FALSE)

Substituting missing values based on both row and column averages

As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA
Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))

Replacing or imputing NA values in R without For Loop

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))

Sort specific range of data frame by a column in R

I have a data.frame of size 8326x13. I would like to order it in parts by a specific column. E.g. order the range 1:1375 only by the column A. Then, I would like to add this order part to same data.frame into the correct place 1:1375. Is it possible?
Thanks in advanced.
Raúl.
Or, (using the dataset of useR)
indx <- rep(c(TRUE,FALSE), each=10) #create a logical index.
In this case the first 10 rows are ordered
data[indx,] <- data[order(data$A[indx]),]
Update
Or instead of creating a logical index, extract the rows that needs to be ordered and replace it with the ordered set
data[1:10,] <- data[order(data$A[1:10]),]
In your dataset if you create a index,
indx <- rep(c(TRUE,FALSE), c(1375, 8326-1375))
As suggested by #JeremyS
A <- sample(1:100, 20)
B <- sample(letters[1:26],20)
data <- data.frame(A, B)
n <- 10 # you want range 1:n
lower <- data[(n+1):dim(data)[1], ] # split to two data.frame with lower and upper part
upper <- data[1:n,]
upper <- upper[order(upper$A),] # or order(upper[,m]), m is the column index
data.new <- rbind.data.frame(upper, lower)

Resources