Substituting missing values based on both row and column averages - r

As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA

Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))

Related

Calculating the row-means with certain conditions

Let's say I have a matrix like so:
df <- matrix(data = c(1,2,9,3,7,NA,4,NA,NA,NA,NA,NA), nrow=4, ncol=3, byrow=T)
What I want to calculate, are the row-means of the matrix when the the row isn't allowed to have more than one NA. In this case the end result would be a vector of four components and more specifically c(4,5,NA,NA).
I can make separate vectors that meet the requirements like so:
df1 <- df[c(which(rowSums(is.na(df))<=1)),]
df2 <- df[c(which(rowSums(is.na(df))>1)),]
rowMeans(df1, na.rm=T)
rowMeans(df2, na.rm=F)
But I can't seem to figure out a good way to have just one vector.
We can assign the rows that have more than 1 NAs to NA, and then do the rowMeans with na.rm=TRUE
df[rowSums(is.na(df))>1,] <- NA
rowMeans(df, na.rm=TRUE)
Or we can do this in one step
rowMeans(df, na.rm=TRUE)*NA^(rowSums(is.na(df))>1)
Or another option would be to create an index for getting the rowMeans
i1 <- !rowSums(is.na(df))>1
ifelse(i1, rowMeans(df, na.rm=TRUE), NA_real_)

Replacing or imputing NA values in R without For Loop

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))

R Matrix, get the index of minimum column

I am very new to R, I am learning.
I have calculated the difference, column wise like this. (difference with omega)
final_wights <- apply(wjs,2, function(x) (omega - x))^2
Now i want to get the column number of the minimum column. I can get minimum column value using
col <- apply(final_wights, 2, min),
But i want to get the index of that how do i just get the index column number in the matrix
You may not need apply here
final_weights <- (wjs-omega)^2
To get the index of the columns with minimum values, you can use which with arr.ind=TRUE to get the 'row/column' index (a modification of #Bhas comments)
which(final_weights == min(final_weights), arr.ind=TRUE)[,2]
data
set.seed(24)
wjs <- as.data.frame(matrix(sample(0:20, 5*10, replace=TRUE), ncol=5))
set.seed(42)
omega <- as.data.frame(matrix(sample(0:20, 5*10, replace=TRUE), ncol=5))

Calculate pairwise-difference between each pair of columns in dataframe

I cannot get around the problem of creating the differentials of every variable (column) in "adat" and saving it to a matrix "dfmtx".
I would just need to automate the following sequence to run for each column in "adat" and than name the obtained vector according to the name of the ones subtracted from each other and placed in to a column of "dfmtx".
In "adat" I have 14 columns and 26 rows not including the header.
dfmtx[,1]=(adat[,1]-adat[,1])
dfmtx[,2]=(adat[,1]-adat[,2])
dfmtx[,3]=(adat[,1]-adat[,3])
dfmtx[,4]=(adat[,1]-adat[,4])
dfmtx[,5]=(adat[,1]-adat[,5])
dfmtx[,6]=(adat[,1]-adat[,6])
.....
dfmtx[,98]=(adat[,14]-adat[,14])
Any help would be appreciated thank you!
If adat is a data.frame, you can use outer to get the combinations of columns and then do the difference between pairwise subset of columns based on the index from outer. It is not clear how you got "98" columns. By removing the diagonal and lower triangular elements, the number of columns will be "91".
nm1 <- outer(colnames(adat), colnames(adat), paste, sep="_")
indx1 <- which(lower.tri(nm1, diag=TRUE))
res <- outer(1:ncol(adat), 1:ncol(adat),
function(x,y) adat[,x]-adat[,y])
colnames(res) <- nm1
res1 <- res[-indx1]
dim(res1)
#[1] 26 91
data
set.seed(24)
adat <- as.data.frame(matrix(sample(1:20, 26*14,
replace=TRUE), ncol=14))

Exclude values within a range for rowsum in R

I'm trying to get the row sums for each row of my dataframe but with a condition. I'd like to exclude all the values that are between -1 and 1 after applying log2. I know how to exclude NAs but I'm confused with excluding actual numbers. My dataframe is just numbers, except for the row and column names.
Try
dat1 <- dat
dat1[dat1 >-1 & dat1<1] <- NA
rowSums(dat1, na.rm=TRUE)
If there are no NAs in the dataset, you could assign the values to 0 and just use rowSums
dat1[dat1 >-1 & dat1<1] <- 0
rowSums(dat1)
data
set.seed(42)
dat <- as.data.frame(matrix(sample(seq(-5,5,by=0.25), 20*5,
replace=TRUE), ncol=5))

Resources