Calculate pairwise-difference between each pair of columns in dataframe - r

I cannot get around the problem of creating the differentials of every variable (column) in "adat" and saving it to a matrix "dfmtx".
I would just need to automate the following sequence to run for each column in "adat" and than name the obtained vector according to the name of the ones subtracted from each other and placed in to a column of "dfmtx".
In "adat" I have 14 columns and 26 rows not including the header.
dfmtx[,1]=(adat[,1]-adat[,1])
dfmtx[,2]=(adat[,1]-adat[,2])
dfmtx[,3]=(adat[,1]-adat[,3])
dfmtx[,4]=(adat[,1]-adat[,4])
dfmtx[,5]=(adat[,1]-adat[,5])
dfmtx[,6]=(adat[,1]-adat[,6])
.....
dfmtx[,98]=(adat[,14]-adat[,14])
Any help would be appreciated thank you!

If adat is a data.frame, you can use outer to get the combinations of columns and then do the difference between pairwise subset of columns based on the index from outer. It is not clear how you got "98" columns. By removing the diagonal and lower triangular elements, the number of columns will be "91".
nm1 <- outer(colnames(adat), colnames(adat), paste, sep="_")
indx1 <- which(lower.tri(nm1, diag=TRUE))
res <- outer(1:ncol(adat), 1:ncol(adat),
function(x,y) adat[,x]-adat[,y])
colnames(res) <- nm1
res1 <- res[-indx1]
dim(res1)
#[1] 26 91
data
set.seed(24)
adat <- as.data.frame(matrix(sample(1:20, 26*14,
replace=TRUE), ncol=14))

Related

Substituting missing values based on both row and column averages

As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA
Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))

Replacing or imputing NA values in R without For Loop

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

Sort specific range of data frame by a column in R

I have a data.frame of size 8326x13. I would like to order it in parts by a specific column. E.g. order the range 1:1375 only by the column A. Then, I would like to add this order part to same data.frame into the correct place 1:1375. Is it possible?
Thanks in advanced.
Raúl.
Or, (using the dataset of useR)
indx <- rep(c(TRUE,FALSE), each=10) #create a logical index.
In this case the first 10 rows are ordered
data[indx,] <- data[order(data$A[indx]),]
Update
Or instead of creating a logical index, extract the rows that needs to be ordered and replace it with the ordered set
data[1:10,] <- data[order(data$A[1:10]),]
In your dataset if you create a index,
indx <- rep(c(TRUE,FALSE), c(1375, 8326-1375))
As suggested by #JeremyS
A <- sample(1:100, 20)
B <- sample(letters[1:26],20)
data <- data.frame(A, B)
n <- 10 # you want range 1:n
lower <- data[(n+1):dim(data)[1], ] # split to two data.frame with lower and upper part
upper <- data[1:n,]
upper <- upper[order(upper$A),] # or order(upper[,m]), m is the column index
data.new <- rbind.data.frame(upper, lower)

sum different columns in a data.frame

I have a very big data.frame and want to sum the values in every column.
So I used the following code:
sum(production[,4],na.rm=TRUE)
or
sum(production$X1961,na.rm=TRUE)
The problem is that the data.frame is very big. And I only want to sum 40 certain columns with different names of my data.frame. And I don't want to list every single column. Is there a smarter solution?
At the end I also want to store the sum of every column in a new data.frame.
Thanks in advance!
Try this:
colSums(df[sapply(df, is.numeric)], na.rm = TRUE)
where sapply(df, is.numeric) is used to detect all the columns that are numeric.
If you just want to sum a few columns, then do:
colSums(df[c("X1961", "X1962", "X1999")], na.rm = TRUE)
res <- unlist(lapply(production, function(x) if(is.numeric(x)) sum(x, na.rm=T)))
will return the sum of each numeric column.
You could create a new data frame based on the result with
data.frame(t(res))
If you dont want to include every single column, you somehow have to indicate which ones to include (or alternatively, which to exclude)
colsInclude <- c("X1961", "X1962", "X1963") # by name
# or #
colsInclude <- paste0("X", 1961:2003) # by name
# or #
colsInclude <- c(10:19, 23, 55, 147) # by column number
To put those columns in a new data frame simply use [ ] as you've done: '
newDF <- oldDF[, colsInclude]
To sum up each column, simply use colSums
sums <- colSums(newDF, na.rm=T)
# or #
sums <- colSums(oldDF[, colsInclude], na.rm=T)
Note that sums will be a vector, not necessarilly a data frame.
You can make it into a data frame using as.data.frame
sums <- as.data.frame(sums)
# or, to include the data frame from which it came #
sums <- rbind(newDF, "totals"=sums)

Resources