Let's say I have a matrix like so:
df <- matrix(data = c(1,2,9,3,7,NA,4,NA,NA,NA,NA,NA), nrow=4, ncol=3, byrow=T)
What I want to calculate, are the row-means of the matrix when the the row isn't allowed to have more than one NA. In this case the end result would be a vector of four components and more specifically c(4,5,NA,NA).
I can make separate vectors that meet the requirements like so:
df1 <- df[c(which(rowSums(is.na(df))<=1)),]
df2 <- df[c(which(rowSums(is.na(df))>1)),]
rowMeans(df1, na.rm=T)
rowMeans(df2, na.rm=F)
But I can't seem to figure out a good way to have just one vector.
We can assign the rows that have more than 1 NAs to NA, and then do the rowMeans with na.rm=TRUE
df[rowSums(is.na(df))>1,] <- NA
rowMeans(df, na.rm=TRUE)
Or we can do this in one step
rowMeans(df, na.rm=TRUE)*NA^(rowSums(is.na(df))>1)
Or another option would be to create an index for getting the rowMeans
i1 <- !rowSums(is.na(df))>1
ifelse(i1, rowMeans(df, na.rm=TRUE), NA_real_)
Related
Hey I am having a little bit of missunderstanding and need a little bit of guidance. I want to compute the correlation between a vector (or df with 1 column) and each line of a dataframe.
I made a graphic for a better understanding:
!(https://ibb.co/51Fk5KB)
All rows have a date and fit to a unique as.Date of the other dataframe. Because I want to compute it in a rolling window of 12 months I run:
df1 <- read.zoo(df1)
df2 <- read.zoo(df2)
new_df <- rollapplyr(??????????, 12, function(x) cor(x[, 1], x[, 2]), by.column = TRUE, fill = NA)
new_df <- fortify.zoo(new_df)
Now I ask you: what do I have to insert in the ?????????? spot? Or do I even have to change/add something else?
You can use calculate the correlation between a vector and columns of a dataframe like so cor(vector, dataframe)
Example
Create a vector and dataframe :
set.seed(1234)
vec <- (runif(150, 0, 10))
iris2 <- iris[,c(1:4)] # 150 x 4 dataframe
Now calculate correlations
cor(vec, iris2)
# Correlations
# -0.0187099581910839078691 -0.0233219261874525844724 -0.0063229780212239634907 0.0138003706052788940178
As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA
Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))
Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))
I have a dataframe consisting of a series of paired columns. Here is a small example.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(rep(1:12, each=30))
df3 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df4 <- as.data.frame(c(rep(5:12, each=30),rep(1:4, each=30)))
df5 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
df6 <- as.data.frame(c(rep(8:12, each=30),rep(1:7, each=30)))
Example <- cbind(df1,df2,df3,df4,df5,df6)
What I would like to do is find an average value for the odd numbers columns (df1,df3,df5) based on the values in the adjacent column, so in the example I would have three sets of averages for each value between 1 and 12. I have managed to apply a function for a specific pair of columns...
Example_two <- cbind(df1,df2)
colnames (Example_two) <- c("x","y")
tapply(Example_two$x, Example_two$y, mean)
However, the dataframe I will be looking at will be considerably larger so some form of apply function would be ideal to perform this iteratively across each paired set. I have found a similar problem Is there a R function that applies a function to each pair of columns?, but I can't seem to apply this to my own dataset.
Any help would be much appreciated, thank you in advance.
Try
mapply(function(x,y) tapply(x,y, FUN=mean) ,
Example[seq(1, ncol(Example), 2)], Example[seq(2, ncol(Example), 2)])
Or instead of seq(1, ncol(Example), 2) just use c(TRUE, FALSE) and c(FALSE, TRUE) for the second case
I cannot get around the problem of creating the differentials of every variable (column) in "adat" and saving it to a matrix "dfmtx".
I would just need to automate the following sequence to run for each column in "adat" and than name the obtained vector according to the name of the ones subtracted from each other and placed in to a column of "dfmtx".
In "adat" I have 14 columns and 26 rows not including the header.
dfmtx[,1]=(adat[,1]-adat[,1])
dfmtx[,2]=(adat[,1]-adat[,2])
dfmtx[,3]=(adat[,1]-adat[,3])
dfmtx[,4]=(adat[,1]-adat[,4])
dfmtx[,5]=(adat[,1]-adat[,5])
dfmtx[,6]=(adat[,1]-adat[,6])
.....
dfmtx[,98]=(adat[,14]-adat[,14])
Any help would be appreciated thank you!
If adat is a data.frame, you can use outer to get the combinations of columns and then do the difference between pairwise subset of columns based on the index from outer. It is not clear how you got "98" columns. By removing the diagonal and lower triangular elements, the number of columns will be "91".
nm1 <- outer(colnames(adat), colnames(adat), paste, sep="_")
indx1 <- which(lower.tri(nm1, diag=TRUE))
res <- outer(1:ncol(adat), 1:ncol(adat),
function(x,y) adat[,x]-adat[,y])
colnames(res) <- nm1
res1 <- res[-indx1]
dim(res1)
#[1] 26 91
data
set.seed(24)
adat <- as.data.frame(matrix(sample(1:20, 26*14,
replace=TRUE), ncol=14))