Averaging values between paired columns across a large data frame - r

I have a dataframe consisting of a series of paired columns. Here is a small example.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(rep(1:12, each=30))
df3 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df4 <- as.data.frame(c(rep(5:12, each=30),rep(1:4, each=30)))
df5 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
df6 <- as.data.frame(c(rep(8:12, each=30),rep(1:7, each=30)))
Example <- cbind(df1,df2,df3,df4,df5,df6)
What I would like to do is find an average value for the odd numbers columns (df1,df3,df5) based on the values in the adjacent column, so in the example I would have three sets of averages for each value between 1 and 12. I have managed to apply a function for a specific pair of columns...
Example_two <- cbind(df1,df2)
colnames (Example_two) <- c("x","y")
tapply(Example_two$x, Example_two$y, mean)
However, the dataframe I will be looking at will be considerably larger so some form of apply function would be ideal to perform this iteratively across each paired set. I have found a similar problem Is there a R function that applies a function to each pair of columns?, but I can't seem to apply this to my own dataset.
Any help would be much appreciated, thank you in advance.

Try
mapply(function(x,y) tapply(x,y, FUN=mean) ,
Example[seq(1, ncol(Example), 2)], Example[seq(2, ncol(Example), 2)])
Or instead of seq(1, ncol(Example), 2) just use c(TRUE, FALSE) and c(FALSE, TRUE) for the second case

Related

How to compute the correlation between a vector and each column of a data.frame

Hey I am having a little bit of missunderstanding and need a little bit of guidance. I want to compute the correlation between a vector (or df with 1 column) and each line of a dataframe.
I made a graphic for a better understanding:
!(https://ibb.co/51Fk5KB)
All rows have a date and fit to a unique as.Date of the other dataframe. Because I want to compute it in a rolling window of 12 months I run:
df1 <- read.zoo(df1)
df2 <- read.zoo(df2)
new_df <- rollapplyr(??????????, 12, function(x) cor(x[, 1], x[, 2]), by.column = TRUE, fill = NA)
new_df <- fortify.zoo(new_df)
Now I ask you: what do I have to insert in the ?????????? spot? Or do I even have to change/add something else?
You can use calculate the correlation between a vector and columns of a dataframe like so cor(vector, dataframe)
Example
Create a vector and dataframe :
set.seed(1234)
vec <- (runif(150, 0, 10))
iris2 <- iris[,c(1:4)] # 150 x 4 dataframe
Now calculate correlations
cor(vec, iris2)
# Correlations
# -0.0187099581910839078691 -0.0233219261874525844724 -0.0063229780212239634907 0.0138003706052788940178

Calculating the row-means with certain conditions

Let's say I have a matrix like so:
df <- matrix(data = c(1,2,9,3,7,NA,4,NA,NA,NA,NA,NA), nrow=4, ncol=3, byrow=T)
What I want to calculate, are the row-means of the matrix when the the row isn't allowed to have more than one NA. In this case the end result would be a vector of four components and more specifically c(4,5,NA,NA).
I can make separate vectors that meet the requirements like so:
df1 <- df[c(which(rowSums(is.na(df))<=1)),]
df2 <- df[c(which(rowSums(is.na(df))>1)),]
rowMeans(df1, na.rm=T)
rowMeans(df2, na.rm=F)
But I can't seem to figure out a good way to have just one vector.
We can assign the rows that have more than 1 NAs to NA, and then do the rowMeans with na.rm=TRUE
df[rowSums(is.na(df))>1,] <- NA
rowMeans(df, na.rm=TRUE)
Or we can do this in one step
rowMeans(df, na.rm=TRUE)*NA^(rowSums(is.na(df))>1)
Or another option would be to create an index for getting the rowMeans
i1 <- !rowSums(is.na(df))>1
ifelse(i1, rowMeans(df, na.rm=TRUE), NA_real_)

Substituting missing values based on both row and column averages

As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA
Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))

Formatting dataframes for statistical analyses

What I would like to do is to test the statistical relationship between one response and one explanatory variable. To do this, I assumed a one-way ANOVA was an effective procedure. However, my dataframe is not set up to do this. I have one column for a response variable (df1) but several columns that would be categorised into the explanatory variable I want (df2 and df3) below. As a crude example, df2 and df3 represent a season (summer) in 2 seperate locations. How would I test the influence of summer on the response variable in this instance?
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df3 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
Example <- cbind(df1,df2,df3)
Would this involve restructuring the dataframe so that df2 and df3 merge to become one long column and double the length of df1?
Thank you in advance for any help!
As suggested by Jaap and Andrew Taylor, the problem was formatting a linear regression. This was achieved through the 'stack' and 'cbind' functions.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df3 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
Example <- cbind(df2,df3)
Stacked <- stack(Example)
Combined <- cbind(df1,Stacked)
colnames(Combined) <- c("Response","Explanatory","Variable")
Linear <- lm(Explanatory~Response, data = Combined)
summary(Linear)
Stack put all the explanatory variables (df2 and df3) into one column, whilst cbind combined this new column with the values from response (df1), with these values being replicated to create a dataframe with an even number of rows, as per SabDeM's comment.

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

Resources