Normalize data with its row mean in R - r

I am running into some difficulty after trying to divide each element in a given row by that row's mean. A dummy set of data:
set.seed(1)
x <- cbind(Plant = letters[1:5],
as.data.frame(matrix(rnorm(60), ncol = 12)))
x
Therefore, for Plant a, I would like V1, V2...V12 to divided by the mean of that row.
I thought it could be done using:
x/rowMeans(x)
But I get the error:
Error in rowMeans(x) : 'x' must be numeric
I assume that this error is due to the format of the data, because it's a data.frame and not a vector. I however managed to calculate the mean per row, by changing the data's format:
library(data.table)
x.T <- as.data.table(x)
x.T[,list(Mean=rowMeans(.SD)), by=Plant]
From there, I am not sure where to go. I am thinking that a loop would work, but doing some searches, I see where it is not advised. I would therefore like to have the normalized data for each sample Plant. Any suggestions please?

The first error is coming from trying to take the mean including Plant variable/column, which is non-numeric. Try:
cbind(x$Plant, x[,-1]/rowMeans(x[,-1]))

Related

Removing outliers in time series rasters per pixel in R

Basically, I have a time-series of rasters in a stack. Here is my workflow:
Convert the stack to a data frame so each row represents a pixel, and each column represents a data. This process is fairly straightforward, so no issues here.
For each row (pixel), identify outliers and set them to NA. So in this case, I want to set what the outlier is. For example, let's say I want to set all the values larger than the 75th percentile to NA. The goal is so that when I calculate the mean, the outliers don't affect the calculation. The outliers in this case are several magnitudes higher, so they influence the mean significantly.
I got some help online and came up with this code:
my_data %>%
rowwise() %>%
mutate(across(is.numeric, ~ if (. > as.numeric(quantile(across(), .75, na.rm=TRUE))) NA else .))
The problem is that since it is a raster, there are a lot of NA values in some rows that I need the quantile function to ignore while calculating evaluating the cells (see below)
Using na.rm=TRUE seemed to be the solution, but now I am encountering a new error
Error: Problem with mutate() input ..1. i ..1 = across(...). x
missing value where TRUE/FALSE needed i The error occurred in row 1.
I understand that to get around this, I need to tell the if function to ignore the value if it is NA, but the dplyr syntax is very complicated for me, I so need some help on how to do this.
Looking forward to learning more and if there is a better way to do what I'm trying to do. I don't think I did a good job explaining it but, hopefully the code helps.
When asking a R question, you should always include some example data. Either create data with code (see below) or use a file that ships with R (do not use dput if it can be avoided). See the help files that ship with R, or other questions on this site for examples and inspiration.
Example data:
library(terra)
r <- rast(ncols=10, nrows=10, nlyr=10)
set.seed(1)
v <- runif(size(r))
v[sample(size(r), 100)] <- NA
values(r) <- v
Solution:
First write a function that does what you want, and works with a vector
f <- function(x) {
q <- quantile(x, .75, na.rm=TRUE)
x[x>q] <- NA
x
}
Now apply it to the raster data
x <- app(r, f)
With the raster package it would go like
library(raster)
rr <- brick(r)
xx <- calc(rr, f)
Note that you should not create a data.frame, but if you did you could do something like dd <- t(apply(d, 1, f))

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

R: Sampling observations by factors using tapply

I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff

Covariance matrices by group, lots of NA

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

How can I get each numeric column's mean in one data?

I have data named cluster_1. It has nominal variable from first column to the third.
# select the columns based on the clustering results
cluster_1 <- mat[which(groups==1),]
m_cluster_1 <- mean(cluster_1[c(-(1:3))])
By the last statement, I can get the mean of all columns'. However, what I want is to attach the mean of each variable(column) to the bottom of the column.
How can I make it? Please let me know.
colMeans() will give you the mean of each column in a data frame or matrix. And rbind() can be used to append the result.
rbind(cluster_1[, -(1:3)], colMeans(cluster_1[, -(1:3)]))
A generalization of what you are doing can be found with the function addmargins. Try, for example:
cluster_1Means <- addmargins(cluster_1[, -(1:3)], margin = 1, FUN = mean)
cluster_1Means

Resources