Trouble with NA's in large dataframe - r

I'm having trouble trying to standardize my data.
So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.
EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)
Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell
Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)
Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ]
And now my whole df is NA's cells.
Indeed, there are several NAs in the dataset, some full rows and even some columns.
I tried several approaches to remove NAs
EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)
Yet apparently, it does not work. Additionally, trying to is.na(EXPGli)
Returns me False to all fields.
I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!

You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.
Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}

Related

Why am I getting a "number of items to replace is not a multiple of replacement length" error?

I'm currently working with a dataframe "dat." I'm trying to calculate a score using columns 69-88 (if there are values in any of those columns, then add them together and put the result in a new column called "score").
This is the code I have now:
dat$score <- 0
for (num in 69:88){
dat$score[!is.na(dat[,num])] <- dat$score+dat[,num]
}
This gives me a column where some rows show the correct score, but other rows are returning "NA". I also have 20 warnings messages that look like so:
1: In dat$score[!is.na(dat[, num])] <- dat$score + ... :
number of items to replace is not a multiple of replacement length
Why is my code working for some rows and not for others, and why am I getting this error?
Are you looking for the rowSums()function? You just have to add the argument na.rm=TRUE.
A solution with dplyr:
library(dplyr)
dat %>% mutate(score=rowSums(across(69:88), na.rm=TRUE))
Or with base R
dat$score<-rowSums(dat[, 69:88], na.rm=TRUE)
use apply. its usually quicker than for
dat$score <- apply(dat[,69:88], 1, sum, na.rm = T)

To find the mean of entire row

Need to know the meaning of each and every command . This is used to find the mean of entire row. But it also contains character type. So this code is used.
rowMeans(sapply(iris, function(x) as.numeric(as.character(x))), na.rm = T)
We can create an index to check the column type and then apply the rowMeans on the numeric columns
i1 <- sapply(iris, is.numeric)
rowMeans(iris[i1], na.rm = TRUE)

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

How to filter columns of a matrix whose IQR is below a specific value?

filter <- apply(expressionMatrix, 2, function (x) (colIQRs(x, na.rm = TRUE) < 1.6))
"Argument x is of class numeric, should be a matrix" error was thrown. How to cope with that? I think logically this code is correct: I remove all columns, whose IQR values is less than 1.6.
How to code this technically?
colIQRs from package matrixStats requires a matrix as an input. But by wrapping it inside an apply statement, you are giving it only a single column vector at a time. The solution is to send the whole matrix to colIQRs, then subset on the result:
filter <- expressionMatrix[, colIQRs(expressionMatrix, na.rm = TRUE) < 1.6]

Add Min Row in R

I have read a lot of similar questions, but cannot get any of the code to work. I simply want to add a row to the bottom of my data frame (df) that has the min value for each column for columns 2:8. Below is my code (that works) to add a total line, but I want a min line.
df[(nrow(df)+ 1),(2:8)] <- colSums(df[,2:8], na.rm=TRUE)
I have tried to get colMins in package matrixStats to work, but can't for some reason. Any help would be appreciated!
In base R, you can use sapply() to create an atomic vector of column minimums and then rbind() to attach it to the original data. I added an NA for the first value, since we need something there to add it to the original data.
rbind(df, c(NA, sapply(df[2:8], min, na.rm = TRUE)))
Obviously this assumes only 8 columns, and so df[-1] can be used in place of df[2:8].
And for a speed increase we can use vapply() over sapply() because we know the result will be a single numeric value.
rbind(df, c(NA, vapply(df[-1], min, 1, na.rm = TRUE)))
Update: In response to your comment on the other answer - to get "MIN" in the first column and the minimum values in all the rest, we can adjust the call to a named list and do it all in one go. This way we don't mix the column classes (character and numeric) and end up with unexpected classes in the columns of the resulting data.
rbind(
df,
c(setNames(list("MIN"), names(df)[1]), lapply(df[-1], min, na.rm = TRUE))
)
We can try colMins from library(matrixStats)
library(matrixStats)
rbind(df, c(NA,colMins(as.matrix(df[2:8]))))
Update
To replace NA with 'MIN',
rbind(df, c('MIN',as.list(colMins(as.matrix(df[2:8])))))
Or another approach would be to convert to matrix and use addmargins
addmargins(`row.names<-`(as.matrix(df[-1]), df$ID), 1, FUN=min)
data
set.seed(24)
df <- cbind(ID= 1:10,as.data.frame(matrix(sample(1:9, 7*10,
replace=TRUE), ncol=7)))

Resources