I am trying to impute missing values based on a group. I am getting an error that the median() function requires numeric data, but all of my data is numeric so I can't see the issue. Here is a minimally reproducible example.
set.seed(123)
cluster = sample(seq(1,10),1000,replace=TRUE)
V1 = sample(c(runif(100),NA),1000,replace=TRUE)
V2 = sample(c(runif(100),NA),1000,replace=TRUE)
df = as.data.frame(cbind(cluster,V1,V2))
df_fixed = by(df,df$cluster,function(x){replace(x,is.na(x),median(x, na.rm=TRUE))})
Error returned:
Error in median.default(x, na.rm = TRUE) : need numeric data
This code will work though, so the issue is with the median function.
df_fixed = by(df,df$cluster,function(x){replace(x,is.na(x),1)})
df_fixed <- apply(df[,2:3], 2, function(x) {
md <- sapply(sort(unique(df$cluster)), function(k) median(x[df$cluster==k], na.rm=TRUE))
x[is.na(x)] <- md[df$cluster][is.na(x)]
return(x)
})
any(is.na(df_fixed))
# [1] FALSE
Related
Let's say I have a dataframe (df) in R:
df <- data.frame(x = rnorm(5, mean = 5), u = rnorm(5, mean = 5), y = rnorm(5, mean = 5), z = rnorm(5, mean = 5))
print(df)
I want to get the mean absolute difference (MAD) between the first column (x) and the other columns.
With this function, I can find the MAD between the first column and another (the second for example):
mad <- function(dat){
abs(mean(dat[,1] - dat[,2], na.rm = TRUE))
}
mad(dat = df)
But I want to generalize the function to apply across all of the columns. Changing the function to something like this:
mad <- function(dat) {
abs(mean(dat[,1] - dat[,2:4], na.rm = TRUE))
}
mad(dat = df)
does not work and returns this error: "argument is not numeric or logical: returning NA"
I was thinking of using apply() across the dataframe, as that seems to be the general advice that I've found on here. But I don't understand how to keep the first column constant and subtract the other columns from the first.
We can create the function with two arguments
mad <- function(x, y) abs(mean(x - y, na.rm = TRUE))
and use sapply/lapply to loop over the columns other than 1, apply the mad function by extracting the first column of data with the looped column values
sapply(df[-1], function(x) mad(df[,1], x))
# u y z
#0.003399429 0.991685267 0.710553411
Here is another option without defining mad function:
sapply(abs(df[-1] - df[["x"]]), mean, na.rm = TRUE)
I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?
I have a simple and a small data set for which I wish to apply a set of functions to each column or variable of the data frame using the sapply function. Below is the code from R Blogs
multi.sapply <- function(...) {
arglist <- match.call(expand.dots = FALSE)$...
var.names <- sapply(arglist, deparse)
has.name <- (names(arglist) != "")
var.names[has.name] <- names(arglist)[has.name]
arglist <- lapply(arglist, eval.parent, n = 2)
x <- arglist[[1]]
arglist[[1]] <- NULL
result <- sapply(arglist, function (FUN, x) sapply(x, FUN), x)
colnames(result) <- var.names[-1]
return(result)
}
Since I am a novice user of R, I would like to know, how can you modify the above code when the data has missing or NA values? So for example:
multi.sapply(mydata,mean, median, min, max)
Works fine but yields NA values for variables that has missing values
The following code however gives me the following error message:
multi.sapply(mydata,mean, median, valid.n, min, max, na.rm = TRUE)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Your help would me much appreciated!
I have dataframe df with two columns col1, col2, includes NA values in them. I have to calculate mean, sd for them. I have calculated them separately with below code.
# Random generation
set.seed(12)
df <- data.frame(col1 = sample(1:100, 10, replace=FALSE),
col2 = sample(1:100, 10, replace=FALSE))
# Introducing null values
df$col1[c(3,5,9)] <- NA
df$col2[c(3,6)] <- NA
# sapply with return a value for a function
stat <- data.frame(Mean=numeric(length = length(df)), row.names = colnames(df))
stat[,'Mean'] <- as.data.frame(sapply(df, mean, na.rm=TRUE))
stat[,'Sd'] <- as.data.frame(sapply(df, sd, na.rm=TRUE))
I have tried to do both operations at a single time using the below code.
#sapply with return more than one value
stat[,c('Mean','Sd')] <- as.data.frame(t(sapply(c(1:length(df)),function(x)
return(c(mean(df[,x]), sd(df[,x]))))))
As I failed to remove the NA values in the latest function, I am getting output as NA for both mean, sd.
Can you please give an idea on how to remove NA values for each function mean, sd. Also, please suggest any other possible smart ways to this.
Here is an option:
funs <- list(sd=sd, mean=mean)
sapply(funs, function(x) sapply(df, x, na.rm=T))
Produces:
sd mean
col1.value 39.34826 39.42857
col2.value 28.33946 51.625
If you want to get cute with the functional library:
sapply(funs, Curry(sapply, X=df), na.rm=T)
Does the same thing.
I have the following function:
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=TRUE)
p.val <- 1-pchisq(Xsq, df = 2*length(p))
return(p.val)
}
I was guessing that command na.rm=TRUE was dealing with NA in my data. However, when I test the function with simple values the behaviour is not the expected. For example:
Fisher.test(c(0.1,0.4,0.1,NA))
[1] 0.199279
Fisher.test(c(0.1,0.4,0.1))
[1] 0.08705891
Why in the first option I do not get the same result as in the second one? The na.rm=TRUE should remove the NA??
Many thanks
Because the lengths of those two vectors are different. If you just wanted to filter out NAs you could use sum(!is.na(p)) instead of length(p), but since log can produce a NaN for negative values, which will also get filtered out by your sum, I'd use sum(p >= 0, na.rm = T) instead (or just sum(!is.na(log(p))) to let R figure out the details itself):
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=TRUE)
p.val <- 1-pchisq(Xsq, df = 2*sum(p >= 0, na.rm = T))
return(p.val)
}