I am trying to calculate mean, IQR, and CV. The data set is "flights_DTW" and subset is "DEP_DELAY_NEW" "NA" are not removed.
Hi is if possible to calculate CV using the following codes:
mean(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
mean(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
IQR(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
IQR(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
CV(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
CV(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
cat(sprintf("16.76 = %.2f", flights_DTW$DEP_DELAY_NEW)))```
I came up with the following result:
[1] 16.75676
[1] 16.43083
[1] 8
[1] 9
Error in CV(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE) :
could not find function "CV"
What I want is that I want to put everything in a command.
If you don't want to install a package just for one function, you can define your own cv function as:
CV <- function(x, na.rm=TRUE){
sd(x, na.rm = na.rm)/mean(x, na.rm = na.rm)
}
> CV(mtcars$mpg)
[1] 0.2999881
Related
I'm trying to set up details for which function to run and which arguments to include at the start of my script, to then later call the function. I'm having trouble specifying arguments to be input into the function.
I have a fixed object
v <- c(1,2,3,5,6,7,8,9,NA)
I want to specify which measurement function I will use as well as any relevant arguments.
Example 1:
chosenFunction <- mean
chosenArguments <- "trim = 0.1, na.rm = T"
Example 2:
chosenFunction <- median
chosenArguments <- "na.rm = F"
Then I want to be able to run this specified function
chosenFunction(v, chosenArguments)
Unfortunately, I can't just put in the string chosenArguments and expect the function to run. Is there any alternative way to specify the arguments to my function?
Updated answer based on OP's clarifications
chosenFunction <- mean
get_summary <- function(x, fun, ...) fun(x, ...)>
v <- 1:100
get_summary(v, chosenFunction, na.rm = TRUE)
# [1] 50.5
Later on if you want to change the function
chosenFunction <- median
get_summary(v, chosenFunction, na.rm = TRUE)
# [1] 50.5
Original answer
get_summary <- function(x, chosenFunction, ...) chosenFunction(x, ...)
v <- 1:100
get_summary(v, mean, na.rm = TRUE, trim = 1)
# [1] 50.5
get_summary(v, median, na.rm = TRUE)
# [1] 50.5
By doing ..., you don't have to specify all arguments
get_summary(mean, na.rm = TRUE)
# [1] 50.5
If we want to calculate mean, we do it by
mean(v, na.rm = TRUE, time = 0.1)
#[1] 5.125
Another way is by using do.call
do.call(mean, list(v, na.rm = TRUE, trim = 0.1))
#[1] 5.125
We can leverage this fact and create a named list for chosenArguments and use it in do.call
chosenFunction <- mean
chosenArguments <- list(na.rm = TRUE, trim = 0.1)
do.call(chosenFunction, c(list(v), chosenArguments))
#[1] 5.125
This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
I have a problem with removing NAs.
This is my code:
dataset <- read.csv2("my_file.csv", header = T)
year_order <- ret12[order(dataset$year, na.rm = T, decreasing = T), ]
# Returns:
# *Error in order(ret12$year, na.rm = T, decreasing = T) :
# argument lengths differ*
Why?
If you what to order and remove NA's marked by missing values in the ret12$year column (which would normally come last in the order()-ing), then you need to order, then omit:
year.order <- ret12[ order(ret12$year, decreasing=T),][ 1:sum(!is.na(ret12)), ]
I have about 30 lines of code that do just this (getting Z scores):
data$z_col1 <- (data$col1 - mean(data$col1, na.rm = TRUE)) / sd(data$col1, na.rm = TRUE)
data$z_col2 <- (data$col2 - mean(data$col2, na.rm = TRUE)) / sd(data$col2, na.rm = TRUE)
data$z_col3 <- (data$col3 - mean(data$col3, na.rm = TRUE)) / sd(data$col3, na.rm = TRUE)
data$z_col4 <- (data$col4 - mean(data$col4, na.rm = TRUE)) / sd(data$col4, na.rm = TRUE)
data$z_col5 <- (data$col5 - mean(data$col5, na.rm = TRUE)) / sd(data$col5, na.rm = TRUE)
Is there some way, maybe using apply() or something, that I can just essentially do (python):
for col in ['col1', 'col2', 'col3']:
data{col} = ... z score code here
Thanks R friends.
A data.frame is a list, thus you can use lapply. Don't use apply on a data.frame as this will coerce to a matrix.
lapply(data, function(x) (x - mean(x,na.rm = TRUE))/sd(x, na.rm = TRUE))
Or you could use scale which performs this calculation on a vector.
lapply(data, scale)
You can translate the python style approach directy
for(col in names(data)){
data[[col]] <- scale(data[[col]])
}
Note that this approach is not memory efficient in R as [[<.data.frame copies the entire data.frame each time.
I think you're right, apply() may be the way to go here.
For example:
data <- array(1:20, dim=c(4, 5))
data.zscores <- apply(data, 2, function(x)
(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE))
The function apply() takes a matrix or array as it's first argument. The "2" refers to the dimension the function is iterated over - which in our case is columns. If we wanted to do it by row, we'd go with "1". Lastly, we have the function we want to apply to each column. See ?apply for more details.
Check this out
I iterate through the data frame to recognise NA rows
for(i in names(houseDF)){
print(i)
print(nrow(houseDF[is.na(houseDF[i]),]))
print("---------------------")
}