I am trying to build a function that takes a numeric vector of homework scores (of length n), and an optional logical argument drop, to compute a single homework value. If drop = TRUE, the lowest HW score must be dropped.
step1 function to get average
get_average <- function(x,na.rm=TRUE) {
if(na.rm==TRUE){
x = remove_missing(x)}
total <- 0
for (n in 1:length(x)) {
total= total + x[n]
}
return(total/length(x))
}
put it all together
score_homework <- function(x,drop=TRUE)
{
if(drop==TRUE)
x = drop_lowest(x)
{get_average(x)}}
However I keep getting the error Error in score_homework() : argument "x" is missing, with no default
I'm not sure this is what you want, but here goes.
First generate some dummy data:
# Set seed
set.seed(1234)
# Generate dummy homework data with <NA> values
homework <- c(rep(NA, 20), rnorm(n = 100, mean = 50, sd = 10))
# Have a quick look
hist(homework)
Then we write the function:
# Make function
homework_func <- function(data, drop = TRUE) {
# Remove NA
data <- data[!is.na(data)]
# Calculate the average depending on whether 'drop' is T or F
if(drop == TRUE) {
data <- data[data > min(data)]
mean(data)
} else {
mean(data)
}
}
# Use function with 'drop = TRUE'
homework_func(data = homework, drop = TRUE)
#> [1] 48.65349
# Use function with 'drop = FALSE'
homework_func(data = homework, drop = FALSE)
#> [1] 48.43238
Here is a function to eliminate the lowest missing score that's less complicated than the version in the original post. I sort the scores in descending order in case the there is a tie for the lowest score. In that case, we should only remove one instance of the lowest score. Also, you're really better off using R's mean() function than writing your own.
scores <- c(78,93,61,NA,61,83,92,95,NA,100)
removeMinScore <- function(x) {
x <- x[order(-x)] # order descending
x <- x[!is.na(x)] # remove NAs
x[1:length(x)-1] # return all but lowest score, removes only 1 tied value
}
That said, if you must write your own version of mean(), here is a simpler approach that takes advantage of existing R functions.
TIP: Since is.na() returns a vector of TRUE and FALSE values, you can sum these to count the number of non-missing values in a vector.
mymean <- function(x) {sum(x, na.rm=TRUE) / sum(!is.na(x))}
The results look like this.
The modified version of score_homework() would be:
score_homework <- function(x,drop=TRUE){
if(drop == TRUE) return mean(removeMinScore(x),na.rm=TRUE)
else mean(x,na.rm=TRUE)
}
The results from testing the function are as follows.
Related
I've been trying to randomly subsample my seurat object.
I'm interested in subsampling based on 2 columns: condition and cell type. I have 5 conditions and 5 cell types. Main goal is to have 1000 cells for each cell type in each condition.
I've tried this so far:
First thing is subsetting my seurat object:
my.list <- list(hipo.c1.neurons = hipo %>%
subset(., condition %in% "c1" & group %in% "Neurons"),
hipo.c1.oligo = hipo %>%
subset(., condition %in% "c1" & group %in% "Oligod")...etc...)
And then subsample it using sample function:
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
})
And I get this error since there are some objects with less than 1000 cells: error in evaluating the argument 'j' in selecting a method for function '[': cannot take a sample larger than the population when 'replace = FALSE'
Then I've tried with this function:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error = function(e)NULL))
}
But then it gives me 0 in those objects that have less than 1000 cells. What would be the way to skip those objects that have less than 1000 cells and leave it like they are (not sample those ones)?
Is there a simpler way to do this, so I don't have to subset all of my objects separately?
I can't say for certain without seeing your data, but could you just add an if statement in the function? It looks like you're sampling column-wise, so check the number of columns. Just return x if the number of columns is less than the number you'd like to sample.
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
if(ncol(x) > 1000){
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
} else {
x
}
})
You could make it more flexible if you want to sample something other than 1000.
set.seed(0)
my.list.sampled <- lapply(X = my.list, B = 1000, FUN = function(x, B) {
if(ncol(x) > B){
x <- x[,sample(ncol(x), B, replace = FALSE)]
} else {
x
}
})
Thanks for any help in advance. I have a dataset with correlation values in a column called 'exit' and corresponding sample sizes (n) in a column called 'samplesize' in a data frame called 'dataset'.
My task is to create an R script to populate two full columns (CIleft and CIright) with the confidence interval outputs using the CIr function within the "psychometric" package for each row of data. This CIr function operates as follows, outputting the left and right confidence interval values:
CIr(r = .9, n = 100, level = .95)
[1] 0.8546667 0.9317133
Below is my unsuccessful script.
CI <- function(x)
{
require(psychometric)
library(psychometric)
r <- x["dataset$exit"];
n <- x["dataset$samplesize"];
results <- CIr(r, n, level = .95);
x["dataset$CIleft"] <- results[1];
x["dataset$CIright"] <- results[2];
}
One complication (which I believe may be relevant) is that test runs of "CI(x)" in the console produce the following errors:
// Error in CIz(z, n, level) : (list) object cannot be coerced to type 'double'
Then entering dataset2 <- as.matrix(dataset) and trying CI(x) again yields:
Error in dataset2$exit : $ operator is invalid for atomic vectors
And for
dataset3 <- lapply(dataset$exit, as.numeric)
dataset4 <- lapply(dataset$samplesize, as.numeric)
trying CI(x) again yields:
Error in 1 + x : non-numeric argument to binary operator //
Can anyone assist in helping me populate each row of my data frame with the appropriate output for CIleft and CIright, given that r = 'exit', and n = 'samplesize'?
I don't think you need a function.
library("psychometric")
dataset$lwr = NULL
dataset$upr = NULL
for (row in 1:nrow(dataset)){
dataset[["lwr"]][row] <- CIr(r = dataset[["exit"]][row], n = dataset[["samplesize"]][row], level = .95)[1]
dataset[["upr"]][row] <- CIr(r = dataset[["exit"]][row], n = dataset[["samplesize"]][row], level = .95)[2]
}
I will note though that it's generally advisable to avoid for loops in R because of its architecture (i.e., they're slow). Perhaps someone else can provide a solution with something else, e.g., apply. However, if you only have a small dataframe, the speed cost of using a for loop is unlikely to be noticeable.
Test Data:
set.seed(55); m = rnorm(26, 20, 40); dataset = data.frame( exit = seq(0, 1, 0.04), samplesize = abs(round(m)))
dataset$samplesize[dataset$samplesize == 0] = 5
dataset$exit[dataset$exit == 1] = 0.99
I have the following function taken from R: iterative outliers detection (this is an updated version):
dropout<-function(x) {
outliers <- NULL
res <- NULL
if(length(x)<2) return (1)
vals <- rep.int(1, length(x))
r <- chisq.out.test(x)
while (r$p.value<.05 & sum(vals==1)>2) {
if (grepl("highest",r$alternative)) {
d <- which.max(ifelse(vals==1,x, NA))
res <- rbind(list(as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value)),fill=TRUE)
}
else {
d <- which.min(ifelse(vals==1, x, NA))
}
vals[d] <- r$p.value
r <- chisq.out.test(x[vals==1])
}
return(res)
}
The problem is that in each round it gives me some missing rows to fill in the data.frame
i want to fill res but in some iterations it contains missing values.
I used all possible things e.g rbindlist, rbind.fill, rbind (with fill=TRUE) but nothing is working.
When i do something like :
res <- c(res,as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value))
it works but it creates 2 rows for each set of (V1,V2), one with the last column as r$alternativeand the second row with the same first 2 columns but with the p-value in the last column instead.
Thats how I'm calling the function on data similar as the one in the mentioned question:
outliers <- d[, dropout(V3), list(V1, V2)]
and im getting always this error : j doesn't evaluate to the same number of columns for each group
I have a large dataset and have defined outliers to be those values which fall either above the 99th or below the 1st percentile.
I'd like to take the mean of those outliers with their previous and following datapoints, then replace all 3 values with that average in a new dataset.
If there's anyone who knows how to do this I'd be very grateful for a response.
If you have a list of indices specifying the outliers location in the vector, e.g. using:
out_idx = which(df$value > quan0.99)
You can do something like:
for(idx in out_idx) {
vec[(idx-1):(idx+1)] = mean(vec[(idx-1):(idx+1)])
}
You can wrap this in a function, making the bandwith and the function an optional parameter:
average_outliers = function(vec, outlier_idx, bandwith, func = "mean") {
# iterate over outliers
for(idx in out_idx) {
# slicing of arrays can be used for extracting information, or in this case,
# for assiging values to that slice. do.call is used to call the e.g. the mean
# function with the vector as input.
vec[(idx-bandwith):(idx+bandwith)] = do.call(func, out_idx[(idx-bandwith):(idx+bandwith)])
}
return(vec)
}
allowing you to also use median with a bandwith of 2. Using this function:
# Call average_outliers multiple times on itself,
# first for the 0.99 quantile, then for the 0.01 quantile.
vec = average_outliers(vec, which(vec > quan0.99))
vec = average_outliers(vec, which(vec < quan0.01))
or:
vec = average_outliers(vec, which(vec > quan0.99), bandwith = 2, func = "median")
vec = average_outliers(vec, which(vec < quan0.01), bandwith = 2, func = "median")
to use a bandwith of 2, and replace with the median value.
I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}