retain NA in custom winsorization function withs apply

retain NA in custom winsorization function withs apply - r

I am attempting to winsorize a data frame with a lot of NA entries, and I need to retain the NA entries after winsorization. The columns within the data frame that require winsorization on fine on the lower end (i.e. small values), but require some winsorization on the upper end (i.e. large values). I have created a function that almost does the job, but I can't seem to get the function to return NA entries where they occur. Here is an example.
# vector with an NA entry & upper-end value needs winsorization
a <- c(1:3,NA,90)
# my function
winsor <- function(x)
{ y <- quantile(x,probs=.95,na.rm=T)
sapply(x, function(x) {
if ( x>=y ){
x <- y
}
else { x <- x}
})
# returned vector after apply winsor to my object, a.
z <- as.data.frame(winsor(a))
The result I get when doing this gives the following error:
Error in if (x >= y) { : missing value where TRUE/FALSE needed
But it nevertheless returns the following vector (which is what I want, expect it does not return the NA entry as needed).
a
1.000
2.000
3.000
85.824
Any help will be greatly appreciated, as well as an extension to operationalize this function to an entire data frame. Hopefully this all makes sense. This is my very fist post (and hopefully my example is reproducible!).

winsor function needs an } in the end. I took the liberty and tweaked the code:
winsor <- function(x){
y <- quantile(x,probs=.95,na.rm=T)
sapply(x, function(x) {
if(!is.na(x)){
if(x>=y ){
x <- y
}
else { x <- x}
}
})
}
z<-data.frame(winsor=as.numeric(unlist(winsor(a))))

Related

Continuous Error in argument "x" is missing, with no default

I am trying to build a function that takes a numeric vector of homework scores (of length n), and an optional logical argument drop, to compute a single homework value. If drop = TRUE, the lowest HW score must be dropped.
step1 function to get average
get_average <- function(x,na.rm=TRUE) {
if(na.rm==TRUE){
x = remove_missing(x)}
total <- 0
for (n in 1:length(x)) {
total= total + x[n]
}
return(total/length(x))
}
put it all together
score_homework <- function(x,drop=TRUE)
{
if(drop==TRUE)
x = drop_lowest(x)
{get_average(x)}}
However I keep getting the error Error in score_homework() : argument "x" is missing, with no default

I'm not sure this is what you want, but here goes.
First generate some dummy data:
# Set seed
set.seed(1234)
# Generate dummy homework data with <NA> values
homework <- c(rep(NA, 20), rnorm(n = 100, mean = 50, sd = 10))
# Have a quick look
hist(homework)
Then we write the function:
# Make function
homework_func <- function(data, drop = TRUE) {
# Remove NA
data <- data[!is.na(data)]
# Calculate the average depending on whether 'drop' is T or F
if(drop == TRUE) {
data <- data[data > min(data)]
mean(data)
} else {
mean(data)
}
}
# Use function with 'drop = TRUE'
homework_func(data = homework, drop = TRUE)
#> [1] 48.65349
# Use function with 'drop = FALSE'
homework_func(data = homework, drop = FALSE)
#> [1] 48.43238

Here is a function to eliminate the lowest missing score that's less complicated than the version in the original post. I sort the scores in descending order in case the there is a tie for the lowest score. In that case, we should only remove one instance of the lowest score. Also, you're really better off using R's mean() function than writing your own.
scores <- c(78,93,61,NA,61,83,92,95,NA,100)
removeMinScore <- function(x) {
x <- x[order(-x)] # order descending
x <- x[!is.na(x)] # remove NAs
x[1:length(x)-1] # return all but lowest score, removes only 1 tied value
}
That said, if you must write your own version of mean(), here is a simpler approach that takes advantage of existing R functions.
TIP: Since is.na() returns a vector of TRUE and FALSE values, you can sum these to count the number of non-missing values in a vector.
mymean <- function(x) {sum(x, na.rm=TRUE) / sum(!is.na(x))}
The results look like this.
The modified version of score_homework() would be:
score_homework <- function(x,drop=TRUE){
if(drop == TRUE) return mean(removeMinScore(x),na.rm=TRUE)
else mean(x,na.rm=TRUE)
}
The results from testing the function are as follows.

Write a function that returns a vector or list of three statistics

This is a question for school, but I have been working on it for hours and just need a point in the right direction. I am not asking for the full answer.
I was given a data frame with student grades for various assessments. I have to write a function that will result in a vector or list that will give the min, max, and average of one particular assessment.
I was provided with the following framework:
checkAssessment <- function(df, assessmentName)
{
}
I need to be able to write the code to get the exact results below when the following line of code is executed:
checkAssessment(df,"hw1")
# $min
# [1] 0
#
# $max
# [1] 14
#
# $avg
# [1] 12.58824
So, I have tried many ways to go about this, none of which have worked. The two that came closest were
checkAssessment <- function(df, assessmentName)
{
my_min <- df$assessmentName == min(assessmentName)
my_max <- df$assessmentName == max(assessmentName)
my_avg <- df$assessmentName == mean(assessmentName)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
and
checkAssessment <- function(df, assessmentName)
{
my_min <- sapply(df$assessmentName, min)
my_max <- sapply(df$assessmentName, max)
my_avg <- sapply(mean.default(df$assessmentName, trim = 0, na.rm = FALSE,
...))
funs = c(min, max, mean)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
I'm not even sure if I'm close with either of these. I'm in an introductory R course so the code should be fairly simple, but I've developed a mental block with this question.
Any help would be very much appreciated. Thank you.

Because your were given the function framework, we have to use it.
checkAssessment <- function(df, assessmentName)
{
x <- df[[assessmentName]] ## extract column vector
return(list(min = min(x), max = max(x), avg = mean(x))) ## use a list for multiple return
}
Note:
to extract a column from a data frame by matching column name (exactly), use [[]]. It is OK to use $, but it does partial matching; Maybe this answer can help you understand this concept;
be aware of R-base functions min, max and mean, so that you don't need to struggle with x[x == min(x)], etc. Even if you want this logic, you can try x[which.min(x)]. Read ?which.min for more;
If you want multiple returned values, use a "list" to collect all of them. The basic way to set up a list is like list(1, 2), but a list can have names, so compare with list(a = 1, b = 2).
Test
We use R's built-in dataset trees for a test.
checkAssessment(trees, "Height")
#$min
#[1] 63
#$max
#[1] 87
#$avg
#[1] 76
It might also be worth pointing out where your code is problematic:
checkAssessment <- function(df, assessmentName)
{
my_min <- df$assessmentName == min(assessmentName)
my_max <- df$assessmentName == max(assessmentName)
my_avg <- df$assessmentName == mean(assessmentName)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
First, min(assessmentName) does not make sense. Maybe you want
df$assessmentName == min(df$assessmentName)
Then, return(df[my_min, ]) is returning a data frame, a single row but multiple columns. Maybe you want:
return(df[my_min, assessmentName])
Finally, after the above return, the following won't have any effect:
return(df[my_max, assessmentName])
return(df[my_avg, assessmentName])
because the function terminates after seeing the first return. This is why you should use a "list" to get multiple returned values.

Fill data.frame with missing columns

I have the following function taken from R: iterative outliers detection (this is an updated version):
dropout<-function(x) {
outliers <- NULL
res <- NULL
if(length(x)<2) return (1)
vals <- rep.int(1, length(x))
r <- chisq.out.test(x)
while (r$p.value<.05 & sum(vals==1)>2) {
if (grepl("highest",r$alternative)) {
d <- which.max(ifelse(vals==1,x, NA))
res <- rbind(list(as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value)),fill=TRUE)
}
else {
d <- which.min(ifelse(vals==1, x, NA))
}
vals[d] <- r$p.value
r <- chisq.out.test(x[vals==1])
}
return(res)
}
The problem is that in each round it gives me some missing rows to fill in the data.frame
i want to fill res but in some iterations it contains missing values.
I used all possible things e.g rbindlist, rbind.fill, rbind (with fill=TRUE) but nothing is working.
When i do something like :
res <- c(res,as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value))
it works but it creates 2 rows for each set of (V1,V2), one with the last column as r$alternativeand the second row with the same first 2 columns but with the p-value in the last column instead.
Thats how I'm calling the function on data similar as the one in the mentioned question:
outliers <- d[, dropout(V3), list(V1, V2)]
and im getting always this error : j doesn't evaluate to the same number of columns for each group

Replacing NAs in R with nearest value

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:
dat <- c(1, 3, NA, NA, 5, 7)
Replacing NA with na.locf (3 is carried forward):
library(zoo)
na.locf(dat)
# 1 3 3 3 5 7
and na.locf with fromLast set to TRUE (5 is carried backwards):
na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7
But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:
1 3 3 5 5 7
I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?
FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
To answer smci's questions below:
No, any entry can be NA
If all are NA, leave them as is
No. My current solution defaults to the lefthand nearest value, but it doesn't matter
These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.
Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
And here I test it:
# sample data, suggested by #JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE

Code below. The initial question was not totally well-defined, I had asked for these clarifications:
Is it guaranteed that at least the first and/or last entries are non-NA? [No]
What to do if all entries in a row are NA? [Leave as-is]
Do you care how ties are split i.e. how to treat the middle NA in 1 3 NA NA NA 5 7? [Don't-care/ left]
Do you have an upper-bound (S) on the longest contiguous span of NAs in a row? (I'm thinking a recursive solution if S is small. Or a dataframe solution with ifelse if S is large and number of rows and cols is large.) [worst-case S could be pathologically large, hence recursion should not be used]
geoffjentry, re your solution your bottlenecks will be the serial calculation of nearest.non.na.pos and the serial assignment dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
For a large gap of length G all we really need to compute is that the first (G/2, round up) items fill-from-left, the rest from right. (I could post an answer using ifelse but it would look similar.)
Are your criteria runtime, big-O efficiency, temp memory usage, or code legibility?
Coupla possible tweaks:
only need to compute N <- length(dat) once
common-case speed enhance: if (length(na.pos) == 0) skip row, since it has no NAs
if (length(na.pos) == length(dat)-1) the (rare) case where there is only one non-NA entry hence we fill entire row with it
Outline solution:
Sadly na.locf does not work on an entire dataframe, you must use sapply, row-wise:
na.fill_from_nn <- function(x) {
row.na <- is.na(x)
fillFromLeft <- na.locf(x, na.rm=FALSE)
fillFromRight <- na.locf(x, fromLast=TRUE, na.rm=FALSE)
disagree <- rle(fillFromLeft!=fillFromRight)
for (loc in (disagree)) { ... resolve conflicts, row-wise }
}
sapply(dat, na.fill_from_nn)
Alternatively, since as you say contiguous NAs are rare, do a fast-and-dumb ifelse to fill isolated NAs from left. This will operate data-frame wise => makes the common-case fast. Then handle all the other cases with a row-wise for-loop. (This will affect the tiebreak on middle elements in a long span of NAs, but you say you don't care.)

I can't think of an obvious simple solution, but, having looked at the suggestions (particularly smci's suggestion of using rle) I came up with a complicated function that appears to be more efficient.
This is the code, I'll explain below:
# Your function
your.func = function(dat) {
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) which.min(abs(non.na.pos - x)))
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
# My function
my.func = function(dat) {
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
# Test
n = 100000
test.vec = 1:n
set.seed(1)
test.vec[sample(test.vec,n/4)]=NA
system.time(t1<-my.func(test.vec))
system.time(t2<-your.func(test.vec)) # 10 times speed improvement on my machine
# Verify
any(t1!=t2)
My function relies on rle. I am reading the comments above but it looks to me like rle works just fine for NA. It is easiest to explain with a small example.
If I start with a vector:
dat=c(1,2,3,4,NA,NA,NA,8,NA,10,11,12,NA,NA,NA,NA,NA,18)
I then get the positions of all the NAs:
x=c(5,6,7,8,13,14,15,16,17)
Then, for every "run" of NAs I create a sequence from 1 to the length of the run:
a=c(1,2,3,1,1,2,3,4,5)
Then I do it again, but I reverse the sequence:
b=c(3,2,1,1,5,4,3,2,1)
Now, I can just compare vectors a and b: If a<=b then look back and grab the value at x-a. If a>b then look ahead and grab the value at x+b. The rest is just handling the corner cases when you have all NAs or NA runs at the end or the start of the vector.
There is probably a better, simpler, solution, but I hope this gets you started.

I like all the rigorous solutions. Though not directly what was asked, I found this post looking for a solution to filling NA values with an interpolation. After reviewing this post I discovered na.fill on a zoo object(vector, factor, or matrix):
z <- c(1,2,3,4,5,6,NA,NA,NA,2,3,4,5,6,NA,NA,4,6,7,NA)
z1 <- zoo::na.fill(z, "extend")
Note the smooth transition across the NA values
round(z1, 0)
#> [1] 1 2 3 4 5 6 5 4 3 2 3 4 5 6 5 5 4 6 7 7
Perhaps this could help

Here's my stab at it. I never like to see a for loop in R, but in the case of a sparsely-NA vector, it looks like it will actually be more efficient (performance metrics below). The gist of the code is below.
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
I liked the code you proposed, but I noticed that we were calculating the delta between every NA value and every other non-NA index in the matrix. I think this was the biggest performance hog. Instead, I just extract the minimum-sized neighborhood or window around each NA and find the nearest non-NA value within that window.
So the performance scales linearly on the number of NAs and the window size -- where the window size is (the ceiling of) half the length of the maximum run of NAs. To calculate the length of the maximum run of NAs, you can use the following function:
getLengthNAs <- function(dat){
nas <- which(is.na(dat))
spacing <- diff(nas)
length <- 1;
while (any(spacing == 1)){
length <- length + 1;
spacing <- diff(which(spacing == 1))
}
length
}
Performance Comparison
#create a test vector with 10% NAs and length 50,000.
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
#the a() function is the code posted in the question
a <- function(dat){
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
dat
}
#my code
b <- function(dat){
#the same code posted above, but with some additional helper code to sanitize the input
if(is.null(dat)){
return(NULL);
}
if (all(is.na(dat))){
stop("Can't impute NAs if there are no non-NA values.")
}
if (!any(is.na(dat))){
return(dat);
}
#starts with an NA (or multiple), handle these
if (is.na(dat[1])){
firstNonNA <- which(!is.na(dat))[1]
dat[1:(firstNonNA-1)] <- dat[firstNonNA]
}
#ends with an NA (or multiple), handle these
if (is.na(dat[length(dat)])){
lastNonNA <- which(!is.na(dat))
lastNonNA <- lastNonNA[length(lastNonNA)]
dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA]
}
#get the index of all NA values
nas <- which(is.na(dat))
#get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
namask <- is.na(dat)
#calculate the maximum size of a run of NAs
length <- getLengthNAs(dat);
#the furthest away an NA value could be is half of the length of the maximum NA run
#if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case.
windowSize <- ceiling(length/2)
#loop through all NAs
for (thisIndex in nas){
#extract the neighborhood of this NA
neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
#any already-filled-in values which were NA can be replaced with NAs
neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA
#the center of this neighborhood
center <- windowSize + 1
#compute the difference within this neighborhood to find the nearest non-NA value
delta <- center - which(!is.na(neighborhood))
#find the closest replacement
replacement <- delta[abs(delta) == min(abs(delta))]
#in case length > 1, just pick the first
replacement <- replacement[1]
#replace with the nearest non-NA value.
dat[thisIndex] <- dat[(thisIndex - (replacement))]
}
dat
}
#nograpes' answer on this question
c <- function(dat){
nas=is.na(dat)
if (!any(!nas)) return (dat)
t=rle(nas)
f=sapply(t$lengths[t$values],seq)
a=unlist(f)
b=unlist(lapply(f,rev))
x=which(nas)
l=length(dat)
dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
dat
}
#run 10 times each to get average performance.
sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10)
A: 5.059
sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10)
B: 0.126
sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10)
C: 0.287
So it looks like this code (at least under these conditions), offers about a 40X speedup from the original code posted in the question, and a 2.2X speedup over #nograpes' answer below (though I imagine an rle solution would certainly be faster in some situations -- including a more NA-rich vector).

Speed is about 3-4x slower than that of the chosen answer. Mine is pretty simple though. It's a rare while loop too.
f2 <- function(x){
# check if all are NA to skip loop
if(!all(is.na(x))){
# replace NA's until they are gone
while(anyNA(x)){
# replace from the left
x[is.na(x)] <- c(NA,x[1:(length(x)-1)])[is.na(x)]
# replace from the right
x[is.na(x)] <- c(x[-1],NA)[is.na(x)]
}
}
# return original or fixed x
x
}

Function within function not activating as expected

I have a function that I use to get a "quick look" at a data.frame... I deal with a lot of survey data and this acts as a quick tool to see what's what.
f.table <- function(x) {
if (is.factor(x[[1]])) {
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
x <- na.omit(melt(x,c()))
x <- cast(x, variable ~ value, frequency)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
frequency <- function(x) {
x[x > 1] <- 1
x[is.na(x)] <- 0
x <- round(sum(x)/n, digits=2)
}
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(frequency, mean, sd, min, max))
x <- transform(x, variable=reorder(variable, frequency))
}
return(x)
}
What I find happens is that if I don't define "frequency" outside of the function, it returns wonky results for data frames with continuous variables. It doesn't seem to matter which definition I use outside of the function, so long as I do.
try:
n <- 100
x <- data.frame(a=c(1:25),b=rnorm(100),c=rnorm(100))
x[x > 20] <- NA
Now, select either one of the frequency functions and paste them in and try it again:
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
f.table(x)
Why is that?

Crucially, I think this is where your problem is. cast() is evaluating those functions without reference to the function it was called from. Inside cast() it evaluates fun.aggregate via funstofun and, although I don't really follow what it is doing, is getting stats:::frequency and not your local one.
Hence my comment to your Q. What do you wan the function to do? At the moment it would seem necessary to define a "frequency" function in the global environment so that cast() or funstofun() finds it. Give it a unique name so it is unlikely to clash with anything so it should be the only thing found, say .Frequency(). Without knowing what you want to do with the function (rather than what you thought the function [f.table] should do) it is a bit difficult to provide further guidance, but why not have .FrequencyNum() and .FrequencyFac() defined in the global workspace and rewrite your f.table() wrapper calls to cast to use the relevant one?
.FrequencyFac <- function(X, N) {
round(length(X)/N, digits=2)
}
.FrequencyNum <- function(X, N) {
X[X > 1] <- 1
X[is.na(X)] <- 0
round(sum(X)/N, digits=2)
}
f.table <- function(x, N) {
if (is.factor(x[[1]])) {
x <- na.omit(melt(x, c()))
x <- dcast(x, variable ~ value, .FrequencyFac, N = N)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(.FrequencyNum, mean, sd, min, max), N = N)
##x <- transform(x, variable=reorder(variable, frequency))
## left this out as I wanted to see what cast returned
}
return(x)
}
Which I thought would work, but it is not finding N, and it should be. So perhaps I am missing something here?
By the way, it is probably not a good idea to rely on function that find n (in your version) from outside the function. Always pass in the variables you need as arguments.

I don't have the package that contains melt, but there are a couple potential issues I can see:
Your frequency functions do not return anything.
It's generally bad practice to alter function inputs (x is the input and the output).
There is already a generic frequency function in stats package in base R, which may cause issues with method dispatch (I'm not sure).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

retain NA in custom winsorization function withs apply - r

winsor function needs an } in the end. I took the liberty and tweaked the code: winsor <- function(x){ y <- quantile(x,probs=.95,na.rm=T) sapply(x, function(x) { if(!is.na(x)){ if(x>=y ){ x <- y } else { x <- x} } }) } z<-data.frame(winsor=as.numeric(unlist(winsor(a))))

Related

Continuous Error in argument "x" is missing, with no default

Write a function that returns a vector or list of three statistics

Fill data.frame with missing columns

Replacing NAs in R with nearest value

Function within function not activating as expected

Categories

Resources