I’m trying to do a median like a function, for vectors with even and odd numbers (related to quantities of observations – the calculating method changes). I need to include an option for missing values as well. How can I do this? I need to do this without to use the ready median(x) in R. Is it possible? I have put bellow how I tried to do:
x<-c(1,2,3)
function (x, na.rm = FALSE) {
x <- x[!is.na(x)]
return(mediana = median(x))
Related
I'm using the boot() function from the boot package to bootstrap means from a population. The used function is:
boot_mean <- function(data, i){
ds_m <- data[i]
return(mean(ds_m))
}
Works like charm but now I want to adapt the boot_mean function so that I can get the samples which lead to the mean too. I tried:
library('boot')
boot_mean <- function(data, i){
ds_m <- data[i]
ds_m_mean <- mean(ds_m)
rlist <- list("means" = ds_m_mean, "data" = ds_m)
return(rlist)
}
dummy_data <- rnorm(500)
dummy_boot <- boot(dummy_data, boot_mean, R = 1000)
Which results in an error:
Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts
on matrix
What's wrong here? How can I get the corresponding dataset to the bootstrapped mean?
From the documentation ?boot, describing the statistic argument.
A function which when applied to data returns a vector containing the statistic(s) of interest. ...
The boot() function only wants to deal with functions that output a single vector. Modifying your code to return a list of two elements means it won't work anymore. There's actually a little interesting oddity in R and the boot() function which means the code almost works if you set R=1 in the boot() call, but it's still wrong.
Fortunately for your purpose, the authors have already programmed the useful boot.array() function. It outputs a matrix with R rows and nrow(data) columns, indicating either how many times the jth individual was sampled for the ith bootstrap, or the indices of the sampled individuals. Getting the bootstrapped datasets can easily be found by selecting those individuals from the data. This can take a little while.
dats <- lapply(1:nrow(boot.array(dummy_boot)),
FUN = function(x) dummy_data[boot.array(dummy_boot, indices = TRUE)[x, ]])
If you have multiple columns of data you should add , , drop = FALSE
dats <- lapply(1:nrow(boot.array(dummy_boot)),
FUN = function(x) dummy_data[boot.array(dummy_boot, indices = TRUE)[x, ], , drop = FALSE])
I am writing a simple function in R to calculate percentage differences between two input numbers.
pdiff <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/((a+b)/2))*100
}
pdiffa <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/mean(a,b))*100
}
When you run it with a random value of a and b, the functions give different results
x <- 5
y <- 10
pdiff(x,y) #gives 66%
pdiffa(x,y) #gives 100%
When I go into the code, apparently the values of (x+y)/2 = 7.5 and mean(x,y) = 5 differ......Am I missing something really obvious and stupid here?
This is due to a nasty "gotcha" in the mean() function (not listed in the list of R traps, but probably should be): you want mean(c(a,b)), not mean(a,b). From ?mean:
mean(x, ...)
[snip snip snip]
... further arguments passed to or from other methods.
So what happens if you call mean(5,10)? mean calls the mean.default method, which has trim as its second argument:
trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
The last phrase "values of trim outside that range are taken as the nearest endpoint" means that values of trim larger than 0.5 are set to 0.5, which means that we're asking mean to throw away 50% of the data on either end of the data set, which means that all that's left is the median. Debugging our way through mean.default, we see that we indeed end up at this code ...
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
So mean(c(x,<value_greater_than_0.5>)) returns the median of c(5), which is just 5 ...
Try mean(5, 10) by itself.
mean(5, 10)
[1] 5
Now try mean(c(5, 10)).
mean(c(5, 10))
[1] 7.5
mean takes a vector as its first argument.
I am trying to sample a population drawn from a normal distribution with a mean of 10.016 and a standard deviation of 0.8862719 (n=20), a thousand times. I want to create a loop to do this. I tried creating a function (stamendist) to draw random variables from a normal distribution with the abovementioned mean and standard deviation, but when I run the loop, I get an error message:
Error: could not find function "stamendist" (even though I ran the function before running the loop).
I tried running the loop without the object "stamendist" by just inputting rnorm(n=20,mean=10.016,sd=0.8862719), but the same error message persists.
Here is my code:
stamendist <- rnorm(n=20,mean=10.016,sd=0.8862719)
sampled.means <- NA
for(i in 1:1000){
y=stamendist(100)
sampled.means[i] <- mean(y)
}
Am I misunderstanding how a function works? I'm pretty new to R, so any help or advice would be appreciated.
You don't need a loop to obtain the vector of sample means:
n <- 1000
sampled.means <- colMeans(matrix(rnorm(n = 20 * n, 10.016, 0.8862719), ncol = n))
If you want stamentdist to be a function, you need to assign stamendist as a function. The general notation for a function is:
foo <- function(args, ...){
expressions
}
You must then decide which parameters you want the user to specify. In your specific example, I assume you want the user to specify how many observations. Here is how the function would look with that in mind:
stamendist <- function(n) {
rnorm(n=n,mean=10.016,sd=0.8862719)
}
In this line:
stamendist <- rnorm(n=20,mean=10.016,sd=0.8862719)
You assign 20 values to the vector named stamendist
In this line:
y=stamendist(100)
You try to call a function stamendist, which doesnt exist.
Move this lineinside the loop:
stamendist <- rnorm(n=20,mean=10.016,sd=0.8862719)
So you create a new set of random number for each iteration.
Then pass stamendist to the mean function. And you dont need y at all
I was just playing around and trialling with R and I had trouble with combining my sapply() commands into one expression.
For example, my data table was called height_weight.
I want to calculate the usual summary statistics: mean, median, max, minimum and sample size from column 2 till 7.
Just as sample codes:
I used this for mean:
sapply(height_weight[2:7],mean,na.rm=TRUE)
max;
sapply(height_weight[2:7],max,na.rm=TRUE)
I'm just wondering, how would I combine the two into one expression? I have tried simply placing them next to each other, however that shows an error message.
Many ways to do so.
E.g. use summary and subset for the appropriate rows
sapply(height_weight[2:7], summary)[c("Mean", "Max."), ]
Or use an unnamed custom function that combines the two measures as a result
sapply(height_weight[2:7], function(x) c(Mean=mean(x, na.rm=TRUE), Max=max(x, na.rm=TRUE)))
Placing the two functions besides each other won't work because you can give sapply only one function. Everything that follows will be passed on to that function. (I.e. if it is no parameter of sapply.)
If you want to calculate the usual summary statistics, you can just use summary:
summary(height_weight[2:7])
sapply(height_weight[2:7],summary) # just to use sapply
Otherwise, if you want to define your own summary statistics (in this case mean and max), then you can write a function mysummary and use sapply just as before:
mysummary <- function(x, ...) {
c(mean=mean(x, ...),
max=max(x, ...))
}
sapply(height_weight[2:7], mysummary , na.rm=TRUE)
There is a data.frame() for which's columns I'd like to calculate quantiles:
tert <- c(0:3)/3
data <- dbGetQuery(dbCon, "SELECT * FROM tablename")
quans <- mapply(quantile, data, probs=tert, name=FALSE)
But the result only contains the last element of quantiles return list and not the whole result. I also get a warning longer argument not a multiple of length of shorter. How can I modify my code to make it work?
PS: The function alone works like a charme, so I could use a for loop:
quans <- quantile(a$fileName, probs=tert, name=FALSE)
PPS: What also works is not specifying probs
quans <- mapply(quantile, data, name=FALSE)
The problem is that mapply is trying to apply the given function to each of the elements of all of the specified arguments in sequence. Since you only want to do this for one argument, you should use lapply, not mapply:
lapply(data, quantile, probs=tert, name=FALSE)
Alternatively, you can still use mapply but specify the arguments that are not to be looped over in the MoreArgs argument.
mapply(quantile, data, MoreArgs=list(probs=tert, name=FALSE))
I finally found a workaround which I don't like but kinda works. Perhaps someone can tell the right way to do it:
q <- function(x) { quantile(x, probs=c(0:3)/3, names=FALSE) }
mapply(q, data)
works, no Idea where the difference is.