I want to calculate the error rate by interval where 0 is good and 1 is bad. If I have a sample of 100 observation as levels divided in intervals as follows:
X <- 10; q<-sample(c(0,1), replace=TRUE, size=X)
l <- sample(c(1:100),replace=T,size=10)
bornes<-seq(min(l),max(l),5)
v <- cut(l,breaks=bornes,include.lowest=T)
table(v)
How can I get a table or function that calculates the default rate by each interval, the number of bad observations divided by the total number of observations?
tx_erreur<-function(x){
t<-table(x,q)
return(sum(t[,2])/sum(t))
}
I already tried this code above and tapply.
Thank you!
I think you want this:
tapply(q,# the variable to be summarized
v,# the variable that defines the bins
function(x) # the function to calculate the summary statistics within each bin
sum(x)/length(x))
Related
I have a dataset with daily bond returns for some unique RIC codes (in total approx. 200.000 observations).
Now I want to calculate the standard deviation of those returns for the combined period t-30 to t-6 and t+6 to t+30. This means for every observation i,t, I need the 24 returns before t in the window t-30 to t-6 and 24 returns in the window t+6 to t+30 and calculate the standard deviation based on those 48 observations.
Here is a small snippet of my dataset:
#My data:
date <- c("2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17","2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17")
ric <- c("AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=")
return <- c(0.001009681, 0.003925873, 0.000354606, -0.000472641, -0.002935700, 0.003750854, 0.012317347, -0.001314047, 0.001014453, -0.007234452)
df <- data.frame(ric, date, return)
I have tried to use the slider package to generate two lists with the returns of the specific time frame. However, I feel that there is some more efficient way to solve this problem. I hope to find some help here.
This is what I tried before:
x <- slide(df$return, ~.x, .before=30, .after = -6)
y <- slide(df$return, ~.x, .before=-6, .after = 30)
z <- mapply(c, x, y, SIMPLIFY=TRUE)
for (i in 1:length(z))
{
df$sd[i] <- sd(z[[i]])
}
I am trying to generate data from a multinomial distribution in R using the function rmultinom, but I am having some problems.
The fact is that I want a data frame of 50 rows and 20 columns and a total sum of the outcomes equal to 3 times n*p.
I am using this code:
p <- 20
n <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(50, N, prob_true)
But I get some very strange results and a data frame with 20 rows and 50 columns.
How can I solve this problem?
Thanks in advance!
The help available at ?rmultinom says that n in rmultinom(n, size, prob) is:
"number of random vectors to draw"
And size is:
"specifying the total number of objects that are put into K boxes in the typical multinomial experiment"
And the help says that the output is:
"For rmultinom(), an integer K x n matrix where each column is a random vector generated according to the desired multinomial law, and hence summing to size"
So you're asking for 50 vectors/variables with a total number of "objects" equal to 3000, so each column is drawn as a vector that sums to 3000.
colSums(a) does result in 3000.
Do you want your vectors/variables as rows? Then this would work just by transposing a:
t(a)
but if you want 20 columns, each that is its own variable, you would need to switch your n and p (I also subbed in n in the rmultinom call):
n <- 20
p <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(n, N, prob_true)
Please help with the following question.
The experiment involved mice; feeding them two diets: high-fat diet and normal diet (control group). The data below contains the weights of all female mice (population) that received the normal diet. The data can be downloaded from GitHub running the following command lines in R:
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile = filename)
x <- unlist(read.csv(filename))
Here x represents the weights of the entire population.
So, the question is:
Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 (one thousand) times. Save the averages.
What proportion of these 1,000 averages are more than 1 gram away from the average x?
Below is what I have tried using the ‘sum’ & ‘mean()’ function:
set.seed(1)
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n)
mean(sample1 > mean(x)+1)
So this step is where I need the help…because I am not sure how to deal with ‘1 gram away from average of x’ statement in the question.
Thank you in advance for your help.
Looks like homework, so I'll give some hints:
In your second code block, the last two statements seem off.
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n) #<- why dividing by n here?
mean(sample1 > mean(x)+1) #<- what are you trying to do here?
Why are you dividing the mean of the overall sample by n?
The call to mean does seem to make sense.
I don't think you need the second statement, mean(sample1 > mean(x)+1) to get your answer.
You need an inequality in the sum() statement that will be TRUE for every value that is outside the range of mean(x) - 1 to mean(x) + 1. Or, the number less than mean(x) -1 plus the number greater than mean(x) + 1.
Does that help?
On the loop part you are doing correctly, for the ##What proportion of these 1,000 averages are more than 1 gram away from the average x?##sum(abs(null)-mean(population)>1)/n
My dataframe contains sampling means of 500 samples of size 100 each. Below is the snapshot. I need to calculate the confidence interval at 90/95/99 for mean.
head(Means_df)
Means
1 14997
2 11655
3 12471
4 12527
5 13810
6 13099
I am using the below code but only getting the confidence interval for one row only. Can anyone help me with the code?
tint <- matrix(NA, nrow = dim(Means_df)[2], ncol = 2)
for (i in 1:dim(Means_df)[2]) {
temp <- t.test(Means_df[, i], conf.level = 0.9)
tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")
For any single mean, e. g. 14997, you can not compute a 95%-CI without knowing the variance or the standard deviation of the data, the mean was computed from. If you have access to the standard deviation of each sample, you can than compute the standard error of the mean and with that, easily the 95%-CI. Apparently, you lack the Information needed for the task.
Means_df is a data frame with 500 rows and 1 column. Therefore
dim(Means_df)[2]
will give the value 1.
Which is why you only get one value.
Solve the problem by using dim(Means_df)[1] or even better nrow(Means_df) instead of dim(Means_df)[2].
I have two datasets:
sims = c(2,5,3,5,5,3)
obs = c(1,4,NA,NA,7,4)
Using the hydroGOF R package I can calculate the percentage bias as
pbias(sims,obs,na.rm=T)
However, is there a way to output the sum of sims used in the pbias calculation (i.e. 2+5+5+3 because the hydroGOF manual states that "When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation") rather than the actual sum of sims (i.e. what would be returned by sum(sims) )?
You can do this with any two vectors like this:
sum(sims[!is.na(obs)])