It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I created this program to estimate the Mean Squared Error (MSE), and Mean absolute percent error (MAPE):
Is everything all right with this?
pune is an .csv file with 22 data points.
pune <- read.csv("C:/Users/ervis/Desktop/Te dhenat e konsum energji/pune.csv", header=T,dec=",", sep=";")
pune <- data.matrix(pune,rownames.force=NA)
m1 <- seq(from = 14274.19, to = 14458.17, length.out = 10000)
MSE1 <- numeric(length = 10000)
for(i in seq_along(MSE1)) {
MSE1[i] <- 1 / length(pune) * sum((pune-m1[i]) ^ 2)
}
MAPE1 <- numeric(length = 10000)
for(i in seq_along(MAPE1)) {
MAPE1[i] <- 1 / length(pune) * sum(abs((pune-m1[i]) / pune))
}
Am I right?
Mean squared error seems to have different meanings in different contexts.
For a random sample taken from a population, the MSE of the sample mean is just the variance divided by the number of samples, i.e.,
mse <- function(sample_mean) var(sample_mean) / length(sample_mean)
mse(pune)
For regressions, MSE means the sum of squares of residuals divided by the degreees of freedom of those residuals.
mse.lm <- function(lm_model) sum(residuals(lm_model) ^ 2) / lm_model$df.residual
#or
mse.lm <- function(lm_model) summary(lm_model)$sigma ^ 2
Seems like a lot of code for a simple calculation. Here is how I would do it for a data vector a:
a = c(1:10)
mse_a = sum((a - mean(a)) ^ 2) / length(a)
From what I can see your formula for MSE is correct, but there should only be one value for the whole dataset, not multiple values.
If your data only contains 22 points, I can't see why you need to create a 10,000 item vector, regardless of whether you are using loops or not.
Related
This question already has answers here:
Standard Deviation in R Seems to be Returning the Wrong Answer - Am I Doing Something Wrong?
(4 answers)
Closed 12 months ago.
I have a large dataset for which I want to determine the mean, sd and se depending on two variables (sample and protein), here is a subset of my data:
sample value protein
1 Stage 1 84796453 Tdrd6
2 Stage 1 85665703 Tdrd6
When I use
ddply(df, .(sample, protein), summarise, Mean = mean(value), SE = sd(value) / sqrt((length(value))), SD = sd(value))
I get
sample protein Mean SE SD
1 Stage 1 Tdrd6 85231078 434624.5 614651.9
The mean is correct, however, considering that I have only two values, the SD should be 434625 (the difference between the mean and either of the values, which is given in the output as SE), and (as calculated with excel) the SE should be 307326 (which is +-1/2 of the SD value given in the output). Does anyone know what is going on?
Thanks!
R's var and sd functions use a denominator of n - 1. From the var docs:
The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations.
This is also why R's implementation of these functions will return NA for vectors of length 1. Your Excel calculations seem to be using an uncorrected denominator of n, hence the difference.
The bias correction is considered standard, especially for small samples. We can see the difference if we write a variance function that uses the biased denominator:
var_uncorrected <- function(x, na.rm = F) {
return(mean((x - mean(x, na.rm = na.rm))^2))
}
vals <- c(84796453, 85665703)
sd(vals)
[1] 614652.6
sqrt(var_uncorrected(vals))
[1] 434625
Lastly, the plyr library was retired several years ago, and has been superseded by dplyr.
sd() calculates the standard deviation of a sample, which gives the correct answer. It seems you want the standard error of the population (assuming n is not just your sample size but is your whole population) which can be derived from it.
x = c(84796453, 85665703)
n = length(x)
sd(x) # standard error of a sample
# [1] 614653
sqrt((n-1)/n) * sd(x) # standard error of a population
# [1] 434625
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have the following problem. I have a time series made by 2659 observations. I need to perform a statistical test over a sliding window of length 256 and each time I want to extract the p-values from these tests and gather them into a time series vector. To perform this test (runs test) I want as threshold a moving average that moves along with the data and the rolling window. Here is my attemp (in R)
x<- ts(rnorm(2659, mean = 0.0001, sd = 0.0001))
library(randtests)
for(i in 1:2404){
runs <- runs.test(x[i:i+255], threshold = mean(x[i:i+255]))
ret[i] <- runs$p.value
}
The index starts from 1 but stops to 2404 because the time window must move of 256 each time, therefore the first window goes from 1 to 256, the second from 2 to 257... and finally stops to 255+2404 = 2659. I hope that I made clear my problem, I do not understand why it does not work. Of course I need to plot the result over time to have in a plot all the p-values over the time. I hope you can help me.
PS: Please, set a seed if you propose an example so that I can reproduce your results.
Use rollapplyr with the indicated function.
library(zoo)
pv <- function(xx) runs.test(xx, threshold = mean(xx))$p.value
out <- rollapplyr(x, 256, pv, fill = NA)
Note
library(randtests)
set.seed(123)
x <- ts(rnorm(2659, mean = 0.0001, sd = 0.0001))
Two changes to your existing code should make it work:
set.seed(0)
x <- ts(rnorm(2659, mean = 0.0001, sd = 0.0001))
library(randtests)
ret <- rep(NA, length(x))
for(i in 1:2404){
runs <- runs.test(x[i:(i+255)], threshold = mean(x[i:(i+255)]))
ret[i] <- runs$p.value
}
First change is to initialize the ret variable before the loop. ret <- rep(NA, length(x))
The second change is to add the parenthesis, i.e. x[i:(i+255)]. If you do x[i:i+255], you will get a single return value, x[i].
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Is there an efficient R-package for dealing with the following problem:
I have a set of numerical observations (N in the order of thousands) ranging from -one million to +one million. Given a target value and round off accurarcy is there a linear combination with weights -1(subtract)/0(leave out)/1(add up) such that the sum is equal to the target value within rounding errors and also presenting the weights?
Here is the genetic algorithm I referenced modified to your case, for an explanation of the algorithm see my answer there. There may be (are certainly) ways to solve your issue with less code, but I had this solution on the shelf already and adapting it was simple. The input required is a data.frame with a column value and a column weights, which can be all zero:
value weights
1 45 0
2 33 0
3 47 0
4 65 0
5 12 0
6 43 0
7 5 0
... ... ...
The algorithm will then find a set of weights from the set c(-1,0,1) such that the value of
abs(target_value - sum(final_solution$value*final_solution$weights))
is minimized.
There is definitely still room for improvement, for example the weights are now set completely randomly, so the expected weighted sum of an initial solution is always 0. If the target_value is very high, it would be best to assign 1's a higher probability than -1, to converge to an optimal solution faster.
It seems to work very well for this case, with 100000 objects and a target value of 12000, it finds an optimal solution within a fraction of a second:
Code:
### PARAMETERS -------------------------------------------
n_population = 100 # the number of solutions in a population
n_iterations = 100 # The number of iterations
n_offspring_per_iter = 80 # number of offspring to create per iteration
frac_perm_init = 0.25 # fraction of columns to change from default solution while creating initial solutions
early_stopping_rounds = 100 # Stop if score not improved for this amount of iterations
### SAMPLE DATA -------------------------------------------------
n_objects = 100000
datain =data.frame(value=round(runif(n_objects,0,100)),weights = 0))
target_value=12000
### ALL OUR PREDEFINED FUNCTIONS ----------------------------------
# Score a solution
# We calculate the score by taking the sum of the squares of our overcapacity (so we punish very large overcapacity on a day)
score_solution <- function(solution,target_value)
{
abs(target_value-sum(solution$value*solution$weights))
}
# Merge solutions
# Get approx. 50% of tasks from solution1, and the remaining tasks from solution 2.
merge_solutions <- function(solution1,solution2)
{
solution1$weights = ifelse(runif(nrow(solution1),0,1)>0.5,solution1$weights,solution2$weights)
return(solution1)
}
# Randomize solution
# Create an initial solution
randomize_solution <- function(solution)
{
solution$weights = sample(c(-1,0,1),nrow(solution),replace=T)
return(solution)
}
# sort population based on scores
sort_pop <- function(population)
{
return(population[order(sapply(population,function(x) {x[['score']]}),decreasing = F)])
}
# return the scores of a population
pop_scores <- function(population)
{
sapply(population,function(x) {x[['score']]})
}
### RUN SCRIPT -------------------------------
# starting score
print(paste0('Starting score: ',score_solution(datain,target_value)))
# Create initial population
population = vector('list',n_population)
for(i in 1:n_population)
{
# create initial solutions by making changes to the initial solution
solution = randomize_solution(datain)
score = score_solution(solution,target_value)
population[[i]] = list('solution' = solution,'score'= score)
}
population = sort_pop(population)
score_per_iteration <- score_solution(datain,target_value)
# Run the algorithm
for(i in 1:n_iterations)
{
print(paste0('\n---- Iteration',i,' -----\n'))
# create some random perturbations in the population
for(j in 1:10)
{
sol_to_change = sample(2:n_population,1)
new_solution <- randomize_solution(population[[sol_to_change]][['solution']])
new_score <- score_solution(new_solution,target_value)
population[[sol_to_change]] <- list('solution' = new_solution,'score'= new_score)
}
# Create offspring, first determine which solutions to combine
# determine the probability that a solution will be selected to create offspring (some smoothing)
probs = sapply(population,function(x) {x[['score']]})
if(max(probs)==min(probs)){stop('No diversity in population left')}
probs = 1-(probs-min(probs))/(max(probs)-min(probs))+0.2
# create combinations
solutions_to_combine = lapply(1:n_offspring_per_iter, function(y){
sample(seq(length(population)),2,prob = probs)})
for(j in 1:n_offspring_per_iter)
{
new_solution <- merge_solutions(population[[solutions_to_combine[[j]][1]]][['solution']],
population[[solutions_to_combine[[j]][2]]][['solution']])
new_score <- score_solution(new_solution,target_value)
population[[length(population)+1]] <- list('solution' = new_solution,'score'= new_score)
}
population = sort_pop(population)
population= population[1:n_population]
print(paste0('Best score:',population[[1]]['score']))
score_per_iteration = c(score_per_iteration,population[[1]]['score'])
if(i>early_stopping_rounds+1)
{
if(score_per_iteration[[i]] == score_per_iteration[[i-10]])
{
stop(paste0("Score not improved in the past ",early_stopping_rounds," rounds. Halting algorithm."))
}
}
}
plot(x=seq(0,length(score_per_iteration)-1),y=score_per_iteration,xlab = 'iteration',ylab='score')
final_solution = population[[1]][['solution']]
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have the formula y = x / (a+b*x) that I want to fit to the points (6,72) (211,183) (808,360) (200,440). I put them in R using
x <- c(6,211,808,200)
y <- c(72,183,360,440)
Now I want to the fit the function defined above to fit trough these points, and find a and b.
How do I get a and b (using R) ? and, how do i get the formula in R?
Construct data:
x <- c(6,211,808,200)
y <- c(72,183,360,440)
d <- data.frame(x,y)
Plot the data: although sparse, they're not insane (they do show some evidence of an increasing/saturating pattern)
plot(y~x,data=d)
Fit the model:
## y = x/(a+b*x)
## 1/y = a/x + b
m1 <- glm(y~I(1/x),family=gaussian(link="inverse"),data=d)
You can plot the results in ggplot
library("ggplot2")
qplot(x,y,data=d)+theme_bw()+
geom_smooth(method="glm",family=gaussian(link="inverse"),
formula=y~I(1/x),se=FALSE)
The confidence intervals for this model are somewhat crazy (because the confidence intervals for 1/y include zero, at which point the confidence intervals on y blow up), so be careful ...
Get the data and plot it:
x <- c(6,211,808,200)
y <- c(72,183,360,440)
plot(x,y,pch=19)
Define the function, get your coefficients
f <- function(x,a,b) {x/(a+b*x)}
fit <- nls(y ~ f(x,a,b), start=c(a=1,b=1))
co <- coef(fit)
# co will contain your coefficients for a and b
# a b
#0.070221853 0.002796513
And plot away:
curve(f(x, a=co["a"], b=co["b"]), add = TRUE, col="green", lwd=2)
Result:
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I find the optimal variable to maximize or minimize correlation between several datasets
This can be done in Excel, but my dataset has gotten too large. In excel, I would use solver.
I have 5 variables and I want to recreate a weighted average of these 5 variables so that they have the lowest correlation to a 6th variable.
Column A,B,C,D,E = random numbers
Column F = random number (which I want to minimise the correlation to)
Column G = Awi1+Bwi2+C*2i3+D*wi4+wi5*E
where wi1 to wi5 are coefficients resulted from solver In a separate cell, I would have correl(F,G)
This is all achieved with the following constraints in mind:
1. A,B,C,D, E have to be between 0 and 1
2. A+B+C+D+E= 1
I'd like to print the results of this so that I can have an efficient frontier type chart.
How can I do this in R? Thanks for the help.
I looked at the other thread mentioned by Vincent and I think I have a better solution. I hope it is correct. As Vincent points out, your biggest problem is that the optimization tools for such non-linear problems do not offer a lot of flexibility for dealing with your constraints. Here, you have two types of constraints: 1) all your weights must be >= 0, and 2) they must sum to 1.
The optim function has a lower option that can take care of your first constraint. For the second constraint, you have to be a bit creative: you can force your weights to sum to one by scaling them inside the function to be minimized, i.e. rewrite your correlation function as function(w) cor(X %*% w / sum(w), Y).
# create random data
n.obs <- 100
n.var <- 6
X <- matrix(runif(n.obs * n.var), nrow = n.obs, ncol = n.var)
Y <- matrix(runif(n.obs), nrow = n.obs, ncol = 1)
# function to minimize
correl <- function(w)cor(X %*% w / sum(w), Y)
# inital guess
w0 <- rep(1 / n.var, n.var)
# optimize
opt <- optim(par = w0, fn = correl, method = "L-BFGS-B", lower = 0)
optim.w <- opt$par / sum(opt$par)