Explanations of the Goals:
Could someone please help me on this:
I trying to make a Monte Carlo study on the estimators of the Linear Regression beta0hat, beta1hat, R2, R2Adjusted and P-value changing the samples size(30,60,100) and the variance(0.5,0.75,1), using normal a random error.
First i've created 3 samples of each lenght that is relevant for the study which i don´t want to be random.
X1 = sample(0:20,30,T)
X2 = sample(0:20,60,T)
X3 = sample(0:20,100,T)
For the main purpose, i've created this function of Monte Carlo in witch i´m trying to keep the results of each estimator in some approprieated vectors to generate histograms and a plot of P-value in Y axis against R2 in X axis to verify the behavior of the estimator whem i change the variables and set normal random to the errors.
Arguments of the function:
n = sample size, sig = changed variance, b0 = real betahat0, b1 = real betahat1, X = samples of X axis
Monte.Carlo = function(n, sig, b0, b1,X){
Y = b0 + b1 * X + rnorm(n,0,sig)
smr = summary(lm(Y~X))
return(smr)
}
To generate the vector that will be my data in this study to analise the behavior of the estimators, i've used the function replicate like this:
object.1 = replicate(1000,Monte.Carlo(30,0.5,1.4,0.8,1,X1))
beta0_s0.5_n30 <-list(c(object.1[,1:1000][[4]] [1]))
beta1_s0.5_n30<- object.1[[4]] [2]
R2_s0.5_n30 <- object.1[[8]]
R2A_s0.5_n30 <- object.1[[9]]
valorP_s0.5_n30 <- object.1[[4]] [8]
But there is something wrong in this generations above that i can' figured out.
The object.1 seens to have stored 1000 summarys of the regression.
How can i access the 1000 outputs of each estimator of the regression summary and store then in the apropriated vectors, like list of list, as a intented in the comand lines above?
The puspose is to apply this on several objects like in the example below where i've had changed the variance to 0.75 and the sample size to 60:
beta0_s0.75_n60 <- replicate(1000,Monte.Carlo(60,0.75,1.4,0.8,X2))
beta1_s0.75_n60<- replicate(1000,Monte.Carlo(60,0.75,1.4,0.8,X2))
R2_s0.75_n60 <- replicate(1000,Monte.Carlo(60,0.75,1.4,0.8,X2))
R2A_s0.75_n60 <- replicate(1000,Monte.Carlo(60,0.75,1.4,0.8,X2))
valorP_s0.75_n60 <- replicate(1000,Monte.Carlo(60,0.75,1.4,0.8,X2))
The final go is to generate 120 graphs like in this example to compare the results:
hist(R2A_s0.5_n30,breaks=11)
hist(R2A_s0.75_n30,breaks=11)
hist(R2A_s1_n30,breaks=11)
hist(R2A_s0.5_n60,breaks=11)
hist(R2A_s0.75_n60,breaks=11)
hist(R2A_s1_n60,breaks=11)
hist(R2A_s0.5_n100,breaks=11)
hist(R2A_s0.75_n100,breaks=11)
hist(R2A_s1_n100,breaks=11)
I will really appreciate if someone could help on this, i've tryed a lot of solutions and look in some forums and it doesn't make any difference at all.
Sorry about my english grammar errors.
Thanks a lot!
So I just assumed that your original object.1 call was supposed to have only five arguments like the Monte.Carlo function itself, and shortened this to:
object.1 = replicate(1000,Monte.Carlo(30,0.5,1.4,0.8,X1))
Then I made a dummy dataframe (which is a list of lists) with the column names being the statistics you specified
o1 <- data.frame(b0 = 0, b1 = 0, R2 = 0, R2A = 0, vP = 0)
Then created a for-loop...
for (r in 1:length(object.1)) {
o1[r,"b0"] <- object.1[4,r][[1]][1,1]
o1[r,"b1"] <- object.1[4,r][[1]][2,1]
o1[r,"R2"] <- object.1[8,r]
o1[r,"R2A"] <- object.1[8,r]
o1[r,"vP"] <- object.1[4,r][[1]][2,4]
}
...where a new row is added to o1 for every replication of your Monte.Carlo function and the relevant statistics extracted via the subsetting functions - [[]] and [,] - from the each of the one thousand summary(lm(Y~X)) objects created in object.1. o1 is a dataframe, with each column being a vector of 1000 of each statistic. Apply the same principle for object.2, object.3 etc.
p.s. run the for-loop and you get an error message saying Error in object.1[4, r] : subscript out of bounds but then open o1 and you'll see the loop worked perfectly (I even checked the individual statistics for each replication and they match, so, don't really understand that one :)
Related
I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.
I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471
I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)
Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.
Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.
I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x)))
http://cran.r-project.org/web/packages/reldist/reldist.pdf --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
rm(list=ls())
cat("\014")
library(reldist)
data(nls);data
help(nls)
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
#-------------------------
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
{
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
EDIT:
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.