Say I know the probability of some data:
A: 2%
B: 55%
C: 43%
In a sample of 30 randomly selected items containing A,B and C, I want to know the probability of say B occuring less than 5 times.
Currently I have:
dmultinom(x=c(0,5,0), prob = c(0.02, 0.55, 0.43))
How would I go about doing this in R? I can solve it on paper no problem, but not quite sure how to do it programatically. Not quite sure if I'm using the right function. Appreciate the help!
Since the multinomial distribution is discrete, dmultinom is actually the probability mass function of the multinomial distribution. You can calculate the probability of specific configurations using dmultinom. Some examples:
> dmultinom(x = c(10, 5, 15), size = 30, prob = c(0.02, 0.55, 0.43))
[1] 7.627047e-13
> dmultinom(x = c(20, 5, 5), size = 30, prob = c(0.02, 0.55, 0.43))
[1] 5.873928e-28
> dmultinom(x=c(2,25,3), size = 30, prob = c(0.02, 0.55, 0.43))
[1] 1.463409e-05
> dmultinom(x=c(3,17,10), size = 30, prob = c(0.02, 0.55, 0.43))
[1] 0.002283587
Related
I would like to calculate a weighted correlation between two variables having different weights.
Some example data:
DF = data.frame(
x = c(-0.3, 0.3, -0.18, 0.02, 0.07, 0.11, 0.20, 0.8, 0.3, -0.4),
x_weight = c(50, 40, 70, 5, 15, 30, 32, 13, 9, 19),
y = c(-0.6, 0.25, 0.1, 0.3, 0.3, -0.05, -0.5, 1, 0.05, -0.6),
y_weight = c(70, 8, 10, 39, 9, 49, 90, 77, 23, 75)
)
DF
I read about cov.wt in the stats package, but it only allows input for one vector of weights. Essentially I'm looking for similar inputs as wtd.t.test, but to calculate a correlation instead.
Thank you for your help!
You can calculate the weighted correlation between two variables using the following formula Formulas are based on the definition of weighted covariance and correlation.
First, calculate the weighted means for the two variables using the weights:
mu_x = sum(DF$x * DF$x_weight) / sum(DF$x_weight)
mu_y = sum(DF$y * DF$y_weight) / sum(DF$y_weight)
Next, calculate the weighted covariance between the two variables:
cov_xy = sum((DF$x - mu_x) * (DF$y - mu_y) * DF$x_weight * DF$y_weight) / sum(DF$x_weight * DF$y_weight)
Finally, calculate the weighted correlation between the two variables:
cor_xy = cov_xy / (sqrt(sum((DF$x - mu_x)^2 * DF$x_weight) / sum(DF$x_weight)) * sqrt(sum((DF$y - mu_y)^2 * DF$y_weight) / sum(DF$y_weight)))
I'm using NSGA2R to get the fitness value of the best solution that has 10 variables. However, I would like to keep 4 of them fix through all generations and 6 generate randomly by the algorithm, How do we do that with the nsga2R optimization algorithm?
Sample of the code that I'm using now:
NSGA <- nsga2R(fn = function(x) myfitnessFun(x,m,10), varNo = 10, objDim = 2, generations = 1,
mprob = 0.2, popSize = 50, cprob = 0.8,
lowerBounds = c(rep(1, 1)), upperBounds = c(rep(N, 10)))
I'm looking to find the best 10 sensors locations out of N sensors that satisfied our fitness function. The question is how to fix, as an example, 4 of these 10? where the remaining 6 will be randomly selected.
As our data has the coordinates of these sensors: sample of this data
(structure(c(47.4, 47.6105263157895, 47.8210526315789, 48.0315789473684,
5.71, 5.71, 5.71, 5.71, 0, 0, 0, 0), .Dim = c(4L, 3L)))
I'm currently trying to do a stratified split in R to create train and test datasets.
A problem posed to me is the following
split the data into a train and test sample such that 70% of the data
is in the train sample. To ensure a similar distribution of price
across the train and test samples, use createDataPartition from the
caret package. Set groups to 100 and use a seed of 1031. What is the
average house price in the train sample?
The dataset is a set of houses with prices (along with other data points)
For some reason, when I run the following code, the output I get is labeled as incorrect in the practice problem simulator. Can anyone spot an issue with my code? Any help is much appreciated since I'm trying to avoid learning this language incorrectly.
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
Output
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6
I try to reproduce it manually using sample_frac form dplyr package and cut2 function from Hmisc package. The results are almost the same - still not same.
It looks like there might be a problem with pseudo numbers generator or with some rounding.
In my opinion your code looks to be a correct one.
Is it possible that in previous steps you should remove some outliers or pre-process dataset in any way.
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
This sampling procedure should result in mean that approximate to a population one.
is there a way to generate a random sample from a higher order markov chain? I used the package clickstream to estimate a 2nd order markov chain and i'm now trying to generate a sample from it. I understand how to do this from a transition matrix with the randomClickstreams function but that would only work for a 1st order markov chain.
Here's a reproducible example where we generate a sample from a transition matrix and then fit a 2nd order markov chain on the sample:
trans_mat <- matrix(c(0, 0.2, 0.7, 0, 0.1,
0.2, 0, 0.5, 0, 0.3,
0.1, 0.1, 0.1, 0.7, 0,
0, 0.4, 0.2, 0.1, 0.3,
0, 0 , 0 , 0, 1), nrow = 5)
cls <- randomClickstreams(states = c("P1", "P2", "P3", "P4", "end"),
startProbabilities = c(0.5, 0.5, 0, 0, 0),
transitionMatrix = trans_mat,
meanLength = 20, n = 1000)
# fit 2nd order markov chain:
mc <- fitMarkovChain(clickstreamList = cls, order = 2,
control = list(optimizer = "quadratic"))
This is made of 2 transition matrices and 2 lambda parameters:
How can i then use these elements to create a random sample of say 10000 journeys?
I have a real data and predicted data and I want to calculate overall MAPE and MSE. The data are time series, with each column representing data for different weeks. I predict value for each of the 52 weeks for each of the items as shown below. What would be the best possible calculate overall Error in R.
real = matrix(
c("item1", "item2", "item3", "item4", .5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow=4,
ncol=4)
colnames(real) <- c("item", "week1", "week2", "week3")
predicted = matrix(
c("item1", "item2", "item3", "item4", .55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow=4,
ncol=4)
colnames(predicted) <- c("item", "week1", "week2", "week3")
How do you get the predicted values in the first place? The model you use to get the predicted values is probably based on minimising some function of prediction errors (usually MSE). Therefore, if you calculate your predicted values, the residuals and some metrics on MSE and MAPE have been calculated somewhere along the line in fitting the model. You can probably retrieve them directly.
If the predicted values happened to be thrown into your lap and you have nothing to do with fitting the model, then you calculate MSE and MAPE as per below:
You have only one record per week for every item. So for every item, you can only calculate one prediction error per week. Depending on your application, you can choose to calculate the MSE and MAPE per item or per week.
This is what your data looks like:
real <- matrix(
c(.5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow = 4, ncol = 3)
colnames(real) <- c("week1", "week2", "week3")
predicted <- matrix(
c(.55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow = 4, ncol = 3)
colnames(predicted) <- c("week1", "week2", "week3")
Calculate the (percentage/squared) errors for every entry:
pred_error <- real - predicted
pct_error <- pred_error/real
squared_error <- pred_error^2
Calculate MSE, MAPE:
# For per-item prediction errors
apply(squared_error, MARGIN = 1, mean) # MSE
apply(abs(pct_error), MARGIN = 1, mean) # MAPE
# For per-week prediction errors
apply(squared_error, MARGIN = 0, mean) # MSE
apply(abs(pct_error), MARGIN = 0, mean) # MAPE