How to make vector pick 3 possible values depending on last value in R - r

In R I am trying to create a vector of 12 numbers. The first number is 40 from 40 it can either go down by 1 with probability 0.5, stay the same with probability 0.2 or go up by 1 with probability 0.3. The next value depends on the last value. For example, a possible vector could be:
40 39 38 37 37 38 37 36 37 38 39 39
I have tried several different methods but I am unable to get any to work. My latest attempt was:
xx=c()
num <- 60
for (i in 1:12){
xx[i] <- sample(x=c(num,num+5,num+10),size=1,prob = probs)}
Thanks for your help.

Set your initial value and add the difference based on the probabilities you specified:
x <- 40
for (i in 2:12){
x[i] <- x[i-1]+sample(c(-1, 0, 1), size=1, prob=c(.5, .2, .3))
}
x
[1] 40 41 41 40 41 40 39 39 40 39 38 38

Related

fit a normal distribution to grouped data, giving expected frequencies

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt
here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

How to calculate 95% confidence interval for a proportion in R?

Assume I own a factory that produces 150 screws a day and there is a 22% error rate. Now I am going to estimate how many screws are faulty each day for a year (365 days) with
rbinom(n = 365, size = 150, prob = 0.22)
which generates 365 values in this way
45 31 35 31 34 37 33 41 37 37 26 32 37 38 39 35 44 36 25 27 32 25 30 33 25 37 36 31 32 32 43 42 32 33 33 38 26 24 ...................
Now for each of the value generated, I am supposed to calculate a 95% confidence interval for the proportion of faulty screws in each day.
I am not sure how I can do this. Is there any built in functions for this (I am not supposed to use any packages) or should I create a new function?
If the number of trials per day is large enough and the probability of failure not too extreme, then you can use the normal approximation https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval.
# number of failures, for each of the 365 days
f <- rbinom(365, size = 150, prob = 0.22)
# failure rates
p <- f/150
# confidence interval for the failur rate, for each day
p + 1.96*sqrt((p*(1-p)/150))
p - 1.96*sqrt((p*(1-p)/150))

Use vector to make probability table

In the form of a probability table, I'd like to illustrate a vector of quantiles divisible by 7 and 5, for marginal probability distributions, and 5 given 7, for conditional probability.
Let's assume this is my data:
>prob.table(table(x)) # discrete number and its probability
20 22 23 24 25 26 27 28 29 30 31
0.000152 0.000625 0.000796 0.001224 0.003138 0.003043 0.004549 0.006444 0.005938 0.009301 0.009456
32 33 34 35 36 37 38 39 40 41 42
0.013448 0.019839 0.018596 0.026613 0.028902 0.027377 0.035156 0.041379 0.041092 0.047733 0.055827
43 44 45 46 47 48 49 50 51 52 53
0.046099 0.051624 0.055131 0.049779 0.056992 0.049801 0.052912 0.031924 0.049114 0.022880 0.042279
54 55 56 57 58 59 61 63 65
0.013946 0.032340 0.003466 0.021240 0.001227 0.011734 0.005115 0.001491 0.000278
How can I turn this into a two-way probability table that shows which numbers are divisible by 7 and/or 5 for marginal and conditional probability?
This is what I'd hope the table to look like
Yes NO # Probability of numbers divisible by 7
Yes 0.02754 0.02886
No 0.02656 0.02831
# Probability of numbers divisible by 5
x <- sample(1:100, 100, replace = TRUE)
# %% is the mod operator, which gives the remainder after the division of the left-hand side by the right-hand side. x %% y == 0 therefore returns TRUE if x is divisible by y
db5 <- x %% 5 == 0
db7 <- x %% 7 == 0
table(db5, db7) / length(x)
# db7
# db5 FALSE TRUE
# FALSE 0.62 0.13
# TRUE 0.24 0.01

R error type "Subscript out of bounds"

I am simulating a correlation matrix, where the 60 variables correlate in the following way:
more highly (0.6) for every two variables (1-2, 3-4... 59-60)
moderate (0.3) for every group of 12 variables (1-12,13-24...)
mc <- matrix(0,60,60)
diag(mc) <- 1
for (c in seq(1,59,2)){ # every pair of variables in order are given 0.6 correlation
mc[c,c+1] <- 0.6
mc[c+1,c] <- 0.6
}
for (n in seq(1,51,10)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these are variables 11-12, 21-22 and such.
mc[n:n+1,c(n+2,w)] <- 0.2
mc[c(n+2,w),n:n+1] <- 0.2
}
}
for (m in seq(3,9,2)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these variables are the rest.
mc[m:m+1,c(1:m-1,m+2:w)] <- 0.2
mc[c(1:m-1,m+2:w),m:m+1] <- 0.2
}
}
The first loop works well, but not the second and third ones. I get this error message:
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
I would really appreciate any hints, since I don't see the loop commands get to exceed the matrix dimensions. Thanks a lot in advance!
Note that : takes precedence over +. E.g., n:n+1 is the same as n+1. I guess you want n:(n+1).
The maximal value of w is 60:
w <- 60
m <- 1
m+2:w
#[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#[49] 51 52 53 54 55 56 57 58 59 60 61
And 61 is out of bounds. You need to add a lot of parentheses.

How to return back the imputed values in R

Is there any function in R that can help to return imputed values, for example:
x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,
23)
by using single linear imputation method,
na.approx(x)
I get the imputed data as;
[1] 23 23 25 43 34 22 78 35 98 23 30 24 21 78 22 76 22 77 33 98 22 14 52 87 59
[26] 23 23
How can I get the imputed value from the program back without looking at the completed dataset one by one? For example, if the data I imputed contain $n=200$ observations, can I get 20 estimates of the missing value?
I am not 100 percent sure if I got you right, but does this help?
You first save the places, at which the original NA values are, so.e.g the first NA value is at the 8th place. Save this into the dummy variable
dummy<-NA
for (i in 1:length(x)){
if(is.na(x[i])) dummy[i]<-i
}
Now get the corresponding values in the imputed data
imputeddata<-na.approx(x)
for (i in 1:length(imputeddata)){
if(!is.na(imputeddata[dummy[i]])) print(imputeddata[dummy[i]])
}
You could use is.na to select only those values that were previously NA.
> x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,23)
> na.approx(x)[is.na(x)]
[1] 88.0 25.5 76.5 37.0 55.0
Hope that helps.

Resources