Generating chi-square data - r

I would like to generate data following chi-square distribution with N=30 population. However I don't know if it is correct:
dchisq(1, df = 1:30)
# [1] 2.419707e-01 3.032653e-01 2.419707e-01 1.516327e-01 8.065691e-02
# [6] 3.790817e-02 1.613138e-02 6.318028e-03 2.304483e-03 7.897535e-04
# [11] 2.560537e-04 7.897535e-05 2.327761e-05 6.581279e-06 1.790585e-06
# [16] 4.700913e-07 1.193723e-07 2.938071e-08 7.021903e-09 1.632262e-09
# [21] 3.695738e-10 8.161308e-11 1.759875e-11 3.709686e-12 7.651632e-13
# [26] 1.545702e-13 3.060653e-14 5.945009e-15 1.133575e-15 2.123217e-16

If you would like to generate 30 random Chi-Squared variables, you need to use the rchisq() function.
rchisq(n, df, ncp = 0)
So you would replace n with 30 and the df with the number of degrees of freedom you require. You can read more here.

Related

Create matrix of features using regex?

Suppose I have a data frame of 101 variables. I select one so-called Y as a dependent variable, and the remaining 100 so-called x_1, X_2,...,X_{100} as independent ones.
Now I would like to create a matrix containing 100 independent variables. What are the ways to do it directly? Like when I make a linear regression model, just use "." as regex, i.e lm(Y ~ ., _____)
You can use grep function to extract indpendent variable associated column names of the data frame. Then you can transform it into the matrix. Please see the code below:
# simulation of the data frame with 100 measurements and 101 variables
n <- 100
df <- data.frame(matrix(1:101 * n, ncol = 101))
names(df) <- c(paste0("X_", 1:100), "Y")
# extract matrix of Xs
m_x <- as.matrix(df[, grep("^X", names(df))])
dimnames(m_x)
Output:
[[1]]
NULL
[[2]]
[1] "X_1" "X_2" "X_3" "X_4" "X_5" "X_6" "X_7" "X_8" "X_9" "X_10" "X_11" "X_12" "X_13" "X_14" "X_15"
[16] "X_16" "X_17" "X_18" "X_19" "X_20" "X_21" "X_22" "X_23" "X_24" "X_25" "X_26" "X_27" "X_28" "X_29" "X_30"
[31] "X_31" "X_32" "X_33" "X_34" "X_35" "X_36" "X_37" "X_38" "X_39" "X_40" "X_41" "X_42" "X_43" "X_44" "X_45"
[46] "X_46" "X_47" "X_48" "X_49" "X_50" "X_51" "X_52" "X_53" "X_54" "X_55" "X_56" "X_57" "X_58" "X_59" "X_60"
[61] "X_61" "X_62" "X_63" "X_64" "X_65" "X_66" "X_67" "X_68" "X_69" "X_70" "X_71" "X_72" "X_73" "X_74" "X_75"
[76] "X_76" "X_77" "X_78" "X_79" "X_80" "X_81" "X_82" "X_83" "X_84" "X_85" "X_86" "X_87" "X_88" "X_89" "X_90"
[91] "X_91" "X_92" "X_93" "X_94" "X_95" "X_96" "X_97" "X_98" "X_99" "X_100"

most occurring value in a vector

I have a vector file with 1000 values. All the values were generated using Random function between 0-1.
x <- runif(100,min=0,max=1)
x
[1] 0.84620011 0.82525410 0.31622827 0.08040362 0.12894525 0.23997187 0.57177296 0.91691368 0.65751720
[10] 0.39810175 0.60632205 0.26339035 0.93543618 0.09662383 0.35147739 0.51731042 0.29151612 0.54411769
[19] 0.73688309 0.26086586 0.37808273 0.19163366 0.62776847 0.70973345 0.31802726 0.69101574 0.50042561
[28] 0.20768256 0.23555818 0.21015820 0.18221151 0.85593725 0.12916935 0.52222127 0.62269135 0.51267707
[37] 0.60164023 0.30723904 0.81990231 0.61771762 0.02502631 0.47427724 0.21250040 0.88611710 0.88648546
[46] 0.92586513 0.57015942 0.33454379 0.03572245 0.68120369 0.48692522 0.76587764 0.55214917 0.31137200
[55] 0.47170307 0.48639510 0.68922858 0.73506033 0.23541740 0.81793240 0.17184666 0.06670039 0.55664270
[64] 0.10030533 0.94620061 0.58572228 0.53333567 0.80887841 0.55015406 0.82491114 0.81251132 0.06038019
[73] 0.10918904 0.84011824 0.33169617 0.03568364 0.07703029 0.15601158 0.31623253 0.25021777 0.77024833
[82] 0.88588620 0.49044305 0.10165930 0.55494697 0.17455070 0.94458467 0.43135868 0.99313733 0.04482747
[91] 0.53453604 0.52500493 0.35496966 0.06994880 0.11377845 0.71307042 0.35086237 0.04032254 0.23744845
[100] 0.81131033
Out of all these values in the vector, I need to find the most occurring value(Or close to that). I'm new to R and have no idea what this. Please help?
One approach I have - Divide all the values in a certain ranges and find the frequency distribution. But will it be helpful?
One possibility to analyze the distribution of the numbers could consist in plotting a histogram and adding an approximate probability density distribution.
This can be done with the ggplot2 library:
set.seed(123) # used here for reproducibility
x <- runif(100) # pseudo-random numbers between 0 and 1
library(ggplot2)
p <- ggplot(as.data.frame(x),aes(x=x, y=..density..)) +
geom_histogram(fill="lightblue",colour="grey60",bins=50) +
geom_density()
The value of bins specified in geom_histigram() is the number of bars in the histogram. You may want to try to change this value to obtain a different representation of the distribution.
OR
You could use base Rand plot a simple histogram:
hist(x)
There you can also change the bin width (see breaks), but the default might be sufficient to show the concept.
You can identify which bin in this histogram has the most entries with
> hist(x)$mids[which.max(hist(x)$counts)]
#[1] 0.45
Which in this case means that most values occur near a value of 0.45 (the middle of the bin describing the range between 0.4 and 0.5).
Hope this helps.
You can do this:
set.seed(12)
x <- runif(100,min=0,max=1)
n <- length(x)
x_cut<-cut(x, breaks = n/4)
which(table(x_cut)==max(table(x_cut)))
The result depends on the breaks value you set. This is an alternative to using a histogram if you don't need one.
To really get just the most occurrent value, or when using discrete data as input, you could simply create a table, sort the results and return the highest value:
values <- c("a", "a", "c", "c", "c")
names(sort(table(values), decreasing = TRUE)[1])
#> [1] "c"
Breaking it down:
# create a table of the values
table(values)
#> a c
#> 2 3
# sort the table descending on number of occurrences
sort(table(values), decreasing = TRUE)
#> c a
#> 3 2
# now only keep the first value
sort(table(values), decreasing = TRUE)[1]
#> c
#> 3
# so the final line:
names(sort(table(values), decreasing = TRUE)[1])
#> [1] "c"
If you're feeling like wanting to do fancy stuff, create a function that does this for you:
get_mode <- function(x) {
names(sort(table(values), decreasing = TRUE)[1])
}
get_mode(values)
#> [1] "c"

Generating 100 uniform different random variables in R code

I am trying to generate 100 samples of Z, where Z is the summation of 8 independent uniformly distributed random variables in the interval [0;1]
I have the following code so far, but I'm not sure if it's correct. I am not sure if my loop is correct
eight<-runif(8,0,1) #Uses the runif function to generate 8 uniform 0-1 random variables
Z_1<-sum(eight) #caclulates the sum and stores it in the variable Z_1
sample <-NA
for (i in 1:100 ) { #Function continues the loop for 100 different values
eight<-runif(8,0,1); #Creates sum loop for 8 independent values uniform 0-1 random variables.
Z_1<-sum(eight); # stores in the sum loop in the variable Z
sample[i] = Z_1;
}`
Thanks
I would vectorize the whole thing. There is no real reason to run 100 iterations when you can just generate 800 observations in one run. Then just use matrix and colSums and you done
set.seed(123)
n <- 100
Z <- colSums(matrix(runif(8 * n), 8, n))
Z
# [1] 4.774425 4.671171 4.787691 4.804041 3.513257 2.330163 3.300135 3.568657 5.503481 2.861764 4.533283 3.353658
# [13] 4.230073 4.690739 4.364708 3.094156 4.933616 3.942834 3.712522 2.587036 3.731474 4.388749 4.484030 4.315968
# [25] 4.800758 4.252631 2.716972 5.159044 4.146881 3.244546 4.418618 4.350035 5.344182 3.176801 3.903337 2.261935
# [37] 3.646572 4.286075 3.074900 4.210506 3.172513 4.535665 4.245856 4.184848 4.532286 2.899883 4.473629 4.751224
# [49] 3.498038 3.337437 4.238989 3.349812 3.888696 4.748254 3.029479 4.246619 3.330434 3.879168 3.786216 3.839956
# [61] 3.878997 4.546531 2.863010 3.417072 4.266108 3.141875 4.960758 3.989613 4.373042 4.295742 4.459014 5.561066
# [73] 4.401990 4.121301 3.830575 3.412446 3.812347 5.591238 3.801587 4.454336 4.213343 5.222007 4.300991 2.765003
# [85] 3.293251 5.362586 2.954080 3.036312 3.655677 3.373603 5.575184 4.167740 3.904038 3.884440 2.901452 3.079311
# [97] 4.927770 3.930943 4.169907 2.922618

covariance table for more variables

I've got three parameters a,b and c. Every parameter is a factor with three categories. I wanted to fit a multinomial regression with the car package.
require(car)
a <- sample(3, 100, TRUE)
b <- sample(3, 100, TRUE)
c <- sample(3, 100, TRUE)
a <- as.factor(a)
b <- as.factor(b)
c <- as.factor(c)
testus <- multinom(c ~ a + b)
predictors <-
expand.grid(b=c("1","2","3","4","5"),a=c("1","2","3","4","5"))
p.fit <- predict(testus, predictors, type='probs')
probabilities<-data.frame(predictors,p.fit)
Now I got the predicted probabilities for a under b and c.
>
`head(probabilities)
> b a X1 X2 X3 X4 X5
>1 1 1 0.10609054 0.22599152 0.20107167 0.21953158 0.2473147
>2 2 1 0.20886614 0.27207108 0.08613633 0.18276394 0.2501625
>3 3 1 0.17041268 0.24995975 0.16234240 0.13111518 0.2861700
>4 4 1 0.23704078 0.21179521 0.08493274 0.03135092 0.4348804
>5 5 1 0.09494071 0.09659144 0.24162612 0.21812449 0.3487172
>6 1 2 0.14059489 0.17793438 0.29272452 0.26104833 0.1276979`
The first two cols shows the categories of the independent variables a and b. the next five colums show the conditional probabilities (p.e. P(c=1|b==1&&a==1)=0,10609.
I need the variance covariance and did:
vcov(testus)
2:(Intercept) 2:b2 2:b3 2:c2 2:c3 ....
2:(Intercept) .......................................
2:b2 ................................
2:b3 .................
2:c2 ..............
2:c3 .............
3:(Intercept) .............
....
Sorry for pasting only a part of the matrix, but otherwise it would be to long. What I would like to have, is a variance covariance matrix for the simultaneous observation of two variables(vcov(a,b&c)). That means, that I would like to get variance (covariance)between my variable a and the simultaneous observation of b and c as I created with "probabilities". I would like to get the output
2:(Intercept) 2:b2&c2 2:b2&c3 ....
2:(Intercept) .......................................
2:b2&c2 ................................
2:b3&c3 .................
3:(Intercept) .............
....
Is this possible?
Perhaps:
testus <- multinom(c ~ a : b)
vcov(testus)
I say 'perhaps' because there is also the possibility of using the c ~ a*b model and it's not clear what you want exactly. (The statistical question has not been defined and I would not think this to be a sufficient number of observations to a stable estimate.) At any rate:
colnames( vcov(testus))
#-----------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"
rownames( vcov(testus))
#--------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"

interpolate a series over another series in R

Suppose if I have a random time series that I want to interpolate over another time series. How would I do this in R?
# generate time interval series from exponential distribution
s = sort(rexp(10))
# scale between 0 and 1
scale01 = function(x){(x-min(x))/(max(x)-min(x))}
s = scale01(s)
> s
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
# generate random normal series
x = rnorm(20)
> x
[1] -0.82530658 0.92289557 0.39827984 -0.62416117 -1.69055539 -0.28164232 -1.32717654 -1.36992509
[9] -1.54352202 -1.09826247 -0.68260576 1.07307043 2.35298180 -0.41472811 0.38919315 -0.27325343
[17] -1.52592682 0.05400849 -0.43498544 0.73841106
# interpolate 'x' over 's' ?
> approx(x,xout=s)
$x
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
$y
[1] NA NA NA NA NA NA NA NA NA
[10] -0.8253066
>
I want to interpolate the series 'x' over the series 's'. Lets assume time interval series for the 'x' series has 20 elements distributed uniformly over the interval [0,1]. Now I want to interpolate those 10 elements from 'x' that occur at time intervals described by 's'.
EDIT:
I think this does the job.
> approx(seq(0,1,length.out=20), x, xout=s)
$x
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
$y
[1] -0.8253066 0.1061033 0.8777987 0.3781018 -0.6221134 -1.5566990 -0.3483466 -0.4703429 -1.4444105
[10] 0.7384111
Thanks for your help guys. I think I now understand how to use interpolation functions in R now. I should really use a time series data structure here.
This isn't meant as a direct answer to the OP's Q but rather to illustrate how approx() works so the OP can formulate a better Q
Your Q makes next to no sense. approx() works by taking a reference set of x, and y coordinates and then interpolating to find y at n locations over the range of x, or at the specified xout locations supplied by the user.
So in your call, you don't provide y and x doesn't contain a y component so I don't see how this can work.
If you want to interpolate s, so you can find time intervals for any value over range of s then:
> approx(s, seq_along(s), n = 20)
$x
[1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
[7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
[13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
[19] 0.94736842 1.00000000
$y
[1] 1.00000 26.25815 42.66323 54.79831 64.96162 76.99433 79.67388
[8] 83.78458 86.14656 89.86223 91.98513 93.36233 93.77353 94.19731
[15] 94.63652 95.26239 97.67724 98.74056 99.40548 100.00000
Here $y contains the interpolated values for s at n = 20 equally spaced locations on the range of s (0,1).
Edit: If x represents the series at unstated time intervals uniform on 0,1 and you want the interpolated values of x at the time intervals s, then you need something like this:
> set.seed(1)
> x <- rnorm(20)
> s <- sort(rexp(10))
> scale01 <- function(x) {
+ (x - min(x)) / (max(x) - min(x))
+ }
> s <- scale01(s)
>
> ## interpolate x at points s
> approx(seq(0, 1, length = length(x)), x, xout = s)
$x
[1] 0.00000000 0.04439851 0.11870795 0.14379236 0.20767388 0.21218632
[7] 0.25498856 0.29079300 0.40426335 1.00000000
$y
[1] -0.62645381 0.05692127 -0.21465011 0.94393053 0.39810806 0.29323742
[7] -0.64197207 -0.13373472 0.62763207 0.59390132
Is that closer to what you want?

Resources