Create matrix of features using regex? - r

Suppose I have a data frame of 101 variables. I select one so-called Y as a dependent variable, and the remaining 100 so-called x_1, X_2,...,X_{100} as independent ones.
Now I would like to create a matrix containing 100 independent variables. What are the ways to do it directly? Like when I make a linear regression model, just use "." as regex, i.e lm(Y ~ ., _____)

You can use grep function to extract indpendent variable associated column names of the data frame. Then you can transform it into the matrix. Please see the code below:
# simulation of the data frame with 100 measurements and 101 variables
n <- 100
df <- data.frame(matrix(1:101 * n, ncol = 101))
names(df) <- c(paste0("X_", 1:100), "Y")
# extract matrix of Xs
m_x <- as.matrix(df[, grep("^X", names(df))])
dimnames(m_x)
Output:
[[1]]
NULL
[[2]]
[1] "X_1" "X_2" "X_3" "X_4" "X_5" "X_6" "X_7" "X_8" "X_9" "X_10" "X_11" "X_12" "X_13" "X_14" "X_15"
[16] "X_16" "X_17" "X_18" "X_19" "X_20" "X_21" "X_22" "X_23" "X_24" "X_25" "X_26" "X_27" "X_28" "X_29" "X_30"
[31] "X_31" "X_32" "X_33" "X_34" "X_35" "X_36" "X_37" "X_38" "X_39" "X_40" "X_41" "X_42" "X_43" "X_44" "X_45"
[46] "X_46" "X_47" "X_48" "X_49" "X_50" "X_51" "X_52" "X_53" "X_54" "X_55" "X_56" "X_57" "X_58" "X_59" "X_60"
[61] "X_61" "X_62" "X_63" "X_64" "X_65" "X_66" "X_67" "X_68" "X_69" "X_70" "X_71" "X_72" "X_73" "X_74" "X_75"
[76] "X_76" "X_77" "X_78" "X_79" "X_80" "X_81" "X_82" "X_83" "X_84" "X_85" "X_86" "X_87" "X_88" "X_89" "X_90"
[91] "X_91" "X_92" "X_93" "X_94" "X_95" "X_96" "X_97" "X_98" "X_99" "X_100"

Related

Generating chi-square data

I would like to generate data following chi-square distribution with N=30 population. However I don't know if it is correct:
dchisq(1, df = 1:30)
# [1] 2.419707e-01 3.032653e-01 2.419707e-01 1.516327e-01 8.065691e-02
# [6] 3.790817e-02 1.613138e-02 6.318028e-03 2.304483e-03 7.897535e-04
# [11] 2.560537e-04 7.897535e-05 2.327761e-05 6.581279e-06 1.790585e-06
# [16] 4.700913e-07 1.193723e-07 2.938071e-08 7.021903e-09 1.632262e-09
# [21] 3.695738e-10 8.161308e-11 1.759875e-11 3.709686e-12 7.651632e-13
# [26] 1.545702e-13 3.060653e-14 5.945009e-15 1.133575e-15 2.123217e-16
If you would like to generate 30 random Chi-Squared variables, you need to use the rchisq() function.
rchisq(n, df, ncp = 0)
So you would replace n with 30 and the df with the number of degrees of freedom you require. You can read more here.

most occurring value in a vector

I have a vector file with 1000 values. All the values were generated using Random function between 0-1.
x <- runif(100,min=0,max=1)
x
[1] 0.84620011 0.82525410 0.31622827 0.08040362 0.12894525 0.23997187 0.57177296 0.91691368 0.65751720
[10] 0.39810175 0.60632205 0.26339035 0.93543618 0.09662383 0.35147739 0.51731042 0.29151612 0.54411769
[19] 0.73688309 0.26086586 0.37808273 0.19163366 0.62776847 0.70973345 0.31802726 0.69101574 0.50042561
[28] 0.20768256 0.23555818 0.21015820 0.18221151 0.85593725 0.12916935 0.52222127 0.62269135 0.51267707
[37] 0.60164023 0.30723904 0.81990231 0.61771762 0.02502631 0.47427724 0.21250040 0.88611710 0.88648546
[46] 0.92586513 0.57015942 0.33454379 0.03572245 0.68120369 0.48692522 0.76587764 0.55214917 0.31137200
[55] 0.47170307 0.48639510 0.68922858 0.73506033 0.23541740 0.81793240 0.17184666 0.06670039 0.55664270
[64] 0.10030533 0.94620061 0.58572228 0.53333567 0.80887841 0.55015406 0.82491114 0.81251132 0.06038019
[73] 0.10918904 0.84011824 0.33169617 0.03568364 0.07703029 0.15601158 0.31623253 0.25021777 0.77024833
[82] 0.88588620 0.49044305 0.10165930 0.55494697 0.17455070 0.94458467 0.43135868 0.99313733 0.04482747
[91] 0.53453604 0.52500493 0.35496966 0.06994880 0.11377845 0.71307042 0.35086237 0.04032254 0.23744845
[100] 0.81131033
Out of all these values in the vector, I need to find the most occurring value(Or close to that). I'm new to R and have no idea what this. Please help?
One approach I have - Divide all the values in a certain ranges and find the frequency distribution. But will it be helpful?
One possibility to analyze the distribution of the numbers could consist in plotting a histogram and adding an approximate probability density distribution.
This can be done with the ggplot2 library:
set.seed(123) # used here for reproducibility
x <- runif(100) # pseudo-random numbers between 0 and 1
library(ggplot2)
p <- ggplot(as.data.frame(x),aes(x=x, y=..density..)) +
geom_histogram(fill="lightblue",colour="grey60",bins=50) +
geom_density()
The value of bins specified in geom_histigram() is the number of bars in the histogram. You may want to try to change this value to obtain a different representation of the distribution.
OR
You could use base Rand plot a simple histogram:
hist(x)
There you can also change the bin width (see breaks), but the default might be sufficient to show the concept.
You can identify which bin in this histogram has the most entries with
> hist(x)$mids[which.max(hist(x)$counts)]
#[1] 0.45
Which in this case means that most values occur near a value of 0.45 (the middle of the bin describing the range between 0.4 and 0.5).
Hope this helps.
You can do this:
set.seed(12)
x <- runif(100,min=0,max=1)
n <- length(x)
x_cut<-cut(x, breaks = n/4)
which(table(x_cut)==max(table(x_cut)))
The result depends on the breaks value you set. This is an alternative to using a histogram if you don't need one.
To really get just the most occurrent value, or when using discrete data as input, you could simply create a table, sort the results and return the highest value:
values <- c("a", "a", "c", "c", "c")
names(sort(table(values), decreasing = TRUE)[1])
#> [1] "c"
Breaking it down:
# create a table of the values
table(values)
#> a c
#> 2 3
# sort the table descending on number of occurrences
sort(table(values), decreasing = TRUE)
#> c a
#> 3 2
# now only keep the first value
sort(table(values), decreasing = TRUE)[1]
#> c
#> 3
# so the final line:
names(sort(table(values), decreasing = TRUE)[1])
#> [1] "c"
If you're feeling like wanting to do fancy stuff, create a function that does this for you:
get_mode <- function(x) {
names(sort(table(values), decreasing = TRUE)[1])
}
get_mode(values)
#> [1] "c"

Generating 100 uniform different random variables in R code

I am trying to generate 100 samples of Z, where Z is the summation of 8 independent uniformly distributed random variables in the interval [0;1]
I have the following code so far, but I'm not sure if it's correct. I am not sure if my loop is correct
eight<-runif(8,0,1) #Uses the runif function to generate 8 uniform 0-1 random variables
Z_1<-sum(eight) #caclulates the sum and stores it in the variable Z_1
sample <-NA
for (i in 1:100 ) { #Function continues the loop for 100 different values
eight<-runif(8,0,1); #Creates sum loop for 8 independent values uniform 0-1 random variables.
Z_1<-sum(eight); # stores in the sum loop in the variable Z
sample[i] = Z_1;
}`
Thanks
I would vectorize the whole thing. There is no real reason to run 100 iterations when you can just generate 800 observations in one run. Then just use matrix and colSums and you done
set.seed(123)
n <- 100
Z <- colSums(matrix(runif(8 * n), 8, n))
Z
# [1] 4.774425 4.671171 4.787691 4.804041 3.513257 2.330163 3.300135 3.568657 5.503481 2.861764 4.533283 3.353658
# [13] 4.230073 4.690739 4.364708 3.094156 4.933616 3.942834 3.712522 2.587036 3.731474 4.388749 4.484030 4.315968
# [25] 4.800758 4.252631 2.716972 5.159044 4.146881 3.244546 4.418618 4.350035 5.344182 3.176801 3.903337 2.261935
# [37] 3.646572 4.286075 3.074900 4.210506 3.172513 4.535665 4.245856 4.184848 4.532286 2.899883 4.473629 4.751224
# [49] 3.498038 3.337437 4.238989 3.349812 3.888696 4.748254 3.029479 4.246619 3.330434 3.879168 3.786216 3.839956
# [61] 3.878997 4.546531 2.863010 3.417072 4.266108 3.141875 4.960758 3.989613 4.373042 4.295742 4.459014 5.561066
# [73] 4.401990 4.121301 3.830575 3.412446 3.812347 5.591238 3.801587 4.454336 4.213343 5.222007 4.300991 2.765003
# [85] 3.293251 5.362586 2.954080 3.036312 3.655677 3.373603 5.575184 4.167740 3.904038 3.884440 2.901452 3.079311
# [97] 4.927770 3.930943 4.169907 2.922618

covariance table for more variables

I've got three parameters a,b and c. Every parameter is a factor with three categories. I wanted to fit a multinomial regression with the car package.
require(car)
a <- sample(3, 100, TRUE)
b <- sample(3, 100, TRUE)
c <- sample(3, 100, TRUE)
a <- as.factor(a)
b <- as.factor(b)
c <- as.factor(c)
testus <- multinom(c ~ a + b)
predictors <-
expand.grid(b=c("1","2","3","4","5"),a=c("1","2","3","4","5"))
p.fit <- predict(testus, predictors, type='probs')
probabilities<-data.frame(predictors,p.fit)
Now I got the predicted probabilities for a under b and c.
>
`head(probabilities)
> b a X1 X2 X3 X4 X5
>1 1 1 0.10609054 0.22599152 0.20107167 0.21953158 0.2473147
>2 2 1 0.20886614 0.27207108 0.08613633 0.18276394 0.2501625
>3 3 1 0.17041268 0.24995975 0.16234240 0.13111518 0.2861700
>4 4 1 0.23704078 0.21179521 0.08493274 0.03135092 0.4348804
>5 5 1 0.09494071 0.09659144 0.24162612 0.21812449 0.3487172
>6 1 2 0.14059489 0.17793438 0.29272452 0.26104833 0.1276979`
The first two cols shows the categories of the independent variables a and b. the next five colums show the conditional probabilities (p.e. P(c=1|b==1&&a==1)=0,10609.
I need the variance covariance and did:
vcov(testus)
2:(Intercept) 2:b2 2:b3 2:c2 2:c3 ....
2:(Intercept) .......................................
2:b2 ................................
2:b3 .................
2:c2 ..............
2:c3 .............
3:(Intercept) .............
....
Sorry for pasting only a part of the matrix, but otherwise it would be to long. What I would like to have, is a variance covariance matrix for the simultaneous observation of two variables(vcov(a,b&c)). That means, that I would like to get variance (covariance)between my variable a and the simultaneous observation of b and c as I created with "probabilities". I would like to get the output
2:(Intercept) 2:b2&c2 2:b2&c3 ....
2:(Intercept) .......................................
2:b2&c2 ................................
2:b3&c3 .................
3:(Intercept) .............
....
Is this possible?
Perhaps:
testus <- multinom(c ~ a : b)
vcov(testus)
I say 'perhaps' because there is also the possibility of using the c ~ a*b model and it's not clear what you want exactly. (The statistical question has not been defined and I would not think this to be a sufficient number of observations to a stable estimate.) At any rate:
colnames( vcov(testus))
#-----------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"
rownames( vcov(testus))
#--------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"

R: How to generate a sequence seq() given a condition?

I just discovered R and I am trying to work with it.
Here is what I am trying to achieve:
I have a vector of numbers, x, between 50 and 100 and with a size of 250 observations.
x = sample(seq(50, 100), 250, repeat = T)
Now, I want to generate another vector of numbers, y, between 0 and 100, which is the same size as vector x such that each element in y is less than or equal to its equivalent in x.
That is to say that if x[1] is 76, for example, the highest value y[1] could attain when generated is 76. But it could definitely be any other value below 76. In other words and more generally, I want vector y to be generated in such a way that y[i] <= x[i].
I hope I have made my request clearer.
Thank you very much!
y <- x -1 # ...........................
y <- sapply( x, function(x) runif(n=1, max=x))
y
[1] 7.2713788 30.0008063 42.5205775 0.9271717 10.7114456 39.5199145 7.4109775
[8] 28.3464373 28.5840101 34.0654033 15.0675028 50.2836294 45.9031794 13.5931005
[15] 43.2751738 17.0560824 3.1507491 25.7619129 12.3391448 22.6203684 51.3334810
[22] 37.0481703 33.4733277 37.1304850 26.7984406 66.3844126 40.2775918 47.6379024
[29] 16.2480595 66.8358384 33.3513161 60.2673874 65.6204462 45.6951960 1.5729434
[36] 20.4850357 0.1345737 84.5334203 19.7997451 53.8025623 48.5528486 8.8992123
[43] 90.9651742 28.3584167 41.7728159 46.4790641 17.8129578 83.1906415 37.5114353
[50] 89.5685501 85.2499600

Resources