R: Statistics of distribution - r

I have the number of samples per unit and need to calculate statistics with R.
The table is like this (all rows and columns are actually filled with values, I only write a few here for easier visibility, and there are many more columns):
Hour 1 2 3 4
H1 72 11 98 65
H2 19 27
H3
H4
H5
:
H200000
I.e. the first hour (H1) there were 72 samples of value 1, 11 samples of value 2, etc. The second hour(H2) there were 19 samples of value 1, 27 samples of value 2, etc.
I need to calculate the mean and standard deviation per hour (i.e. per row). As there are many thousands of rows I need a fast method.
Example: The manual mean-calculation for hour 1 (H1) would be:
(72x1 + 11x2 + 98x3 + 65x4)/(72+11+98+65) = 2.6
I suppose there are R-methods or packages that can do this, but I fail to find where. Your support is highly appreciated.
Thanks,
Chris

You want to calculate a weighted mean, so you need weighted.mean. For the first row:
values <- c(1, 2, 3, 4)
weights <- c(72, 11, 98, 65)
weighted.mean(values, weights)
The weighted standard deviation is not well-defined. You could use a hand-rolled weighted RMS as an estimator (but this assumes that your input sample is really from a single Gaussian, i.e. there are no outliers -- not sure if that's the case for your example).
# same values and weights as above
sqrt(sum(values^2*weights^2))/sum(weights)
You should read your data into a table and iterate over every row. Also, "many thousands of rows" is not necessarily a large number for such a simple calculation. This is very basic stuff, maybe checking out a tutorial would also be beneficial.

You are much better off (i.e. faster calculations) using matrix operations instead of applying something by row. For example, assuming X is the matrix containing your data, you can get the weighted means the following way:
w <- 1:ncol(X)
w <- w/sum(w) #scale to have a sum of 1
wmeans <- X %*% w

Assuming your table is a matrix called dataset of n * 20000 and you have the weigths in a weights array you just need to do:
# The 1 as 2nd parameter indicates to apply the function on the rows
w.means <- apply(dataset, 1, weighted.mean, w=weights)

Related

Random generated number are linear combination among them even if not specified

I am simulating some draws using random numbers. Unlikely, the generated numbers are not random as I would like. In fact, I obtain that there are some linear combinations.
In details, I have the following starting data:
start_vector = c(1,10,30,40,50,100) # length equal to 6
residual_of_model = 5
n = 1000 # Number of simulations
I try to simulate n observations from a random normal distribution for each of the start_vector elements, assuming it as a "random noise" to add to the original value (that is the one into start_vector):
out_vec <- matrix(NA, nrow = n, ncol = length(start_vector))
for (h_aux in 1:length(start_vector))
{
random_noise <- rnorm(n, 0, residual_of_model)
out_vec[,h_aux] <- as.numeric(start_vector[h_aux]) + random_noise
}
At this point, I obtain a matrix of size 6x1000. In theory, I assume all the columns and the rows in the matrix are linearly independent among them.
If I try to check it, using the findLinearCombos() function from the caret package I obtain that all the columns are indepent:
caret::findLinearCombos(out_vec)
If I try to evaluate the independence among the rows, using the following code:
caret::findLinearCombos(t(out_vec))
I obtain that all the rows from 7 to 1000 are a linear combination of the first 6 (the length of start_vector).
It is really strange in my opinion, I would like to not observe no dependencies at all since the rows are generated adding a random number using rnorm.
What am I missing? Is there some bug? Thanks in advance!

How does SMOTE create new data from categorical data?

I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.

Weighted correlation in R

I am trying to output a correlation matrix for various locations. The row names 'PC1', PC2' etc. represent principal components. Since the percentage variance explained (and thus the weights) of principal components decreases from PC1 to PC4, I need to run Pearson correlation such that it takes the weights of PC's into account.
In other words, row 1 is more important in determining the correlation among locations than row 2, and row 2 is more important than row 3, and so on...
A simple weight vector for the 4 rows can be as follows:
w = [1.00, 0.75, 0.50, 0.25]
I did go through this, but I am not fully clear with the solution, and unlike this question, I need to find the correlation within the columns of a SINGLE matrix, while weighing its rows.
Ok, this is very easy to do in R using cov.wt (available in stats)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!

simulating the t -distributions -- random samples

I am new to simulation exercises in R. I want to create 1000 samples of size 25 from a t distribution with degrees of freedom 10.
Do I need to create a single vector of data from the rt generator, and then sample repeatedly from that? So, for example, I could create the vector:
singlevector <- rt(5000, 10) , which generates data from a t-distribution of size 5000 and df = 10. So, I would treat this as my population and then sample from it. I chose the population size of 5000 arbitrarily here.
OR, should I create my 1000 samples calling on this random t generator every time?
In other words, create a matrix with 25 rows and 1000 columns, each column containing vector corresponding to a new call of rt(25, 10).
Since you are sampling independent, identically distributed values, all three of these approaches are statistically equivalent.
call the random number generator once to get as many (or more) values than you need, then sample that vector without replacement
call the random number generator 1000 times, picking 25 values each time
call the random number generator once, picking 25000 values, then subdivide the vector into individual samples in order (rather than randomly)
The latter two are not just statistically but computationally equivalent. In the first approach, the order of samples gets scrambled, but that makes no difference to the statistical properties.
Approach #1:
set.seed(101)
x1 <- rt(25000,10)
r1 <- do.call(cbind,split(x1,sample(0:24999) %/% 25))
Illustrating the equivalence of #2 and #3:
set.seed(101)
r2 <- replicate(1000, rt(25, 10))
set.seed(101)
r3 <- matrix(rt(25000,10),nrow=25)
identical(r2,r3) ## TRUE
In general solution #3 is fastest (but all of these approaches are very fast for problems of this order of magnitude, i.e. approx 5 milliseconds (#3) vs 10 milliseconds (#2) for 25 x 1000 samples on my laptop); I would pick whichever approach is easiest for you to understand when you read the code.

sample integer values with specified mean

I want to generate a sample of integer numbers in R with a specified mean.
I used mu+sd*scale(rnorm(n)) to generate a sample of n values that has exactly the mean=mu
but this generates floating-point values; I would like to generate integer values instead. For example, I would like to generate a sample of mean=4. My sample size n=5, an example of generated values would be {2,6,4,3,5}.
Any ideas on how to do this in R while satisfying the constraint of a specific value of the mean?
Picking n values with a mean of m is equivalent to picking n values that sum to m*n. (I'm assuming you're going to stick to positive integers -- otherwise things get much harder!) Here is a solution based on sampling partitions (sets of values that add up to the desired total) uniformly, but I'm not sure it's what you want, since it doesn't sample uniformly over values, but over partitions ... perhaps someone else can do better, or figure out how to reweight the samples.
This brute-force solution will also probably fail for cases much larger than your example (there are 627 partitions for a total of 20, 5604 for a total of 30, 37338 for a total of 40 ...)
m <- 4
n <- 5
library("partitions")
pp <- parts(m*n) ## all sets of integers that sum to m*n (=20 here)
## restrict to partitions with exactly n (=5) non-zero values.
pp5 <- pp[1:5,colSums(pp>0)==n]
set.seed(101) ## for reproducibility
## sample uniformly from this set
pp5[,sample(ncol(pp5),size=1)] ## 9, 5, 4, 1, 1

Resources