R: Putting Variables in order by a different variable - r

Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d

Related

Getting a weighted proportion in R

I have created a transition probability matrix using 3 states (A,B,C) as follows:
transition <-prop.table(with(data, table(data$old,
data$new)), 2)
For example, if you wanted to get the probability for A --> B, you would count the number of times you see B follow A and divide it by the number of times you see any state follow A. Now suppose that there is a certain weight/importance associated with each of the rows of data. How would I modify the above to get a weighted probability transition matrix?
You can do this...
transition <- prop.table(tapply(data$weight, list(data$old, data$new), sum), 2)
where data$weight is a column of weights for each row of data.
The tapply with length will reproduce what you have. Changing it to sum adds the weights for each combination rather than just counting them.

Label or score outliers in R

I'm looking for some easy to use algorithms in R to label (outlier or not) or score (say, 7.5) outliers row-wise. Meaning, I have a matrix m that contains several rows and I want to identify rows who represent outliers compared to the other rows.
m <- matrix( data = c(1,1,1,0,0,0,1,0,1), ncol = 3 )
To illustrate some more, I want to compare all the (complete) rows in the matrix with each other to spot outliers.
Here's some really simple outlier detection (using either the boxplot statistics or quantiles of the data) that I wrote a few years ago.
Outliers
But, as noted, it would be helpful if you'd describe your problem with greater precision.
Edit:
Also you say you want row-wise outliers. Do you mean to say that you're interested in identifying whole rows vs observations within a variable (as is typically done)? If so, you'll want to use some sort of distance metric, though which metric you choose will depend on your data.

tapply, plotting, length doesn't match

I am trying to generate a plot from a dataset of 2 columns - the first column contains distances and the second contains correlations of something measured at those distances.
Now there multiple entries with the same distance but different correlation values. I want to take the average of these various entries and generate a plot of distance versus correlation. So, this is what I did (the dataset is called correlation table):
bins <- sort(unique(correlationtable[,1]))
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
plot(bins,corr,type = 'l')
However, this gives me the error that lengths of bins and corr don't match.
I cannot figure out what am I doing wrong.
I tried it with some random data and for me it worked every time. To track the error you would need to supply us with the concrete example that did not work for you.
However to answer the question here is alternative way to do the same thing:
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
bins <- as.numeric(names(corr))
plot(bins,corr,type = 'l')
This uses the fact that tapply returns names attribute which then is converted into numeric and used as distance. And it must be the same length as corr.

Design Covariance Matrix in a simulation study in R in an efficient way

In my simulation study I need to come up with a covariance matrix for multivariate data.
My data:
dataset=data.frame(observation=rep(1:8,2),plot=rep(1:4,each=2),time=rep(1:2,8),treatment=rep(c("A","B","A","B"),each=4),OutputVariable=rep(c("P","Q"),each=8))
This dataset is multivariate, for every observation (1:8) there is more than one result. In this case, we observe a value for OutputVariable P and for OutputVariable Q at the same time. Note that actual outputs are not in this dataset as I will generate them at a later stage.
The desired Covariance Matrix would be 16x16. Where CovarMat[2,9] indicates the Covariance between the second line (Observation 2 of variable P) and the 9th line (Observation 1 of variable Q) in the dataset.
The value of, for instance, CovarMat[2,9] is based on rules like these:
CovarMat[2,9]=0
If dataset$plot[2]==dataset$plot[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If dataset$time[2]==dataset$time[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If (dataset$plot[2]==dataset$plot[9])&(dataset$time[2]==dataset$time[9]) then CovarMat[2,9]=CovarMat[2,9]+3
If abs(dataset$time[2]-dataset$time[9])=1 then CovarMat[2,9]=CovarMat[2,9]+2
Using For-loops thats easy enough (and thats what I did up to now). But my current dataset is 13,200 lines. And thus my CovarMat consists of 174,240,000 cells. Therefore, I am in desperate need of a more efficient way.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Resources