tapply, plotting, length doesn't match - r

I am trying to generate a plot from a dataset of 2 columns - the first column contains distances and the second contains correlations of something measured at those distances.
Now there multiple entries with the same distance but different correlation values. I want to take the average of these various entries and generate a plot of distance versus correlation. So, this is what I did (the dataset is called correlation table):
bins <- sort(unique(correlationtable[,1]))
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
plot(bins,corr,type = 'l')
However, this gives me the error that lengths of bins and corr don't match.
I cannot figure out what am I doing wrong.

I tried it with some random data and for me it worked every time. To track the error you would need to supply us with the concrete example that did not work for you.
However to answer the question here is alternative way to do the same thing:
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
bins <- as.numeric(names(corr))
plot(bins,corr,type = 'l')
This uses the fact that tapply returns names attribute which then is converted into numeric and used as distance. And it must be the same length as corr.

Related

Working with spatial data: How to find the nearest neighbour of points without replacement?

I am currently working with some forest inventory data.
The data were collected on sample plots whose positions are available as point data (spatial data).
I have two datasets:
dataset dat.1 with n sample plots of species A
dataset dat.2 with k sample plots of species B
with n < k
What I want to do is to match every point of dat.1 with a point of dat.2. The result should be n pairs of points. So n of k plots from dat.2 should be selected.
The criteria for matching are:
spatial distance between a pair of points is as close as possible
one point of dat.2 can only be matched with one point in dat.1 and vice versa. So if there is a pair of points, these points should not be used in any other pair, even if it would be useful in terms of shortest distance. The "occupied" points should not be replaced and should not be used in the further matching process.
I have been looking for a very long time for ways to perform this analysis. There are functions like st_nn from 'nngeo' or nn2 from 'RANN' which give out the k nearest neighbours of a point. However, it is not possible to exclude the possibility of a replacement with these functions.
In the package 'matchIt' there are possibilites to perform a nearest neighbour matching without replacement. Yet these functions are adapted to find the closest distance between control variables and not between spatial locations.
Could anyone come up with an idea for a possibility to match my requirements?
I would really appreciate any hints or suggestions for packages and / or functions that could help me with this issue.
The first thing you should do is create your own distance matrix. The rows should correspond to those in dat.1 and the columns to those in dat.2, and each entry in the matrix is the distance between the plot in the row and the plot in the column. You can do this manually by looping through your datasets and computing the Euclidean (or other) distance between the points. You can also use the match_on function in the optmatch package to do this with the following code:
d <- rbind(dat.1, dat.2)
d$dat <- c(rep(1, nrow(dat.1)), rep(0, nrow(dat.2))
dist <- optmatch::match_on(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Once you have a distance matrix in this form, you can supply it to pairmatch in the optmatch package. pairmatch performs K:1 optimal matching without replacement. The matching is optimal in that the sum of the absolute distances between matched pairs in the matched sample is as low as possible. It doesn't guarantee that any one unit will get its nearest neighbor, but it does yield matched samples that ensure no units are matched to other units too far apart from them. You can specify an argument to controls to choose how many dat.2 units you want to be matched to each dat.1 unit. For example, to match 2 plots from dat.2 to each unit in dat.1, you can use
d$pairs <- optmatch::pairmatch(dist)
The output is a factor containing pair membership for each unit. Unmatched units will have a value of NA.
You can also do this in one single step with
d$pairs <- optmatch::pairmatch(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Then you can subset your dataset so only matched plots remain:
matched <- d[!is.na(d$pairs),]

How is scaling done in multi classification SVM?

I am working with R for solving a multi classification problem. I want to use e1071. How is scaling done for multiclass classification ? On this page, they say that
“A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.”
I am wondering how y is scaled. When we have m classes we have m columns for y, which they have different means and variances. So after scaling y, we have different number in each column for the same class! And it doesn’t make sense to me.
Could you please let me know what is going on in scaling? I am so curious to know that.
Also I am wondering what this mean:
"If scale is of length 1, the value is recycled as many times as needed."
Let's have look at some information for the argument scale:
A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance.
The value expected here is a logical vector (so a vector of TRUE and FALSE). If this vector has as many values as you have columns in your matrix, then the columns are scaled or not according to your vector (eg. if you have svm(..., scale = c(TRUE, FALSE, TRUE), ...) the first and third columns are scaled while the second one is not).
What happens during scaling is explained in the third sentence quoted above: "data are scaled [...] to zero mean and unit variance". To do this:
you substract each value of a column by the mean of this column (this is called centering), and
then you divide each value of this column by the columns standard deviation (this is the actual scaling).
You can reproduce the scaling with following example:
# create a data.frame with four variables
# as you can see the difference between each term of aa and bb is one
# and the difference between each term of cc is 21.63 while dd is random
(df <- data.frame(aa = 11:15,
bb = 1:5,
cc = 1:5*21.63,
dd = rnorm(5,12,4.2)))
# then we substract the mean of each column to this colum and
# put everything back together to a data.frame
(df1 <- as.data.frame(sapply(df, function(x) {x-mean(x)})))
# you can observe that now the mean value of each column is 0 and
# that aa==bb because the difference between each term was the same
# now we divide each column by its standard deviation
(df1 <- as.data.frame(sapply(df1, function(x) {x/sd(x)})))
# as you can see, the first three columns are now equal because the
# only difference between them was that cc == 21.63*bb
# the data frame df1 is now identical to what you would obtain by
# using the default scaling function `scale`
(df2 <- scale(df))
Scaling is necessary when your columns represent data on different scales. For example, if you wanted to distinguish individuals that are obese from lean ones you could collect their weight, height and waist-to-hip ratio. Weight would probably have values ranging from 50 to 95 kg, while height would be around 175 cm (± 20 cm) and waist-to-hip could range from 0.60 to 0.95. All these measurements are on different scales so that it is difficult to compare them. Scaling the variables solves this problem. Moreover, if one variable reaches high numerical values while the other ones do not, this variable will likely be given more importance during multivariate algorithms. Therefore scaling is advisable in most cases for such methods.
Scaling does affect the mean and the variance of each variable but as it is applied equally to each row (potentially belonging to different classes) this is not a problem.

How to extract saved envelope values in Spatstat?

I am new to both R & spatstat and am working with the inhomogeneous pair correlation function. My dataset consists of point values spread across several time intervals.
sp77.ppp = ppp(sp77.dat$Plot_X, sp77.dat$Plot_Y, window = window77, marks = sp77.dat$STATUS)
Dvall77 = envelope((Y=dv77.ppp[dv77.ppp$marks=='2']),fun=pcfinhom, r=seq(0,20,0.25), nsim=999,divisor = 'd', simulate=expression((rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='1']),(rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='2'])), savepatterns = T, savefuns = T).
I am trying to compare multiple pairwise comparisons (from different time periods) and need to create a function that will go through for every calculated envelope value, at each ‘r’ value, and find the min and max differences between the envelopes.
My question is: How do I find the saved envelope values? I know that the savefuns = T is saving all the simulated envelope values but I can’t find how to extract the values. The summary (below) says that the values are stored. How do I call the values and extract them?
> print(Dvall77)
Pointwise critical envelopes for g[inhom](r)
and observed value for ‘(Y = dv77.ppp[dv77.ppp$marks == "2"])’
Edge correction: “iso”
Obtained from 999 evaluations of user-supplied expression
(All simulated function values are stored)
(All simulated point patterns are stored)
Alternative: two.sided
Significance level of pointwise Monte Carlo test: 2/1000 = 0.002
.......................................................................................
Math.label Description
r r distance argument r
obs {hat(g)[inhom]^{obs}}(r) observed value of g[inhom](r) for data pattern
mmean {bar(g)[inhom]}(r) sample mean of g[inhom](r) from simulations
lo {hat(g)[inhom]^{lo}}(r) lower pointwise envelope of g[inhom](r) from simulations
hi {hat(g)[inhom]^{hi}}(r) upper pointwise envelope of g[inhom](r) from simulations
.......................................................................................
Default plot formula: .~r
where “.” stands for ‘obs’, ‘mmean’, ‘hi’, ‘lo’
Columns ‘lo’ and ‘hi’ will be plotted as shading (by default)
Recommended range of argument r: [0, 20]
Available range of argument r: [0, 20]
Thanks in advance for any suggestions!
If you are looking to access the values of the summary statistic (ginhom) for each of the randomly labelled patterns this is in principle documented in help(envelope.ppp). Admittedly this is long and if you are new to both R and spatstat it is easy to get lost. The clue is in the value section of the help file. The result is a data.frame with the some additional classes (envelope and fv) and as the help file says:
Additionally, if ‘savepatterns=TRUE’, the return value has an
attribute ‘"simpatterns"’ which is a list containing the ‘nsim’
simulated patterns. If ‘savefuns=TRUE’, the return value has an
attribute ‘"simfuns"’ which is an object of class ‘"fv"’
containing the summary functions computed for each of the ‘nsim’
simulated patterns.
Then of course you need to know how to access an attribute in R, which is done using attr:
funs <- attr(Dvall77, "simfuns")
Then funs is a data.frame (and fv-object) with all the function values for each randomly labelled pattern.
I can't really understand from your question whether you just need the values of the upper and lower curve defining the envelope? In that case you just access them like an ordinary data.frame (and there is no need to save all the individual function values in the envelope):
lo <- Dvall77$lo
hi <- Dvall77$hi
d <- hi - lo
More elegantly you can do:
d <- with(Dvall77, hi - lo)

R: Putting Variables in order by a different variable

Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d

Label or score outliers in R

I'm looking for some easy to use algorithms in R to label (outlier or not) or score (say, 7.5) outliers row-wise. Meaning, I have a matrix m that contains several rows and I want to identify rows who represent outliers compared to the other rows.
m <- matrix( data = c(1,1,1,0,0,0,1,0,1), ncol = 3 )
To illustrate some more, I want to compare all the (complete) rows in the matrix with each other to spot outliers.
Here's some really simple outlier detection (using either the boxplot statistics or quantiles of the data) that I wrote a few years ago.
Outliers
But, as noted, it would be helpful if you'd describe your problem with greater precision.
Edit:
Also you say you want row-wise outliers. Do you mean to say that you're interested in identifying whole rows vs observations within a variable (as is typically done)? If so, you'll want to use some sort of distance metric, though which metric you choose will depend on your data.

Resources