Averaging different length vectors with same domain range in R - r

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

Related

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Plot along different dimensions

I have the following basic code. The first line sums p along dimension 1 to create a 1 x column array. The next line plot A. Unfortunately, it seems that Julia assumes it must plot many lines (in this case just points) along dimension 2.
A = sum(p,dims = 1)
plot(A)
So, my question is, how can I plot a simple line when the data is in a 1 x column array?
I assume you use Plots.jl. The following is from Plots.jl's documentation.
If the argument [to plot] is a "matrix-type", then each column will map to a series, cycling through columns if there are fewer columns than series. In this sense, a vector is treated just like an "nx1 matrix".
The number of series plot(a) tries to plot is the number of columns in a.
To get a single series, you can do one of the followings
plot(vec(a)) # `vec` will give you a vector view of `a` without an allocation
plot(a') # or `plot(transpose(a))`. `transpose` does not allocate a new array
plot(a[:]) # this allocates a new array so you should probably avoid it

How to use pointDistance with a very large vector

I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.

R Compare each data value of a column to rest of the values in the column?

I would like to create a function that looks at a column of values. from those values look at each value individually, and asses which of the other data points value is closest to that data point.
I'm guessing it could be done by checking the length of the data frame, making a list of the respective length in steps of 1. Then use that list to reference which cell is being analysed against the rest of the column. though I don't know how to implement that.
eg.
data:
20
17
29
33
1) is closest to 2)
2) is closest to 1)
3) is closest to 4)
4) is closest to 3)
I found this example which tests for similarity but id like to know what letter is assigns to.
x=c(1:100)
your.number=5.43
which(abs(x-your.number)==min(abs(x-your.number)))
Also if you know how I could do this, could you expain the parts of the code and what they mean?
I wrote a quick function that does the same thing as the code you provided.
The code you provided takes the absolute value of the difference between your number and each value in the vector, and compares that the minimum value from that vector. This is the same as the which.min function that I use below. I go through my steps below. Hope this helps.
Make up some data
a = 1:100
yourNumber = 6
Where Num is your number, and x is a vector
getClosest=function(x, Num){
return(which.min(abs(x-Num)))
}
Then if you run this command, it should return the index for the value of the vector that corresponds to the closest value to your specified number.
getClosest(x=a, Num=yourNumber)

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Resources