How to merge extreme points of observations and select dominating units only? - r

I need to build an algorithm which will:
For 116 existing observations of 2 variables x1 and x2 (plotted individually: one single point)
Create new observations by merging extreme points of 2 existing observations (ex: observation 117 will have 2 extreme points, (x1_115, x2_115) and (x1_30, x2_30)). Do this for all combinations.
If, for one combination, one pair dominates the other: x1_a < x1_b AND x2_a < x2_b, only select a.
For the new set of 116+n newly created variables, remove the dominated pairs, in the same logic as above.
Continue until we cannot create new non-dominated pairs.
I'm trying to solve this problem by creating independent functions for each operation. So far I have created the ConvexUnion function which merges extreme points (simply the union of 2 observations), but it does not take into account dominance yet.
ConvexUnion <- function(a,b){
output = NULL
for (i in 1:ncol(a)) {
u = unique(rbind(a[,i],b[,i]), incomparables = FALSE)
output = cbind(output, u)
}
output #the extreme points of the newly created pair
}
a = matrix(c(50,70), ncol = 2)
b = matrix(c(60,85), ncol = 2)
v = ConvexUnion(a,b)
TRAFO LABOR DELLV CLIENTS
1 49 15023 180119 11828
2 54 3118 212988 13465
3 31 6016 81597 4787
4 39 8909 127263 10291
5 9 1789 30095 2205
6 59 8327 190405 12045
7 95 11985 288146 16379
8 54 11309 208009 12252
9 13 3844 53631 4426
10 148 26348 459371 39831
11 17 3968 48798 3210
12 157 20131 366409 27050
13 18 4614 60366 4673
14 17 5941 49042 3950
15 77 6449 226815 12584
Here, the result for the new pair, which is the so-called convex union of a and b, would be (50,70) because a dominates b (both x1 and x2 are smaller).
How do I solve the problem?

Related

Autocorrelation Functions - time series analysis in R

I have the following df:
TS A_f1 A_p B_f1 B_p C_f1 C_p
1 10 100 15 150 17 170
2 20 200 25 250 27 270
3 30 300 35 350 37 370
This is, however, only a simplification of my real df with 40k+ observations and 100+ features.
TS are timestamps - in every row there are stores listed ("A","B","C", n ...) with features (f1, p, f_n ...)
Before I want to train a LSTM on my df, I want to use the acf function (or pacf) to find some patterns on my data to do some feature selection beforehand.
Any idea, how I can do this with my data?

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

R: applying a function on whole dataset to find points within a circle

I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Retrieving adjaceny values in a nng igraph object in R

edited to improve the quality of the question as a result of the (wholly appropriate) spanking received by Spacedman!
I have a k-nearest neighbors object (an igraph) which I created as such, by using the file I have uploaded here:
I performed the following operations on the data, in order to create an adjacency matrix of distances between observations:
W <- read.csv("/path/sim_matrix.csv")
W <- W[, -c(1,3)]
W <- scale(W)
sim_matrix <- dist(W, method = "euclidean", upper=TRUE)
sim_matrix <- as.matrix(sim_matrix)
mygraph <- nng(sim_matrix, k=10)
This give me a nice list of vertices and their ten closest neighbors, a small sample follows:
1 -> 25 26 28 30 32 144 146 151 177 183 2 -> 4 8 32 33 145 146 154 156 186 199
3 -> 1 25 28 51 54 106 144 151 177 234 4 -> 7 8 89 95 97 158 160 170 186 204
5 -> 9 11 17 19 21 112 119 138 145 158 6 -> 10 12 14 18 20 22 147 148 157 194
7 -> 4 13 123 132 135 142 160 170 173 174 8 -> 4 7 89 90 95 97 158 160 186 204
So far so good.
What I'm struggling with, however, is how to to get access to the values for the weights between the vertices that I can do meaningful calculations on. Shouldn't be so hard, this is a common thing to want from graphs, no?
Looking at the documentation, I tried:
degree(mygraph)
which gives me the sum of the weights for each node. But I don't want the sum, I want the raw data, so I can do my own calculations.
I tried
get.data.frame(mygraph,"E")[1:10,]
but this has none of the distances between nodes:
from to
1 1 25
2 1 26
3 1 28
4 1 30
5 1 32
6 1 144
7 1 146
8 1 151
9 1 177
10 1 183
I have attempted to get values for the weights between vertices out of the graph object, that I can work with, but no luck.
If anyone has any ideas on how to go about approaching this, I'd be grateful. Thanks.
It's not clear from your question whether you are starting with a dataset, or with a distance matrix, e.g. nng(x=mydata,...) or nng(dx=mydistancematrix,...), so here are solutions with both.
library(cccd)
df <- mtcars[,c("mpg","hp")] # extract from mtcars dataset
# knn using dataset only
g <- nng(x=as.matrix(df),k=5) # for each car, 5 other most similar mpg and hp
V(g)$name <- rownames(df) # meaningful names for the vertices
dm <- as.matrix(dist(df)) # full distance matrix
E(g)$weight <- apply(get.edges(g,1:ecount(g)),1,function(x)dm[x[1],x[2]])
# knn using distance matrix (assumes you have dm already)
h <- nng(dx=dm,k=5)
V(h)$name <- rownames(df)
E(h)$weight <- apply(get.edges(h,1:ecount(h)),1,function(x)dm[x[1],x[2]])
# same result either way
identical(get.data.frame(g),get.data.frame(h))
# [1] TRUE
So these approaches identify the distances from each vertex to it's five nearest neighbors, and set the edge weight attribute to those values. Interestingly, plot(g) works fine, but plot(h) fails. I think this might be a bug in the plot method for cccd.
If all you want to know is the distances from each vertex to the nearest neighbors, the code below does not require package cccd.
knn <- t(apply(dm,1,function(x)sort(x)[2:6]))
rownames(knn) <- rownames(df)
Here, the matrix knn has a row for each vertex and columns specifying the distance from that vertex to it's 5 nearest neighbors. It does not tell you which neighbors those are, though.
Okay, I've found a nng function in cccd package. Is that it? If so.. then mygraph is just an igraph object and you can just do E(mygraph)$whatever to get the names of the edge attributes.
Following one of the cccd examples to create G1 here, you can get a data frame of all the edges and attributes thus:
get.data.frame(G1,"E")[1:10,]
You can get/set individual edge attributes with E(g)$whatever:
> E(G1)$weight=1:250
> E(G1)$whatever=runif(250)
> get.data.frame(G1,"E")[1:10,]
from to weight whatever
1 1 3 1 0.11861240
2 1 7 2 0.06935047
3 1 22 3 0.32040316
4 1 29 4 0.86991432
5 1 31 5 0.47728632
Is that what you are after? Any igraph package tutorial will tell you more!

Resources