What can do to find and remove semi-duplicate rows in a matrix? - r

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.

You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

Related

How can I extract a part of a vector to another vector (including positions)

I have a vector with different values (positive and negative), so, I want to select only the 10 lowest odd number values, and the 10 lowest pair values. Help me, please!
This is a way to do it using base R.
vector with odd and even numbers
x <- sample(-100:100, 30)
The modulus operator in R help to get the job done. You can use it this way
c(
# Extract the lowest even numbers
head(sort(x[x %% 2 == 0]), 5),
# Extract the lowest odds numbers
head(sort(x[x %% 2 == 1]), 5)
)
Given vector vas your input vector, you can obtain the desired output (including positions) via the following code
names(v) <- seq_along(v)
# lowest 10 odd numbers
low_odd <- sort(v[v%%2==1])[1:10]
# positions of those odd numbers in v
low_odd_pos <- as.numeric(names(low_odd))
# lowest 10 even numbers
low_even <- sort(v[v%%2==0])[1:10]
# positions of those even numbers in v
low_even_pos <- as.numeric(names(low_even))
Example
set.seed(1)
v <- sample(-50:50)
then
> low_odd
43 101 39 95 85 72 7 73 45 29
-49 -47 -45 -43 -41 -39 -37 -35 -33 -31
> low_odd_pos
[1] 43 101 39 95 85 72 7 73 45 29

How to merge extreme points of observations and select dominating units only?

I need to build an algorithm which will:
For 116 existing observations of 2 variables x1 and x2 (plotted individually: one single point)
Create new observations by merging extreme points of 2 existing observations (ex: observation 117 will have 2 extreme points, (x1_115, x2_115) and (x1_30, x2_30)). Do this for all combinations.
If, for one combination, one pair dominates the other: x1_a < x1_b AND x2_a < x2_b, only select a.
For the new set of 116+n newly created variables, remove the dominated pairs, in the same logic as above.
Continue until we cannot create new non-dominated pairs.
I'm trying to solve this problem by creating independent functions for each operation. So far I have created the ConvexUnion function which merges extreme points (simply the union of 2 observations), but it does not take into account dominance yet.
ConvexUnion <- function(a,b){
output = NULL
for (i in 1:ncol(a)) {
u = unique(rbind(a[,i],b[,i]), incomparables = FALSE)
output = cbind(output, u)
}
output #the extreme points of the newly created pair
}
a = matrix(c(50,70), ncol = 2)
b = matrix(c(60,85), ncol = 2)
v = ConvexUnion(a,b)
TRAFO LABOR DELLV CLIENTS
1 49 15023 180119 11828
2 54 3118 212988 13465
3 31 6016 81597 4787
4 39 8909 127263 10291
5 9 1789 30095 2205
6 59 8327 190405 12045
7 95 11985 288146 16379
8 54 11309 208009 12252
9 13 3844 53631 4426
10 148 26348 459371 39831
11 17 3968 48798 3210
12 157 20131 366409 27050
13 18 4614 60366 4673
14 17 5941 49042 3950
15 77 6449 226815 12584
Here, the result for the new pair, which is the so-called convex union of a and b, would be (50,70) because a dominates b (both x1 and x2 are smaller).
How do I solve the problem?

r - How to create vector with for loops and ifelse

I'm having a problem with nested for loops and ifelse statements. This is my dataframe abund:
Species Total C1 C2 C3 C4
1 Blue 223 73 30 70 50
2 Black 221 17 50 56 98
3 Yellow 227 29 99 74 25
4 Green 236 41 97 68 30
5 Red 224 82 55 21 66
6 Orange 284 69 48 73 94
7 Black 154 9 63 20 62
8 Red 171 70 58 13 30
9 Blue 177 57 27 8 85
10 Orange 197 88 61 18 30
11 Orange 112 60 8 31 13
I would like to add together some of abund’s columns but only if they match the correct species I’ve specified in the vector colors.
colors <- c("Black", "Red", "Blue")
So, if the Species in abund matches the species in color then add columns C2 through C4 together in a new vector minus. If the species in abund does not match the species in color then add a 0 to the new vector minus.
I'm having trouble with my code and hope it's just a small matter of defining a range, but I'm not sure. This is my code so far:
# Use for loop to create vector of sums for select species or 0 for species not selected
for( i in abund$Species)
{
for( j in colors)
{
minus <- ifelse(i == j, sum(abund[abund$Species == i,
"C2"]:abund[abund$Species == i, "C4"]), 0)
}
}
Which returns this: There were 12 warnings (use warnings() to see them)
and this "vector": minus [1] 0
This is my target:
minus
[1] 150 204 0 0 142 0 145 101 120 0 0
Thank you for your time and help with this.
This is probably better done without any loops.
# Create the vector
minus <- rep(0, nrow(abund))
# Identify the "colors" cases
inColors <- abund[["Species"]] %in% colors
# Set the values
minus[inColors] <- rowSums(abund[inColors, c("C2","C3","C4")])
Also, for what it is worth there are quite a few problems with your original code. First, your first for loop isn't doing what you think. In each round, the value of i is being set to the next value in abund$Species, so first it is Blue then Black then Yellow, etc. As a result, then you index using abund[abund$Species == i, ], you may return multiple rows (ex. Blue will give you 1 and 9, since both those rows Species == "Blue").
Second when you make the statement abund[abund$Species == i, "C2"]:abund[abund$Species == i, "C4"] you are not indexing the columns C2 C3 and C4 you are making a sequence starting at the value in C2 and ending at the value in C4. For example, when i == "Yellow" it returns 99:25 or 99, 98, 97, ... , 26, 25. The reason you were getting those warnings was a combination of this problem and the last one. For example, when i == "Blue", you were trying to make a sequence starting at both 30 and 27 and ending at both 50 and 85. The warning was saying that it was just using the first number in your start and finish and giving you 30:50.
Finally, you were constantly over writing your value of minus rather than adding to it. You need to first create minus as above and index into it for the assignment like this minus[i] <- newValue.
Note that ifelse is vectorized so you usually don't need any for loops when using it.
I like Barker's answer best, but if you wanted to do this with ifelse this is the way:
abund$minus = with(abund, ifelse(
Species %in% colors, # if the species matches
C2 + C3 + C4, # add the columns
0 # otherwise 0
))
Even though this is just one line and Barker's is 3, on large data it will be slightly more efficient to avoid ifelse.
However, ifelse statements can be nested and are often easier to work with when conditions get complicated - so there are definitely good times to use them. On small to medium sized data the speed difference will be negligible so just use whichever you think of first.
# Create a column called minus with the length of the number of existing rows.
# The default value is zero.
abund$minus <- integer(nrow(abund))
# Perform sum of C2 to C4 only in those rows where Species is in the colors vector
abund$minus[abund$Species %in% colors] <- rowSums(abund[abund$Species %in% colors,5:7])

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

Retrieving adjaceny values in a nng igraph object in R

edited to improve the quality of the question as a result of the (wholly appropriate) spanking received by Spacedman!
I have a k-nearest neighbors object (an igraph) which I created as such, by using the file I have uploaded here:
I performed the following operations on the data, in order to create an adjacency matrix of distances between observations:
W <- read.csv("/path/sim_matrix.csv")
W <- W[, -c(1,3)]
W <- scale(W)
sim_matrix <- dist(W, method = "euclidean", upper=TRUE)
sim_matrix <- as.matrix(sim_matrix)
mygraph <- nng(sim_matrix, k=10)
This give me a nice list of vertices and their ten closest neighbors, a small sample follows:
1 -> 25 26 28 30 32 144 146 151 177 183 2 -> 4 8 32 33 145 146 154 156 186 199
3 -> 1 25 28 51 54 106 144 151 177 234 4 -> 7 8 89 95 97 158 160 170 186 204
5 -> 9 11 17 19 21 112 119 138 145 158 6 -> 10 12 14 18 20 22 147 148 157 194
7 -> 4 13 123 132 135 142 160 170 173 174 8 -> 4 7 89 90 95 97 158 160 186 204
So far so good.
What I'm struggling with, however, is how to to get access to the values for the weights between the vertices that I can do meaningful calculations on. Shouldn't be so hard, this is a common thing to want from graphs, no?
Looking at the documentation, I tried:
degree(mygraph)
which gives me the sum of the weights for each node. But I don't want the sum, I want the raw data, so I can do my own calculations.
I tried
get.data.frame(mygraph,"E")[1:10,]
but this has none of the distances between nodes:
from to
1 1 25
2 1 26
3 1 28
4 1 30
5 1 32
6 1 144
7 1 146
8 1 151
9 1 177
10 1 183
I have attempted to get values for the weights between vertices out of the graph object, that I can work with, but no luck.
If anyone has any ideas on how to go about approaching this, I'd be grateful. Thanks.
It's not clear from your question whether you are starting with a dataset, or with a distance matrix, e.g. nng(x=mydata,...) or nng(dx=mydistancematrix,...), so here are solutions with both.
library(cccd)
df <- mtcars[,c("mpg","hp")] # extract from mtcars dataset
# knn using dataset only
g <- nng(x=as.matrix(df),k=5) # for each car, 5 other most similar mpg and hp
V(g)$name <- rownames(df) # meaningful names for the vertices
dm <- as.matrix(dist(df)) # full distance matrix
E(g)$weight <- apply(get.edges(g,1:ecount(g)),1,function(x)dm[x[1],x[2]])
# knn using distance matrix (assumes you have dm already)
h <- nng(dx=dm,k=5)
V(h)$name <- rownames(df)
E(h)$weight <- apply(get.edges(h,1:ecount(h)),1,function(x)dm[x[1],x[2]])
# same result either way
identical(get.data.frame(g),get.data.frame(h))
# [1] TRUE
So these approaches identify the distances from each vertex to it's five nearest neighbors, and set the edge weight attribute to those values. Interestingly, plot(g) works fine, but plot(h) fails. I think this might be a bug in the plot method for cccd.
If all you want to know is the distances from each vertex to the nearest neighbors, the code below does not require package cccd.
knn <- t(apply(dm,1,function(x)sort(x)[2:6]))
rownames(knn) <- rownames(df)
Here, the matrix knn has a row for each vertex and columns specifying the distance from that vertex to it's 5 nearest neighbors. It does not tell you which neighbors those are, though.
Okay, I've found a nng function in cccd package. Is that it? If so.. then mygraph is just an igraph object and you can just do E(mygraph)$whatever to get the names of the edge attributes.
Following one of the cccd examples to create G1 here, you can get a data frame of all the edges and attributes thus:
get.data.frame(G1,"E")[1:10,]
You can get/set individual edge attributes with E(g)$whatever:
> E(G1)$weight=1:250
> E(G1)$whatever=runif(250)
> get.data.frame(G1,"E")[1:10,]
from to weight whatever
1 1 3 1 0.11861240
2 1 7 2 0.06935047
3 1 22 3 0.32040316
4 1 29 4 0.86991432
5 1 31 5 0.47728632
Is that what you are after? Any igraph package tutorial will tell you more!

Resources