In R: Avoid duplicates in selection from many rows - r

Summary: I have an array of 10 rows and 4 columns filled with numbers. I select one number from each row and want to avoid duplicates in the selection.
Elaborate:
I have a grid of 100*100 cells. In that grid are 10 cells that contain a "person". In an iterative process I want to make the persons "walk around" in the grid, but I do not want to occur that two persons are in the same cell at the same time.
I have a vector that describes the positions of the 10 persons. It contains the cell numbers with a person. These positions are counting across all rows and columns (i.e. ranges from 1:10000). For example: Position 234 would be in the 3rd row, 34th column).
Positions<-sample(1:10000,10) #Initial positions
What I did is to first make an array of the surrounding cells of each person (up, right, down, left) giving 4 positions for each person:
Surroundings<-array(c(Positions+100,Positions+1,Positions-100,Positions-1),dim=c(10,4))
I then take a random direction from each of the rows in Surroundings into vector PosNew. It is this last vector in which I want to avoid duplicates.
I could repeat the random selection process of PosNew until it has no duplicates, but this could take very long. There are probably more efficient ways to do this.
For simplicity sake, let's assume that persons do not walk off the grid and no other errors occur.
My script:
Positions<-sample(1:10000,10) #Initial positions
for(i in 1:50) {
Surroundings<-array(c(Positions+100,Positions+1,Positions-100,Positions-1),dim=c(10,4))
PosNew<-Surroundings[cbind(1:10,sample(1:4,10,replace=TRUE))]
Dups<-length(which(duplicated(PosNew)==TRUE))
Positions<-PosNew
}
I am looking for a way to check for duplicates in the selected new positions and make sure that Dups is never above zero. Any suggestions are welcome, including suggestions to make the code faster/more efficient.
Added: What could I do when at some point one or more of the persons really cannot move to an empty cell, because all 4 sides are occupied? I want that person to stay in its original cell. How to code that?
Thank you so much for your time!

As this is an iterative process, where every person's move depends on the locations of others, I don't think you can do much better then moving one person and sampling the position for the next from the difference of the sets of all directions and all occupied positions (note that this adds a bit of unfairness as the first person has the most freedom to move, so to speak).
So the code would be something like this:
Positions <- sample(1:10000, 10) #Initial positions
for (i in 1:50) {
Surroundings <-
array(c(Positions + 100, Positions + 1, Positions - 100, Positions - 1),
dim = c(10, 4))
# BEGIN NEW CODE
PosNew <- numeric(10)
for (i in 1:10) {
# PosNew[seq_len(i-1)] is the set of occupied positions
available <- setdiff(Surroundings[i, ], PosNew[seq_len(i-1)])
if (length(available) != 0)
PosNew[i] <- sample(available, 1)
else
PosNew[i] <- Positions[i] # stay where you are
}
# END NEW CODE
Dups <- sum(duplicated(PosNew)) # shorter version - sum logical values to get a count
Positions <- PosNew
}
Hope this helps!

Related

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

Find n closest non-NA values to position t in vector

This is probably a simple question for those experienced in R, but it is something that I (a novice) am struggling with...
I have two examples of vectors that are common to the problem I am trying to solve, A and B:
A <- c(1,3,NA,3,NA,4,NA,1,7,NA,2,NA,9,9,10)
B <- c(1,3,NA,NA,NA,NA,NA,NA,NA,NA,2,NA,9)
#and three scalars
R <- 4
t <- 5
N <- 3
There is a fourth scalar, n, where 0<=n<=N. In general, N <= R.
I want to find the n closest non-NA values to t such that they fall within a radius R centered on t. I.e., the search radius, R comprises of R+1 values. For example A, the search radius sequence is (3,NA,3,NA,4,NA,1), where t=NA, the middle value in the search radius sequence.
The expected answer can be one of two results for A:
answerA1 <- c(3,4,1)
OR
answerA2 <- c(3,4,3)
The expected answer for B:
answerB <- c(1,3)
How would I accomplish this task in the most time- and space-efficient manner? One liners, loops, etc. are welcome. If I have to choose a preference, it is for speed!
Thanks in advance!
Note:
For this case, I understand that the third closest non-NA value may involve choosing a preference for the third value to fall on either the right or left of t (as shown by the two possible answers above). I do not have a preference for whether this values falls to the left or the right of t but, if there is a way to leave it to random chance, (whether the third value falls to the right or the left) that would be ideal (but, again, it is not a requirement).
A relatively short solution is:
orderedA <- A[order(abs(seq_len(length(A)) - t))][seq_len(R*2)]
n_obj <- min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
res <- na.omit(orderedA)[seq_len(n_obj)]
res
#[1] 3 4 3
Breaking this down a little more the steps are:
Order A, by the absolute distance from the position of interest, t.
Code is: A[order(abs(seq_len(length(A)) - t))]
Subset to the first R*2 elements (so this will get the elements on either side of t within R.
Code is: [seq_len(R*2)]
Get the first min(N, # of non-NA, len of non-NA) elements
Code is: min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
Drop NA
Code is: na.omit()
Take first elements determined in step 3 (whichever is smaller)
Code is: [seq_len(n_obj)]
Something like this?
thingfinder <- function(A,R,t,n) {
left <- A[t:(t-R-1)]
right <- A[t:(t+R+1)]
leftrightmat <- cbind(left,right)
raw_ans <- as.vector(t(leftrightmat))
ans <- raw_ans[!is.na(raw_ans)]
return(ans[1:n])
}
thingfinder(A=c(1,3,NA,3,NA,4,NA,1,7,NA,2,NA,9,9,10), R=3, t=5, n=3)
## [1] 3 4 3
This would give priority to the left side, of course.
In case it is helpful to others, #Mike H. also provided me with a solution to return the index positions associated with the desired vector elements res:
A <- setNames(A, seq_len(length(A)))
orderedA <- A[order(abs(seq_len(length(A)) - t))][seq_len(R*2)]
n_obj <- min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
res <- na.omit(orderedA)[seq_len(n_obj)]
positions <- as.numeric(names(res))

Using head() to print n ordered rows in dataframe from random starting position

I know I can use
head(sample(x),m)
to print a random selection of m rows from my dataset, but in this case each new draw is randomized. What if, instead of randomizing every draw, I wanted to randomize only the starting position for the first draw, while preserving the order of subsequent rows?
To illustrate, imagine we have a dataset of n rows and I wanted to print m of them in order, starting from a random position. The randomly drawn starting position is 5, so my desired function would print 5, 6, 7, ..., m < n.
This is more of a theoretical question, not a diagnostic one, so I don't believe a MWE example is needed...please let me know if you think it is and I will be happy to provide one.
We create a numeric index using the sample element and adding with the sequence of 'n' rows that should follow it. If the sampled index is say the last row, then we can create a condition to check for those cases
i1 <- sample(nrow(df1), 1)+ 0:3
df1[ i1[i1 <= nrow(df1)], ]

Vectorizing raster brick objects with r-raster so that I can count them

I have an image of columns of red and blue bordered circles like so:
Where the columns alternate red and blue (in this example the first column is red)
I have been able to create a raster brick and plot the image in RGB layers but I want to count these columns into a vector like this (from above example). Values 1(red) and 2(blue)
1,1,1,1,2,2,2,1,1,2,1,1,1 ...
Is it possible to clear out areas of the brick I don't need for counting and collapse the brick down into values I could then convert into the numbers or labels I want? Or is there a much simpler way that I'm unable to locate? Also long term I want to be able to point the program at several images without opening them myself.
Edit: To clear somethings up, I want to count the circles top to bottom, left to right. For example once the first col is counted, I want to start over at the top of the next column on the right. Also I'm not sure if I'm headed in the right direction but I was able to remove all background cells from the image. leaving me with a plot of only values where the circles are.
Edit 2:
The current code I have for the image above.
color.image <- brick("image")
color.image = dropLayer(color.image,4) #gets rid of a channel
plot(color.image)
e <- extent(-10, 240, 45, 84.8) #xmin,xmax, ymin,ymax
ccolor.image <- crop(color.image, e)
plot(ccolor.image)
#thresholding to simplify what I was dealing with
mini=ccolor.image[ccolor.image > 97] = NA
mini=ccolor.image[ccolor.image < 15] = NA
mini=ccolor.image[ccolor.image > 20] = 80
plot(ccolor.image)
mcolor = as.matrix(ccolor.image)
colSums(ccolor.image)
rowSums(ccolor.image)
Edit 3:
I figured it out! Or at least found a round about way to do it, will post code later once I clean it up some. I still however would like input on creating a vector based on the matrix of values I have for my simplified raster brick matrix. Code coming soon!
The fastest way to count values in a raster is freq(x, merge=T). This will give you the value in one column and the frequency in as many columns as you have rows. In this way we the need to poll a value of interest and sum all the other columns (the counts). Hope that helps!
freq_vals <- freq ( rasterbrick , merge = T )
sum( freq_vals [ which ( freq_vals$value == 1 ) , 2 : ncol ( freq_vals ) ] )

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Resources