Searching the closest value in other column - r

Suppose we have a data frame of two columns
X Y
10 14
12 16
14 17
15 19
21 19
The first element of Y that is 14, the nearest value (or same) to it is 14 (which is 3rd element of X). Similarly, next element of Y is closest to 15 that is 4th element of X
So, the output I would like should be
3
4
4
5
5
As my data is large, Can you give me some advice on the systemic/proper code for doing it?

You can try this piece of code:
apply(abs(outer(d$X,d$Y,FUN = '-')),2,which.min)
# [1] 3 4 4 5 5
Here, abs(outer(d$X,d$Y,FUN = '-')) returns a matrix of unsigned differences between d$X and d$Y, and apply(...,2,which.min) will return position of the minimum by row.

Related

R: Sample unique X, Y pairs without duplication of Xs and Ys

I am having a hard time figuring how to program this in R: Given a number of X and Y pairs, such as
X Y
9 1
1 2
12 3
8 4
9 4
4 5
16 6
18 7
5 8
11 9
4 10
6 11
6 12
14 13
18 13
20 13
13 14
20 15
20 16
I need to sample randomly n pairs that fulfil the condition that Xs and Ys are unique. For instance, if n=3 and using the data above, the following combinations (9,1) (4,5) (4,10) or (1,2) (14,13) (20,13) will be invalid because X=4 or Y=13 are duplicated in each of the solutions. However, (9,1) (1,2) and (8,4) will be a valid solution because the Xs and Ys are unique. Any help will be moooooost welcome.
If you start by sampling (randomizing) the rows of your original data, then subset only those rows where X or Y are not duplicated and then select the first, last or any n (=3) number of rows (you could use sample again), you should be fine, I think.
set.seed(1) # for reproducibility
head(subset(df[sample(nrow(df)),], !duplicated(X) & !duplicated(Y)), 3)
# X Y
#6 4 5
#7 16 6
#10 11 9
In response to the comment by #Richo64, saying that this approach will not randomly select the pairs:
It does sample the pairs randomly because the first (inner most) thing I do is
df[sample(nrow(df)),]
which samples the rows of the data randomly. Now, once we have done that, it's a random process which, say 4, in column X will come first and therefore will remain in the data because the other 4 is removed since it is a duplicated entry in X.
The same applies to values in Y.
It's obvious then, that after the sampling and subsetting, you are free to choose any 3 rows of the remaining data and even if you always selected the first 3 rows, you would still get a random selection that will differ every time you run it (except when it coincidently samples the same rows again).

How do I repeat only a part of a vector?

I have a vector of: 0,24,12,12,12,96,12,12,12,12,12,12.
I want to repeat only a part of it from 96 to the last element (12). The first part (0, 24, 12, 12, 12) I want to keep constant.
Could you please help ?
The answer depends on whether number 96 is always located at the 6th position inside your vector. If so, please refer to the first comment underneath your question. If the position is variable, however, you could implement a simple query that identifies the position of 96 inside your vector, and then repeat the part of the vector starting from there as often as you wish (2 times in the below-mentioned code).
x <- c(0,24,12,12,12,96,12,12,12,12,12,12)
# Identify index of 96
id <- which(x == 96)
# Repeat part of vector starting from `id` 2 times
c(x[1:(id-1)], rep(x[id:length(x)], 2))
# # Which results in
# [1] 0 24 12 12 12 96 12 12 12 12 12 12 96 12 12 12 12 12 12

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

Creating a numerical variable order

I have a set of data with 3 columns: index column (with no name), colour, colour of seed, and germination time.
How do I create a numerical variable called 'order' with values 1 to 22 (the number of data sets)?
I don't know if I get you right, but simplest way would be:
> order <- c(1:22)
> order
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
No, if you run:
class(order)
you will get:
[1] "integer"
but you can easily get every element of object order (especially in a loop)
for(i in 1:length(order)){
print(order[i])
}

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources