Matching elements of two vectors based on proximity - r

I got two vectors:
a<-c(268, 1295, 1788, 2019, 2422)
b<-c(266, 952, 1295, 1791, 2018)
I want to match the elements of b to the elements of a, based on the smallest difference. So a[1] would be matched to b[1].
However, each element can only be matched to a single other element. It is possible that elements cannot be matched. If two elements of b have the smallest difference to the same element in a, then the element with the smaller difference is matched.
For example 952 and 1295 are closest to element a[2], as 1295 is closer (in this case even equal to) a[2] it would get matched with 1295.
The final solution for this particular example should look like this.
268 NA 1295 1788 2019 2422
266 952 1295 1791 2018 NA
Some of the item are not matched and although it would be possible to match 952 and 2422 the code I need would not considere them a match because matches were found inbetween them. The vectors are also strictly increasing.
With my coding capabilities I would use tons of if statements to solve that issue. But I was wondering whether this is a know problem, and I am aware of the terminology of such or if someone would have an idea for an elegant solution

A base R approach, although probably not the most elegant one:
aux1 <- apply(abs(outer(a, b, `-`)), 2, function(r) c(min(r), which.min(r)))
colnames(aux1) <- 1:length(b)
aux2 <- tapply(aux1[1, ], factor(aux1[2, ], levels = 1:length(a)),
function(x) as.numeric(names(which.min(x))))
rbind(cbind(a, b = b[aux2]), cbind(a = NA, b = b[-aux2[!is.na(aux2)]]))
# a b
# [1,] 268 266
# [2,] 1295 1295
# [3,] 1788 1791
# [4,] 2019 2018
# [5,] 2422 NA
# [6,] NA 952
Here aux1 contains closest a elements to b (2nd row) and the corresponding distances (1st row).
tmp
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 343 0 3 1
# [2,] 1 2 2 3 4
Then aux2 may already be enough for your purposes.
out
# 1 2 3 4 5
# 1 3 4 5 NA
aux1 showed some ties but aux2 now gives which element of a (2nd row) should be assigned to which element of b (names). Then in the last line we bind the rest of the elements.
In a more complex case we have
a <- c(932, 1196, 1503, 2819, 3317, 3845, 4118, 4544)
b <- c(1190, 1498, 2037, 2826, 3323, 4128, 4618, 1190, 1498, 2037, 2826, 3323, 4128, 4618)
# ....
rbind(cbind(a, b = b[aux2]), cbind(a = NA, b = b[-aux2[!is.na(aux2)]]))
# a b
# [1,] 932 NA
# [2,] 1196 1190
# [3,] 1503 1498
# [4,] 2819 2826
# [5,] 3317 3323
# [6,] 3845 NA
# [7,] 4118 4128
# [8,] 4544 4618
# [9,] NA 2037
# [10,] NA 1190
# [11,] NA 1498
# [12,] NA 2037
# [13,] NA 2826
# [14,] NA 3323
# [15,] NA 4128
# [16,] NA 4618

Related

How do I automatically populate a matrix with intervals given the size and number of the intervals?

bucket_size <- 30
bucket_amount <- 24
matrix(???, bucket_amount, 2)
I'm trying to populate a (bucket_amount x 2) matrix using the interval size given by bucket_size. Here is what it would look like with the current given values of bucket_size and bucket_amount.
[1 30]
[31 60]
[61 90]
[91 120]
.
.
.
[691 720]
I can obviously hard code this specific example out, but I'm wondering how I can do this for different values of bucket_size and bucket_amount and have the matrix populate automatically.
We can seq specifying the from, by as 'bucket_size' and length.out as 'bucket_amount' to create a sequence of values ('v1'). Append 1 at the beginning while adding 1 to the 'v1' without last element and cbind these two vectors to create a matrix
v1 <- seq(bucket_size, length.out = bucket_amount , by = bucket_size)
v2 <- c(1, v1[-length(v1)] + 1)
m1 <- cbind(v2, v1)
-outupt
> head(m1)
v2 v1
[1,] 1 30
[2,] 31 60
[3,] 61 90
[4,] 91 120
[5,] 121 150
[6,] 151 180
> tail(m1)
v2 v1
[19,] 541 570
[20,] 571 600
[21,] 601 630
[22,] 631 660
[23,] 661 690
[24,] 691 720

Why is outer recycling a vector that should go unused and not throwing a warning?

I recently used the following line of code, expecting to get an error. To my surprise, I was given an output:
> outer(1:5,5:10,c=1:3,function(a,b,c) 10*a + 100*b + 1000*c)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1510 3610 2710 1810 3910 3010
[2,] 2520 1620 3720 2820 1920 4020
[3,] 3530 2630 1730 3830 2930 2030
[4,] 1540 3640 2740 1840 3940 3040
[5,] 2550 1650 3750 2850 1950 4050
It appears that the code is being evaluated to outer(1:5,5:10,function(a,b) 10*a + 100*b)+1000*(1:3). Why is this? And as a follow-up, is there any clear reason why this doesn't give a warning? To my mind, a user who entered code like this was probably expecting an output covering all a, b, and c values.
This is expected behaviour based on R's recycling rules. It has nothing to do with outer as such, though it might be a surprise if you think outer is somehow applying a function across margins.
Instead, outer takes two vectors X and Y as its first two arguments. It takes Xand replicates it length(Y) times. Similarly, it takes Y and replicates it length(X) times. Then it just runs your function FUN on these two long vectors, passing the long X as the first argument and the long Y as the second argument. Any other arguments to FUN have to be passed directly as arguments to outer via ... (as you have done with c = 1:3).
The result is a single long vector which is turned into a matrix by writing its dim attribute as the original values of length(X) by length(Y).
Now, in the specific example you gave, X has 5 elements (1:5) and Y has 6 (5:10). Therefore your anonymous function is called on two length-30 vectors and a single length-3 vector. R's recycling rules dictate that if the recycled vector fits neatly into the longer vector without partial recycling, no warning is emitted.
To see this, take your anonymous function and try it outside outer with two length-30 vectors and one length-3 vector:
f <- function(a, b, c) 10*a + 100*b + 1000*c
f(1:30, 1:30, 1:3)
#> [1] 1110 2220 3330 1440 2550 3660 1770 2880 3990 2100 3210 4320 2430
#> [14] 3540 4650 2760 3870 4980 3090 4200 5310 3420 4530 5640 3750 4860
#> [27] 5970 4080 5190 6300
3 recycles nicely into 30, so there is no warning.
Conversely, if the product of the length of the two vectors you pass to outer is not a multiple of 3, you will get a warning:
outer(1:5,6:10,c=1:3,function(a,b,c) 10*a + 100*b + 1000*c)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1610 3710 2810 1910 4010
#> [2,] 2620 1720 3820 2920 2020
#> [3,] 3630 2730 1830 3930 3030
#> [4,] 1640 3740 2840 1940 4040
#> [5,] 2650 1750 3850 2950 2050
#> Warning message:
#> In 10 * a + 100 * b + 1000 * c :
#> longer object length is not a multiple of shorter object length

Subset dataframe based on the condition in a column of another dataframe

I have two data frames where each line represent data from one individual. Lines in the first data frame (that enter the specific analysis of geometric morphometry) correspond to the lines in the second data frame (additional descriptions of animals as sampling site or sex). I would like to subset the first data frame based on the condition form the second data frame (e.g. select all lines of the first data frame that are females, but sex of the animal is defined in the second dataframe). It is possible to do it by adding new column to the first data frame, subset it based on this new column and remove the column. Is there any other more elegant way to do it?
df1
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
IMGP7012.JPG -0.07059829 0.07905529 0.021803102 0.07480276 0.04849282 0.07270644
IMGP7013.JPG -0.07176010 0.08561111 0.009568661 0.08297752 0.06374573 0.08272648
IMGP7014.JPG -0.06751993 0.08895038 0.016800152 0.08799522 0.04776876 0.08100145
IMGP7015.JPG -0.07945826 0.07844136 0.008176800 0.07431915 0.06471417 0.07348312
IMGP7017.JPG -0.06587874 0.09280032 0.010204330 0.09085868 0.05290771 0.08739235
df2
number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433
df1 after subset
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
df1[df2$sex %in% "F", ]
# [,1] [,2] [,3] [,4] [,5] [,6]
# IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
# IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
# IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
# IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
# IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
Explanation
Your df1 looks like a matrix, not a data.frame. But the solution I provided will also work if df1 is a data frame.
df2$sex %in% "F" reports if sex matches F. and reports a logical vector with TRUE and FALSE. After that, you can use that to subset df1.
Data
df1 <- matrix(c(-0.07612235, 0.08189661, 0.020690012, 0.07532420, 0.05373111, 0.07139840,
-0.06759482, 0.09449720, 0.022907275, 0.08807724, 0.05953926, 0.08256468,
-0.06902234, 0.08418980, 0.013522385, 0.08186618, 0.05375763, 0.07769076,
-0.07201136, 0.08475765, 0.009462017, 0.08080315, 0.06148776, 0.07059229,
-0.08112908, 0.08485488, 0.037193459, 0.07971364, 0.05834018, 0.07917079,
-0.07059829, 0.07905529, 0.021803102, 0.07480276, 0.04849282, 0.07270644,
-0.07176010, 0.08561111, 0.009568661, 0.08297752, 0.06374573, 0.08272648,
-0.06751993, 0.08895038, 0.016800152, 0.08799522, 0.04776876, 0.08100145,
-0.07945826, 0.07844136, 0.008176800, 0.07431915, 0.06471417, 0.07348312,
-0.06587874, 0.09280032, 0.010204330, 0.09085868, 0.05290771, 0.08739235),
ncol = 6, byrow = TRUE)
rownames(df1) <- c("IMGP6995.JPG", "IMGP6997.JPG", "IMGP6998.JPG", "IMGP6999.JPG",
"IMGP7001.JPG", "IMGP7012.JPG", "IMGP7013.JPG", "IMGP7014.JPG",
"IMGP7015.JPG", "IMGP7017.JPG")
df2 <- read.table(text = " number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433",
header = TRUE, stringsAsFactors = FALSE)

Loop over matrix using n consecutive rows in R

I have a matrix that consists of two columns and a number (n) of rows, while each row represents a point with the coordinates x and y (the two columns).
This is what it looks (LINK):
V1 V2
146 17
151 19
153 24
156 30
158 36
163 39
168 42
173 44
...
now, I would like to use a subset of three consecutive points starting from 1 to do some fitting, save the values from this fit in another list, an den go on to the next 3 points, and the next three, ... till the list is finished. Something like this:
Data_Fit_Kasa_1 <- CircleFitByKasa(Data[1:3,])
Data_Fit_Kasa_2 <- CircleFitByKasa(Data[3:6,])
....
Data_Fit_Kasa_n <- CircleFitByKasa(Data[i:i+2,])
I have tried to construct a loop, but I can't make it work. R either tells me that there's an "unexpected '}' in "}" " or that the "subscript is out of bonds". This is what I've tried:
minimal runnable code
install.packages("conicfit")
library(conicfit)
CFKasa <- NULL
Data.Fit <- NULL
for (i in 1:length(Data)) {
row <- Data[i:(i+2),]
CFKasa <- CircleFitByKasa(row)
Data.Fit[i] <- CFKasa[3]
}
RStudio Version 0.99.902 – © 2009-2016 RStudio, Inc.; Win10 Edu.
The third element of the fitted circle (CFKasa[3]) represents the radius, which is what I am really interested in. I am really stuck here, please help.
Many thanks in advance!
Best, David
Turn your data into a 3D array and use apply:
DF <- read.table(text = "V1 V2
146 17
151 19
153 24
156 30
158 36
163 39", header = TRUE)
a <- t(DF)
dim(a) <-c(nrow(a), 3, ncol(a) / 3)
a <- aperm(a, c(2, 1, 3))
# , , 1
#
# [,1] [,2]
# [1,] 146 17
# [2,] 151 19
# [3,] 153 24
#
# , , 2
#
# [,1] [,2]
# [1,] 156 30
# [2,] 158 36
# [3,] 163 39
center <- function(m) c(mean(m[,1]), mean(m[,2]))
t(apply(a, 3, center))
# [,1] [,2]
#[1,] 150 20
#[2,] 159 35
center(DF[1:3,])
#[1] 150 20

Cumulative sums, moving averages, and SQL "group by" equivalents in R

What's the most efficient way to create a moving average or rolling sum in R? How do you do the rolling function along with a "group by"?
While zoo is great, sometimes there are simpler ways. If you data behaves nicely, and is evenly spaced, the embed() function effectively lets you create multiple lagged version of a time series. If you look inside the VARS package for vector auto-regression, you will see that the package author chooses this route.
For example, to calculate the 3 period rolling average of x, where x = (1 -> 20)^2:
> x <- (1:20)^2
> embed (x, 3)
[,1] [,2] [,3]
[1,] 9 4 1
[2,] 16 9 4
[3,] 25 16 9
[4,] 36 25 16
[5,] 49 36 25
[6,] 64 49 36
[7,] 81 64 49
[8,] 100 81 64
[9,] 121 100 81
[10,] 144 121 100
[11,] 169 144 121
[12,] 196 169 144
[13,] 225 196 169
[14,] 256 225 196
[15,] 289 256 225
[16,] 324 289 256
[17,] 361 324 289
[18,] 400 361 324
> apply (embed (x, 3), 1, mean)
[1] 4.666667 9.666667 16.666667 25.666667 36.666667 49.666667
[7] 64.666667 81.666667 100.666667 121.666667 144.666667 169.666667
[13] 196.666667 225.666667 256.666667 289.666667 324.666667 361.666667
I scratched up a good answer from Achim Zeileis over on the r list. Here's what he said:
library(zoo)
## create data
x <- rnorm(365)
## transform to regular zoo series with "Date" index
x <- zooreg(x, start = as.Date("2004-01-01")) plot(x)
## add rolling/running/moving average with window size 7
lines(rollmean(x, 7), col = 2, lwd = 2)
## if you don't want the rolling mean but rather a weekly ## time series of means you can do
nextfri <- function(x) 7 * ceiling(as.numeric(x - 1)/7) + as.Date(1) xw <- aggregate(x, nextfri, mean)
## nextfri is a function which computes for a certain "Date" ## the next friday. xw is then the weekly series.
lines(xw, col = 4)
Achim went on to say:
Note, that the difference between is
rolling mean and the aggregated series
is due to different alignments. This
can be changed by changing the 'align'
argument in rollmean() or the
nextfri() function in the aggregate
call.
All this came from Achim, not from me:
http://tolstoy.newcastle.edu.au/R/help/05/06/6785.html

Resources