how to find the columns with the most similar values

how to find the columns with the most similar values - r

Just starting to use R and am feeling a bit confused. Suppose I have three columns
data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9),
column2=c(3, 4, 2), column3=c(5, 15, 7))
How can I create a new column (e.g., colmean) that is the mean of the two columns closest in value? I thought about doing a bunch of ifelse statements, but that seemed unnecessarily messy.
In this case, for instance, colmean=c(2.5, 4, 8).

Borrowing the function findClosest() created here by #Cole, we can do the following,
findClosest <- function(x, n) {
x <- sort(x)
x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}
colMeans(apply(data[-1], 1, function(i)findClosest(i, 2)))
#[1] 2.5 4.0 8.0

A vectorized function using the Rfast package:
library(Rfast)
fClosest <- function(m, n) {
m <- colSort(t(m))
matrix(
m[
sequence(
rep(n, ncol(m)),
seq(0, nrow(m)*(ncol(m) - 1), nrow(m)) + colMins(diff(m, lag = n - 1))
)
],
ncol(m), n, TRUE
)
}
m <- matrix(sample(10, 24, 1), 4)
m
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 4 2 6 2 5 3
#> [2,] 3 4 7 3 4 7
#> [3,] 4 2 7 6 10 2
#> [4,] 8 1 10 8 2 9
fClosest(m, 3L)
#> [,1] [,2] [,3]
#> [1,] 2 2 3
#> [2,] 3 3 4
#> [3,] 2 2 4
#> [4,] 8 8 9
rowMeans(fClosest(m, 3L))
#> [1] 2.333333 3.333333 2.666667 8.333333

Here is a version with a loop:
data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9),
column2=c(3, 4, 2), column3=c(5, 15, 7))
data$colmean <- NaN # set up empty column for results
for(i in seq(nrow(data))){
data.i <- data[i,-1] # get ith row
d <- as.matrix(dist(c(data.i))) # get distances between values
diag(d) <- NaN # replace diagonal of distance matrix with NaN
hit <- which.min(d) # identify value of lowest distance
pos <- c(row(d)[hit], col(d)[hit]) # get the position (i.e. the values that are closest)
data$colmean[i] <- mean(unlist(data.i[pos])) # calculate mean
}
data
# id column1 column2 column3 colmean
# 1 101 2 3 5 2.5
# 2 102 4 4 15 4.0
# 3 103 9 2 7 8.0

Here's a self-contained solution, based on the tidyverse, that is independent of the number of columns to be compared.
library(tidyverse)
data %>%
# Add the means of smallest pairwise differences to the input data
bind_cols(
data %>%
# Make the data tidy (and hence independent of the number of "column"s)
pivot_longer(starts_with("column")) %>%
# For each id/row (replace with rowwise() if appropriate)
group_by(id) %>%
group_map(
function(.x, .y) {
# Form a tibble of all pairwise ciombinations of values
as_tibble(t(combn(.x$value, 2))) %>%
# Calculate pairwise differences
mutate(difference = abs(V1 - V2)) %>%
# Find the smallest pairwise difference
arrange(difference) %>%
head(1) %>%
# Calculate the mean of this pair
pivot_longer(starts_with("V")) %>%
summarise(colmean=mean(value))
}
) %>%
# Convert list of values to column
bind_rows()
)
id column1 column2 column3 colmean
1 101 2 3 5 2.5
2 102 4 4 15 4.0
3 103 9 2 7 8.0

Related

distribute `n` among `k` units without repetition and zero structures in R

I was wondering if there might be a way in R to distribute n among k units without repetition (e.g., 3 5 2 is the same as 5 3 2, and 2 3 5 and 5 2 3) and without considering 0 combinations (i.e., no 9 1 0) and see the make-up of this distribution?
For example if n = 9 and k = 3 then we expect the make-up to be:
(Note: k will always be the # of columns)
3 3 3
4 3 2
4 1 4
5 2 2
5 1 3
6 2 1
7 1 1
makeup <- function(n, k){
# your suggested solution #
}

These are called integer partitions (more specifically restricted integer partitions) and can efficiently be generated with the packages partitions or arrangements like so:
partitions::restrictedparts(9, 3, include.zero = FALSE)
[1,] 7 6 5 4 5 4 3
[2,] 1 2 3 4 2 3 3
[3,] 1 1 1 1 2 2 3
arrangements::partitions(9, 3)
[,1] [,2] [,3]
[1,] 1 1 7
[2,] 1 2 6
[3,] 1 3 5
[4,] 1 4 4
[5,] 2 2 5
[6,] 2 3 4
[7,] 3 3 3
They are much faster than the solutions thus provided:
library(microbenchmark)
microbenchmark(arrangePack = arrangements::partitions(20, 5),
partsPack = partitions::restrictedparts(20, 5, include.zero = FALSE),
myfun2(20, 5, 20),
myfun1(20, 5, 20),
makeup(20, 5),
mycomb(20, 5), times = 3, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
arrangePack 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
partsPack 3.070203 2.755573 2.084231 2.553477 1.854912 1.458389 3
myfun2(20, 5, 20) 10005.679667 8528.784033 6636.284386 7580.133387 5852.625112 4872.050067 3
myfun1(20, 5, 20) 12770.400243 10574.957696 8005.844282 9164.764625 6897.696334 5610.854109 3
makeup(20, 5) 15422.745155 12560.083171 9248.916738 10721.316721 7812.997976 6162.166646 3
mycomb(20, 5) 1854.125325 1507.150003 1120.616461 1284.278219 950.015812 760.280469 3
In fact, for the example below, the other functions will error out because of memory:
system.time(arrangements::partitions(100, 10))
user system elapsed
0.068 0.031 0.099
arrangements::npartitions(100, 10)
[1] 2977866

You may try gtools::combinations for this work like below with repeats.allowed=TRUE option:
m <- gtools::combinations(9, 3, repeats.allowed = TRUE)
m[rowSums(m) == 9,]
A probable function could be, with options(expressions = 500000), this function could go till n = 500 (successfully ran on my machine for n=500, r=3):
mycomb <- function(n, r, sumval){
m <- combinations(n, r, repeats.allowed = TRUE)
m[rowSums(m) == sumval,]
}
mycomb(9,3,9)
Output:
# [,1] [,2] [,3]
#[1,] 1 1 7
#[2,] 1 2 6
#[3,] 1 3 5
#[4,] 1 4 4
#[5,] 2 2 5
#[6,] 2 3 4
#[7,] 3 3 3

Here's a base solution using expand.grid. I'm not going to recommend it for large n, but it works:
makeup <- function(n, k) {
x <- expand.grid(rep(list(1:n), 3)) # generate all combinations
x <- x[rowSums(x) == n,] # filter out stuff that doesn't sum to n
x <- as.data.frame(t(apply(x, 1, sort))) # order everything
unique(x) # keep non-duplicates
}
A little rethinking simplifies this greatly. If we have a vector of n objects, we can break it apart at n-1 different spots.. starting from this, we can reduce the work substantially:
makeup <- function(n, k) {
splits <- combn(n-1, k-1) # locations where to split up the data
bins <- rbind(rep(0, ncol(splits)), splits) # add an extra "split" before the 1st element
x <- apply(bins, 2, function(x) c(x[-1],9) -x) # count how many items in each bin
x <- as.data.frame(t(apply(x, 2, sort))) # order everything
unique(x) # keep non-duplicates
}

using matrix in base R:
myfun1 <- function( n, k){
x <- as.matrix(expand.grid( rep(list(seq_len(n)), k)))
x <- x[rowSums(x) == n,]
x[ ! duplicated( t( apply(x, 1, sort)) ),]
}
myfun1( n = 9, k = 3 )
May be this using data.table.
myfun2 <- function( n, k){
require('data.table')
dt <- do.call(CJ, rep(list(seq_len(n)), k))
dt <- dt[rowSums(dt) == n,]
dt[which(!duplicated(dt[, transpose(lapply( transpose(.SD), sort ))])),]
}
myfun2( n = 9, k = 3 )
# V1 V2 V3
# 1: 7 1 1
# 2: 6 2 1
# 3: 5 3 1
# 4: 4 4 1
# 5: 5 2 2
# 6: 4 3 2
# 7: 3 3 3

Cumsum but with maximum number of datapoints

I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?

That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0

Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6

How to make this code more efficient in R?

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.

Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]

Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6

Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.

Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

Vector whose elements add up to a value in R

I'm trying to create a vector whose elements add up to a specific number. For example, let's say I want to create a vector with 4 elements, and they must add up to 20, so its elements could be 6, 6, 4, 4 or 2, 5, 7, 6, whatever. I tried to run some lines using sample() and seq() but I cannot do it.
Any help appreciated.

To divide into 4 parts, you need three breakpoints from the 19 possible breaks between 20 numbers. Then your partitions are just the sizes of the intervals between 0, your partitions, and 20:
> sort(sample(19,3))
[1] 5 7 12
> diff(c(0, 5,7,12,20))
[1] 5 2 5 8
Test, lets create a big matrix of them. Each column is an instance:
> trials = sapply(1:1000, function(X){diff(c(0,sort(sample(19,3)),20))})
> trials[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 8 13 3 2
[2,] 4 7 10 2 9 5
[3,] 2 11 1 4 3 7
[4,] 11 1 1 1 5 6
Do they all add to 20?
> all(apply(trials,2,sum)==20)
[1] TRUE
Are there any weird cases?
> range(trials)
[1] 1 17
No, there are no zeroes and nothing bigger than 17, which will be a (1,1,1,17) case. You can't have an 18 without a zero.

foo = function(n, sum1){
#Divide sum1 into 'n' parts
x = rep(sum1/n, n)
#For each x, sample a value from 1 to that value minus one
f = sapply(x, function(a) sample(1:(a-1), 1))
#Add and subtract f from 'x' so that sum(x) does not change
x = x + sample(f)
x = x - sample(f)
x = floor(x)
x[n] = x[n] - (sum(x) - sum1)
return(x)
}

How to select/find coordinates within a distance from a list (X/Y) using R

I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?

You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))

I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930