How to find all comparison from a large set of data set - r

I have a large data set from an experiment. There are about 25000 spectrum in the data set. I want to see if there is any common feature in all spectrum. There is a builtin function for comparespectra between two specific spectra. But I want to develop a loop that gives me results from all possible comparison. Finally, want to make a data.frame or list along with the identity of the compared spectrum number.
I wrote a simple loop that gives me a comparison of spectrum 1 and 2, 2 and 3, 3 and 4 and 4 and 5.
for (i in 1:4){
comparison <- compareSpectra(raw_25kda[[i]], raw_25kda[[i+1]], fun = "common")
print(as.list(comparison))
}
From the loop, I have the list of 4 number 2,5,6,2 respectively for four comparisons of 1 and 2, 2 and 3, 3 and 4 and 4 and 5 comparisons.
The first comparison is between 1 and 2 and there is 2 common feature. Is there any way I can explicitly print that 1 and 2 are compared and there is 2 common feature between them?
I also want a comparison of 1 and 3, 1 and 4, 2 and 4, 3 and 4 as well.
When I recall comparison later in different R chunk that gives me only one value such as the last value 2. How can I save the list inside the loop for future analysis? Any help will be appreciated.

I don't have the data or packages you are using, so this might be a little off, but should hopefully point in the right direction.
Here are all the combinations of 5 data sets:
my_data_sets <- 1:5
combos <- combn(my_data_sets, m = 2, simplify = T)
combos
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 1 2 2 2 3 3 4
#[2,] 2 3 4 5 3 4 5 4 5 5
There are 10. Now we can initialize a list with ten elements to store our results in.
combo_len = dim(combos)[2]
output <- vector("list", combo_len)
for (i in combo_len) {
set1 = combos[1, i]
set2 = combos[2, i]
output[[i]] <- compareSpectra(raw_25kda[[set1]], raw_25kda[[set2]], fun = "common")
}
The output object should now have ten elements which each represent their respective combination.

Related

Incorrect number of subscripts on matrix in R using read_table

I am trying to read several .RLS (spreadsheet) files into one matrix using R. These files have very formulaic names, which makes it easy to work with. My plan (see code below) is to create the file names in a vector called names, then use a for loop to read and add the second column of each of those files into a matrix. However, I have encountered some issue. I have tested the names part of my code, and it can read tables into R individually. However when I try to put them all together into the matrix collected using the 2nd for loop, I get an error that says, "incorrect number of subscripts on matrix". I am not sure what this means. Any advice would be welcome.
library(tidyverse)
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[,2]
}
}
test <- collector(15, 23)
test
Regarding the issue, it may be because we used read_table which returns a tibble and tibble doesn't drop dimensions with [. Instead, we need [[.
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[[2]]## change
}
}
Instead of initializing with a NULL vector, we can create a vector of certain length and then assign with [i]. Other than that the code works with a dummy data
collector <- function(min, max){
i1 <- max - min + 1
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = character(i1)
for (i in 1:i1){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:i1){
e <- cbind(NA, 1:601) # created dummy data
collected[,j+1] = e[,2]
}
collected
}
test <- collector(15, 23)
-testing
test <- collector(15, 23)
> head(test)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA 1 1 1 1 1 1 1 1 1
[2,] NA 2 2 2 2 2 2 2 2 2
[3,] NA 3 3 3 3 3 3 3 3 3
[4,] NA 4 4 4 4 4 4 4 4 4
[5,] NA 5 5 5 5 5 5 5 5 5
[6,] NA 6 6 6 6 6 6 6 6 6
NOTE: The last part of reading the data couldn't be tested. It may be that some of the links doesn't have data and thus couldn't be read. Also, paste is vectorized, so the first loop is not really needed

Using distGeo with two sets of coordinates

I have two sets of coordinates (loc and stat) both in the following format
x y
1 49.68375 8.978462
2 49.99174 8.238287
3 51.30842 12.411870
4 50.70487 6.627252
5 50.70487 6.627252
6 50.37381 8.040766
For each location in the first data set (location of observation) I want to know the location in the second data set (weather stations), that is closest to it. Basically matching the locations of observations to the closest weather station for later analysis of weather effects.
I tried using the distGeo function simply by putting in
distGeo(loc, stat, a=6378137, f=1/298.257223563)
But that didn't work, because loc and stat are not in the right format.
Thanks for your help!
Try this:
outer(seq_len(nrow(loc)), seq_len(nrow(stat)),
function(a,b) geosphere::distGeo(loc[a,], stat[b,]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 88604.79 419299.1 283370.9 283370.9 128560.08
# [2,] 88604.79 0.00 483632.9 194784.6 194784.6 47435.65
# [3,] 419299.12 483632.85 0.0 643230.3 643230.3 494205.86
# [4,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [5,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [6,] 128560.08 47435.65 494205.9 160540.6 160540.6 0.00
Brief explanation:
outer(1:3, 1:4, ...) produces two vectors that are a cartesian product, very similar to
expand.grid(1:3, 1:4)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
(using expand.grid only for demonstration of the expansion)
the anonymous function I defined (function(a,b)...) is called once, where a is assigned the integer vector c(1,2,3,1,2,3,1,2,3,1,2,3) (using my 1:3 and 1:4 example), and b is assigned the int vector c(1,1,1,2,2,2,3,3,3,4,4,4).
within the anon func, loc[a,] results in a much longer frame: if loc has m rows and stat has n rows, then loc[a,] should have m*n rows; similarly stat[b,] should have m*n rows as well. This works well, because distGeo (and other dist* functions in geosphere::) operates in one of two ways:
If either of the arguments have 1 row, then its distance is calculated against all rows of the other argument. Unfortunately, unless you know that loc or stat will always have just one row, this method doesn't work.
otherwise, both arguments must have the same number of rows, where the distance is calculated piecewise (1st row of 1st arg with 1st row of 2nd arg; 2nd row 1st arg with 2nd row 2nd arg; etc). This is the method we're prepared for.
in general, the anon func given to outer must deal with vectorized arguments on its own. For instance, if you needed geoDist to be called once for each pair (so it would be called m*n times), then you have to handle that yourself, outer will not do it for you. There are constructs in R that support this (e.g., mapply, Map) or replace outer (Map, expand.grid, and do.call), but that's for another question.

Get row indices of data frame A according to multiple matching criteria in that data frame and another data frame, B

Let's say we have two data frames in R, df.A and df.B, defined thus:
bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)
obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)
df.A defines the ranges of bins, while df.B consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in df.A.
We want to generate a new vector of length nrow(df.B) containing the row indices of df.A corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return NA (or something similar).
In the above example, the correct output vector would be this:
bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)
I came up with a long-winded solution using sapply:
bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call
Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve match() or which(). Any ideas?
1) Using SQL it can readily be done in one statement:
library(sqldf)
sqldf('select a.rowid
from "df.B" b
left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')
rowid
1 4
2 NA
3 5
4 2
5 NA
6 4
7 3
8 4
9 5
10 2
2) merge/by We can do it in two statements using merge and by. No packages are used.
This does have the downside that it materializes the large join which the SQL solution would not need to do.
Note that df.B, as defined in the question, has obs_10 is the second level rather than the 10th level. If it were such that obs_10 were the 10th level then the second argument to by could have been just m$obs_ID so fixing up the input first could simplify it.
m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m$obs_ID)),
with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)[1]))
giving:
values ind
1 4 1
2 NA 2
3 5 3
4 2 4
5 NA 5
6 4 6
7 3 7
8 4 8
9 5 9
10 2 10
3) sapply Note that using the c(..., NA)[1] trick from (2) we can simplify the sapply solution in the quesiton to one statement:
sapply(1:nrow(df.B), function(i)
c(which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i]), NA)[1])
giving:
[1] 4 NA 5 2 NA 4 3 4 5 2
3a) mapply A nicer variation of (3) using mapply is given by #Ronak Shah` in the comments:
mapply(function(x, y) c(which(x >= df.A$bin_min & y <= df.A$bin_max), NA)[1],
df.B$obs_min,
df.B$obs_max)
4) outer Here is another one statement solution that uses no packages.
seq_len(nrow(df.A)) %*%
(outer(df.A$bin_max, df.B$obs_max, ">=") & outer(df.A$bin_min, df.B$obs_min, "<="))
giving:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 0 5 2 0 4 3 4 5 2

Nearest Neighbors from KKNN package in R giving garbage indices values when the entire dataset is used

I am using "kknn" package in R to find all of the nearest neighbors for every row in the data set. For some odd reasons, the last row in the test dataset is always ignored. Below, is the R code and the output obtained.
X1 <- c(0.6439659, 0.1923593, 0.3905551, 0.7728847, 0.7602632)
X2 <- c(0.9147394, 0.6181713, 0.8515923, 0.8459367, 0.9296278)
Class <- c(1, 1, 0, 0, 0)
Data <- data.frame(X1,X2,Class)
Data$Class <- as.factor(Data$Class)
library("kknn")
### Here, both training and testing data sets is the object Data
Neighbors.KNN <- kknn(Data$Class~., Data,Data,k = 5, distance =2, kernel = "gaussian")
## Output
## The Column 5 in the below output is filled with garbage values and the value of the first value in the last row is 4, when it has to be 5.
Neighbors.KNN$C
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 3 2 3245945
[2,] 2 3 4 1 3245945
[3,] 3 1 4 2 3245945
[4,] 4 1 3 2 3245945
[5,] 1 4 3 2 3245945
Could someone let me know if I am doing something wrong or if that is a bug in the package?
the current implementation (silently) assumes that k is smaller than n, the number of rows. In general will be k << n and this case is no problem. The (k+1)th is used to scale distances. I should have mentioned this in the documentation.
Regards,
Klaus

R - sample() within for loop generates identical permutations?

When I run a simple for loop to compute X number of permutations of a vector, the sample() function returns the same permutation for each iteration.
Below is my code:
options <- commandArgs(trailingOnly=T)
labels <- read.table(options[2], header=F)
holder <- c()
for (i in 1:options[1]){
perm <- sample(labels[,2:ncol(labels)], replace=F)
perm <- cbind(as.character(labels[1]), perm)
holder <- rbind(holder, perm)
}
write.table(holder, file=options[3], row.names=F, col.names=F, quote=F, sep='\t')
Is there a reason why this is so? Is there another simple way to generate say 1000 permutations of a vector?
*Added after comment - a replicable example*
vec <- 1:10
holder <-c()
for (i in 1:5){
perm <- sample(vec, replace=F)
holder <- rbind(holder, perm)
}
> holder
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
perm 3 2 1 10 9 6 7 4 5 8
perm 5 8 2 3 4 10 9 1 6 7
perm 10 7 3 1 4 2 5 8 9 6
perm 9 5 2 8 3 1 6 10 7 4
perm 3 7 5 6 8 2 1 9 10 4
And this works fine! I guess I have a bug somewhere! My input is perhaps in a mess.
Thanks,
D.
Thanks,
D.
For a reproducible example, just replace options[1] with a constant set and labels to a built-in or self-specified data frame. (By the way, neither are great variable names being base functions.) Just looking at the inner part of your for loop, you shuffle all but the first column of a data.frame. This works as you expect. Put print(names(perm)) in after finishing making perm and you will see. You then rbind this data frame to the previous results. rbind, recognizing it is working with data frames, helpfully reshuffles the column order of the different data frames so that the column names line up (which, generally, is what you would want it to do; the name of the column defines which one it is and you would want to extend each column appropriately.)
The problem is that you are doing permutations on columns of a data frame, not "of a vector" as you seem to think.

Resources