R - sample() within for loop generates identical permutations? - r

When I run a simple for loop to compute X number of permutations of a vector, the sample() function returns the same permutation for each iteration.
Below is my code:
options <- commandArgs(trailingOnly=T)
labels <- read.table(options[2], header=F)
holder <- c()
for (i in 1:options[1]){
perm <- sample(labels[,2:ncol(labels)], replace=F)
perm <- cbind(as.character(labels[1]), perm)
holder <- rbind(holder, perm)
}
write.table(holder, file=options[3], row.names=F, col.names=F, quote=F, sep='\t')
Is there a reason why this is so? Is there another simple way to generate say 1000 permutations of a vector?
*Added after comment - a replicable example*
vec <- 1:10
holder <-c()
for (i in 1:5){
perm <- sample(vec, replace=F)
holder <- rbind(holder, perm)
}
> holder
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
perm 3 2 1 10 9 6 7 4 5 8
perm 5 8 2 3 4 10 9 1 6 7
perm 10 7 3 1 4 2 5 8 9 6
perm 9 5 2 8 3 1 6 10 7 4
perm 3 7 5 6 8 2 1 9 10 4
And this works fine! I guess I have a bug somewhere! My input is perhaps in a mess.
Thanks,
D.
Thanks,
D.

For a reproducible example, just replace options[1] with a constant set and labels to a built-in or self-specified data frame. (By the way, neither are great variable names being base functions.) Just looking at the inner part of your for loop, you shuffle all but the first column of a data.frame. This works as you expect. Put print(names(perm)) in after finishing making perm and you will see. You then rbind this data frame to the previous results. rbind, recognizing it is working with data frames, helpfully reshuffles the column order of the different data frames so that the column names line up (which, generally, is what you would want it to do; the name of the column defines which one it is and you would want to extend each column appropriately.)
The problem is that you are doing permutations on columns of a data frame, not "of a vector" as you seem to think.

Related

Incorrect number of subscripts on matrix in R using read_table

I am trying to read several .RLS (spreadsheet) files into one matrix using R. These files have very formulaic names, which makes it easy to work with. My plan (see code below) is to create the file names in a vector called names, then use a for loop to read and add the second column of each of those files into a matrix. However, I have encountered some issue. I have tested the names part of my code, and it can read tables into R individually. However when I try to put them all together into the matrix collected using the 2nd for loop, I get an error that says, "incorrect number of subscripts on matrix". I am not sure what this means. Any advice would be welcome.
library(tidyverse)
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[,2]
}
}
test <- collector(15, 23)
test
Regarding the issue, it may be because we used read_table which returns a tibble and tibble doesn't drop dimensions with [. Instead, we need [[.
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[[2]]## change
}
}
Instead of initializing with a NULL vector, we can create a vector of certain length and then assign with [i]. Other than that the code works with a dummy data
collector <- function(min, max){
i1 <- max - min + 1
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = character(i1)
for (i in 1:i1){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:i1){
e <- cbind(NA, 1:601) # created dummy data
collected[,j+1] = e[,2]
}
collected
}
test <- collector(15, 23)
-testing
test <- collector(15, 23)
> head(test)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA 1 1 1 1 1 1 1 1 1
[2,] NA 2 2 2 2 2 2 2 2 2
[3,] NA 3 3 3 3 3 3 3 3 3
[4,] NA 4 4 4 4 4 4 4 4 4
[5,] NA 5 5 5 5 5 5 5 5 5
[6,] NA 6 6 6 6 6 6 6 6 6
NOTE: The last part of reading the data couldn't be tested. It may be that some of the links doesn't have data and thus couldn't be read. Also, paste is vectorized, so the first loop is not really needed

Is there an R function to retrieve values from a matrix of column names?

I have a matrix M consisting of column names from a data frame with one row, such that each column name has just one corresponding value in the data frame. Is there a function to create a new matrix with the corresponding values from the column names in M?
M <- t(data.frame(A=c("label_1","label_2","label_3"),
B=c("label_4","label_5","label_6"),
C=c("label_7","label_8","label_9")))
M
> [,1] [,2] [,3]
A "label_1" "label_2" "label_3"
B "label_4" "label_5" "label_6"
C "label_7" "label_8" "label_9"
df <- data.frame(label_2=5, label_1=0, label_4=7,
label_6=15, label_3=12, label_5=11,
label_9=9, label_8=15, label_7=35)
df
> label_2 label_1 label_4 label_6 label_3 label_5 label_9 label_8 label_7
1 5 0 7 15 12 11 9 15 35
## I want to create a new data.frame with the values from these labels
> [,1] [,2] [,3]
A 0 5 12
B 7 11 15
C 35 15 9
One possible way I'm aware of is to convert the data frame df to a key-value pair, with k = column names and v = values. I could then retrieve the values using:
apply(M,2,function(x){df[df$k==x,"v"]})
But this seems too overcomplicated for what should be a simple operation...
Additionally, I would prefer not to use any libraries outside of dplyr or tidyr to minimize the dependencies needed in my code.
Updated to an easier code using Onyambu's suggestion:
M <- t(data.frame(A=c("label_1","label_2","label_3"),
B=c("label_4","label_5","label_6"),
C=c("label_7","label_8","label_9")))
df <- data.frame(label_2=5, label_1=0, label_4=7,
label_6=15, label_3=12, label_5=11,
label_9=9, label_8=15, label_7=35)
P<-matrix(df[c(M)],nrow(M))
P
> P
[,1] [,2] [,3]
[1,] 0 5 12
[2,] 7 11 15
[3,] 35 15 9

How to find all comparison from a large set of data set

I have a large data set from an experiment. There are about 25000 spectrum in the data set. I want to see if there is any common feature in all spectrum. There is a builtin function for comparespectra between two specific spectra. But I want to develop a loop that gives me results from all possible comparison. Finally, want to make a data.frame or list along with the identity of the compared spectrum number.
I wrote a simple loop that gives me a comparison of spectrum 1 and 2, 2 and 3, 3 and 4 and 4 and 5.
for (i in 1:4){
comparison <- compareSpectra(raw_25kda[[i]], raw_25kda[[i+1]], fun = "common")
print(as.list(comparison))
}
From the loop, I have the list of 4 number 2,5,6,2 respectively for four comparisons of 1 and 2, 2 and 3, 3 and 4 and 4 and 5 comparisons.
The first comparison is between 1 and 2 and there is 2 common feature. Is there any way I can explicitly print that 1 and 2 are compared and there is 2 common feature between them?
I also want a comparison of 1 and 3, 1 and 4, 2 and 4, 3 and 4 as well.
When I recall comparison later in different R chunk that gives me only one value such as the last value 2. How can I save the list inside the loop for future analysis? Any help will be appreciated.
I don't have the data or packages you are using, so this might be a little off, but should hopefully point in the right direction.
Here are all the combinations of 5 data sets:
my_data_sets <- 1:5
combos <- combn(my_data_sets, m = 2, simplify = T)
combos
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 1 2 2 2 3 3 4
#[2,] 2 3 4 5 3 4 5 4 5 5
There are 10. Now we can initialize a list with ten elements to store our results in.
combo_len = dim(combos)[2]
output <- vector("list", combo_len)
for (i in combo_len) {
set1 = combos[1, i]
set2 = combos[2, i]
output[[i]] <- compareSpectra(raw_25kda[[set1]], raw_25kda[[set2]], fun = "common")
}
The output object should now have ten elements which each represent their respective combination.

Cartesian Product on column headers for Time Series Data

Say I had a dataframe like
d <- c("03-12-2018","03-11-2018")
g <- c(10,5)
p <- c(8,9)
a <- c(7,2)
df <- data.frame(d,g,p,a)
colnames(df) <- c("date","grapes","pears","apples")
df
date grapes pears apples
1 03-12-2018 10 8 7
2 03-11-2018 5 9 2
I essentially want output looking like:
date grapes_pears grapes_apples pears_apples
3-12-2018 2 3 1
3-11-2018 -4 3 7
So the values in the output table are just the difference between the first fruit and the second fruit in the column. A basic Cartesian product on the headers (ex date column) is fine... I know I will receive pairs in reverse (grapes_pears, pears_grapes) and simply a sign change for the value and also cases with grapes_grapes but for now that is okay. Will refine later.
Thanks for your help.
You can try combn(), i.e.
combn(names(df[-1]),2, FUN = function(i) Reduce(`-`, df[i]))
which gives,
[,1] [,2] [,3]
[1,] 2 3 1
[2,] -4 3 7

Get row indices of data frame A according to multiple matching criteria in that data frame and another data frame, B

Let's say we have two data frames in R, df.A and df.B, defined thus:
bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)
obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)
df.A defines the ranges of bins, while df.B consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in df.A.
We want to generate a new vector of length nrow(df.B) containing the row indices of df.A corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return NA (or something similar).
In the above example, the correct output vector would be this:
bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)
I came up with a long-winded solution using sapply:
bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call
Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve match() or which(). Any ideas?
1) Using SQL it can readily be done in one statement:
library(sqldf)
sqldf('select a.rowid
from "df.B" b
left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')
rowid
1 4
2 NA
3 5
4 2
5 NA
6 4
7 3
8 4
9 5
10 2
2) merge/by We can do it in two statements using merge and by. No packages are used.
This does have the downside that it materializes the large join which the SQL solution would not need to do.
Note that df.B, as defined in the question, has obs_10 is the second level rather than the 10th level. If it were such that obs_10 were the 10th level then the second argument to by could have been just m$obs_ID so fixing up the input first could simplify it.
m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m$obs_ID)),
with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)[1]))
giving:
values ind
1 4 1
2 NA 2
3 5 3
4 2 4
5 NA 5
6 4 6
7 3 7
8 4 8
9 5 9
10 2 10
3) sapply Note that using the c(..., NA)[1] trick from (2) we can simplify the sapply solution in the quesiton to one statement:
sapply(1:nrow(df.B), function(i)
c(which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i]), NA)[1])
giving:
[1] 4 NA 5 2 NA 4 3 4 5 2
3a) mapply A nicer variation of (3) using mapply is given by #Ronak Shah` in the comments:
mapply(function(x, y) c(which(x >= df.A$bin_min & y <= df.A$bin_max), NA)[1],
df.B$obs_min,
df.B$obs_max)
4) outer Here is another one statement solution that uses no packages.
seq_len(nrow(df.A)) %*%
(outer(df.A$bin_max, df.B$obs_max, ">=") & outer(df.A$bin_min, df.B$obs_min, "<="))
giving:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 0 5 2 0 4 3 4 5 2

Resources