I have 20 data frames of different lengths, but all the same number of columns. I would like to run some analyses, in this case a distance matrix using vegan, for each of these data frames. I have searched around and just figure I am missing a step somewhere.
dummy data is using 5 df, and I have been trying to use the lapply.
df1<- matrix(data = c(1:100), nrow = 10, ncol = 10)
df2<- matrix(data = c(1:150), nrow = 15, ncol = 10)
df3<- matrix(data = c(1:50), nrow = 5, ncol = 10)
df4<- matrix(data = c(1:200), nrow = 20, ncol = 10)
df5<- matrix(data = c(1:100), nrow = 10, ncol = 10)
Y<- list(df1, df2, df3, df4, df5)
Y.dc <- lapply(Y, dist.ldc(Y, "chord"))
I have also tried just running it on the list directly, and I keep getting errors there too.
Y.dc<- dist.ldc(Y, "chord")
Ideally, I would like to not run 20 lines/chunks of code for each frame.
Eventually, I would also like to be able to generate nMDS plots, and run PERMANOVAs on each of the data frames all at once as well. Would I need to write/run a function in order to accomplish that?
A valid syntax :
Y.dc <- lapply(Y, dist.ldc, method = "chord")
(I assumed function dist.lc came from package adespatial, which I don't know)
Related
I have an array Q which has size nquantiles by nfeatures by nfeatures. In this, essentially the slice Q[1,,] would give me the first quantile of my data, across all nfeatures by nfeatures of my data.
What I am interested in, is using another matrix M (again of size nfeatures by nfeatures) which represents some other data, and asking the question to which quantile do each of the elements in M lie in Q.
What would be the quickest way to do this?
I reckon I could do double for loop across all rows and columns of the matrix M and come up with a solution similar to this: Finding the closest index to a value in R
But doing this over all nfeatures x nfeatures values will be very inefficient. I am hoping that there might exist a vectorized way of approaching this problem, but I am at a lost as to how to approach this.
Here is a reproducible way of the slow way I can approach the problem with O(N^2) complexity.
#Generate some data
set.seed(235)
data = rnorm(n = 100, mean = 0, sd = 1)
list_of_matrices = list(matrix(data = data[1:25], ncol = 5, nrow = 5),
matrix(data = data[26:50], ncol = 5, nrow = 5),
matrix(data = data[51:75], ncol = 5, nrow = 5),
matrix(data = data[76:100], ncol = 5, nrow = 5))
#Get the quantiles (5 quantiles here)
Q <- apply(simplify2array(list_of_matrices), 1:2, quantile, prob = c(seq(0,1,length = 5)))
#dim(Q)
#Q should have dims nquantiles by nfeatures by nfeatures
#Generate some other matrix M (true-data)
M = matrix(data = rnorm(n = 25, mean = 0, sd = 1), nrow = 5, ncol = 5)
#Loop through rows and columns in M to find which index of the array matches up closest with element M[i,j]
results = matrix(data = NA, nrow = 5, ncol = 5)
for (i in 1:nrow(M)) {
for (j in 1:ncol(M)) {
true_value = M[i,j]
#Subset Q to the ith and jth element (vector of nqauntiles)
quantiles = Q[,i,j]
results[i,j] = (which.min(abs(quantiles-true_value)))
}
}
'''
Currently I am working on an R project where I iteratively have to make a lot of small changes to the final output, which is stored in an self made Class. The calculation time of the problem becomes very large if the amount of iterations increase. Unfortunately, a more vectorized version of the code is not possible, because the future values to be changed depend on the current changes.
A small example of the problem that I encouter is given below. For the calculations using the Example_Class the calculations are significantly longer then for the instance where just a matrix is used. Are their methods available to speed up the calculations in R? Or should I look at extionsion to for example C++?
Example_Class <- setClass(
"Example",
slots = c(
slot_1 = "matrix",
slot_2 = "matrix"
),
prototype=list(
slot_1 = matrix(1, ncol = 1, nrow = 7),
slot_2 = matrix(1, ncol = 4, nrow = 7)
)
)
Example <- Example_Class()
example_matrix_1 <- matrix(1, ncol = 1, nrow = 7)
example_matrix_2 <- matrix(1, ncol = 4, nrow = 7)
example_list <- list(example_matrix_1, example_matrix_2)
profile <- microbenchmark::microbenchmark(
example_matrix_2[3,3] <- (example_matrix_2[3,3] + 1)/2,
example_list[[2]][3,3] <- (example_list[[2]][3,3] + 1)/2,
Example#slot_2[3,3] <- (Example#slot_2[3,3] + 1)/2,
times = 1000
)
profile
I have a simple 12 x 2 matrix called m that contains my dataset (see below).
Question
I was wondering why when I use dimnames(m) to create two names for the two columns of my data, I run into an Error? Is there a better way to create column names for this data in R?
Here is my R code:
Group1 = rnorm(6, 7) ; Group2 = rnorm(6, 9)
Level = gl(n = 2, k = 6)
m = matrix(c(Group1 , Group2, Level), nrow = 12, ncol = 2)
dimnames(m) <- list( DV = Group1, Level = Level)
replace dimnames(m) with
colnames(m) <- c("DV","Level")
I often need to write something like
sample_size = 10^4
my_data <- data.frame(x1 = runif(sample_size, 0,3), x2 = runif(sample_size, 0,3), x3 = runif(sample_size, 0,3), x4 = runif(sample_size, 0,3))
in order to test some statistical models. For example,
error <- rnorm(sample_size, 0, 0.1)
y <- with( my_data, 2*x1+0.1*(x2 + x3 + x4)) + error
my_model <- lm(y ~ ., data = my_data)
Since my_data is used as input to lm, it has to be a data frame (or a list).
I wonder if invoking runif 4 times is the right way to do this, or if there are better solutions. I tried
my_data <- matrix(4*runif(sample_size, 0,3), sample_size, 4, dimnames = list(NULL, paste0("x", 1:4)))
my_data <- as.data.frame(my_data)
But it doesn't seem so readable to me.
There are a few ways to do this. Let's say you want ncol columns, here are some good ways:
ncol = 4
sample_size = 10
replicate(ncol, runif(sample_size, 0, 3))
matrix(runif(sample_size * ncol, 0, 3), ncol = ncol)
sapply(1:ncol, function(x) runif(sample_size, 0, 3))
These create matrices which you can, of course, convert to data frames as needed. The differences are minor. replicate is essentially a nice wrapper for sapply. The direct matrix method may be slightly faster, but probably the difference is a few milliseconds.
I wrote the follow code to clustering data :
clusrer.data <- function(data,n) {
miRNA.exp.cluster <- scale(t(miRNA.exp))
k.means.fit <- kmeans(miRNA.exp.cluster,n)
#i try to save the results of k-means cluster by this code :
k.means.fit <- as.data.frame(k.means.fit)
write.csv(k.means.fit, file="k-meanReslut.csv")
#x<-k.means.fit$clusters
#write.csv(x, file="k-meanReslut.csv")
}
but I can not save the clusters to outside of (clusters) (8, 6, 7, 20, 18), I want to save each cluster separated (with columns and rows) in txt file or CSV.
Here is one approach of splitting the original dataset according to cluster and saving that chunk to a file. I added cluster assignment to the original dataset for easier visual check. Example is partly taken from ?kmeans. Feel free to adapt the way files are written, as well as the way file name is created.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
x <- cbind(x, cluster = cl$cluster)
by(x, INDICES = cl$cluster, FUN = function(sp) {
write.table(sp, file = paste0("file", unique(sp$cluster), ".txt"),
row.names = TRUE, col.names = TRUE)
})