I have a large matrix(28960×45807 Array{Float64,2}), where rows represent individuals and columns for snpID. now I want to get a subset matrix(28960x4580) that snps are selected ramdomly from large matrix.
how can I do it in Julia?
Assuming that your matrix is x:
using StatsBase
#view x[:, sample(1:size(x,2), 4580, replace=false, ordered=true)]
Exaplanation:
Using #view avoids data copying. It could be skipped in this command but it would worsen the performance
The colon : selects the data across all rows, the second parameter of array slicing is used to select a bunch of columns
size(x,2) returns the number of columns
we use sample from StatsBase to sample the column numbers without replacement. Additionally, I assumed that you do not want to change the order of columns
Related
I am running some big simulations where in each iteration a matrix of unknown row number gets created. I want to unite all these matrices into one big matrix. Intuitively, the easiest way to do this is setting up an empty matrix before the loop and then appending each result of an iteration with rbind. However, this is very inefficient because of memory allocation and I need to have a fast script!
Therefore, I decided to create a sparse results matrix filled with 0s prior to the loop and then replace its rows iteratively with the new matrices. The size of this results matrix is very exaggerated to make sure that I have "enough" rows. The "unused" rows get removed after the loop. An example of my current solution is as follows (this is just dummy code, the real code is more complex):
library(Matrix)
#Create empty results matrix
matrix_to_fill<-sparseMatrix(i=integer(0),j=integer(0),x=0,
dims=c(100,10))
#Run 10 iterations where a new matrix of unknown size is created
for(i in 1:10){
#Find first row that is a 0
first0<-(1:100)[matrix_to_fill[,1]==0][1]
#Create a new dataframe with random number of rows
new_matrix<-matrix(rep(1,10*(rpois(1,2)+1)),ncol=10)
#Replace section of matrix_to_fill starting at first 0-line with new matrix
matrix_to_fill[first0:(first0+nrow(new_matrix)-1),]<-new_matrix
}
matrix_to_fill<-matrix_to_fill[matrix_to_fill[,1]!=0,]
However, again, I am running into a similar problem with memory allocation. Since I do not know what is the number of rows of each of my matrices, I have to store them first, calculate their row number and then replace the respective rows in my results matrix.
Is there a way I can replace an unknown number of rows in my results table with a new matrix (I do know the starting row and I do know that my table is big enough to "fit" the new matrix)? Or do I have to solve this by creating a "results list" of known size prior to the loop instead of a "results matrix" and then fill each new matrix into the list? This would be of course possible, but I'm afraid this would be less efficient...
Thanks!
If my data frame has mixed types in it (continuous and categorical), and I want to compute pair correlations using, for instance, pairs in R, is there a way to quickly only select the numerical type columns from the frame?
Emulate this solution as applied to a data frame X:
pairs(subset(X, select=sapply(X, is.numeric)))
Note that pairs will fail when X has fewer than two numeric columns, so for general-purpose use, consider encapsulating this in a function that checks the result of sapply before doing the subsetting.
I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.
I want to know if a vector is 1xN or Nx1 in R. What function should I use? Length returns only one value regardless of the vector type.
As AnandaMahto's comment, using NROW/NCOL it returns numbers of rows and columns.
In R, vectors don't have dimensions. The dimension of a vector is NULL. Whereas, arrays, matrices, data frames, tables have dimensions.
If you want to know the value of N(that is the number of elements in a vector) you can use the length function
Hopefully this has an easy answer I just haven't been able to find:
I am trying to write a simulation that will compare a number of statistical procedures on different subsets of rows (subjects) and columns (variables) of a large matrix.
Subsets of rows was fairly easy using a sample() of the subject ID numbers, but I am running into a little more trouble with columns.
Essentially, what I'd like to be able to do is create a random sample of column index numbers which will then be used to create a new matrix. What's got me the closest so far is:
testmat <- matrix(rnorm(10000),nrow=1000,ncol=100)
column.ind <- sample(3:100,20)
teststr <- paste("testmat[,",column.ind,"]",sep="",collapse=",")
which gives me a string that has a testmat[,column.ind] for every sampled index number. Is there any way to easily plug that into a cbind() function to make a new matrix? Is there any other obvious way I'm missing?
I've been able to do it using a loop (i.e. cbind(matrix,newcolumn) over and over), but that's fairly slow as the matrix I'm using is quite large and I will be doing this many times. I'm hoping there's a couple-line solution that's more elegant and quicker.
Have you tried testmat[, column.ind]?
Rows and columns can be indexed in the same way with logical vectors, a set of names, or numbers for indexes.
See here for an example: http://ideone.com/EtuUN.