I would like to know if there is a 'proper' way to subset big.matrix objects in R. It is simple to subset a matrix but the class always reverts to 'matrix'. This isn't a problem when working with small datasets like this but with massive datasets but with extremely large datasets the subset could still benefit from the 'big.matrix' class.
require(bigmemory)
data(iris)
# I realize the warning about factors but not important for this example
big <- as.big.matrix(iris)
class(big)
[1] "big.matrix"
attr(,"package")
[1] "bigmemory"
class(big[,c("Sepal.Length", "Sepal.Width")])
[1] "matrix"
class(big[,1:2])
[1] "matrix"
I have since learned that the 'proper' way to subset a big.matrix is to use sub.big.matrix although this is only for contiguous columns and/or rows. Non-contiguous subsetting is not currently implemented.
sm <- sub.big.matrix(big, firstCol=1, lastCol=2)
It doesn't seem to be possible without calling as.big.matrix on the subset.
From the big.matrix documentation,
If x is a big.matrix, then x[1:5,] is returned as an R matrix containing the first five rows of x.
I presume this also applies to columns as well. So it seems you would need to call
a <- as.big.matrix(big[,1:2])
in order for the subset to also be a big.matrix object.
class(a)
# [1] "big.matrix"
# attr(,"package")
# [1] "bigmemory"
Related
I would like to know if there is any way (I'm sure it is) to get the elements of the
additive relationship matrix A in R.
I already have the pedigree and I was succesfull on getting the A matrix by to different ways:
by using the function makeA from the pedigree package:
library(pedigree)
makeA(pedigree_renum, which = pedigree_renum$ID=="1-2372") #for all the animals
#> [1] TRUE
but I can not get the elements from the matrix
by using the function getA from the peigreemm package. In this case I get the 2372*2372 A matrix:
class(pedigree_general)
#> [1] "pedigree"
attr(,"package")
#> [1] "pedigreemm"
matrizA<-getA(pedigree_general)
class(matrizA)
#> [1] "dsCMatrix"
attr(,"package")
#> [1] "Matrix"
But I can't find out how to save certain elements from the matrix such as the upper diagonal elements.
Hope some of you can help me figure this out!
Different approaches to obtain the same result are welcome :)
Greetings from Buenos Aires.
From Pedigree's documentation on makeA:
Makes the A matrix for a part of a pedigree and stores it in a file called A.txt
What you have missed, if I'm not mistaken, is that the matrix you are searching for should be loaded from the file A.txt, which is the output file of the command. Example:
id <- 1:6
dam <- c(0,0,1,1,4,4)
sire <- c(0,0,2,2,3,5)
ped <- data.frame(id,dam,sire)
makeA(ped,which = c(rep(FALSE,4),rep(TRUE,2)))
A <- read.table("A.txt")
After printing A, here is the matrix:
Let me know if I'm missing something.
I´m not sure if this is possible or even how to get a good resolution for the following R problem.
Data / Background / Structure:
I´ve collected a big dataset of project based cooperation data, which maps specific projects to the participating companies (this can be understood as a bipartite edgelist for social network analysis). Because of analytical reasons it is advised to subset the whole dataset to different subsets of different locations and time periods. Therefore, I´ve created the following data structure
sna.location.list
[[1]] (location1)
[[1]] (is a dataframe containing the bip. edge-list for time-period1)
[[2]] (is a dataframe containing the bip. edge-list for time-period2)
...
[[20]] (is a dataframe containing the bip. edge-list for time-period20)
[[2]] (location2)
... (same as 1)
...
[[32]] (location32)
...
Every dataframe contains a project id and the corresponding company ids.
My goal is now to transform the bipartite edgelists to one-mode networks and then do some further sna-related-calculations (degree, centralization, status, community detection etc.) and save them.
I know how to these claculation-steps with one(!) specific network but it gives me a really hard time to automate this process for all of the networks at one time in the described list structure, and save the various outputs (node-level and network-level variables) in a similar structure.
I already tried to look up several ways of for-loops and apply approaches but it still gives me sleepless nights how to do this and right now I feel very helpless. Any help or suggestions would be highly appreciated. If you need more information or examples to give me a brief demo or code example how to tackle such a nested structure and do such sna-related calculations/modification for all of the aforementioned subsets in an efficient automatic way, please feel free to contact me.
Let's say you have a function foo that you want to apply to each data frame. Those data frames are in lists, so lapply(that_list, foo) is what we want. But you've got a bunch of lists, so we actually want to lapply that first lapply across the outer list, hence lapply(that_list, lapply, foo). (The foo will be passed along to the inner lapply with .... If you wish to be more explicit you can use an anonymous function instead: lapply(that_list, function(x) lapply(x, foo)).
You haven't given a reproducible example, so I'll demonstrate applying the nrow function to a list of built-in data frames
d = list(
list(mtcars, iris),
list(airquality, faithful)
)
result = lapply(d, lapply, nrow)
result
# [[1]]
# [[1]][[1]]
# [1] 32
#
# [[1]][[2]]
# [1] 150
#
#
# [[2]]
# [[2]][[1]]
# [1] 153
#
# [[2]][[2]]
# [1] 272
As you can see, the output is a list with the same structure. If you need the names, you can switch to sapply with simplify = FALSE.
This covers applying functions to a nested list and saving the returns in a similar data structure. If you need help with calculation efficiency, parallelization, etc., I'd suggest asking a separate question focused on that, with a reproducible example.
In order to run Random Forest models over very large datasets, I have divided my data into chunks and have run randomForest::randomForest() on each chunk. The resulting randomForest objects are contained in a list. I now need to use randomForest::combine() to combine the trees from each chunk of data.
My question is, how do I use a function such as combine() over all objects in a list? In my understanding, sapply(), etc. apply a function to each object in a list--not what I want to do. I need to use combine() over all randomForest objects in the list; or if that is not directly possible, I need to pull out each object separately and send it to combine(). Another issue is that I have different datasets with a varying number of data chunks; I want the code to be flexible in regards to the number of chunks.
My list (rf.final) contains objects "1" through "5" which are each randomForest objects:
> class(rf.final)
[1] "list"
> names(rf.final)
[1] "1" "2" "3" "4" "5"
> class(rf.final[[1]])
[1] "randomForest.formula" "randomForest"
There are 5 objects just because I had 5 chunks of data for this particular dataset.
I haven't included str(rf.final) because the output is huge [even just for str(rf.final[[1]])] but I can if desired.
I finally found the solution! Use the do.call() function in the base package.
I.e.
rf.final2 <- do.call("combine", rf.final)
When I save an object from R using save(), what determines the size of the saved file? Clearly it is not the same (or close to) the size of the object determined by object.size().
Example:
I read a data frame and saved it using
snpmat=read.table("Heart.txt.gz",header=T)
save(snpmat,file="datamat.RData")
The size of the file datamat.RData is 360MB.
> object.size(snpmat)
4998850664 bytes #Much larger
Then I performed some regression analysis and obtained another data frame adj.snpmat of same dimensions (6820000 rows and 80 columns).
> object.size(adj.snpmat)
4971567760 bytes
I save it using
> save(adj.snpmat,file="adj.datamat.RData")
Now the size of the file adj.datamat.RData is 3.3GB. I'm confused why the two files are so different in size while the object.size() gives similar sizes. Any idea about what determines the size of the saved object is welcome.
Some more information:
> typeof(snpmat)
[1] "list"
> class(snpmat)
[1] "data.frame"
> typeof(snpmat[,1])
[1] "integer"
> typeof(snpmat[,2])
[1] "double" #This is true for all columns except column 1
> typeof(adj.snpmat)
[1] "list"
> class(adj.snpmat)
[1] "data.frame"
> typeof(adj.snpmat[,1])
[1] "character"
> typeof(adj.snpmat[,2])
[1] "double" #This is true for all columns except column 1
Your matrices are very different and therefore compress very differently.
SNP data contains only a few values (e.g., 1 or 0) and is also very sparse. This means that is very easy to compress. For example, if you had a matrix of all zeros, you could think of compressing the data by specifying a single value (0) as well as the dimensions.
Your regression matrix contains many different types of values, and are also real numbers (I'm assuming p-values, coefficients, etc.) This makes it much less compressible.
I have bunches of data.frames in R workspace. And I have exactly same processing to treat them. Since I am "lazy" to run the command for each data.frame one by one, I wish to treat them as a group and process them with a loop which saves time.
Let me say, simply, to apply as.data.frame to those matrix for example of my real serial data-processing.
# dummy data
set.seed(1026)
a<-matrix(rnorm(100),50,2)
b<-matrix(rnorm(100),50,2)
c<-matrix(rnorm(100),50,2)
# process data one-by-one which is not good
a<-as.data.frame(a)
b<-as.data.frame(b)
c<-as.data.frame(c)
I could do but it is time-consume. I turn to a lazy but quick way similar to*applydealing with rows or columns inside data.frame.
sapply(c(a,b,c),as.data.frame) or sapply(list(a,b,c),as.data.frame), or even:
> for (dt in c(a,b,c)){
+ dt<-as.data.frame(dt)
+ }
But, none of them make changes happened to the original three matrix.
> class(a)
[1] "matrix"
> class(b)
[1] "matrix"
> class(c)
[1] "matrix"
I wish to see all of them have been trans to data.frame.
Your problem is that you're using sapply, which simplifies results to vectors or matrices.
You want lapply instead:
lapply(list(a,b,c), as.data.frame)
Edit for the (generally frowned upon) practice of changing the objects systematically but keeping the object names the same:
for(i in c("a", "b", "c"))
assign(i, as.data.frame(get(i))
This should get you a list of 3 data.frames:
set.seed(1026)
lapply(1:3,function(x){as.data.frame(matrix(rnorm(100),50,2))})