Social graph analysis. 60GB and 100 million nodes - r

Good evening,
I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?
Thanks in advance.
P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.

If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that
your R objects are file backed so that you aren't limited by RAM
you can parallelize the degree computation using foreach
Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.
set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
row.names=FALSE,col.names=FALSE)
I next concatenate this file 10 times to make the example a bit bigger.
system("
for i in $(seq 1 10)
do
cat edgelist-small.csv >> edgelist.csv
done")
Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.
library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE,
type = "integer",sep = ",",
backingfile = "edgelist.bin",
descriptor = "edgelist.desc")
nrow(x) # 1e7 as expected
We can compute the outdegrees by using bigtable() on the first column.
outdegree <- bigtable(x,1)
head(outdegree)
Quick sanity check to make sure table is working as expected:
# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1])) # get name of first node
all.equal(as.numeric(outdegree[1]), # outdegree's answer
sum(x[,1]==j)) # manual outdegree count
To get indegree, just do bigtable(x,2).

Related

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

How to work with a large multi type data frame in Snow R?

I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.
Do you know other packages for this goal? Bigmemory only supports matrix.
I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.
The results are here. The description is below.
This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):
library(doSNOW)
library(itertools)
# if size on cores exceeds available memory, increase the chunk factor
chunk.factor <- 1
chunk.num <- kNoCores * cut.factor
tic()
# init the cluster
cl <- makePSOCKcluster(kNoCores)
registerDoSNOW(cl)
# init the progress bar
pb <- txtProgressBar(max = 100, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# conduct the parallelisation
travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
.combine='cbind',
.packages=c('httr','data.table'),
.export=c("QueryOSRM_dopar", "GetSingleTravelInfo"),
.options.snow = opts) %dopar% {
QueryOSRM_dopar(m,osrm.url,int.results.file)
}
# close progress bar
close(pb)
# stop cluster
stopCluster(cl)
toc()
Note that
coord.table is the data frame/table
kNoCores ( = 25 in this case) is the number of cores
Distributed memory. Sends coord.table to all nodes
Shared memory. Shares coord.table with nodes
Shared memory with cuts. Shares subset of coord.table with nodes.
Do par with cuts. Sends subset of coord.table to nodes.
SNOW with cuts and progress bar. Sends subset of coord.table to nodes
Option 5 without progress bar
More information about the other options I compared can be found here.
Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.

Reading in chunks at a time using fread in package data.table

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:
chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
#do something to temp
chunk = chunk + 1
if(nrow(temp)<2) done = TRUE
}
In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.
Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?
You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data.
Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.
UPDATE:
library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'
#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)
Then create a connection to your file using the model:
df.laf <- laf_open(model)
Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:
goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)
Now data contains 1e6 lines of your CSV file (starting from line 100e6).
You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:
library('dplyr')
library('stringr')
res <- df.laf[1,][0,]
for(i in 1:10){
raw <-
next_block(df.laf,nrows=100e6) %>%
filter(str_detect(protein1,"^9606\\."))
res <- rbind(res, raw)
}
Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.
You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example:
# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
A related option is the chunked package. Here is an example with a 3.5 GB text file:
library(chunked)
library(tidyverse)
# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this:
my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'
# How big is my file?
print(paste(round(file.info(my_file)$size / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor
# look at the top of the file
readLines(my_file, n = 100)
# to find where the content starts, vary the skip value,
read.table(my_file, nrows = 10, skip = 25)
This is where we start working in chunks of the file, we can use most dplyr verbs in the usual way:
# Let the chunked pkg work its magic! We only want the lines containing
# "Gun_control". The main challenge here was identifying the column
# header
df <-
read_chunkwise(my_file,
chunk_size=5000,
skip = 30,
format = "table",
header = TRUE) %>%
filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))
# this line does the evaluation,
# and takes a few moments...
system.time(out <- collect(df))
And here we can work on the output as usual, since it's much smaller than the input file:
# clean up the output to separate into cols,
# and get the number of page views as a numeric
out_df <-
out %>%
separate(De.mw.De.5.J3M1O1,
into = str_glue("V{1:4}"),
sep = " ") %>%
mutate(V3 = as.numeric(V3))
head(out_df)
V1 V2 V3
1 en.z Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z Gun_control_policy_of_the_Clinton_Administration 223
4 en.z Category:Gun_control_advocates 80
5 en.z Gun_control_in_the_United_Kingdom 68
6 en.z Gun_control_in_america 59
V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2 B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3 A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4 D2H1M1S4T8U22V10W18X14
5 B1C1S1T11U12V13W16X13
6 B1H1M1N2P1S1T6U5V17W12X12
#--------------------
fread() can definitely help you read the data by chunks
What mistake you have made in your code is that you should keep your nrow a constant while you change the size of your skip parameter in the function during the loop.
Something like this is what I wrote for my data:
data=NULL
for (i in 0:20){
data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)
}
And you may insert the follow code in your loop:
start_time <- Sys.time()
#####something!!!!
end_time <- Sys.time()
end_time - start_time
to check the time -- that each loop on average takes similar time.
Then you could use another loop to combine your data by rows with function default rbind function in R.
The sample code could be something like this:
new_data = data[[1]]
for (i in 1:20){
new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}
to unify into a large dataset.
Hope my answer may help with your question.
I loaded a 18Gb data with 2k+ columns, 200k rows in about 8 minutes using this method.

R bigmemory attach.big.matrix is very slow for very wide matrices

I am using the package bigmemory to interact with large matrices in R. This works well for large matrices except that the attach.big.matrix() function to reload a binary file created with read.big.matrix() is MUCH slower than the original call to read.big.matrix(). Here is an example:
library(bigmemory)
# Create large matrix with 1,000,000 columns
X = matrix(rnorm(1e8), ncol=1000000)
colnames(X) = paste("col", 1:ncol(X))
rownames(X) = paste("row", 1:nrow(X))
# Write to file
write.big.matrix(as.big.matrix(X), "X.txt", row.names=TRUE, col.names=TRUE)
# read into big.matrix and create backing-file for faster loading the second time
A = read.big.matrix("X.txt", header=TRUE, has.row.names=TRUE, type="double", backingfile="X.bin", descriptorfile="X.desc")
# Attach the data based on the backing-file
G = attach.big.matrix("X.desc")
When the number of columns is small (i.e. 1000), the code works as expected and attach.big.matrix() is faster than read.big.matrix(). But with 1,000,000 columns, attach.big.matrix() is 10x slower!
Also, note that this performance issue completely goes away when there are no column names (i.e. comment-out the colnames(X) line) and I can attach in zero time. This suggestions that the bottle neck is in parsing X.desc and there should be a better way to attach.big.matrix().
This matrix is small in comparison to my real data.
Or can I do something different?
Thanks
System info:
Intel Xeon E5-2687W # 3.10GHz with 64 Gb RAM
Ubuntu 12.04.2 LTS
R 3.0.1
bigmemory_4.4.3
From the package user manual
If x has a huge number of rows and/or columns, then the use of rownames and/or colnames will
be extremely memory-intensive and should be avoided.
As you suggest yourself the performance issue disappears when you avoid using column names. In my case I solve this by keeping column names in a separate vector that I use to extract the indices of the columns I need.

Transform a matrix txt file in spectra data for ChemoSpec package

I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)

Resources