R: Read in random rows from file using fread or equivalent? - r

I have a very large multi-gigabyte file which is too costly to load into memory. The ordering of the rows in the file, however, are not random. Is there a way to read in a random subset of the rows using something like fread?
Something like this, for example?
data <- fread("data_file", nrows_sample = 90000)
This github post suggests one possibility is to do something like this:
fread("shuf -n 5 data_file")
This does not work for me, however. Any ideas?

Using the tidyverse (as opposed to data.table), you could do:
library(readr)
library(purrr)
library(dplyr)
# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))
# sort the index sequentially
start_at <- start_at[order(start_at)]
# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )

If your data file happens to be a text file this solution using the package LaF could be useful:
library(LaF)
# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)
dim(mat)
#[1] 1000000 10
write.table(mat, "tmp.csv",
row.names = F,
sep = ",",
quote = F)
# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
n = 90000,
nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs
dim(random_mat)
#[1] 90000 10

Related

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )
Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

R: Run for loop in parallel

I am trying to read in two csv files (dataset1 and dataset2) one of them has about 400 million lines. Both the files have the same number of columns i.e. 7.
In the code below, I am reading both files in chunks of fixed size, rbind them, apply a function and then write out the returned output to a file in append mode.
The following is my code:
# set x to 0 - number of lines to skip in dataset1
# set y to 7924 - number of lines to read in dataset1
# dataset1 has 60498*7924
x = 0
y = 7924
# set a to 0 - number of lines to skip in dataset2
# set b to 734 - number of lines to read in dataset2
# dataset2 has 60498*734 lines
a = 0
b = 734
# run the loop from 1 to 60498
# each time skip lines already read in
# each time read fixed number of rows
for(i in 1:60498)
{
# read both datasets and combine in one
dat <- read.csv('dataset1.csv', skip = x, nrows = y, header = F)
dat2 <- read.csv('dataset2.csv', skip = a, nrows = b, header = F)
dat3 <- rbind(dat, dat2)
# apply function to this dataset and return the output
# the function is too long and not in the scope so I will skip it
# it returns a dataframe of 1 row
res <- limma.test(dat3)
# write out the output in append mode
# so at the end of the loop, out.txt should have 60498 lines
write.table(res, file = 'out.txt', append = TRUE, quote = F, col.names = F)
# set x and y so that it skips the lines that are already read in
x = x + 7924
a = a + 734
}
The function itself is pretty fast, there is no bottleneck there. However, running a for loop for 60498 times, it is going to take really long. I have a computer with 8 cores. How can I modify my code to run the for loop in parallel to minimize the time?
Thanks!

Count occurrences around a genomic region given in a data frame

I need to count mutations in the genome that occur at certain spots or rather ranges. The mutations have a genomic position (chromosome and basepair, e.g. Chr1, 10658324). The range or spot, respectively, is defined as 10000 basepairs up- and downstream (+-) of a given position in the genome. Both, positions of mutations and position of "spots" are stored in data frames.
Example:
set.seed(1)
Chr <- 1
Pos <- as.integer(runif(5000 , 0, 1e8))
mutations <- data.frame(Pos, Chr)
Chr <- 1
Pos <- as.integer(runif(50 , 0, 1e8))
spots <- data.frame(Pos, Chr)
So the question I am asking is: How many mutations are present +-10k basepairs around the positions given in "spots". (e.g. if the spot is 100k, the range would be 90k-110k)
The real data would of course contain all 24 chromosomes, but for the sake of simplicity we can focus on one chromosome for now.
The final data should contain the "spot" and the number of mutations in it's vicinity, ideally in a data frame or matrix.
Many thanks in advance for any suggestions or help!
Here's a first attempt, but I am pretty shure there is a way more elegant way of doing it.
w <- 10000 #setting range to 10k basepairs
loop <- spots$Pos #creating vector of positions to loop through
out <- data.frame(0,0)
colnames(out) <- c("Pos", "Count")
for (l in loop) {
temp <- nrow(filter(mutations, Pos>=l-w, Pos<=l+w))
temp2 <- cbind(l,temp)
colnames(temp2) <- c("Pos", "Count")
out <- rbind(out, temp2)
}
out <- out[-1,]
Using data.table foverlaps, then aggregate:
library(data.table)
#set the flank
myFlank <- 100000
#convert to ranges with flank
spotsRange <- data.table(
chr = spots$Chr,
start = spots$Pos - myFlank,
end = spots$Pos + myFlank,
posSpot = spots$Pos,
key = c("chr", "start", "end"))
#convert to ranges start end same as pos
mutationsRange <- data.table(
chr = mutations$Chr,
start = mutations$Pos,
end = mutations$Pos,
key = c("chr", "start", "end"))
#merge by overlap
res <- foverlaps(mutationsRange, spotsRange, nomatch = 0)
#count mutations
resCnt <- data.frame(table(res$posSpot))
colnames(resCnt) <- c("Pos", "MutationCount")
merge(spots, resCnt, by = "Pos")
# Pos Chr MutationCount
# 1 3439618 1 10
# 2 3549952 1 15
# 3 4375314 1 11
# 4 7337370 1 13
# ...
I'm not familiar with bed manipulations in R, so I'm going propose an answer with bedtools and someone here can try to convert to GRanges or other R bioinformatics library.
Essentially, you have two bed files, one with your spots and other with your mutations (I'm assuming a 1bp coordinate for each in the latter). In this case, you'd use closestBed to get the closest spot and the distance in bp of each mutation and then filter those that are 10KB from the spots. The code in a UNIX environment would look something like this:
# Assuming 4-column file structure (chr start end name)
closestBed -d -a mutations.bed -b spots.bed | awk '$9 <= 10000 {print}'
Where column 9 ($9) will be the distance in bp from the closest spot. Depending on how more specific you want to be, you can check the manual page at http://bedtools.readthedocs.io/en/latest/content/tools/closest.html. I'm pretty sure there's at least one bedtools-like package in R. If the functionality is similar, you can apply this exact same solution.
Hope that helps!

How to read specific rows of CSV file with fread function

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.
How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.
The best solution this question, Reading specific rows of large matrix data file, gives the following solution:
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.
However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.
Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?
This is not a duplicate because it is only about the function fread.
Here's my problem setup:
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]
This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.
If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

merging multiple csv's in R

Hi i was merging csv downloaded from NSE Bhavcopy. different dates have different no of cols. Say in 26-12-2006 it had 998 rows & 27-12-2006 it has 1003 rows. It has 8 cols. I do the cbind to create a & b with just 2 cols, Symbol, close price. I name the col using colnames so that for merging i can merge by SYMBOL.
Questions:
1) When i use merge function with by = "SYMBOL", all = F; i was surprised to see resulting c having 1011 rows. where ever i read, merging with all = F it should become 998 rows or max 1003 rows. I also analyzed the data and found there were 5 different symbols in 27-12-2006 & 3 different symbols in 26-12-2006. So when we merge by "SYMBOL" will new symbols from both rows will be added? or it will merge only with earlier existing a row?
2) NSEmerg is a function using a for loop to read new file every time & merge with existing c file. I have about 1535 files having data from 2006 Dec till 2013 Apr. However i was not able to merge more than 12 files as it throws error vector size of 12 MB cannot be allowed. It also shows warning messages saying memory allocation of 1535 MB used up. Also at 12th file i found nrow of c to be 1508095 implying loop running infinitely. Of all the 1535 files, highest row was at 1435. Even if we add all stocks delisted, no traded on specific date, i believe it might not cross 2200 stocks. Why this shows nrow of 1.5 Million??
3) Is there any better way of merging csv? I am in stack overflow for first time else i would have attached say 10 files.
Code:
a <- read.csv("C://Users/home/desktop/061226.csv", stringsAsFactors = F, header = T)
b <- read.csv("C://Users/home/desktop/061227.csv", stringsAsFactors = F, header = T)
a_date <- a[2,1]
b_date <- b[2,1]
a <- cbind(a[,2],a[,6])
b <- cbind(b[,2], b[,6])
colnames(a) <- c("SYMBOL", a_date)
colnames(b) <- c("SYMBOL", b_date)
c <- merge(a,b,by = "SYMBOL", all = F)
NSEmerg <- function(x,y) {
y_date <- y[2,1]
y <- cbind(y[,2], y[,6])
colnames(y) <- c("SYMBOL", y_date)
c <- merge(c, y, by = "SYMBOL", all = F)
}
filenames = list.files(path = "C:/Users/home/Documents/Rest data", pattern = "*csv")
for (i in 1:length(filenames)){
y <- read.csv(filenames[i], header = T, stringsAsFactors = F)
c <- NSEmerg(c,y)
}
write.csv(c, file = "NSE.csv")
Are you sure you want to cbind and not rbind? To answer your last question. First you list all the .csv files in your map:
listfiles <- list.files(path="C:/Users/home/desktop", pattern='\\.csv$', full.names=TRUE)
Next use do.call to read in the different csv files and combine them with rbind.
df <- do.call(rbind, lapply(listfiles , read.csv))
You'd probably be better off just using a perl one-liner:
perl -pe1 file1 file2 file3 ... > newfile
and then you can cut the columns you need out
cut -f1,2 -d"," newfile > result

Resources