How to read specific rows of CSV file with fread function - r

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.
How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.
The best solution this question, Reading specific rows of large matrix data file, gives the following solution:
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.
However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.
Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?
This is not a duplicate because it is only about the function fread.
Here's my problem setup:
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.
If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

Related

R: Read in random rows from file using fread or equivalent?

I have a very large multi-gigabyte file which is too costly to load into memory. The ordering of the rows in the file, however, are not random. Is there a way to read in a random subset of the rows using something like fread?
Something like this, for example?
data <- fread("data_file", nrows_sample = 90000)
This github post suggests one possibility is to do something like this:
fread("shuf -n 5 data_file")
This does not work for me, however. Any ideas?
Using the tidyverse (as opposed to data.table), you could do:
library(readr)
library(purrr)
library(dplyr)
# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))
# sort the index sequentially
start_at <- start_at[order(start_at)]
# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )
If your data file happens to be a text file this solution using the package LaF could be useful:
library(LaF)
# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)
dim(mat)
#[1] 1000000 10
write.table(mat, "tmp.csv",
row.names = F,
sep = ",",
quote = F)
# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
n = 90000,
nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs
dim(random_mat)
#[1] 90000 10

R: How to change data in a column across multiple files. Help understanding lapply

I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")

Crafty ways to make super efficient R vector processing?

I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]

data.frame filtering function too slow

I am trying to filter a data frame of mine, with about 200 thousand rows, using R.
The dataframe is structured as follow:
testdf<- data.frame("CHROM"='CHR8', "POS"=c(500,510), "ID"='Some_value',
"REF"=c('A','C'), "ALT"=c('C','T,G'), "Some_more_stuff"='More_info')
I am trying to filter the rows based on how many letters are in the 'ALT' column, being equal to or lesser than a custom threshold. In the example above, if my threshold is 1, only the first row would be retained (the second row - ALT column- has 2 letters > 1).
I have written a couple of functions, which do the job. The only problem is that they take several seconds on a test dataframe with just 14 rows. On the real dataframe (200,000 rows) it takes forever. I am looking for advice on how writing better syntax and get faster results.
Here are my functions:
# Function no. 1:
allele_number_filtering<- function (snp_table, max_alleles=1, ALT_column=5) {
#here I calculate how many letters are in the ALT column
alt_allele_list_length <- function(ALT_field) {
alt_length<- length(strsplit(as.character
(ALT_field), split = ',')[[1]])
return(alt_length)}
# Create an empty dataframe with same columns as the input df
final_table<- snp_table[0,]
# Now only retain the rows that are <= max_alleles
for (i in 1:nrow(snp_table)) {
if (alt_allele_list_length(snp_table[i, ALT_column]) <= max_alleles) {
final_table<- rbind(final_table, snp_table[i,])}}
return(final_table)}
#Function no. 2:
allele_number_filtering<- function (snp_table, max_alleles=1, ALT_column=5) {
final_table<- snp_table[0,]
for (i in 1: nrow(snp_table)) {
if (length(strsplit(as.character(snp_table[i,ALT_column]),
split = ',')[[1]])<=max_alleles) {
final_table<- rbind(final_table, snp_table[i,])
}}
return(final_table)}
I would be thankful for any advice :)
Max
EDIT: I realized I also had values such as 'ALT' = 'at' (still to be counted as 1) or 'ALT' = 'aa,at' (to be counted as 2 ).
you can use lengths() for this:
testdf[lengths(strsplit(as.character(testdf$ALT), ',',fixed = TRUE))<=1,]
Thanks #docendodiscimus for strsplit( fixed=TRUE) option for speed up and to #joran for his perspicacity
I would use nchar for this (before I would remove the , via gsub):
nchar(gsub(",", "", as.character(testdf$ALT)))
# [1] 1 2
threshold <- 1
testdf[nchar(gsub(",", "", as.character(testdf$ALT))) > threshold, ]
# CHROM POS ID REF ALT Some_more_stuff
# 2 CHR8 510 Some_value C T,G More_info

How can I maintain the same column format in R data frame when i export it as csv?

I have a variable in my R data frame that has a 18 chars. When I use write.csv(out2, file="ddd.csv", row.names=FALSE ). I get this specific variable's values in a scientific format. I try to export it as txt and it maintained the same exact structure as I wanted but I need it as a csv format. What can i do in order to maintain the exact format of my R data frame when I export it to csv?
Thank you,
Ron
R will write that column as a number if it thinks that it is a number, rather than a categorical variable. Compare, for example, the columns in this dataset
n <- 5
ids <- replicate(
n,
paste0(
sample(0:9, 18, replace = TRUE),
collapse = ""
)
)
out2 <- data.frame(
CategoricalId = factor(ids),
NumericId = as.numeric(ids)
)
out2
## CategoricalId NumericId
## 1 097572748411056439 9.757275e+16
## 2 455782786931417422 4.557828e+17
## 3 046986020739330140 4.698602e+16
## 4 384292451744509872 3.842925e+17
## 5 787170367185951077 7.871704e+17
The Excel number formatting dialog options, with output:

Resources