R: Run for loop in parallel - r

I am trying to read in two csv files (dataset1 and dataset2) one of them has about 400 million lines. Both the files have the same number of columns i.e. 7.
In the code below, I am reading both files in chunks of fixed size, rbind them, apply a function and then write out the returned output to a file in append mode.
The following is my code:
# set x to 0 - number of lines to skip in dataset1
# set y to 7924 - number of lines to read in dataset1
# dataset1 has 60498*7924
x = 0
y = 7924
# set a to 0 - number of lines to skip in dataset2
# set b to 734 - number of lines to read in dataset2
# dataset2 has 60498*734 lines
a = 0
b = 734
# run the loop from 1 to 60498
# each time skip lines already read in
# each time read fixed number of rows
for(i in 1:60498)
{
# read both datasets and combine in one
dat <- read.csv('dataset1.csv', skip = x, nrows = y, header = F)
dat2 <- read.csv('dataset2.csv', skip = a, nrows = b, header = F)
dat3 <- rbind(dat, dat2)
# apply function to this dataset and return the output
# the function is too long and not in the scope so I will skip it
# it returns a dataframe of 1 row
res <- limma.test(dat3)
# write out the output in append mode
# so at the end of the loop, out.txt should have 60498 lines
write.table(res, file = 'out.txt', append = TRUE, quote = F, col.names = F)
# set x and y so that it skips the lines that are already read in
x = x + 7924
a = a + 734
}
The function itself is pretty fast, there is no bottleneck there. However, running a for loop for 60498 times, it is going to take really long. I have a computer with 8 cores. How can I modify my code to run the for loop in parallel to minimize the time?
Thanks!

Related

bind files into one and generate a tag

I am trying to bind several files into a single one and also generate a list which returns to the number of rows for each individual file. I can bind those files into one, but I failed to generate this list. Could anyone help me? Thanks in advance.
aa = c(1:3) #files with .1 or .2 suffix
bb = paste0(datasource,"file.",aa) # paste all those files into one
merge.data = read.table(bb[1]) # read the first file as a data.frame
n = rnow(merge.data) # this will be the number of row under the first file
for (i in 2:n){
new.data = read.table(bb[i]) # read all following files
merge.data = rbind(merge.data, new.data) # bind those files into a single one
}
merge.data # this the whole data.frame contains all original files' data
yy = paste0(datasource,"file.",aa)
merge.data = data.frame()
##### I would like to add another variable np, which is a list representing the number of rows for each individual file before merge
np = NULL
for (i in 1:n){
merge.data[i] = read.table(bb[i])
np[i] = nrow(merge.data[i])
merge.data = rbind(merge.data[i])
}
np
merge.data
aa = c(1:3) #files with .1 or .2 suffix
bb = paste0(datasource,"file.",aa) # paste all those files into one
# read all files into list of data frames
bb_list = sapply(bb, read.table, simplify = F)
# get vector of number of rows
np = sapply(bb_list, nrow)
# combine list of data frames into one
result = do.call(rbind, bb_list)
For more details/discussion, have a look at How to make a list of data frames.

R: Read in random rows from file using fread or equivalent?

I have a very large multi-gigabyte file which is too costly to load into memory. The ordering of the rows in the file, however, are not random. Is there a way to read in a random subset of the rows using something like fread?
Something like this, for example?
data <- fread("data_file", nrows_sample = 90000)
This github post suggests one possibility is to do something like this:
fread("shuf -n 5 data_file")
This does not work for me, however. Any ideas?
Using the tidyverse (as opposed to data.table), you could do:
library(readr)
library(purrr)
library(dplyr)
# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))
# sort the index sequentially
start_at <- start_at[order(start_at)]
# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )
If your data file happens to be a text file this solution using the package LaF could be useful:
library(LaF)
# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)
dim(mat)
#[1] 1000000 10
write.table(mat, "tmp.csv",
row.names = F,
sep = ",",
quote = F)
# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
n = 90000,
nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs
dim(random_mat)
#[1] 90000 10

R: How to change data in a column across multiple files. Help understanding lapply

I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")

merging multiple csv's in R

Hi i was merging csv downloaded from NSE Bhavcopy. different dates have different no of cols. Say in 26-12-2006 it had 998 rows & 27-12-2006 it has 1003 rows. It has 8 cols. I do the cbind to create a & b with just 2 cols, Symbol, close price. I name the col using colnames so that for merging i can merge by SYMBOL.
Questions:
1) When i use merge function with by = "SYMBOL", all = F; i was surprised to see resulting c having 1011 rows. where ever i read, merging with all = F it should become 998 rows or max 1003 rows. I also analyzed the data and found there were 5 different symbols in 27-12-2006 & 3 different symbols in 26-12-2006. So when we merge by "SYMBOL" will new symbols from both rows will be added? or it will merge only with earlier existing a row?
2) NSEmerg is a function using a for loop to read new file every time & merge with existing c file. I have about 1535 files having data from 2006 Dec till 2013 Apr. However i was not able to merge more than 12 files as it throws error vector size of 12 MB cannot be allowed. It also shows warning messages saying memory allocation of 1535 MB used up. Also at 12th file i found nrow of c to be 1508095 implying loop running infinitely. Of all the 1535 files, highest row was at 1435. Even if we add all stocks delisted, no traded on specific date, i believe it might not cross 2200 stocks. Why this shows nrow of 1.5 Million??
3) Is there any better way of merging csv? I am in stack overflow for first time else i would have attached say 10 files.
Code:
a <- read.csv("C://Users/home/desktop/061226.csv", stringsAsFactors = F, header = T)
b <- read.csv("C://Users/home/desktop/061227.csv", stringsAsFactors = F, header = T)
a_date <- a[2,1]
b_date <- b[2,1]
a <- cbind(a[,2],a[,6])
b <- cbind(b[,2], b[,6])
colnames(a) <- c("SYMBOL", a_date)
colnames(b) <- c("SYMBOL", b_date)
c <- merge(a,b,by = "SYMBOL", all = F)
NSEmerg <- function(x,y) {
y_date <- y[2,1]
y <- cbind(y[,2], y[,6])
colnames(y) <- c("SYMBOL", y_date)
c <- merge(c, y, by = "SYMBOL", all = F)
}
filenames = list.files(path = "C:/Users/home/Documents/Rest data", pattern = "*csv")
for (i in 1:length(filenames)){
y <- read.csv(filenames[i], header = T, stringsAsFactors = F)
c <- NSEmerg(c,y)
}
write.csv(c, file = "NSE.csv")
Are you sure you want to cbind and not rbind? To answer your last question. First you list all the .csv files in your map:
listfiles <- list.files(path="C:/Users/home/desktop", pattern='\\.csv$', full.names=TRUE)
Next use do.call to read in the different csv files and combine them with rbind.
df <- do.call(rbind, lapply(listfiles , read.csv))
You'd probably be better off just using a perl one-liner:
perl -pe1 file1 file2 file3 ... > newfile
and then you can cut the columns you need out
cut -f1,2 -d"," newfile > result

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

Resources