Read.table and dbWriteTable result in different output? - r

I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.
If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:
con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)
> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A" "BP_A" "SNP_A" "CHR_B" "BP_B" "SNP_B" "R2"
> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
row_names CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 1 1 1579 SNP-1.578. 1 2097 SNP-1.1096. 0.07223050
2 2 1 1579 SNP-1.578. 1 2553 SNP-1.1552. 0.00763724
If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation
dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R
I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?
*Edit, update:
It seems as though this works:
n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)
con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}
Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.

I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines).
If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole.
Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld").
Let's say your file has 100,000 lines.
`sip.size = 100
for (i in seq(0,100000,sip.size)) {
data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`
You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)

Related

read only the first and last row of several file at once with R

I got many .csv files of different sizes. I choose some of them who correspond at a condition (those matching with my id in the example). They are ordered by date and can be huge. I need to know the minimum and maximum dates of these files.
I can read all of those wanted and only for the column date.hour, and then I can find easily the minimum and maximum of all the dates values.
But it would be a lot faster, as I repeat this for thousand ids, if I could read only the first and last rows of my files.
Does anyone got an idea of how to solve this ?
This code works well, but I wish to improve it.
function to read several files at once
`read.tables.simple <- function(file.names, ...) {require(plyr)
ldply(file.names, function(fn) data.frame(read.table(fn, ...)))}`
reading the files and selecting the minimum and maximum dates for all of theses
`diri <- dir()
dat <- read.tables.simple(diri[1], header = TRUE, sep = ";", colClasses = "character")
colclass <- rep("NULL", ncol(dat))
x <- which(colnames(dat) == "date.hour")
colclass[x] <- "character"
x <- grep("id", diri)
dat <- read.tables.simple(diri[x], header = TRUE, sep = ";", colClasses = colclass)
datmin <- min(dat$date.hour)
datmax <- max(dat$date.hour)`
In general, read.table is very slow. If you use read_tsv, read_csv or read_delim from the readr library, it will already be much, much faster.
If you are on Linux/Mac OS, you can also read only the first or last parts by setting up a pipe, which will be more or less instant, no matter how large your file is. Let's assume you have no column headers:
library(readr)
read_last <- function(file) {
read_tsv(pipe(paste('tail -n 1', file)), col_names=FALSE)
}
# Readr can already read only a select number of lines, use `n_max`
first <- read_tsv(file, n_max=1, col_names=FALSE)
If you want to go in on parallelism, you can even read files in parallel, see e.g., library(parallel) and ?mclapply
The following function will read the first two lines of your csv (the header row and the first data row), then seek to the end of the file and read the last line. It will then stick these three lines together to read them as a two-row csv in memory from which it returns the column date.time. This will have your minimum and maximum values, since the times are arranged in order.
You need to tell the function the maximum line length. It's OK if you over-estimate this, but make sure the number is less than a third of your file size.
read_head_tail <- function(file_path, line_length = 100)
{
con <- file(file_path)
open(con)
seek(con, where = 0)
first <- suppressWarnings(readChar(con, nchars = 2 * line_length))
first <- strsplit(first, "\n")[[1]][1:2]
seek(con, where = file.info(file_path)$size - line_length)
last <- suppressWarnings(readChar(con, nchars = line_length))
last <- strsplit(last, "\n")[[1]]
last <- last[length(last)]
close(con)
csv <- paste(paste0(first, collapse = "\n"), last, sep = "\n")
df <- read.csv(text = csv, stringsAsFactors = FALSE)[-1]
return(df$date.hour)
}

searching multiple txt files for data and reporting result to new table

I have thousands of txt files containing Mass, %Base data. I need to search each file for a row within a specific mass range. Then, report that row into a new table with the filename as an additional character. The goal is a table of (Mass, %Base, Filename) for all of the text files based on the condition of the search.
Existing File example for file1name.txt:
Mass %Base
100 .1
101 26.2
...
900 0
Goal:
Mass %Base File
375.004 98 file1name
375.003 96 file2name
My current code is:
library(tidyverse)
library(readr)
#setwd to where data is located
setwd("Z:/Dnigra")
#set path where data is located
path <- "Z:/Dnigra"
mc <- 375.3 #mc is the calculated target mass
limit<- 0.1 # the width of the search window
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname,col_names=FALSE, skip =1)
#filters the data that includes the target mass
df <- between(mc,limit,limit)
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F
)
}
The end result is a table with:
df f
FALSE filename
Also errors:
Parsed with column specification: cols( X1 = col_character(), X2 = col_character() ) Warning: 2536 parsing failures.
I think there may be a few things here to address:
First, with read_tsv you might want to specify the column types as double if appropriate, so values are not read in as character strings. This would affect your ability to filter and subset based on Mass.
Next, the between statement has the syntax of:
between(x, left, right)
where x <= right and x >= left. If you want to make sure your mc value is between 375.2 and 375.4 you might want between(X1, mc-limit, mc+limit) instead. Note that since no header was read in, the Mass variable is assumed first as X1.
When you use write.table and append, you might want to set col.names to FALSE (or include header on first write).
Hope this is helpful to you.
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, col_names = FALSE, skip=1, col_types = "dd")
#filters the data that includes the target mass
df <- filter(df, between(X1, mc-limit, mc+limit))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F,
col.names = FALSE
)
}
Thanks #Ben. I had gotten to that point last night and had added a tolerance calculation. The "dd" definitely helped but required a col_names to get through another error. The final code is below. A parsing error comes up, but it does what it need to do!
tol<- .02 # the width of the search window
mmneg <- mc - tol
mmpos <- mc + tol
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, skip =1,skip_empty_rows = T, col_types="dd", col_names=c("X1","X2"))
#filters the data that excludes the offending peak
df<- filter(df,between(X1,mmneg,mmpos))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"Caviunin_20_.csv",
append= T,
sep=",",
row = F,
col.names = F
)
}

selecting text from a middle of a textfile with known line numbers

I wrote some R code to run analysis on my research project. I coded it in such a way that there was an output text file with the status of the program. Now the header of the output file looks like this:
start time: 2014-10-23 19:15:04
starting analysis on state model: 16
current correlation state: 1
>>>em_prod_combs
em_prod_combs
H3K18Ac_H3K4me1 1.040493e-50
H3K18Ac_H3K4me2 3.208806e-77
H3K18Ac_H3K4me3 0.0001307375
H3K18Ac_H3K9Ac 0.001904384
the `>>>em_prod_combs" is on line 4. line 5 its repeated again (R code). I'd like data from like 6. This data goes on for 36 more rows so ends at line 42. Then there is some other text in the file until all the way to like 742 which looks like this:
(742) >>>em_prod_combs
(743) em_actual_perc
(744) H3K18Ac_H3K4me1 0
H3K18Ac_H3K4me2 0
H3K18Ac_H3K4me3 0.0001976819
H3K18Ac_H3K9Ac 0.001690382
And again I'd like to select data from line 744 (actual data, not headers) and go for another 36 rows and end at line 780. Here is my part of the code:
filepath <- paste(folder_directory, corr_folders[fi], filename, sep="" )
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
if(line==">>>em_prod_combs"){
storethenext <- TRUE
}
}
close(con)
Here, I was trying to see if the line read had the ">>>" mark. If so, set a variable to TRUE and store the next 36 lines (using another counter variable) in a data frame or list and set the storethenext variable back to F. I was kind of hoping that there is a better way of doing this....
So I realized that ReadLines has a parameter that you can set for skipping lines. Based on that, I got this:
df <- data.frame(name = character,
params = numeric(40),
stringsAsFactors = FALSE)
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
firstblock <- readLines(con, n = 5, warn = FALSE)
firstblock <- NULL #throwaway
firstblock <- readLines(con, n = 36, warn = FALSE)
firstblock <- as.list(firstblock) #convert to list
for(k in 1:36){
splitstring = strsplit(firstblock[[k]], " ", fixed=TRUE)
## put the data in the df
}
But it turns out from Ben's answer that read.table can do the same thing in one line: So I've reduced it down to the following one liner:
firstblock2 <- read.table(filepath, header = FALSE, sep = " ", skip = 5, nrows = 36)
This also makes it a data frame impliticitly and does all the dirty work for me.
The documentation for read.table is here:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
In tidyverse (readr)
If you don't wish to convert the data into a dataframe you can just read the slice of text with read_lines()
(being sure to note that the n_max = argument is asking for how many lines you want to read in; not the number of the row you want to stop at. tbh: Ultimately I too found this preferable as usually I need to manage the file length more than I need to isolate sections of code on the read in.)
firstblock <- read_lines(filepath, skip = 5, n_max = 31)
If you don't want to think in terms of file size, you could modify your code thus:
start_line = 5
end_line = 36
line_count = end_line - start_line
firstblock <- read_lines(filepath, skip = start_line, n_max = line_count)
In any case, additional things I found helpful for working with these file formats after I got to know them a little better after finding this post before:
If you want to convert files immediately into lists as you did above just use:
read_lines_raw(filepath, skip = 5, n_max = 31)
and you will get a 31 element list as your firstblock element in lieu of the character element that you get with the first.
Additional super-cool features I stumbled upon there (and was moved to share - because I thought they rock):
automatically decompresses* .gz, .bz2, .xz, or .zip files.
will make the connection and complete the download automatically if the filepath starts with http://, https://, ftp://, or ftps://.
If the file has both an extension and prefix as above, then it does both.
If things need to go back where they came from, the write_lines function turned out to be much more enjoyable to use that the base version. specifically, there are no FileConn's to open and close: just specify the object, and the filename you wish to write it into.
so for example the below:
fileConn <- file("directory/new.sql")
writeLines(new_sql, fileConn)
close(fileConn)
gets to just be:
write_lines(new_sql, "directory/new.sql")
Enjoy, and hope this helps!

How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
I calculate skipVec as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
But for the moment I am just handing my function a skipVec vector of all zeros.
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep =
"") : no lines available in input"
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
Some clues towards higher speeds:
use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.

read.table returns extra rows

I am working with textfiles of many, long rows with varying number of elements. Each element in the rows are separated by \t and of course the rows are terminated by \n. I'm using read.table to read the textfiles. An example samplefile is this: https://www.dropbox.com/s/6utslbnwerwhi58/samplefile.txt
The sample file has 60 rows.
Code to read the file:
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE);
dim(sampleData);
The dim returns 70 rows when in fact it should be 60. When I try nrows=60 like
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE, nrows = 60);
dim(sampleData);
it does work, however, I don't know if doing so will delete some of the information. My suspicion is that the last portions of some of the rows are added to new rows. I don't know why that would be the case, however, as I have fill = TRUE;
I have also tried
na.strings = "NA", fill=TRUE, strip.white=TRUE, blank.lines.skip =
TRUE, stringsAsFactors=FALSE, quote = "", comment.char = ""
but to no avail.
Does anyone have any idea what might be going on?
In the absence of a reproducible example, try something like this:
# Make some fake data
R <- c("1 2 3 4","2 3 4","4 5 6 7 8")
writeLines(R, "samplefile.txt")
# read line by line
r <- readLines("samplefile.txt")
# split by sep
sp <- strsplit(r, " ")
# Make each into a list of dataframes (for rbind.fill)
sp <- lapply(sp, function(x)as.data.frame(t(x)))
# now bind
library(plyr)
rbind.fill(sp)
If this is similar to your actual problem, anyway.

Resources