I wrote some R code to run analysis on my research project. I coded it in such a way that there was an output text file with the status of the program. Now the header of the output file looks like this:
start time: 2014-10-23 19:15:04
starting analysis on state model: 16
current correlation state: 1
>>>em_prod_combs
em_prod_combs
H3K18Ac_H3K4me1 1.040493e-50
H3K18Ac_H3K4me2 3.208806e-77
H3K18Ac_H3K4me3 0.0001307375
H3K18Ac_H3K9Ac 0.001904384
the `>>>em_prod_combs" is on line 4. line 5 its repeated again (R code). I'd like data from like 6. This data goes on for 36 more rows so ends at line 42. Then there is some other text in the file until all the way to like 742 which looks like this:
(742) >>>em_prod_combs
(743) em_actual_perc
(744) H3K18Ac_H3K4me1 0
H3K18Ac_H3K4me2 0
H3K18Ac_H3K4me3 0.0001976819
H3K18Ac_H3K9Ac 0.001690382
And again I'd like to select data from line 744 (actual data, not headers) and go for another 36 rows and end at line 780. Here is my part of the code:
filepath <- paste(folder_directory, corr_folders[fi], filename, sep="" )
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
if(line==">>>em_prod_combs"){
storethenext <- TRUE
}
}
close(con)
Here, I was trying to see if the line read had the ">>>" mark. If so, set a variable to TRUE and store the next 36 lines (using another counter variable) in a data frame or list and set the storethenext variable back to F. I was kind of hoping that there is a better way of doing this....
So I realized that ReadLines has a parameter that you can set for skipping lines. Based on that, I got this:
df <- data.frame(name = character,
params = numeric(40),
stringsAsFactors = FALSE)
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
firstblock <- readLines(con, n = 5, warn = FALSE)
firstblock <- NULL #throwaway
firstblock <- readLines(con, n = 36, warn = FALSE)
firstblock <- as.list(firstblock) #convert to list
for(k in 1:36){
splitstring = strsplit(firstblock[[k]], " ", fixed=TRUE)
## put the data in the df
}
But it turns out from Ben's answer that read.table can do the same thing in one line: So I've reduced it down to the following one liner:
firstblock2 <- read.table(filepath, header = FALSE, sep = " ", skip = 5, nrows = 36)
This also makes it a data frame impliticitly and does all the dirty work for me.
The documentation for read.table is here:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
In tidyverse (readr)
If you don't wish to convert the data into a dataframe you can just read the slice of text with read_lines()
(being sure to note that the n_max = argument is asking for how many lines you want to read in; not the number of the row you want to stop at. tbh: Ultimately I too found this preferable as usually I need to manage the file length more than I need to isolate sections of code on the read in.)
firstblock <- read_lines(filepath, skip = 5, n_max = 31)
If you don't want to think in terms of file size, you could modify your code thus:
start_line = 5
end_line = 36
line_count = end_line - start_line
firstblock <- read_lines(filepath, skip = start_line, n_max = line_count)
In any case, additional things I found helpful for working with these file formats after I got to know them a little better after finding this post before:
If you want to convert files immediately into lists as you did above just use:
read_lines_raw(filepath, skip = 5, n_max = 31)
and you will get a 31 element list as your firstblock element in lieu of the character element that you get with the first.
Additional super-cool features I stumbled upon there (and was moved to share - because I thought they rock):
automatically decompresses* .gz, .bz2, .xz, or .zip files.
will make the connection and complete the download automatically if the filepath starts with http://, https://, ftp://, or ftps://.
If the file has both an extension and prefix as above, then it does both.
If things need to go back where they came from, the write_lines function turned out to be much more enjoyable to use that the base version. specifically, there are no FileConn's to open and close: just specify the object, and the filename you wish to write it into.
so for example the below:
fileConn <- file("directory/new.sql")
writeLines(new_sql, fileConn)
close(fileConn)
gets to just be:
write_lines(new_sql, "directory/new.sql")
Enjoy, and hope this helps!
Related
I got many .csv files of different sizes. I choose some of them who correspond at a condition (those matching with my id in the example). They are ordered by date and can be huge. I need to know the minimum and maximum dates of these files.
I can read all of those wanted and only for the column date.hour, and then I can find easily the minimum and maximum of all the dates values.
But it would be a lot faster, as I repeat this for thousand ids, if I could read only the first and last rows of my files.
Does anyone got an idea of how to solve this ?
This code works well, but I wish to improve it.
function to read several files at once
`read.tables.simple <- function(file.names, ...) {require(plyr)
ldply(file.names, function(fn) data.frame(read.table(fn, ...)))}`
reading the files and selecting the minimum and maximum dates for all of theses
`diri <- dir()
dat <- read.tables.simple(diri[1], header = TRUE, sep = ";", colClasses = "character")
colclass <- rep("NULL", ncol(dat))
x <- which(colnames(dat) == "date.hour")
colclass[x] <- "character"
x <- grep("id", diri)
dat <- read.tables.simple(diri[x], header = TRUE, sep = ";", colClasses = colclass)
datmin <- min(dat$date.hour)
datmax <- max(dat$date.hour)`
In general, read.table is very slow. If you use read_tsv, read_csv or read_delim from the readr library, it will already be much, much faster.
If you are on Linux/Mac OS, you can also read only the first or last parts by setting up a pipe, which will be more or less instant, no matter how large your file is. Let's assume you have no column headers:
library(readr)
read_last <- function(file) {
read_tsv(pipe(paste('tail -n 1', file)), col_names=FALSE)
}
# Readr can already read only a select number of lines, use `n_max`
first <- read_tsv(file, n_max=1, col_names=FALSE)
If you want to go in on parallelism, you can even read files in parallel, see e.g., library(parallel) and ?mclapply
The following function will read the first two lines of your csv (the header row and the first data row), then seek to the end of the file and read the last line. It will then stick these three lines together to read them as a two-row csv in memory from which it returns the column date.time. This will have your minimum and maximum values, since the times are arranged in order.
You need to tell the function the maximum line length. It's OK if you over-estimate this, but make sure the number is less than a third of your file size.
read_head_tail <- function(file_path, line_length = 100)
{
con <- file(file_path)
open(con)
seek(con, where = 0)
first <- suppressWarnings(readChar(con, nchars = 2 * line_length))
first <- strsplit(first, "\n")[[1]][1:2]
seek(con, where = file.info(file_path)$size - line_length)
last <- suppressWarnings(readChar(con, nchars = line_length))
last <- strsplit(last, "\n")[[1]]
last <- last[length(last)]
close(con)
csv <- paste(paste0(first, collapse = "\n"), last, sep = "\n")
df <- read.csv(text = csv, stringsAsFactors = FALSE)[-1]
return(df$date.hour)
}
Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)
My script reads in a list of text files from a folder. A calculation for all values in a few columns in each text file is made.
At the end I want to write the resulting data.frame into a new text file in a different location.
The problem is, that the script keeps overwriting the file it created before. So I end up with only one file (the last one that was read in).
But I don't get what I am doing wrong here. The output file name is different each time, so in my head it should produce separate files.
The script looks as follows:
RAW <- "C:/path/tofiles"
files <- list.files(RAW, full.names = TRUE)
for(j in length(files)) {
if(file.exists(files[[j]])){
data <- read.csv(files[[j]], skip = 0, header=FALSE)
data[9] <- do.call(cbind,lapply(data[9], function(x){(data[9]*0.01701)/0.00848}))
data[11] <- do.call(cbind,lapply(data[11], function(x){(data[11]*0.01834)/0.00848}))
data[13] <- do.call(cbind,lapply(data[13], function(x){(data[13]*0.00982)/0.00848}))
data[15] <- do.call(cbind,lapply(data[15], function(x){(data[15]*0.01011)/0.00848}))
OUT <- paste("C:/path/to/destination_folder",basename(files[[j]]),sep="")
write.table(data, OUT, sep=",", row.names = FALSE, col.names = FALSE, append = FALSE)
}
}
The problem is in your for loop. length(files) just provides 1 value, namely the length of your files-vector, while I think you want to have a sequence with that length.
Try seq_along or just for(j in files).
I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.
If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:
con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)
> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A" "BP_A" "SNP_A" "CHR_B" "BP_B" "SNP_B" "R2"
> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
row_names CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 1 1 1579 SNP-1.578. 1 2097 SNP-1.1096. 0.07223050
2 2 1 1579 SNP-1.578. 1 2553 SNP-1.1552. 0.00763724
If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation
dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R
I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?
*Edit, update:
It seems as though this works:
n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)
con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}
Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.
I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines).
If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole.
Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld").
Let's say your file has 100,000 lines.
`sip.size = 100
for (i in seq(0,100000,sip.size)) {
data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`
You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)
I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
I calculate skipVec as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
But for the moment I am just handing my function a skipVec vector of all zeros.
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep =
"") : no lines available in input"
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
Some clues towards higher speeds:
use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.