I have a file that is laid out in the following way:
# Query ID 1
# note
# note
tab delimited data across 12 columns
# Query ID 2
# note
# note
tab delimited data across 12 columns
I'd like to import this data into R so that each query is its own dataframe. Ideally as a list of dataframes with the query ID as the name of each item in the list. I've been searching for awhile, but I haven't seen a good way to do this. Is this possible?
Thanks
We have used comma instead of tab to make it easier to see and have put the body of the file in a string but aside from making the obvious changes try this. First we use readLines to read in the file and then determine where the headers are and create a grp vector which has the same number of elements as lines in the file and whose values are the header for that line. Finally split the lines, and apply Read to each group.
but aside from that try this:
# test data
Lines <- "# Query ID 1
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
# Query ID 2
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12"
L <- readLines(textConnection(Lines)) # L <- readLines("myfile")
isHdr <- grepl("Query", L)
grp <- L[isHdr][cumsum(isHdr)]
# Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, comment = "#")
Read <- function(x) read.table(text = x, sep = ",", fill = TRUE, comment = "#")
Map(Read, split(L, grp))
giving:
$`# Query ID 1`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
$`# Query ID 2`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
No packages needed.
Related
I have a .csv data set which is separated by "," and has about 5,000 rows and "5" columns.
However, for some columns, the content contains also ",", for example:
2660,11-01-2016,70.75,05-06-2013,I,,,
4080,26-02-2016,59.36,,D
Thus, when I tried to read it with read_delim(), it will throw me warnings, but the result shall be fine, for example:
Warning: 7 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected actual 1 309 5 columns 8 columns 'data/my_data.csv' file 2 523 5 columns 7 columns 'data/my_data.csv' row 3 588 5 columns 8 columns 'data/my_data.csv' col 4 1661 5 columns 9 columns 'data/my_data.csv' expected 5 1877 5 columns 7 columns 'data/my_data.csv'
Is there any way for me to tackle this problem?
I guess I could use read_Lines() and process it one by one and then turn them into a data frame.
Do you have any other ways to deal with such a situation?
1) read.table with fill=TRUE Using fill=TRUE with read.table results in no warnings:
Lines <- "2660,11-01-2016,70.75,05-06-2013,I,,,
4080,26-02-2016,59.36,,D"
# replace text = Lines with your filename
read.table(text = Lines, sep = ",", fill = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8
1 2660 11-01-2016 70.75 05-06-2013 I NA NA NA
2 4080 26-02-2016 59.36 D NA NA NA
2) replace 1st 4 commas with semicolon Another approach would be:
# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
for(i in 1:4) L <- sub(",", ";", L)
read.table(text = L, sep = ";")
giving:
V1 V2 V3 V4 V5
1 2660 11-01-2016 70.75 05-06-2013 I,,,
2 4080 26-02-2016 59.36 D
3) remove commas at end of lines Another possibility is to remove commas at the end of lines. (If you are on Windows then sed is in the Rtools distribution.)
read.table(pipe("sed -e s/,*$// readtest.csv"), sep = ",")
giving:
V1 V2 V3 V4 V5
1 2660 11-01-2016 70.75 05-06-2013 I
2 4080 26-02-2016 59.36 D
3a) similar to (3) but without sed
# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
read.table(text = sub(",*$", "", L), sep = ",")
I am trying to read in and combine multiple text files into R. The issue with this is that I have been given some data where field separators between files are different (e.g. tab for one and commas for another). How could I combine these efficiently? An example of layout:
Data1 (tab):
v1 v2 v3 v4 v5
1 2 3 4 urban
4 5 3 2 city
Data2 (comma):
v1,v2,v3,v4,v5
5,6,7,8,rural
6,4,3,1,city
This example is obviously not real, the real code has nearly half a million points! And so cannot reshape the original files. The code I have used so far has been:
filelist <- list.files(path = "~/Documents/", pattern='.dat', full.names=T)
data1 <- ldply(filelist, function(x) read.csv(x, sep="\t"))
data2 <- ldply(filelist, function(x) read.csv(x, sep=","))
This gives me the the data both ways, which I then need to manually clean and then combine. Is there a way of using sep in a way that can remove this? Column names are the same among files. I know that stringr or other concatenating functions may be useful, but I also need to load the data in at the same time, and am unsure how to set this up within the read commands.
I would suggest using fread from the "data.table" package. It's fast, and does a pretty good job of automatically detecting a delimiter in a file.
Here's an example:
## Create some example files
cat('v1\tv2\tv3\tv4\tv5\n1\t2\t3\t4\turban\n4\t5\t3\t2\tcity\n', file = "file1.dat")
cat('v1,v2,v3,v4,v5\n5,6,7,8,rural\n6,4,3,1,city\n', file = "file2.dat")
## Get a character vector of the file names
files <- list.files(pattern = "*.dat") ## Use what you're already doing
library(data.table)
lapply(files, fread)
# [[1]]
# v1 v2 v3 v4 v5
# 1: 1 2 3 4 urban
# 2: 4 5 3 2 city
#
# [[2]]
# v1 v2 v3 v4 v5
# 1: 5 6 7 8 rural
# 2: 6 4 3 1 city
## Fancy work: Bind it all to one data.table...
## with a column indicating where the file came from....
rbindlist(setNames(lapply(files, fread), files), idcol = TRUE)
# .id v1 v2 v3 v4 v5
# 1: file1.dat 1 2 3 4 urban
# 2: file1.dat 4 5 3 2 city
# 3: file2.dat 5 6 7 8 rural
# 4: file2.dat 6 4 3 1 city
You can also add an if clause into your function:
data = ldply(filelist,function(x) if(grepl(",",readLines(x,n=1))){read.csv(x,sep=",")} else{read.csv(x,sep="\t")})
I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158
This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
Partial output: first columns
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1
To elaborate on my comment...
From the help page to read.table:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).
To work around this with unknown datasets, use count.fields to determine the number of separators in a file, and use that to create col.names for read.table to use:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
Inspect the first few lines. I'll leave the actual full inspection to you.
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
Here's a minimal example for those who don't want to download the other file to try it out.
Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table on it:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
Note the difference with if the longest line were in the first five lines of the file:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
To fix the problem, we use count.fields which returns a vector of the number of fields detected in each line. We take the max from that and pass it on to a col.names argument for read.table.
x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
col.names = paste("V", sequence(max(x)), sep = ""))
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 NA
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 5
Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.