read.table ignores some \t and \n in a table - r

I've encountered a strange problem when reading in an annotation file from an RNAseq experiment.
I am trying to read in the tab-separated file (http://we.tl/qCjv4N3LF2) and then search the annotations (in the fourth column) for the pattern "bahd", to find entries like "bahd acyltransferase dcr" and then display all the IDs (first column) that belong to these entries. The code is:
ShootAnnot<-read.table("annotation1.txt",sep="\t")
matches<-grep("bahd",ShootAnnot[,4],ignore.case=TRUE)
ShootAnnot[matches,1]
Weirdly, I noticed this did not find all the gene annotations that I know to be there - only 9 matches out of 12 in the file. When I scanned the table for the missing entries I found one line in the file where it seems R failed to interpret the separation patterns "\t" and "\n" for a bit.
look at line 4825 in the dataset:
ShootAnnot[4825,]
for some reason, the sixth cell in that line contains a big chunk of data, with many complete lines and the appropriate "\t" and "\n" cell and line separation patterns all in one cell. Then it suddenly jumps back into separating cells and lines correctly.
I have got a bunch of these files, so I would like to make sure I can resolve any issues like that automatically. Any ideas what might be causing this?
Thanks!

I'm not sure why it goes haywire (maybe a DOS CR/LF thing), but the file is pretty big and if you plug it into data.table you will get a pretty decent speedup just from reading the data.
library(data.table)
ShootAnnot <- fread("~/Downloads/annotation1.txt")
ShootAnnot[like(Blast2GO_GO_Description,"bahd"), "#ID", with=FALSE]
which will give you
#ID
1: c112902_g1_i1_m.105401
2: c11459_g1_i1_m.4290
3: c11459_g2_i1_m.4292
4: c186946_g1_i1_m.110882
5: c24956_g1_i1_m.8768
6: c265515_g1_i1_m.117383
7: c28096_g1_i1_m.10253
8: c37936_g1_i1_m.14867
9: c40683_g1_i1_m.17292
10: c54651_g1_i1_m.34709
11: c54651_g2_i1_m.34711
12: c921_g1_i1_m.351
(you don't have any non-lower-case "bahd"'s in your file)

Related

How to make a line of code span multiple lines in R where a new-line isn't created? Picture of desired outcome included

I'm trying to make a comment spanning multiple lines, but want them all to be the same line.
As in, Line 1 starts a comment that goes on to cover more lines, but each new line doesn't need a new '#' and Line 2 doesn't start until the comment is done.
Picture shows what I want in lines 3 and 5
You can use strings!
"
Module 4 Lecture 3: 911 call times
You have 250 datapoints of the time it takes to answer and resolve a
911 call saved in a csv file called 911TaskTimes.csv
You want...
"
Here's more information

fread() error and strange behaviour when reading csv

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:
' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.
I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.
I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv
I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.
The four lines in testout.csv are below (I start a new line for each entry below for readability).
"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail.
.",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131
When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)
Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.
UPDATE: Now fixed in v1.9.3 on GitHub :
fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.See:
fread and a quoted multi-line column value
Windows users are reporting success with the latest version from GitHub.

un-quote an R string?

TL;DR
I have a snippet of text
str <- '"foo\\dar embedded \\\"quote\\\""'
# cat(str, '\n') # gives
# "foo\dar embedded \"quote\""
# i.e. as if the above had been written to a CSV with quoting turned on.
I want to end up with the string:
str <- 'foo\\dar embedded "quote"'
# cat(str, '\n') # gives
# foo\dar embedded "quote"
essentially removing one "layer" of quoting. How may I do this?
(Initial attempt -- eval(parse(text=str)), which works unless you have something like \\dar, where you get the error "\d is an unrecognized escape in character string ...").
Gory details (optional)
The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines with embedded quotes (i.e. forgot to escape/remove them).
It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines and embedded quotes (or something like that), the function fails (fair enough).
I had since closed my R session so my only access to my data was through my munged CSV. So I wrote some spaghetti code to simply readLines my CSV and split everything up to reconstruct my dataframe again. However, since all my character columns were quoted in the CSV, I have a few columns in my restored dataframe that are still quoted that I want to unquote.
Messy, I know. I'll remember to save an original version of the data next time (save, saveRDS).
For those interested, the header row and three rows of my CSV are shown below (all the characters are ASCII)
"quote";"id";"date";"author";"context"
"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"February 28, 2013";"nhqdb";"nhqdb"
"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"February 28, 2013";"nhqdb";"nhqdb"
"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"February 28, 2013";"nhqdb";"nhqdb"
The first two columns of each row are the same, being the quote (the first row has no embedded newlines in the quote; the second and third do). Separator is ';'.
> read.table('test.csv', sep=';', header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
# same for with ,allowEscape=T
Use regular expressions:
str <- gsub('^"|"$', '', gsub('\\\"', '"', str, fixed = TRUE))
[EDIT 3: the OP has posted three separate versions of this - two of them irreproducible, interspersed with complaining. Due to this timewasting behavior and several people downvoting, I'm leaving the original answer to version 2 of the question.]
EDIT 1: My solution to the second version of the OP's question was this:
txt <- read.csv('escaped.csv', header=T, allowEscapes=T, sep=';')
EDIT 2: We now get a third version. Finally some reproducible code after 36 minutes asking and waiting. Due to the behavior of the OP and other posters I'm not inclined to waste more time on this. I'm going to complain about both of your behavior on MSO. Downvote yourselves silly.
ORIGINAL:
gsub is the ugly way.
Use read.csv(..., allowEscapes=TRUE, quote=..., encoding=...) arguments. See the manpage, section on Encoding
If you want actual code, you need to give us a full line or two of your CSV file.
See also SO: "How to detect the right encoding for read.csv?"
Quoting the relevant part of your question:
The reason my strings are quoted once-too-many times is I kludged some
data processing -- I wrote str (well, a dataframe in my case) to a
table with quoting enabled, but forgot that many of the columns in my
dataframe had embedded newlines within quotes (i.e. forgot to
escape/remove them).
It turns out that when I read.table a file with multiple columns in
the same row that have embedded newlines within quotes, the function
fails (fair enough).

skip and autostart in fread

I am using the following code to read a file with the data.table library:
fread(myfile, header=FALSE, sep=",", skip=100, colClasses=c("character","numeric","NULL","numeric"))
but I get the following error:
The supplied 'sep' was not found on line 80. To read the file as a single character column set sep='\n'.
It says it did not find sep on line 80, however I set skip=100 so it should not pay attention to the first 100 lines.
UPDATE:
I tried with skip=101 and it worked but it skips the first line where the data starts
I am using version 1.9.2 of the data.table package and R version 3.02 64 bit on windows 7
We don't know the version number you're using, but I can make a guess in this case.
Try setting autostart=101.
Note the first paragraph of Details in ?fread :
Once the separator is found on line autostart, the number of columns is determined. Then the file is searched backwards from autostart until a row is found that doesn't have that number of columns. Thus, the first data row is found and any human readable banners are automatically skipped. This feature can be particularly useful for loading a set of files which may not all have consistently sized banners. Setting skip>0 overrides this feature by setting autostart=skip+1 and turning off the search upwards step.
the skip argument has :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).
and the autostart argument has :
Any line number within the region of machine readable delimited text, by default 30. If the file is shorter or this line is empty (e.g. short files with trailing blank lines) then the last non empty line (with a non empty line above that) is used. This line and the lines above it are used to auto detect sep, sep2 and the number of fields. It's extremely unlikely that autostart should ever need to be changed, we hope.
In your case perhaps the human readable header is much larger than 30 rows, which is why I guess setting autostart=101 might work. No need to use skip.
One motivation is for convenience when a file contains multiple tables. By setting autostart to any row inside the table that you want to pluck out of the file, it'll find the first data row and header row for you automatically, and then read just that table. You don't have to worry about getting the exact line number at the start of data like you do with skip. fread can only read one table currently. It could feasibly return a list of tables from a single file, but that's getting a bit complicated and nobody has asked for that.

read.csv - unknown character and embedded quotes

I have a .csv that causes different problems with read.table() and fread().
There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.
fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.
Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?
Thanks in advance!
In case of Paul's file, I was able to read the file (after some experimentation) using fread() with the cmd "unzip -cq" and quote = "" parameters without error or warnings. I suppose that this might work as well with Kristian's file.
On Windows, it might be necessary to install the Rtools beforehand.
library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
"extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")
Caveat: This will download a file of 38 MBytes
According to the unzip man page, the -c option automatically performs ASCII-EBCDIC conversion.
The quote = "" was required because in at least one case a data field contained double quotes within the text.
I have also tried the -p option of unzip which extracts the data without conversion. Then, we can see that there is \032 embedded in the string.
txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")
txt2[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOP\032OULOS
The \032 does not appear in the converted version
txt1[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOPOULOS
We can search for all occurrences of \032 in all character fields by
melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
value %flike% "\032"][order(`CUST-ID`)]
CUST-ID variable value
1: 1253096 LEGAL-NAME JOHN A. GIANNAKOP\032OULOS
2: 2050751 DBA-NAME colbert ball tax tele\032hone rd
3: 2082166 LEGAL-NAME JUAN DE J. MORALES C\032TALA
4: 2273606 LEGAL-NAME INTRINSIC DM\032 INC.
5: 2300016 MAIL-ADDR1 PO BOX \03209
6: 2346154 LEGAL-NAME JOEL I GONZ\032LEZ-APONTE CPA
7: 2384445 LEGAL-NAME NUMBERS CAF\032 PLLC
8: 2518214 MAIL-ADDR1 556 W 800 N N\03211
9: 2518214 BUSN-ADDR1 556 W 800 N N\03211
10: 13718109 DBA-NAME World Harvest Financial Grou\032
11: 13775763 LEGAL-NAME Fiscally Responsible Consulting LLC\032
12: 13775763 DBA-NAME Fiscally Responsible Consulting LLC\032
This may help to identify the records of the file to fix manually.
I hit this problem today, so its still there in R 4.0.5
The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.
In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table get a little further. I found no encoding setting in R that made a difference on this problem.
I found 2 solutions.
Get rid of the "^Z" symbol with text tools before using read.csv.
Switch to Python. The pandas package function read_csv, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame that results, the unrecognized character is in the data, it looks like an empty square.
If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.

Resources