Read Table in R with "=" starting cell - r

I'm trying to read a table in R. I allways used the command read.table(...), but today I found an error when trying to read a table with a cell starting with the "=" symbol.
It is something like:
chr7 79435791 a 7 ,,...... \ak.f9ka
chr7 79435792 a 7 C$...... =.;a#Kk
chr7 79435793 T 7 ........ GF-FGGB
I use:
read.table(files, sep="\t", comment.char="", header=FALSE, colClasses=c("character"), stringsAsFactors=FALSE)
but I get the error:
The line 2 does not have 6 elements, and I think it is because of the = symbol.
How can I read this?
Thank you

I am not able to reproduce your error. I copied exactly your example in a .txt file, and I do not get the error you get.
By the way, with a simple read.table()statement it does not read correctly the last value in row 2 because of the comment character, which you can disable (see ?read.table).

I was able to somewhat reproduce the error which gave me some ideas - it could of course go horribly wrong as well :)
This should hopefully do the trick:
files.data.frame <- read.table(textConnection(readLines(files)), sep="\t")
And of course, add the extra options to read.table as deemed necessary.

Related

List index out of range (searched for answer, none work)

When i run this code
f = open("test.txt", "r")
xp_levelup_save=f.readlines(3)
xp_levelup_save=[int(i.replace("\n", "")) for i in xp_levelup_save][2]
f.close()
print (xp_levelup_save)
but a error comes up "List index out of range" If the readlines is 2 and the [2] is [1] it works fine. Not sure why this is happening. Can anyone help me and find a fix. I've tried looking at mulitple other discussions but none
seem to work with this code.
My text document looks like this
1
2
3
4
5
6
f.readlines() doesn't need an argument to retrieve the lines of a document. If you provide this optional argument like f.readlines(n), Python reads it as "expect n bytes and finish the line". Depending on your text file, this could be one, two or even more lines.
To read the text file, just use
xp_levelup_save = f.readlines()
#resume with your code
where each line is stored as a string list element in your variable xp_levelup_save.
Or simply
xp_levelup_save = [int(i.replace("\n", "")) for i in f.readlines()][2]

Trying to do a multiple grep in R with regular expressions

I have to look for some registers with a regular expression in a massive log file with R.
All the data is about 1 Gb and i have to look for some registries and add the matches to a dataframe.
What I am doing is the following:
grep(paste("(Sending+.+message+.+1234567890+.+5000)", sep=""), dfLogs)
This goes correct when I do it for only one register.
When I try to do the grep for all the searches:
dfTrx$RcvMessage <- paste("(Sending+.+message+.+", dfTrx$NUMBER, "+.+", dfTrx$AMOUNT,")", sep="")
dfReceived <- unique(grep(paste(dfTrx$RcvMessage, collapse="|"), dfLogs), value=TRUE)
And I get the following error:
Error in grep(paste(dfTrx$RcvMessage, collapse = "|"), dfLogs) :
invalid regular expression '(Sending+.+message+.+1234567890+.+20)|(Sending+.+message+.+9876543210+.+20)|...
How can I do this regular expression for all the values? What am I doing wrong?
An example for data:
2015-12-09 19:01:44,717 - [DEBUG] [pool-1-thread-4450 ] [OutputHandler ] Sending 8 message(s) : 01XX765903091220151901440XXXX0000129D3A00003996101901442015120903857655184776733438000000200001XX765904091220151901440XXXX0000118BC100001839671901442015120903857655194251212137000000300001XX765905091220151901440XXXX000010E52A00003311451901442015120903857655203331836622000000200001XX765906091220151901440XXXX000011DCD300001972561901442015120903857655215522476419000000300001XX765907091220151901440XXXX000012980900003923951901442015120903857655225531194531000000500001XX765908091220151901440XXXX000010ED2200003882461901442015120903857655237351043626000000200001XX765909091220151901440XXXX000011BDBE00001656451901442015120903857655243312669477000000200001XX765910091220151901440XXXX00001211F3000024385819014420151209038576552598211945310000002000
I need to find the sending of the message, and the number and amount in the content of the message.
Thanks at #Marek.
I found that in the df there were some bad values that were giving me NA and some spaces that I hadn't seen before. Seems I didn't clean all the data properly. Thanks and sorry for a silly mistake from me.

read.table ignores some \t and \n in a table

I've encountered a strange problem when reading in an annotation file from an RNAseq experiment.
I am trying to read in the tab-separated file (http://we.tl/qCjv4N3LF2) and then search the annotations (in the fourth column) for the pattern "bahd", to find entries like "bahd acyltransferase dcr" and then display all the IDs (first column) that belong to these entries. The code is:
ShootAnnot<-read.table("annotation1.txt",sep="\t")
matches<-grep("bahd",ShootAnnot[,4],ignore.case=TRUE)
ShootAnnot[matches,1]
Weirdly, I noticed this did not find all the gene annotations that I know to be there - only 9 matches out of 12 in the file. When I scanned the table for the missing entries I found one line in the file where it seems R failed to interpret the separation patterns "\t" and "\n" for a bit.
look at line 4825 in the dataset:
ShootAnnot[4825,]
for some reason, the sixth cell in that line contains a big chunk of data, with many complete lines and the appropriate "\t" and "\n" cell and line separation patterns all in one cell. Then it suddenly jumps back into separating cells and lines correctly.
I have got a bunch of these files, so I would like to make sure I can resolve any issues like that automatically. Any ideas what might be causing this?
Thanks!
I'm not sure why it goes haywire (maybe a DOS CR/LF thing), but the file is pretty big and if you plug it into data.table you will get a pretty decent speedup just from reading the data.
library(data.table)
ShootAnnot <- fread("~/Downloads/annotation1.txt")
ShootAnnot[like(Blast2GO_GO_Description,"bahd"), "#ID", with=FALSE]
which will give you
#ID
1: c112902_g1_i1_m.105401
2: c11459_g1_i1_m.4290
3: c11459_g2_i1_m.4292
4: c186946_g1_i1_m.110882
5: c24956_g1_i1_m.8768
6: c265515_g1_i1_m.117383
7: c28096_g1_i1_m.10253
8: c37936_g1_i1_m.14867
9: c40683_g1_i1_m.17292
10: c54651_g1_i1_m.34709
11: c54651_g2_i1_m.34711
12: c921_g1_i1_m.351
(you don't have any non-lower-case "bahd"'s in your file)

Error: C stack usage too close to the limit when viewing data frame tail

I have a data frame appt which is 91.2MB in size, comprising 29255 observation of 51 variables.
When I tried to check its end with tail(appt), I get the error
Error: C stack usage 20212630 is too close to the limit
I have no clue what to do to trouble shoot this. Any suggestions on what I can do?
As additional info, I have in memory at the same time a few other variables of almost comparable size, including a 90.2MB character vector and a 42.3MB data frame of 77405 obs. x 60 variables. Calling tail on these two other variables did not trigger any error.
Edit:
I have narrowed down that the error occurs only when the last row is accessed. i.e. appt[29254, ] is fine, appt[29255, ] throws the error.
I had exactly the same error but managed to solve it by disabling quoting when reading in the dataframe. For me the error showed up when trying tail(df) as well.
Turns out one of the lines in my text file had the \ character. You could run into the same problem if your file has a " or ' somewhere, as read.table considers this a quote character in the default case.
Add the option quote="" to disable quoting alltogether. Example: read.table(filepath, quote="")
If this doesn't solve your issue, check out some of the other options in ?read.table (such as allowEscapes, sep, ...) as they might be causing your error.

fread and a quoted multi-line column value

> fread('col1,col2\n')
Empty data.table (0 rows) of 2 cols: col1,col2
> fread('col1,col2\n5,4')
col1 col2
1: 5 4
> fread('col1,col2\n5,"4\n3"')
Error in fread("col1,col2\n5,\"4\n3\"") :
Unbalanced quote (") observed on this line: 3"
>
read.csv can import this csv as long as the value that spans multiple lines is wrapped in quotes.
Should fread be able to import it as well? Using read.csv is actually fine for my use case. I can just convert the resulting data frame into a data table. But I just wanted to make sure that not having this functionality was a design decision, and not something that just wasn't yet tested.
UPDATE: Now fixed in v1.9.3 on GitHub :
fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.
This error has been reported before and it's on the list to do. But what's new here is the \n inside the quotes. I hadn't realised that was a use case giving rise to the error.
Many thanks for reporting. It'll be fixed.
Similar question but not exactly the same here :
data.table::fread and Unbalanced "
and the bug report is here :
https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2694

Resources