using DelimitedFiles
dat = readdlm("tmp.dat", ' ', Float64, '\n')
fails with
ERROR: LoadError: at row 1, column 1 : ErrorException("file entry \"\" cannot be converted to Float64")
when the data file starts with a white space (blank):
1 2
3 4
When I remove the first space, the Julia program works.
Is there an option to readdlm to ignore leading white space or is there an equally easy alternative?
Edit: Reading an answer below (Thanks!), I realize that I haven't phrased my question precisely. Instead of editing the above and moving the goalpost as a result, I explain my situation. I have ASCII data files, each line of which may or may not start with an unknown number of white spaces. (In addition there may be trailing white spaces.)
Basically what I have in mind is the "usual" ASCII-file read in popular languages like C, Ruby, Fortran, gnuplot . . . In all these languages, leading and trailing white spaces in ASCII-data files are ignored, and inter-word extra spaces are also ignored.
I was surprised readdlm doesn't have such a "mode" of reading. . . .
readdlm supports only skipping entire lines via the skipstart keyword parameter.
Hence all you can do is to use seek to skip initial bytes in the stream.
Setup:
open("f.txt","w") do f;println(f," 1 2\n3 4");end
Actual code:
julia> open("f.txt") do f
seek(f,1)
readdlm(f, ' ', Float64, '\n')
end
2×2 Matrix{Float64}:
1.0 2.0
3.0 4.0
Related
I have a .csv file that contains all text fields. However, some of the text fields contain an unescaped double quote character, eg:
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Lines 1 and 2 are fine but 3 doesn't read in correctly. At the moment I am manually going through the file in Notepad++ to try and remove such quotes. Ideally I'd like R to be able to handle this but I think that the unescaped nature of the unmatched double quote makes such an expectation unreasonable.
In Notepad++ I am trying to build a regular expression to identify double quotes that are not preceded or succeeded by a comma. The logic is that a valid double quote will be at the start or end of a field and this is signified by an adjacent comma. This might help to identify the majority of my cases, which I can then deal with.
Just to say that I have about 3.4 million records and about 0.1% appear to be problematic.
EDIT:
fread from data.table has been suggested as an alternative, but use of fread is even less successful:
1: In fread(paste(infilename, "1", ".csv", sep = "")) :
Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line
Nether of the suggested options works. I think this is because the "Text" field can also contain CRLF characters. The read.csv appears to just ignore these (good) whilst fread takes exception. Sorry that I can not make the actual text available, but here is some more comprehensive test data, that has both the unmatched double quote (read.csv has issues with) and CRLF (fread has issues with).
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Help with the regex in Notepad++ would be great.
Perhaps one option could be to use a conditional replacement in notepad++.
You could find all the strings that start with a double quote which start with a comma or at the start of the string.
Then match not a double quote until you encounter the next double quote where a comma follows or the end of the string. These are the lines white are ok, so for the alternation part that you want to capture and replace match a double quote not between comma's.
Find what:
(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)
Replace with:
A conditional replacement. If group 1, then replace with empty, else replace with the match.
(?{1}:$0)
Regex demo
Explanation
(?:^|,) Match either a comma or assert the start of the string
"[^"\n]*" Match the double quotes when there is no double quote in between
(?=$|,) Assert what is on the right is either the end of the string or a comma
| OR
(?<!,)(")(?!,)Capture a double quote in group1 while asserting what is on the left and on the right is not a comma
Seems to work rather well with data.table::fread:
fread("E:/temp/test.txt")
# ID Text Optional text "Date"
#1: 1 Today is going to be a good day 2013-02-03
#2: 2 And I am inspired by the quote "every dog must have it's day" Hi 2013-01-01
#3: 3 Did not the bard say "All the World's a stage" this quote is so true Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
# Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.
TL;DR
I have a snippet of text
str <- '"foo\\dar embedded \\\"quote\\\""'
# cat(str, '\n') # gives
# "foo\dar embedded \"quote\""
# i.e. as if the above had been written to a CSV with quoting turned on.
I want to end up with the string:
str <- 'foo\\dar embedded "quote"'
# cat(str, '\n') # gives
# foo\dar embedded "quote"
essentially removing one "layer" of quoting. How may I do this?
(Initial attempt -- eval(parse(text=str)), which works unless you have something like \\dar, where you get the error "\d is an unrecognized escape in character string ...").
Gory details (optional)
The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines with embedded quotes (i.e. forgot to escape/remove them).
It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines and embedded quotes (or something like that), the function fails (fair enough).
I had since closed my R session so my only access to my data was through my munged CSV. So I wrote some spaghetti code to simply readLines my CSV and split everything up to reconstruct my dataframe again. However, since all my character columns were quoted in the CSV, I have a few columns in my restored dataframe that are still quoted that I want to unquote.
Messy, I know. I'll remember to save an original version of the data next time (save, saveRDS).
For those interested, the header row and three rows of my CSV are shown below (all the characters are ASCII)
"quote";"id";"date";"author";"context"
"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"February 28, 2013";"nhqdb";"nhqdb"
"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"February 28, 2013";"nhqdb";"nhqdb"
"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"February 28, 2013";"nhqdb";"nhqdb"
The first two columns of each row are the same, being the quote (the first row has no embedded newlines in the quote; the second and third do). Separator is ';'.
> read.table('test.csv', sep=';', header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
# same for with ,allowEscape=T
Use regular expressions:
str <- gsub('^"|"$', '', gsub('\\\"', '"', str, fixed = TRUE))
[EDIT 3: the OP has posted three separate versions of this - two of them irreproducible, interspersed with complaining. Due to this timewasting behavior and several people downvoting, I'm leaving the original answer to version 2 of the question.]
EDIT 1: My solution to the second version of the OP's question was this:
txt <- read.csv('escaped.csv', header=T, allowEscapes=T, sep=';')
EDIT 2: We now get a third version. Finally some reproducible code after 36 minutes asking and waiting. Due to the behavior of the OP and other posters I'm not inclined to waste more time on this. I'm going to complain about both of your behavior on MSO. Downvote yourselves silly.
ORIGINAL:
gsub is the ugly way.
Use read.csv(..., allowEscapes=TRUE, quote=..., encoding=...) arguments. See the manpage, section on Encoding
If you want actual code, you need to give us a full line or two of your CSV file.
See also SO: "How to detect the right encoding for read.csv?"
Quoting the relevant part of your question:
The reason my strings are quoted once-too-many times is I kludged some
data processing -- I wrote str (well, a dataframe in my case) to a
table with quoting enabled, but forgot that many of the columns in my
dataframe had embedded newlines within quotes (i.e. forgot to
escape/remove them).
It turns out that when I read.table a file with multiple columns in
the same row that have embedded newlines within quotes, the function
fails (fair enough).
I am studying mathematical computation and I am completely stuck on this task! I don't even know how to go about starting it!
**Write a program in Fortran that can parse a single line of well-formed HTML or XML markup so that it takes input on a single line (guaranteed to not exceed 80 characters in total) like
-lots of lovely text
where
tag might be anything from 1 to 37 ASCII characters and will not contain spaces
text could contain spaces and be anything from 1 to 73 characters in length
so that the program outputs one of two lines:
tag : text if the two occurrences of tag match inside <...> and
syntax error if anything else is input.
Any help is hugely appreciated !**
There are a number of intrinsic functions for working with strings that may be helpful.
result = index(string, substring) - returns the position of the start of the first occurrence of string substring as a substring in string, counting from one. (Fortran 77)
result = scan(string, set) - scans a string for any of the characters in a set of characters. (Fortran 95)
result = verify(string, set) - verifies that all the characters in a string are present in a set. (Fortran 95)
There are a few user-contributed string tokenization functions on the Fortran Wiki that might be helpful:
delim, strtok, and find_field. Also, FLIBS includes some string manipulation and tokenization routines that might be useful as examples.
Finally, there are a number of existing open-source XML parsers written in Fortran: xmlf90 and xml-fortran. Looking at the source code for these libraries should be helpful.
I need to be able to delimit a stream of binary data. I was thinking of using something like the ASCII EOT (End of Transmission) character to do this.
However I'm a bit concerned -- how can I know for sure that the particular binary sequence used for this (0b00000100) won't appear in my own binary sequences, thus giving a false positive on delimitation?
In other words, how is binary delimiting best handled?
EDIT: ...Without using a length header. Sorry guys, should have mentioned this before.
You've got five options:
Use a delimiter character that is unlikely to occur. This runs the risk of you guessing incorrectly. I don't recommend this approach.
Use a delimiter character and an escape sequence to include the delimiter. You may need to double the escape character, depending upon what makes for easier parsing. (Think of the C \0 to include an ASCII NUL in some content.)
Use a delimiter phrase that you can determine does not occur. (Think of the mime message boundaries.)
Prepend a length field of some sort, so you know to read the following N bytes as data. This has the downside of requiring you to know this length before writing the data, which is sometimes difficult or impossible.
Use something far more complicated, like ASN.1, to completely describe all your content for you. (I don't know if I'd actually recommend this unless you can make good use of it -- ASN.1 is awkward to use in the best of circumstances, but it does allow completely unambiguous binary data interpretation.)
Usually, you wrap your binary data in a well known format, for example with a fixed header that describes the subsequent data. If you are trying to find delimeters in an unknown stream of data, usually you need an escape sequence. For example, something like HDLC, where 0x7E is the frame delimeter. Data must be encoded such that if there is 0x7E inside the data, it is replaced with 0x7D followed by an XOR of the original data. 0x7D in the data stream is similarly escaped.
If the binary records can really contain any data, try adding a length before the data instead of a marker after the data. This is sometimes called a prefix length because the length comes before the data.
Otherwise, you'd have to escape the delimiter in the byte stream (and escape the escape sequence).
You can prepend the size of the binary data before it. If you are dealing with streamed data and don't know its size beforehand, you can divide it into chunks and have each chunk begin with size field.
If you set a maximum size for a chunk, you will end up with all but the last chunk the same length which will simplify random access should you require it.
As a space-efficient and fixed-overhead alternative to prepending your data with size fields and escaping the delimiter character, the escapeless encoding can be used to trim off that delimiter character, probably together with other characters that should have special meaning, from your data.
#sarnold's answer is excellent, and here I want to share some code to illustrate it.
First here is a wrong way to do it: using a \n delimiter. Don't do it! the binary data could contain \n, and it would be mixed up with the delimiters:
import os, random
with open('test', 'wb') as f:
for i in range(100): # create 100 binary sequences of random
length = random.randint(2, 100) # length (between 2 and 100)
f.write(os.urandom(length) + b'\n') # separated with the character b"\n"
with open('test', 'rb') as f:
for i, l in enumerate(f):
print(i, l) # oops we get 123 sequences! wrong!
...
121 b"L\xb1\xa6\xf3\x05b\xc9\x1f\x17\x94'\n"
122 b'\xa4\xf6\x9f\xa5\xbc\x91\xbf\x15\xdc}\xca\x90\x8a\xb3\x8c\xe2\x07\x96<\xeft\n'
Now the right way to do it (option #4 in sarnold's answer):
import os, random
with open('test', 'wb') as f:
for i in range(100):
length = random.randint(2, 100)
f.write(length.to_bytes(2, byteorder='little')) # prepend the data with the length of the next data chunk, packed in 2 bytes
f.write(os.urandom(length))
with open('test', 'rb') as f:
i = 0
while True:
l = f.read(2) # read the length of the next chunk
if l == b'': # end of file
break
length = int.from_bytes(l, byteorder='little')
s = f.read(length)
print(i, s)
i += 1
...
98 b"\xfa6\x15CU\x99\xc4\x9f\xbe\x9b\xe6\x1e\x13\x88X\x9a\xb2\xe8\xb7(K'\xf9+X\xc4"
99 b'\xaf\xb4\x98\xe2*HInHp\xd3OxUv\xf7\xa7\x93Qf^\xe1C\x94J)'