Parse Flat file every 410 characters - flat-file

I have a flat file that I need to take and insert a carriage return every 410 characters. I know this sounds weird, but for whatever reason my work was given several huge flat files from a clearinghouse, and I need to parse it out.
There is nothing that seperates what is supposed to be each new line, but it is exactly 410 characters. So I can't even search for anything specific and then do it.
There are 21 files total, each about 12-13mb.
I have asked for a CSV file, and they are unable to provide that.
I am trying to see if Notepad++ will do a Character count and then I can just hit "enter" after every 410th.
Also I am trying to see if I can do this in Java.
Any help you all can provide would be appreciated.

In Notepad++ you can search for the regular expression (.{410}) and replace it with \1\r.
It has happened to me that Notepad++ swallowed some characters when doing regex-based search and replace operations in large files, so I would try this for one file, then remove all the carriage returns again and compare the result size to the original size, just to make sure that nothing got swallowed during the replace operation.

Related

Write lines of text at a given position in a file in R

I would like to write a line in a text file at a given position (i) by avoiding the sequential reading.
There is WriteLines base function but I don't know how to insert the text at position (i) given as parameter.
Thanks
Dave
This is — unrelated to R — fundamentally impossible. Most (all common) filesystems do not support inserting or removing content in the middle of a file. The only supported operations are appending (or truncation) at the end, and R only supports appending, not truncation.
The way virtually all software solves your problem is by reading the file, modifying it, and writing it back to disk. If you want to get fancy because the file is very large (at least in the order of hundreds of MiB), you can stream edit the file: Read a part, edit that part, write it back to a new file. Rinse and repeat.
Technical aside: There is one exception to the above with low-level file operations, since files are stored as as non-contiguous “blocks”. But even if R supported this it wouldn’t help you since it doesn’t permit byte-level or line-level granularity: Blocks are typically at least 4 kiB in size.

readcsv fails to read # character in Julia

I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?
The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.

New line constant

Is there a new line constant that's platform independent in R? I'm used to C# and there's Environment.NewLine which will return \r\n on windows and \n otherwise. Searching turned up nothing, but I assume there has to be something somewhere so that scripts can be platform independent.
Related question: Is there a way to detect the platform a script is running on? This could be useful to know for other reasons (which I haven't thought of yet).
EDIT: Here's why I'm asking. I'm downloading files from an FTP server, but want to get a list of files and only download files that are on the server that don't exist locally. Here's how I'm getting the list of files:
filesonserver <- unlist(strsplit(getURL(basePath, ftp.use.epsv=F, dirlistonly=T), "\n"))
On windows, the files are separated by \r\n. On my mac (where I'm currently working), they're separated by \n. I was looking for a way to make this platform independent. I haven't tried just separating by \n on windows, which might work. There might also be a way to get the list of files as a vector without having to split them, which would avoid this entirely...
The package tryCatchLog has a function determine.platform.NewLine():
https://cran.r-project.org/package=tryCatchLog
https://github.com/aryoda/tryCatchLog/blob/master/R/platform_newline.R
If you consequently use this string instead of hard-coded "\n" your new lines will work platform-independently.
The answer to the initial question appears to be there isn't a new line constant like C# has. But it doesn't matter in my case, as the comments pointed out. It didn't occur to me until after I edited in the details that I probably didn't need to worry about it. Splitting by \n works fine on windows, even though the string containing the files names returned by getURL() is split by \r\n.

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

repair data in csv file

I have a huge csv file, separated by comma's and I want to do a analysis with glm in R.
In one column there exists data with a comma implied, something like: bla,blabla
When reading the file in R with read.csv.sql there comes a error-message:
RS-DBI driver: (RS_sqlite_import: ./agp.csv line 47612 expected 37 columns of data but found 38)
This is due to the 'extra' comma in some of the data, not the whole column has an extra column.
How can I fix this? I want to remove this extra superfluous comma.
Thanks for the reaction,
André
The CSV format is very simple and can easily be hand edited. In order to include a comma in a value, you must surround the value with quotes quotes. Try this: "bla,blabla". If that data happens to contain any quotes, eg. blah,"thequotedblah",blah, those quotes need to be escaped with another quote, like this: "blah,""thequotedblah"",blah".
Although there is no official standard around it, there isn't much to the CSV format. Wikipedia has a great CSV reference that I have personally used to implement CSV support in applications. Spend 5-10 minutes reading it and you'll know everything you ever need to know to manually create/read/repair CSV data.
Is it just this one line that contains a non-quoted comma - or are there several such lines? Editing the .csv with an editor that can handle large files (e.g. Ultraedit) to sanitize that one record would certainly help. Asaph's suggestion of quoting is also a good 'un.

Resources