I have a comma-separated value file that looks like this when I open it in vim:
12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M
and so forth. I believe this means my CSV uses CR-only (classic mac) line endings. R's read.table() function ostensibly requires LF line endings, or some variant thereof.
I know I can preprocess the file, and that's probably what I'll do.
That solution aside: is there a way to import CR files directly into R? For instance, write.table() has an "eol" parameter one can use to specify the line ending of outputs -- but I don't see a similar parameter for read.table() (cf. http://stat.ethz.ch/R-manual/R-patched/library/utils/html/read.table.html).
R will not recognize "^M" as anything useful.(I suppose it's possible that vim is just showing you a cntrl-M as that character.) If that were in a text-connection-stream R will think it's not a valid escaped-character, since "^" is not used for that purpose. You might need to do the pre-processing, unless you want to pass it through scan() and substitute using gsub():
subbed <- gsub("\\^M", "\n", scan(textConnection("12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M"), what="character"))
Read 1 item
> read.table(text=subbed, sep=",")
V1 V2 V3 V4 V5
1 12 31 50 NA 12
2 34 23 45 2 12
3 12 31 50 NA 12
4 34 23 45 2 12
I suppose it's possible that you may need to use "\\m" as the patt argument to gsub.
A further note: The help page for scan says: "Whatever mode the connection is opened in, any of LF, CRLF or CR will be accepted as the EOL marker for a line and so will match sep = "\n"." So the linefeed character ("\n"if that's what they are) should have been recognized them, since read.table is based on scan. You should look at ?Quotes for information on escape characters.
If this vim tutorial is to be believed those may be DOS-related characters since it offers this advice:
Strip DOS ctrl-M's:
:1,$ s/{ctrl-V}{ctrl-M}//
There is an R native solution that requires no preprocessing or external hacks. You should use the encoding input argument to the read.table function and set it equal to "latin1" for Mac character encoding.
For example, say your file in Mac (^M for return) format is saved as test.csv, load as follows:
test <- read.table("./test.csv", sep=",", encoding="latin1")
To see what options you can pass the encoding argument type ?Encoding into the R interpreter and you will see "latin1", "UTF-8", "bytes" or "unknown"are the supported encodings.
This is the best & cleanest way to do this.
Related
I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)
I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__
I have 2.000+ tables, some with hundreds of lines, that I'm downloading from a web service (of botanical names) and saving to disk for further inspection.
Since some text fields have carriage returns, I decided to quote everything. But some fields have " characters, others have ' characters, so these characters can't be used for quoting (I could try to escape them, but some are already escaped, and this would easily become a mess. I thought it would be easier to use a different quote character). I tried %, only to find that some fields also use this character. So I need something different. I tried ¨ ☺ π and 人, but nothing seems to work! All of them appear correctly on screen (RKWard in Ubuntu 14.04), all are saved correctly with write.table, but NONE can be read with read.table or read.csv. I'm using UTF-8 as fileEncoding. I get the message "invalid multibyte string", even for ☺ (which is ASCII 1st character).
Sys.getlocale(category="LC_ALL")
gives
"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=pt_BR.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=pt_BR.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=pt_BR.UTF-8;LC_NAME=pt_BR.UTF-8;LC_ADDRESS=pt_BR.UTF-8;LC_TELEPHONE=pt_BR.UTF-8;LC_MEASUREMENT=pt_BR.UTF-8;LC_IDENTIFICATION=pt_BR.UTF-8"
I have tried changing the locale to chinese, to use the 人 (what shouldn't be needed, I guess, since it displays and saves correctly), but also didn't work. I get
OS reports request to set locale to "chinese" cannot be honored
OS reports request to set locale to "Chinese" cannot be honored
OS reports request to set locale to "zh_CN.utf-8" cannot be honored
Now the most strange: if the chinese characters are in the body of data, they're read without problem. It seems they just can't go as quotes!
Any ideas? Thanks in advance.
I'm not sure this is the solution you're looking for, but if I understood correctly you have CR/LF characters in your text which are a problem to read the data as a table. If so, you can use readLines which automatically escapes \r, \n and \r\n and then read as a table. For example, consider the file crlf.txt:
col1 col2 col3 col4 col5
1 \n 3 \r 5
a \r\n 3 2 2
You can use
> readLines("crlf.txt")
[1] "col1 col2 col3 col4 col5" "1 \\n 3 \\r 5 "
[3] "a \\r\\n 3 2 2"
And then:
> read.table(text=readLines("crlf.txt"), header = T)
col1 col2 col3 col4 col5
1 1 \\n 3 \\r 5
2 a \\r\\n 3 2 2
Obviously the line breaks are now escaped when printed, otherwise they would actually break the lines.
See ?scan (scan is used by read.table):
quote: the set of quoting characters as a single character string or ‘NULL’. In a multibyte locale the quoting characters must be ASCII (single-byte).
The easiest option would be to replace all your embedded new lines with another string prior to importing the file, and then reintroduce the new lines later using gsub.
I need to read a text file (tab-separated) that has some carriage returns inside some fields.
If I use read.table, it gives me an error:
line 6257 did not have 20 elements
If I use read.csv, it doesn't give an error, but creates a new line in that place, putting the next fields in the first fields of the new line.
How can I avoid this? I can't alter the file itself (the script is to be run elsewhere). Also the broken strings don't have quotation marks (no strings in the file have). One option would be to read the carriage return as a single space, or as \n, but how?
Use read.table instead of read.csv and set allowEscapes to TRUE.
read.table("your/path",sep=",",allowEscapes=TRUE)
I tested w/ the following:
wrote a csv file in excel
contents of csv file:
1,df,3,"4
"
df,"df
",3,a
result:
V1 V2 V3 V4
1 1 df 3 4 \n
2 df df\n 3 a
I have a .csv that causes different problems with read.table() and fread().
There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.
fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.
Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?
Thanks in advance!
In case of Paul's file, I was able to read the file (after some experimentation) using fread() with the cmd "unzip -cq" and quote = "" parameters without error or warnings. I suppose that this might work as well with Kristian's file.
On Windows, it might be necessary to install the Rtools beforehand.
library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
"extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")
Caveat: This will download a file of 38 MBytes
According to the unzip man page, the -c option automatically performs ASCII-EBCDIC conversion.
The quote = "" was required because in at least one case a data field contained double quotes within the text.
I have also tried the -p option of unzip which extracts the data without conversion. Then, we can see that there is \032 embedded in the string.
txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")
txt2[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOP\032OULOS
The \032 does not appear in the converted version
txt1[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOPOULOS
We can search for all occurrences of \032 in all character fields by
melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
value %flike% "\032"][order(`CUST-ID`)]
CUST-ID variable value
1: 1253096 LEGAL-NAME JOHN A. GIANNAKOP\032OULOS
2: 2050751 DBA-NAME colbert ball tax tele\032hone rd
3: 2082166 LEGAL-NAME JUAN DE J. MORALES C\032TALA
4: 2273606 LEGAL-NAME INTRINSIC DM\032 INC.
5: 2300016 MAIL-ADDR1 PO BOX \03209
6: 2346154 LEGAL-NAME JOEL I GONZ\032LEZ-APONTE CPA
7: 2384445 LEGAL-NAME NUMBERS CAF\032 PLLC
8: 2518214 MAIL-ADDR1 556 W 800 N N\03211
9: 2518214 BUSN-ADDR1 556 W 800 N N\03211
10: 13718109 DBA-NAME World Harvest Financial Grou\032
11: 13775763 LEGAL-NAME Fiscally Responsible Consulting LLC\032
12: 13775763 DBA-NAME Fiscally Responsible Consulting LLC\032
This may help to identify the records of the file to fix manually.
I hit this problem today, so its still there in R 4.0.5
The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.
In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table get a little further. I found no encoding setting in R that made a difference on this problem.
I found 2 solutions.
Get rid of the "^Z" symbol with text tools before using read.csv.
Switch to Python. The pandas package function read_csv, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame that results, the unrecognized character is in the data, it looks like an empty square.
If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.