The woes of endless columns in .csv data in R - r

So I have a bunch of .csv files that were output by a simulation. I'm writing an R script to run through them and make a histogram of a column in each .csv file. However, the .csv is written in such a way that R does not like it. When I was testing it, I had been originally opening the files in Excel and apparently this changed the format to one R liked. Then when I went back to run the script on the entire folder I discovered that R doesn't like the format.
I was reading the data in as:
x <- read.csv("synch-imit-characteristics-2-tags-2-size-200-cost-0.1run-2-.csv", strip.white=TRUE)
Error in read.table(test, strip.white = TRUE, header = TRUE) :
more columns than column names
Investigating I found that the original .csv file, which R does not like, looks different than after the test one I opened with excel. I copied and pasted the first bit below after opening it in notepad:
cost,0.1
mean-loyalty, mean-hospitality
0.9885449527316088, 0.33240076252915735
weight,1 of p1, 2 of p1,
However, in notepad, there is no apparent formatting. In fact, between rows there is no space at all, ie it is cost,0.1mean-loyalty,mean-hospitality0.988544, etc. So it is weird to me as well that when I cope and paste it from notepad it gets the desired formatting as above. Anyway, moving on, after I had opened it in excel it got transferred to this"
cost,0.1,,,,,,,,
mean-loyalty, mean-hospitality,,,,,,,,
0.989771257,0.335847092,,,,,,,,
weight,1 of p1, etc...
So it seems like the data originally has no separation between rows (but I don't know how excel figures it out, or copying and pasting it) but R doesn't pick up on this. Instead, it views it all as one row (and since I have 40,000+ rows, it doesn't have that many columns). I don't want to have to open and save every file in excel. Is there a way to get R to read the data as desired?
Since when I copy and paste it from notepad it had new lines for the rows, it seems like I just need R to read it knowing that commas separate columns on the same row and a return separates rows. I tried messing around with all the sep="" commands I could find. But I can't figure it out.

To first solve the Notepad issue:
You must have CR (carriage return, \r) characters between the lines (and no LF, \n characters, causing Notepad to see it as one line).
Some programs accept this as well as a new line character, some don't.
You can for example use Notepad++ to replace all '\r' with '\n' or '\r\n', using Replace wih the "Extended" option. First select View > Show Symbol > Show all characters, so see what you are doing.
Finally, to get back to R:
(As it was pointed out, R can actually handle CR as a newline)
read.csv assumes that you have non-empty header names in the first row, but instead you have:
cost,0.1
while later in the data you have a row with more than just two columns:
weight,1 of p1, 2 of p1,
This means that not all columns have a header name (and I wonder if 0.1 was supposed to be a header name anyway).
The two solutions can be:
add a header including all columns, or
as it was pointed out in a comment use header=F.

Related

How do I get EXCEL to interpret character variable without scientific notation in R using fwrite?

I have a relatively simple issue when writing out in R with fwrite from the data.table package I am getting a character vector interpreted as scientific notation by Excel. You can run the following code to create the data issue:
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))
fwrite(samp,"test.csv",row.names = F)
When you read this back into R you get values back no problem if you have scinote disable. My less code capable colleagues work with the csv directly in excel and they see this:
They can attempt to change the variable to text but excel then interprets all the zeros. I want them to see the original "7E39" from the data table created. Any ideas how to avoid this issue?
PS: I'm working with millions of rows so write.csv is not really an option
EDIT:
One workaround I've found is to just create a mock variable with quotes:
samp = data.table(id = c("7E39", "7G32","5D99999"))[,id2:=shQuote(id)]
I prefer a tidyr solution (pun intended), as I hate unnecessary columns
EDIT2:
Following R2Evan's solution I adapted it to data table with the following (factoring another numerical column, to see if any changes occured):
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))[,second_var:=c(1,2,3)]
fwrite(samp[,id:=sprintf("=%s", shQuote(id))],
"foo.csv", row.names=FALSE)
It's a kludge, and dang-it for Excel to force this (I've dealt with it before).
write.csv(data.frame(id=sprintf("=%s", shQuote(c("7E39", "7G32","5D99999")))),
"foo.csv", row.names=FALSE)
This is forcing Excel to consider that column a formula, and interpret it as such. You'll see that in Excel, it is a literal formula that assigns a static string.
This is obviously not portable and prone to all sorts of problems, but that is Excel's way in this regard.
(BTW: I used write.csv here, but frankly it doesn't matter which function you use, as long as it passes the string through.)
Another option, but one that your consumers will need to do, not you.
If you export the file "as is", meaning the cell content is just "7E39", then an auto-import within Excel will always try to be smart about that cell's content. However, you can manually import the data.
Using Excel 2016 (32bit, on win10_64bit, if it matters):
Open Excel (first), have an (optionally empty) worksheet already open
On the ribbon: Data > Get External Data > From Text
Navigate to the appropriate file (CSV)
Select "Delimited" (file type), click Next, select "Comma" (and optionally deselect any others that may default to selected), Next
Click on the specific column(s) and set the "Default data format" to "Text" (this will need to be done for any/all columns where this is a problem). Multiple columns can be Shift-selected (for a range of columns), but not Ctrl-selected. Finish.
Choose the top-left cell to import/paste the data (or a new worksheet)
Select Properties..., and deselect "Save query definition". Without this step, the data is considered a query into an external data source, which may not be a problem but makes some things a little annoying. (For example, try to highlight all data and delete it ... Excel really wants to make sure you know what you're doing there.)
This method provides a portable solution. It "punishes" the Excel users, but anybody/anything else will still be able to consume the files directly without change. The biggest disadvantage with this method is that you won't know if somebody loads it incorrectly unless/until they get odd results when the try to use the data and some fields are silently converted.

importing txt into R

I am trying to read an ftp file from the internet ("ftp://ftp.cmegroup.com/pub/settle/stlint") into R, using the following command:
aaa<-read.table("ftp://ftp.cmegroup.vom/pub/settle/stlint", sep="\t", skip=2, header=FALSE)
the result shows the 8th, 51st, 65th, 71st, 72nd, 73rd, 74th, etc etc rows of the resulting dataset as including add-on rows appended at the end. Basically instead of returning
{row8}
{row9}
etc
{row162}
{row163}
It returns (adding in the quotes around the \n)
{row8'\n'row9'\n'etc...etc...'\n'row162}
{row163}
If it seems like i'm picking arbitrary numbers then run the code above, take a look at the actual ftp file on the internet (as of mid-day feb18) and you'll see i'm not, it really adding 155x rows onto the end of the 8th row. So what i'm looking for is simply I'm looking for a way to read it in without the random appending of rows. Thanks, and apologize in advance i'm new to R and was not able to find this fix after a while of searching.

Copy an R data.frame to an Excel spreadsheet

I have a question which is exactly similar to this question.
As part of my work, I have to copy output from the R Studio Console to an excel worksheet in order to make excel graphs. However, the R Studio Console uses formatted text, which excel doesn't read so well. To compensate, I'm always copying from the R Studio Console, pasting into notepad, then copying into Excel. That way, when I paste a table, I can tell excel that it's actually fixed width delimited data, and not just a clump of text.
How can I copy output from the R Studio console so that it goes into the clipboard as unformatted text so that I can paste it directly into Excel and thus organize the numbers into different cells? This would be very helpful as I dislike having to copy/paste tables into notepad then excel to make graphs.
It works with an easy trick.
First, you have to visualize your data in the Viewer pane of Rstudio (you can use the function View()), then you should start selecting from the last value to the first, it is from bottom to top (see image). Note that the first cell should be selected completely. Finally, right click on the selection, copy, and then paste it in Excel as you want, with or without format.
Good luck!
UPDATE:
based on this Post, other alternative is making a new function to copy your data.frame to Excel through the clipboard:
write.excel <- function(x,row.names=FALSE,col.names=TRUE,...) {
write.table(x,"clipboard",sep="\t",row.names=row.names,col.names=col.names,...)
}
write.excel(my.df)
and finally Ctr+V in Excel :)
This is by far the easiest way I have found so far:
clipr::write_clip(my_df)
source here
I usually source the following function:
cb <- function(df, sep="\t", dec=",", max.size=(200*1000)){
# Copy a data.frame to clipboard
write.table(df, paste0("clipboard-", formatC(max.size, format="f", digits=0)), sep=sep, row.names=FALSE, dec=dec)
}
A few notes:
Max.size allows you to specify how big the clipboard can become (in kilobytes) before it cancels, it's set to ~200MB right now.
It works perfectly for copying an R dataframe from an R studio session to Excel (with my EU locale). You might have to adjust the separator / decimal symbols to make it work with US versions.
How to use:
df <- mtcars
cb(df)
# Paste in excel as 'values'
From my experience there is no convenient way, I use two methods:
For small data frames, use RStudio's View(data.frame) function, if you copy only data without headers it works fine, but if you want to copy with headers then you have to paste it into notepad first to add at least one character to the top left empty cell.
For large data frames, use write.csv or write.xls (from package WriteXLS)

Excel data organized in multiple nested rows, can R read it?

Please see the picture. I've started using R, and know how/that it can read files from Excel, but can it read something formatted like this?
http://www.flickr.com/photos/68814612#N05/8632809494/
(my apologies, upload was not working for me)
Elaborating on some of what's in the comments:
If you load the file into Excel, you can save it as a fixed-width or comma-delimited text file. Either should be easy to read into R.
The following may be obvious to you already.
(First, a question: Are you sure that you can't get the data in a format that has one set of data per line? Is it possible that the file you're getting was generated from a different file format that is more conducive to loading the data into R?)
Whether you should start rearranging the data in R or instead manipulate the raw text depends on what comes naturally to you (or to people you have around who can help). For me, personally, I would rearrange the text file outside of R before loading it into R. That's what's easiest for me. Perl is a great language for this purpose, but you could also do it with Unix shell scripts if that's accessible to you, or using a powerful editor such as Vim or Emacs. If you have no preference, I'd suggest Perl. If you have any significant programming experience, you'll be able to learn what you need. On the other hand, you're already loading it into R, so maybe it would be better to process the data there.
For example, you could execute a loop that goes the text file line by line and does something like this:
while (still have lines to read) {
read first header line into an vector if this is the first time through the loop
otherwise, read it and throw it away
read data line 1 into an vector
read second header line into vector if this is the first time
otherwise, read it and throw it away
read data line 2 into an vector
read third header line into vector if this is the first time
otherwise, read it and throw it away
read data line 3 into an vector
if this is first time through, concatenate the header vectors; store as next row
in something (a file, a matrix, a dataframe, etc.)
concatenate the data vectors you've been saving, and store as next row in same thing
}
write out the whole 2D data structure
Or if the headers will never change, then you could just embed them literally into the script before the loop, and throw them out no matter what. That will make the code cleaner. Or read the first few lines of the file separately to get the headers, and then have a separate script to read the data and add it to the file with the headers in it. (The headers will probably be useful in R, so I would suggest preserving them at the top of the text file.)

'Incomplete final line' warning when trying to read a .csv file into R

I'm trying to read a .csv file into R and upon using this formula:
pheasant<-read.table(file.choose(),header=TRUE,sep=",")
I get this warning message:
"incomplete final line found by readTableHeader on 'C:\Documents and Settings..."
There are a couple of things I thought may have caused this warning, but unfortunately I don't know enough about R to diagnose the problem myself so I thought I'd post here in the hope someone else can diagnose it for me!
the .csv file was originally an Excel file, which I saved into .csv format
the file comprises three columns of data
each data column is of a differing length, i.e. there are a different number of values in each column
I want to compare the means (using t-test or equivalent depending on normal / not normal distribution) of two of the columns at a time, so for example, t-test between column 1 values and column 2 values, then a t-test of column 1 and column 3 values, etc.
Any help or suggestions would be seriously appreciated!
The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)). The original intention of this message was to warn you that the file may be incomplete; most datafiles have an EOL character as the very last character in the file.
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return
Save the file
The problem is easy to resolve;
it's because the last line MUST be empty.
Say, if your content is
line 1,
line2
change it to
line 1,
line2
(empty line here)
Today I met this kind problem, when I was trying to use R to read a JSON file, by using command below:
json_data<-fromJSON(paste(readLines("json01.json"), collapse=""))
; and I resolve it by my above method.
Are you really sure that you selected the .csv file and not the .xls file? I can only reproduce the error if I try to read in an .xls file. If I try to read in a .csv file or any other text file, it's impossible to recreate the error you get.
> Data <- read.table("test.csv",header=T,sep=",")
> Data <- read.table("test.xlsx",header=T,sep=",")
Warning message:
In read.table("test.xlsx", header = T, sep = ",") :
incomplete final line found by readTableHeader on 'test.xlsx'
readTableHead is the c-function that gives the error. It tries to read in the first n lines (standard the first 5 ) to determine the type of the data. The rest of the data is read in using scan(). So the problem is the format of the file.
One way of finding out, is to set the working directory to the directory where the file is. That way you see the extension of the file you read in. I know on Windows it's not shown standard, so you might believe it's csv while it isn't.
The next thing you should do, is open the file in Notepad or Wordpad (or another editor) and check that the format is equivalent to my file test.csv:
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,
This file will give you the following dataframe :
> read.table(testfile,header=T,sep=",")
Test1 Test2 Test3
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
5 5 5 NA
6 NA 6 NA
The csv format saved by excel seperates all cells with a comma. Empty cells just don't have a value. read.table() can easily deal with this, and recognizes empty cells just fine.
Use readLines() (with warn = FALSE) to read the file into a character vector first.
After that use the text = option to read the vector into a data frame with read.table()
pheasant <- read.table(
text = readLines(file.choose(), warn = FALSE),
header = TRUE,
sep = ","
)
I realized that several answers have been provided but no real fix yet.
The reason, as mentioned above, is a "End of line" missing at the end of the CSV file.
While the real Fix should come from Microsoft, the walk around is to open the CSV file with a Text-editor and add a line at the end of the file (aka press return key).
I use ATOM software as a text/code editor but virtually all basic text editor would do.
In the meanwhile, please report the bug to Microsoft.
Question: It seems to me that it is a office 2016 problem. Does anyone have the issue on a PC?
I have solved this problem with changing encoding in read.table argument from fileEncoding = "UTF-16" to fileEncoding = "UTF-8".
I received the same message. My fix included: I deleted all the additional sheets (tabs) in the .csv file, eliminated non-numeric characters, resaved the file as comma delimited and loaded in R v 2.15.0 using standard language:
filename<-read.csv("filename",header=TRUE)
As an additional safeguard, I closed the software and reopened before I loaded the csv.
In various European locales, as the comma character serves as decimal point, the read.csv2 function should be used instead.
I got this problem once when I had a single quote as part of the header. When I removed it (i.e. renamed the respective column header from Jimmy's data to Jimmys data), the function returned no warnings.
In my case, it was literally the final line. The issue was fixed by literally adding a blank row at the bottom of the CSV file.
FROM
cola,colb,colc
1,2,3
4,5,6
7,8,9
INTO
cola,colb,colc
1,2,3
4,5,6
7,8,9
Take a look closer on that extra space at the very last row. Just add that blank line and it will fix the issue.
NOTE
It seems that R's CSV parser is looking for that very last new line character as the new line separator. This is more known to programmers as the \r\n or \r characters.
The problem that you're describing occurred for me when I renamed a .xlsx as .csv.
What fixed it for me was going "Save As" and then saving it as a .csv again.
To fix this issue through R itself, I just used read.xlsx(..) instead of a read.csv(). Works like a charm!! You do not even have to rename. Renaming an xlsx into to csv is not a viable solution.
Open the file in text wrangler or notepad ++ and show the formating e.g. in text wrangler you do show invisibles. That way you can see the new line or tabs characters
Often excel will add all sorts of tabs in the wrong places and not a last new line character, but you need to show the symbols to see this.
My work around was that I opened the csv file in a text editor, removed the excessive commas on the last value, then saved the file. For example for the following file
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,,
Remove the commas after 6, then save the file.
I've experienced a similar problem, however this appears to a generic warning, and may not in fact be related to the line-end character. In my case it was giving this error because the file I was using contained Cyrillic characters, once I replaced them with latin characters the error disappeared.
I tried different solutions, such as using a text editor to insert a new line and get the End Of Line character as recommended in the top answer above. None of these worked, unfortunately.
The solution that did finally work for me was very simple: I copy-pasted the content of a CSV file into a new blank CSV file, saved it, and the problem was gone.
There is a quite simple solution (if it is indeed the finale line which is causing troubles) where you don't need to open the file before reading it:
cat("\n", file = "your/File/Dir", append = TRUE)
Found this solution here.

Resources