How do I read in a CSV file into R? - r

I'm having some technical issues with loading a CSV file into R. When I inspect the csv file in RStudio's Source pane, all the characters are surrounded by weird red circles or dots. When I inspect another self-made CSV file, the characters appear perfectly fine, without any of the red circles.
What is this issue/symptom, and what would be the best way to fixing this for about 40 similar CSV files?
When I try to run readfile <- read.csv("filename.csv", sep="", collapse=NULL) I get the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file

My guess would be that you ran into some encoding issue.
Especially on Windows you can run into all sorts of problems with that.
Try opening the csv file with a text editor that has the capability of saving files with various Encodings (e.g. Notpead++) then change that to e.g. UTF-8 (which is the preferred Encoding of RStudio most other Editors and R itself), save the file and try to run the import again.
Just make sure that you don't loose characters - especially special characters tend to get lost during Encoding changes.
Greetings ...

Related

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

Warning message:In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'hola.csv'

Problem: in R I get Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'hola.csv'
To simplify I created a basic table in excel and I have saved it in all the .csv formats it offers (comma separated values, csv UTF8, MS2 csv etc) and the error persists in all of them. I'm working in mac 10.15 catalina, Excel version 16.29.1 (2019).
I changed the language of my laptop from Spain to Uk, selecting , for groups and . for decimals, as some people here suggested it may be due to some countries languages by default using semicolon instead of commas for csv. After this, as expected, csv are indeed created separated by commas, but I still get the warning.
As suggested, if I open the file in textedit and click enter at the end, saving it afterwards, R works perfectly and the error disappears, but it does not seem practical/efficient to do that every single time I want to open a csv. On the other hand it remains a mystery to me why working colleagues using mac UK configuration do not get this error (neither do I when I open csv they have created on their laptops).
Can it be the Excel version? Should I ignore the warning? (the table looks fine when opening it). thanks!
aq2<-read.csv("hola.csv")
That is a warning message generated because R's read.table expects the final line to include an end of line character (either or ). It's almost always an unnecessary warning. Many programs including Excel, will create files like that.
You should carefully read the error message. It says incomplete final line found by readTableHeader. This refers to the last row of your .csv file and suggests that this line is incomplete for R to read. So what could be the problem? If you have a csv (=comma separated values) file, it might well be that each line has a certain formatting. Check if this formatting is consistently applied throughout the file. An issue that often pops up in hand-collected data. If you post an excerpt of your data using tail(aq2) (from tidyverse package) we could have a look at the last line and check the formatting to answer the issue in more depth. Eventually, it is just a warning, not an error message. Still, important to understand warnings.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

R is the encoding readLines uses dependent on any meta-information of the .txt file?

I am trying to read a given set of files which contain characters like "à". So I use the following code to read the file:
readLines(con = file.path("C:\\myFolder", "testFile.txt"), encoding = "UTF-8")
Now instead of getting an 'à', I get as output "\xe1". So when I remove all content of the .txt file except for this specific letter and a newline, readelines produces "\xe1" as output. Then I created a new file with the same file content, to create a reproducible example. I called it "testFile2.txt". However when I try:
readLines(con = file.path("C:\\myFolder", "testFile.txt", "testFile2.txt"), encoding = "UTF-8")
I get the expected output of "à". I tried manually retyping the content of the file, resaving both files under a different name in a different folder. But whatever I try, the 2 files with seemingly different content produce different outputs in readLines. Is there any meta-information somehow attached to the files that could be causing this?
I use notepad++ to manipulate the files.
edit: readr::guess_encoding was a good suggestion. I got for the file that worked:
encoding confidence
1 UTF-8 0.8
2 GB18030 0.1
3 Big5 0.1
And for the file that gave problems:
encoding confidence
1 UTF-8 0.15
So there does indeed to seem to be an encoding problem
edit 2: and I got my answer, the file was saved in ANSI instead of in UTF-8. So notepad++ of course kept it in that format regardless of how much a moved or renamed it. And when I made a new file to get a reproducible example, it saved that automatically in UTF-8, causing the differences in output when reading the files. Even though the content in notepad++ showed as the same.

Converting Word2007 into R

I'm just starting using R and I can't figure out how to infile files from any other program into R. I tried a basic example from going to Word to R. I used this website as a supposed example on how to do this http://www.mayin.org/ajayshah/KB/R/html/r1.html. So here is what I typed:
A<-read.table("C:\Users\anr28\Desktop\x.docx", sep=",", col.names=c("year", "my1", "my2"))
I had a document named "x" in Microsoft Word which according to the properties menu on my computer ends with docx. I followed exactly what they did in the example and it didn't work. This was the error messages printed out, but I don't know how to interpret them.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 3 elements
In addition: Warning message:
In read.table("C:\\Users\\anr28\\Desktop\\x.docx", sep = ",", col.names = c("year", :
incomplete final line found by readTableHeader on 'C:\Users\anr28\Desktop\x.docx'
Please Help, I'm trying to learn this on my own and it's very frustrating not being able to bring files in to actually learn the crux of the program, which is what I'm really after. Thanks
The read.table function (and related) expects a plain text file. Word uses its own file type (hence the .docx extension) which is not plain text, it includes your data (probably compressed) along with information about fonts, colors, sizes, and a bunch of other things in a way that R does not understand.
The best approach is to open your file in word, then save it again as a plain text file (try clicking the circle in the upper left corned, then choose "Save As", then choose "Other Formats", then in the dialog box choose the "Plain text (.txt)" option for "Save as type"). Then read the text file into R following the example.
The link you posted is about a file that looks like this:
1997,3.1,4
1998,7.2,19
1999,1.7,2
2000,1.1,13
With "looks like" it is meant that if you read this file in a plain text editor like notepad, this is what you get. A word file is not plain text. A plain text file is a file (often with .txt as extension, but this is not necessary) that only contains text. A word file is a file that can be opened and read by word and contains information on the text, but also typesetting, fonts, etcetera, encoded in a machine language that is not readable. You can see the difference by opening the word document in notepad.
As said in other answers, you can save your word file as a plain text file with "save as". You can also save data from excel as a plain text file which can easily be read in R.
You might want to use a plain text editor (not a word processor) for typing in simple data files - try notepad++, which is as easy to use as notepad but with a lot more functionality.
Google and download it, then enter some comma-separated numbers, save, and read into R.
There is a also a basic text editor built into R for Windows that you can use to type R functions and data files.
It makes no sense to read data into R from a proprietary windows format. R will happily accept any plain text format. In your case, just save as plain text and read it in.

Resources