Import csv file into R data frame with UTF-8 encoding - r

I'm trying to import a csv file with a data frame:
pc2020 <- read.table("pc2020.csv", sep = ";", header = TRUE)
This works ok, but the enconding is wrong, thus, I get all messed up accentuated characters.
So, I'm trying with:
pc2020 <- read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8")
That returns:
Error in read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8") :
no lines available in input
In addition: Warning message:
In read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8") :
invalid input found on input connection 'pc2020.csv'

You can use read.csv() function with the same attributes you used with read.table. Except fileEncoding - in read.csv() you should write just encoding = "UTF-8".
Also Duck's answer is suitable, too.

Related

R write.table function inserts unwanted empty line at the bottom of my csv

I have this code:
write.table(df, file = f, append = F, quote = TRUE, sep = ";",
eol = "\n", na = "", dec = ".", row.names = FALSE,
col.names = TRUE, qmethod = c("escape", "double"))
where df is my data frame and f is a .csv file name.
The problem is that the resulting csv file has an empty line at the end.
When I try to read the file:
dd<-read.table(f,fileEncoding = "UTF-8",sep = ";",header = T,quote = "\"")
I get the following error:
incomplete final line found by readTableHeader
Is there something I am missing?
Thank you in advance.
UPDATE: I solved the problem deleting the UTF-8 file encoding into the read.table:
dd<-read.table(f,sep = ";",header = T,quote = "\"")
but I can't explain the reason of this, since the default for write.table seems to be UTF-8 anyway (I checked this using an advanced text tool).
Any idea of why this is happening?
Thank you,

Reading CSV containing emojis from Google Sheets fails

I've made a survey on Google Forms and send results to Google Sheets.
Then I tried to download results to R:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
google <- read.csv(url, sep = ',', header = T, fileEncoding = "UTF-8")
and faced the problem:
Warning:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
incorrect input found in input connection 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
There were imported just 96 rows out of all to my R dataset.
I checked my Google Sheets and saw that 96th row contains emoji and stops downloading another rows.
What should I do there or which encoding should I choose to have an opprotuninty to read the emojis in R?
R version: 1.2.5033
Thanks to Allan, you helped a lot to me!
I found another decision.
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
df = readLines(url, encoding = "UTF-8")
df <- read.table(text = df,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)
Then I used View function to check my rows with emojis, and it shows it to me correctly.
You can try to load the contents of the file as a character vector then remove the emojis manually before you load the data.
Removing the very high UTF-8 values is a crude but effective way of stripping out emojis.
remove_emojis <- function(strings)
{
sapply(strings, function(x) intToUtf8(utf8ToInt(x)[-which(utf8ToInt(x) > 100000)]))
}
google <- remove_emojis(readLines(url, encoding = "UTF-8"))
df <- read.table(text = google,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)

read.csv warning 'EOF within quoted string' to read whole file

I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?
I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.

Blank spaces in header read csv

I want to read a file with read.csv2 function. This file contains blank spaces in column names. Whith the parameter header = FALSE I can read the file but when I replace FALSE by TRUE, I have this error :
Error in make.names(col.names, unique = TRUE) : chaîne de charactères multioctets incorrecte 7
How can I manage this error?
My code :
client <- read.csv2("./data/Clients.csv", header = T, na.strings = "",
stringsAsFactors = FALSE, sep = ";", encoding = "UTF-8")
Thanks for your help.
The reason for the error is pointing towards column name having a multi-bytes character which is not compatible with UTF-8.
An option is to use encoding = "UCS-2LE":
client <- read.csv2("./data/Clients.csv", header = TRUE, na.strings = "",
stringsAsFactors = FALSE, sep = ";", encoding = "UCS-2LE")

R reading a tsv file using specific encoding

I am trying to read a .tsv (tab-separated value) file into R using a specific encoding. It's supposedly windows-1252. And it has a header.
Any suggestions for the code to put it into a data frame?
Something like this perhaps?
mydf <- read.table('thefile.txt', header=TRUE, sep="\t", fileEncoding="windows-1252")
str(mydf)
You can also use:
read.delim('thefile.txt', header= T, fileEncoding= "windows-1252")
Simply entering the command into your R consol:
> read.delim
function (file, header = TRUE, sep = "\t", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
reveals that read.delim is a packaged read.table command that already specifies tabs as your data's separator. read.delim might be more convenient if you're working with a lot of tsv files.
The difference between the two commands is discussed in more detail in this Stack question.
df <- read.delim(~/file_directory/file_name.tsv, header = TRUE) will be working fine for single .tsv file, because it is already tab separated, so no need sep = "\t". fileEncoding= "windows-1252" could be used but not necessary.

Resources