Text encoding in R with japanese characters

Text encoding in R with japanese characters - r

I am trying to read a CSV file containing texts in many different characters using the function read.csv.
This is a sample of the file content:
device,country_code,keyword,indexed_clicks,indexed_cost
Mobile,JP,お金 借りる,5.913037843442198,103.05985173478956
Desktop,US,email,82.450427682737157,81.871030974598241
Desktop,US,news,414.14755054432345,66.502397615344861
Mobile,JP,ヤフートラベル,450.9622861586314,55.733902871922957
If I use the next function to read the data:
texts <- read.csv("text.csv", sep = ",", header = TRUE)
The dataframe is imported to R, but the characters are not well saved...
device country_code keyword indexed_clicks indexed_cost
1 Mobile JP ã\u0081Šé‡‘ å€Ÿã‚Šã‚‹ 5.913038 103.05985
2 Desktop US email 82.450428 81.87103
3 Desktop US news 414.147551 66.50240
4 Mobile JP ãƒ¤ãƒ•ãƒ¼ãƒˆãƒ©ãƒ™ãƒ« 450.962286 55.73390
If I use the next function (same as before with fileEncoding="UTF-8"):
texts <- read.csv("text.csv", sep = ",", header = TRUE, fileEncoding = "utf-8")
I get the next warning message:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
entrada inválida encontrada en la conexión de entrada 'text.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'text.csv'
Anyone knows how to read properly this file?

I replicated your problem with both:
texts <- read.csv("text.csv", sep = ",", header = TRUE)
and
texts_ <- read.csv("text.csv", sep = ",", header = TRUE, encoding = "utf-8")
and both works perfectly fine (R Studio V1.4.1717, Ubuntu 20.04.3 LTS).
Some possibilities I can think of:
The csv file wasn't saved properly as UTF-8 or corrupted. Have you checked the file again?
If you are using Windows, try using encoding instead of fileEncoding. These problems happen with non-standard characters (Windows Encoding Hell).

Related

Reading CSV containing emojis from Google Sheets fails

I've made a survey on Google Forms and send results to Google Sheets.
Then I tried to download results to R:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
google <- read.csv(url, sep = ',', header = T, fileEncoding = "UTF-8")
and faced the problem:
Warning:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
incorrect input found in input connection 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
There were imported just 96 rows out of all to my R dataset.
I checked my Google Sheets and saw that 96th row contains emoji and stops downloading another rows.
What should I do there or which encoding should I choose to have an opprotuninty to read the emojis in R?
R version: 1.2.5033

Thanks to Allan, you helped a lot to me!
I found another decision.
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
df = readLines(url, encoding = "UTF-8")
df <- read.table(text = df,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)
Then I used View function to check my rows with emojis, and it shows it to me correctly.

You can try to load the contents of the file as a character vector then remove the emojis manually before you load the data.
Removing the very high UTF-8 values is a crude but effective way of stripping out emojis.
remove_emojis <- function(strings)
{
sapply(strings, function(x) intToUtf8(utf8ToInt(x)[-which(utf8ToInt(x) > 100000)]))
}
google <- remove_emojis(readLines(url, encoding = "UTF-8"))
df <- read.table(text = google,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)

read.csv warning 'EOF within quoted string' to read whole file

I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?

I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.

File error when running a simple read.csv command in R

When I run read.csv on a dataset
read.csv(file = msleep_ggplot2, header = TRUE, sep = ",")
I get an error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection
The csv file loaded in r studio and looks good. Any idea what the problem might be?

Changing file encoding in R

I was having difficulties importing an excel sheet into R (csv). However, after reading this post, I was able to successfully import it. However, I noticed that some of the numbers in a particular column have transformed into unwanted characters-"Ï52,386.43" "Ï6,887.61" "Ï32,923.45". Any ideas how I can change these to numbers?
Here's my code below:
df <- read.csv("data.csv", header = TRUE, strip.white = TRUE,
fileEncoding="latin1", stringsAsFactors=FALSE)
I've also tried fileEncoding = "UTF-8" but this doesn't work-I'm getting the following warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'data.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote
I am using a mac with "R version 3.2.4 (2016-03-10)" (if that makes any difference). Here are the first ten entries from the affected column:
[1] "Ï52,386.43" "Ï6,887.61" "Ï32,923.45" "" "Ï82,108.44"
[6] "Ï6,378.10" "" "Ï22,467.43" "Ï3,850.14" "Ï5,547.83"

It turns out the issue was a pound sign that got changed into Ï in the process of saving an xls file into csv format (in windows-opened in a mac). Thanks for your replies.

R Error in columns and type.convert(data[[i]], specifically on Mac

I am trying to make R read my CSV file (which contains numerical and categorical data). I am able to open this file on a Windows computer(I tried different ones and it always worked) without any issues, but it is not working on my Mac at all. I am using the latest version of R. Originally, the data was in Excel and then I converted it to csv.
I have exhausted all my options, I tried recommendations from similar topics but nothing works. One time I sort of succeeded but the result looked like this: ;32,0;K;;B;50;;;; I tried the advice given in this topic Import data into R with an unknown number of columns? and the result was the same. I am a beginner in R and I really know nothing about coding or programming, so I would appreciate tremendously any kind of advice on this issue.Below are my feckless attempts to fix this problem:
> file=read.csv("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep = " ")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> ?read.csv
> file=read.csv2("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv2("~/Desktop/file.csv", sep = ";", header=TRUE)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep=" ",row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> file=read.csv("~/Desktop/file.csv", row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
> file=read.csv("~/Desktop/file.csv", sep=";",row.names=1)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
This is what the header of the data looks like. So using the advice below, I saved the document in the CSV format for Mac and once I executed the View(file) function, everything looked ok, except for some rows like the row#1 (Cord Number 1) below, it was completely misplaced :
Cord.Number Ply Attch Knots Length Term Thkns Color Value
1,S,U,,37.0,K,,MB,,,"5.5 - 6.5:4, 8.0 - 8.5:2",,UR1031,unknown,
1s1 S U 1S(5.5/Z) 1E(11.5/S) 46.5 K NA W 11
1s2 S U 1S(5.5/Z) 5L(11.0/Z) 21.0 B NA W 15
This is what the spreadsheet looks like in R Studio on Windows (I don't have enough reputation to post an image):
http://imgur.com/zQdJBT2

As a workaround, what you can do is open the csv file on a Windows machine, and then save it to a .rdata file. Rdata is R's internal storage format. You can then put the file on a USB stick, (or DropBox, Google Drive, or whatever), copy it to your Mac, and work on it there.
# on the Windows PC
dat <- read.csv("<file>", ...)
save(dat, file="<file location>/dat.rdata")
# copy the dat.rdata file over, and then on your Mac:
load("<Mac location>/dat.rdata")

fileEncoding="latin1" is a way to make R read the file, but in my case it came with loss of data and special characters. For example, the symbol € disappeared.
As a workaround that worked best for me for this issue (I'm on a mac too), I opened first the file on Sublime Text, and saved it "with encoding" UTF 8.
When trying to import it after again, it could get read by R with no problem, and my special character were still present.

I had a similar problem, but when including , fileEncoding="latin1" after file's name it works

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Text encoding in R with japanese characters - r

Related

Reading CSV containing emojis from Google Sheets fails

read.csv warning 'EOF within quoted string' to read whole file

File error when running a simple read.csv command in R

Changing file encoding in R

R Error in columns and type.convert(data[[i]], specifically on Mac

Categories

Resources