Reading csv file with Japanese characters into R - r

I am struggling to have R read in a csv file which has some of its columns in standard English characters, some numerical and some fields in Japanese characters.Here is how the data looks like:
category,desc,otherdesc,volume
UPC - 31401 Age Itameabura,かどや製油 純白ごま油,OIL_OTHERS_SML_ECO,83.0
UPC - 31401 Age Itameabura,オレインリッチ,OIL_OTHERS_MED,137.0
UPC - 31401 Age Itameabura,TVキャノーラ油,OIL_CANOLA_OTHERS_LRG,3026.0
Keeping the R's language setting as English, the japanese characters are converted into some gibberish. When I change the language setting in R to Japanese, Sys.setlocale("LC_CTYPE", "japanese"), I see the file is not read in at all. R gives an error saying:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at 'サcategory'
I have no clue what's wrong with my csv file or the header names. Can you guide me as to how can I go about reading this csv file into R so that everything is displayed just as they do in the csv file?
Thanks!
Vish

For japanese the below works for me:
df <- read.csv("your_file.csv", fileEncoding="cp932")

Related

Keeping Turkish characters with the text mining package for R

let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.

Headers changing when reading data from csv or tsv in R

I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".

Encoding for read.csv: data is truncated and special characters are not read properly

I want to read a csv file with 5.000 observations into R Studio. If I set the encoding to UTF-8, only 3.500 observations are imported and I get 2 warning messages:
# Example code
options(encoding = "UTF-8")
df <- read.csv("path/data.csv", dec = ".", sep = ",")
1: invalid input found on input connection
2: EOF within quoted string
According to this thread I was able to find some encodings, which are able to read the whole csv file, e.g. windows-1258. However, with this encoding special characters such as ä, ü or ß are not read properly.
My guess is, that UTF-8 would be the right encoding, but that something is wrong with the character/factor variables of the csv file. For that reason I need a way to read the whole csv file with UTF-8. Any help is highly appreciated!

CSV separated by ';' have semicolons in some of their attributes and can't parse correctly

I've been downloading Tweets in the form of a .csv file with the following schema:
username;date;retweets;favorites;text;geo;mentions;hashtags;permalink
The problem is that some tweets have semi-colons in their text attribute, for example, "I love you babe ;)"
When i'm trying to import this csv to R, i get some records with wrong schema, as can you see here:
I think this format error is because of the csv parser founding ; in text section and separating the table there, if you understand what i mean.
I've already tried matching with the regex: (;".*)(;)(.*";)
and replacing it with ($1)($3) until not more matches are found, but the error continues in the csv parsing.
Any ideas to clean this csv file? Or why the csv parser is working bad?
Thanks for reading
Edit1:
I think that there is no problem in the structure more than a bad chosen separator (';'), look at these example record
Juan_Levas;2015-09-14 19:59;0;2;"Me sonrieron sus ojos; y me tembló hasta el alma.";Medellín,Colombia;;;https://twitter.com/Juan_Levas/status/643574711314710528
This is a well formatted record, but i think that the semi-colon in the text section (Marked between "") forces the parser to divide the text-section in 2 columns, in this case: "Me sonrieron sus ojos and y me tembló hasta el alma.";.
Is this possible?
Also, i'm using read.csv("data.csv", sep=';') to parse the csv to a data-frame.
Edit2:
How to reproduce the error:
Get the csv from here [~2 MB]: Download csv
Do df <- read.csv('twit_data.csv', sep=';')
Explore the resulting DataFrame (You can sort it by Date, Retweets or Favorites and you will see the inconsistences in the parsing)
Your CSV file is not properly formatted: the problem is not the separator occurring in character fields, it's rather the fact that the " are not escaped.
The best thing to do would be to generate a new file with a proper format (typically: using RFC 4180).
If it's not possible, your best option is to use a "smart" tool like readr:
library(readr)
df <- read_csv2('twit_data.csv')
It does quite a good job on your file. (I can't see any obvious parsing error in the resulting data frame)

Import dataset with spanish special characters into R

I'm new to R and I've imported a dataset in a CSV file format created in Excel into my project using the "Import Dataset from Text File" function. However the dataset displays spanish special characters (á, é, í, ó, ú, ñ) with the � symbol, as below:
Nombre Direccion Beneficiado
Mu�oz B�rbara H�medo
...
Subsequently I tried with this code to make R display the spanish special characters:
Encoding(dataset) <- "UTF-8"
And received the following answer:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8") :
a character vector argument expected
So far I haven't been able to find a solution to this.
I'm working in Rstudio Version 0.98.1083, in Windows 7.
Thanks in advance for your help.

Resources