read csv +unicode in R

read csv +unicode in R - r

I have the same problem as explain in here ,the only difference is that the CSV file contain non_english string and I couldn't find any solution for it :
when I read the csv file with out encoding it gives me no error but the data changed to :
network=read.csv("graph1.csv",header=TRUE)
Ø§Ø´Ù¾ÛŒÙ„(60*4)
and if I run the read.csv with fileEncoding it gives me this error:
network=read.csv("graph1.csv",fileEncoding="UTF-8",header=TRUE)
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'graph1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'graph1.csv'
network[1]
[1] X.
<0 rows> (or 0-length row.names)
system info :
windows server 2008
R:R3.1.2
sample file :
node1,node2,weight
ورق800*750*6,ورق 1350*1230*6mm,0.600000024
ورق900*1200*6,ورق 1350*1230*6mm,0.600000024
ورق76*173,ورق 1350*1230*6mm,0.600000024
ورق76*345,ورق 1350*1230*6mm,0.600000024
ورق800*200*4,ورق 1350*1230*6mm,0.600000024

I tried with your input this:
> read.csv("graph1.csv", encoding="UTF-8")
X.U.FEFF.node1 node2 weight
1 <U+0648><U+0631><U+0642>800*750*6 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
2 <U+0648><U+0631><U+0642>900*1200*6 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
3 <U+0648><U+0631><U+0642>76*173 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
4 <U+0648><U+0631><U+0642>76*345 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
5 <U+0648><U+0631><U+0642>800*200*4 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6

The following should work – mind you, I can’t test it since I don’t have Windows (and Windows, Unicode and R simply do not mix):
x = read.csv('graph1.csv', fileEncoding = '', stringsAsFactors = TRUE)
At this point, x is gibberish, since it was read as-is, without parsing the byte data into an encoding. We should be able to verify this:
Encoding(x[1, 1])
# [1] "unknown"
Now we tell R to treat it as UTF-8:
x = as.data.frame(lapply(x, iconv, from = 'UTF-8', to = 'UTF-8),
stringsAsFactors = FALSE)
These two steps can be compressed into one by using encoding instead of fileEncoding as the argument to read.csv:
x = read.csv('graph1.csv', encoding = 'UTF-8', stringsAsFactors = TRUE)
In either case, roughly the same process takes place.
At this point, x still appears as gibberish, since your terminal on Windows presumably does not support a Unicode code page which R understands. In fact, when running the code with a non-UTF-8 code page on Mac, I get the following output now:
x[1, 1]
# [1] "<U+0648><U+0631><U+0642>800*750*6"
However, at least the encoding is now correctly set, and the bytes are parsed:
Encoding(x[1, 1])
# [1] "UTF-8"
And if you pass the data to a device or program which speaks UTF-8, it should appear correctly. For instance, using the data as labels in a plot command should work.
plot.new()
text(0.5, seq(0, 1, along.with = x[, 1]), x[, 1])

Related

Reading a txt file separated by "¬" in R [duplicate]

I'm trying to read a large file into R which is separated by the "not" sign (¬). What I normally do, is to change this symbol into semicolons using Text Edit, and save it as a csv file, but this file is too large, and my computer keeps crashing when I try to do so. I have tried the following options:
my_data <- read.delim("myfile.txt", header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
which results in a dataframe with a single row. This makes sense, I know, since my file is not separated by tabs, but by the not sign. However, when I try so change sep to ¬ or \¬, I get the following message:
Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, :
invalid 'sep' value: must be one byte
I have also tried with
my_data <- read.csv2(file.choose("myfile.txt"))
and
my_data <- read.table("myfile.txt", sep="\¬", quote="", comment.char="")
getting similar results. I have searched for options similar to mine, but his kind of separator is not commonly used.

You can try to read in a piped translation of it.
Setup:
writeLines("a¬b¬c\n1¬2¬3\n", "quux.csv")
The work:
read.csv(pipe("tr '¬' ',' < quux.csv"))
# a b c
# 1 1 2 3
If commas don't work for you, this works equally well with other replacement chars:
read.table(pipe("tr '¬' '\t' < quux.csv"), header = TRUE)
# a b c
# 1 1 2 3
The tr utility is available on all linuxes, it should be available on macos, and it is included in Rtools for windows (as well as git-bash, if you have that).
If there is an issue using pipe, you can always use the tr tool to create another file (replacing your text-editor step):
system2("tr", c("¬", ","), stdin="quux.csv", stdout="quux2.csv")
read.csv("quux2.csv")
# a b c
# 1 1 2 3

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??

Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.

Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

Scan and readLines

In R, scan and readLines have same function on file reading, but different classes of output. To get the vector for further steps, I use scan in reading files. However, one of the text file always occurs error, just like below:
filt <- "E:/lexicon/wenku_baidu_com/stopwords_cn.txt"
specialfilter <- scan(file = filt, what=character(), nmax = -1, sep = "\n", blank.lines.skip = TRUE, skipNul = TRUE, fileEncoding = "UTF-8")
Read 1 item
Warning message:
In scan(file = filt, what = character(), nmax = -1, sep = "\n", :
invalid input found on input connection 'E:/lexicon/wenku_baidu_com/stopwords_cn.txt'
The environment has checked several times, no directory error, no encoding error(file encoding is UTF-8). The salient feature in this file is it has thousand of lines. If use readLines, there is no errors at all:
specialfilter<-readLines(filt, encoding = "UTF-8", skipNul = FALSE)
My questions are:
Is scan have lines limits on reading files? If the answer is
“yes”, how many lines it can read in one file?
If in this case, we can only use readLines, how to change the
result(specialfilter) into vector?
PS: the file uploaded in a network storage, its only 12kb: https://yunpan.cn/OcMTMXyFXNQzYu Access Code is 3c9d

Changing file encoding in R

I was having difficulties importing an excel sheet into R (csv). However, after reading this post, I was able to successfully import it. However, I noticed that some of the numbers in a particular column have transformed into unwanted characters-"Ï52,386.43" "Ï6,887.61" "Ï32,923.45". Any ideas how I can change these to numbers?
Here's my code below:
df <- read.csv("data.csv", header = TRUE, strip.white = TRUE,
fileEncoding="latin1", stringsAsFactors=FALSE)
I've also tried fileEncoding = "UTF-8" but this doesn't work-I'm getting the following warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'data.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote
I am using a mac with "R version 3.2.4 (2016-03-10)" (if that makes any difference). Here are the first ten entries from the affected column:
[1] "Ï52,386.43" "Ï6,887.61" "Ï32,923.45" "" "Ï82,108.44"
[6] "Ï6,378.10" "" "Ï22,467.43" "Ï3,850.14" "Ï5,547.83"

It turns out the issue was a pound sign that got changed into Ï in the process of saving an xls file into csv format (in windows-opened in a mac). Thanks for your replies.

R Error in columns and type.convert(data[[i]], specifically on Mac

I am trying to make R read my CSV file (which contains numerical and categorical data). I am able to open this file on a Windows computer(I tried different ones and it always worked) without any issues, but it is not working on my Mac at all. I am using the latest version of R. Originally, the data was in Excel and then I converted it to csv.
I have exhausted all my options, I tried recommendations from similar topics but nothing works. One time I sort of succeeded but the result looked like this: ;32,0;K;;B;50;;;; I tried the advice given in this topic Import data into R with an unknown number of columns? and the result was the same. I am a beginner in R and I really know nothing about coding or programming, so I would appreciate tremendously any kind of advice on this issue.Below are my feckless attempts to fix this problem:
> file=read.csv("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep = " ")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> ?read.csv
> file=read.csv2("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv2("~/Desktop/file.csv", sep = ";", header=TRUE)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep=" ",row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> file=read.csv("~/Desktop/file.csv", row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
> file=read.csv("~/Desktop/file.csv", sep=";",row.names=1)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
This is what the header of the data looks like. So using the advice below, I saved the document in the CSV format for Mac and once I executed the View(file) function, everything looked ok, except for some rows like the row#1 (Cord Number 1) below, it was completely misplaced :
Cord.Number Ply Attch Knots Length Term Thkns Color Value
1,S,U,,37.0,K,,MB,,,"5.5 - 6.5:4, 8.0 - 8.5:2",,UR1031,unknown,
1s1 S U 1S(5.5/Z) 1E(11.5/S) 46.5 K NA W 11
1s2 S U 1S(5.5/Z) 5L(11.0/Z) 21.0 B NA W 15
This is what the spreadsheet looks like in R Studio on Windows (I don't have enough reputation to post an image):
http://imgur.com/zQdJBT2

As a workaround, what you can do is open the csv file on a Windows machine, and then save it to a .rdata file. Rdata is R's internal storage format. You can then put the file on a USB stick, (or DropBox, Google Drive, or whatever), copy it to your Mac, and work on it there.
# on the Windows PC
dat <- read.csv("<file>", ...)
save(dat, file="<file location>/dat.rdata")
# copy the dat.rdata file over, and then on your Mac:
load("<Mac location>/dat.rdata")

fileEncoding="latin1" is a way to make R read the file, but in my case it came with loss of data and special characters. For example, the symbol € disappeared.
As a workaround that worked best for me for this issue (I'm on a mac too), I opened first the file on Sublime Text, and saved it "with encoding" UTF 8.
When trying to import it after again, it could get read by R with no problem, and my special character were still present.

I had a similar problem, but when including , fileEncoding="latin1" after file's name it works

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read csv +unicode in R - r

Related

Reading a txt file separated by "¬" in R [duplicate]

Issues reading data as csv in R

Scan and readLines

Changing file encoding in R

R Error in columns and type.convert(data[[i]], specifically on Mac

Categories

Resources