The file I'm reading contains one word per line.
I have issues with some of these words, as it seems some characters are unusual. see the following example with the first word of my list
stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8")$V1
stopwords[1] # "a" , if you copy paste into R studio this character with the quotes around it, you'll see a little red dot preceding the a.
stopwords[1] == "a" # FALSE
How did it happen ? How can I avoid it ? And if I haven't avoided it, how do I convert this dotted "a" into a regular "a" ?
EDIT:
you can reproduce the issue by just copy pasting this in Rstudio:
"a" == "a" # FALSE
here's where I get the file from:
https://sites.google.com/site/kevinbouge/stopwords-lists/stopwords_fr.txt?attredirects=0&d=1
The encoding of the file, according to notepad++, is UTF-8-BOM. But using "UTF-8-BOM" as the encoding doesn't help. though it seemed to work in this answer:
Read a UTF-8 text file with BOM
stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8-BOM")$V1
stopwords[1] # "a"
I have R version 3.0.2
Related
I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.
I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.
Here is the reproducible example:
s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing
s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')
I get the following output:
> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"
After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?
Thanks in advance,
Edit:
After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting.
But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the #
Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.
It turns out that using sub='' actually solved the issue although I am quite unsure why.
iconv(s1, to='ASCII//TRANSLIT', sub='')
From the documentation sub
character string. If not NA it is used to replace any non-convertible
bytes in the input. (This would normally be a single character, but
can be more.) If "byte", the indication is "" with the hex code of
the byte. If "Unicode" and converting from UTF-8, the Unicode point in
the form "<U+xxxx>".
So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.
There is probably a latin1 (or other encoding) character in your supposedly utf8 file. For example:
> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"
You can e.g. make one new vector or data column with the result of each character set encoding in your data, and if one is NA, fall back to the next one, e.g.
utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
is.na(utf8),
ifelse(
is.na(latin1),
win1250,
latin1
),
utf8
)
If these encodings don't work, make a file with just the problem word, then use the unix/linux file command to detect the encoding, or else try some likely encodings.
I have in the past just listed all of iconv's supported encodings, tried all with lapply, and then used whichever results worked on each string, but some "from" encodings will return a non-NA but incorrect result, so it's best to try this on each unique character in your data in order to decide which subset of iconv's encodings to use and in which order.
I have list of words, which i got from below code.
tags_vector <- unlist(tags_used)
Some of the strings in this list has ellipsis in the end,which i want to remove. Here i print the 5th element of this list, and its class
tags_vector[5]
#[1] "#b…"
class(tags_vector[5])
#[1] "character"
I am trying to remove the ellipsis from this 5th element using gsub, using the code ,
gsub("[…]", "", tags_vector[5])
#[1] "#b…"
This code doesn't works and i get "#b…" as output. But in the same code when i put the value of 5th element directly, it works fine as below,
gsub("[…]", "", "#b…")
#[1] "#b"
I even tried putting the value of tags_vector[5] in a variable x1 and tried to use it in gsub() code but it still din't work.
It might be a Unicode issue. In R(studio), not all characters are created equally.
I tried to create a reproducible example:
# create the ellipsis from the definition (similar to your tags_used)
> ell_def <- rawToChar(as.raw(c('0xE2','0x80','0xA6'))) # from the unicode definition here: http://www.fileformat.info/info/unicode/char/2026/index.htm
> Encoding(ell_def) <- 'UTF-8'
> ell_def
[1] "…"
> Encoding(ell_def)
[1] "UTF-8"
# create the ellipsis from text (similar to your string)
> ell_text <- '…'
> ell_text
[1] "…"
> Encoding(ell_text)
[1] "latin1"
# show that you can get strange results
> gsub(ell_text,'',ell_def)
[1] "…"
The reproducibility of this example might be dependent on your locale. In my case, I work in windows-1252 since you cannot set the locale to UTF-8 in Windows. According to this stringi source, "R lets strings in ASCII, UTF-8, and your platform's native encoding coexist peacefully". As the example above shows, this might sometimes give contradictory results.
Basically, the output you see looks the same, but isn't on a byte level.
If I run this example in the R terminal, I get similar results, but apparently, it shows the ellipsis as a dot: ".".
A quick fix for your example would be to use the ellipsis definition in your gsub. E.g.:
gsub(ell_def,'',tags_vector[5])
I'm trying to save data extracted with RSelenium from https://www.magna.isa.gov.il/Details.aspx?l=he, but although R succeeds printing Hebrew character to the console, it does not when exporting TXT, CSV or in other simple R functions, like data.frame(), readHTMLTable(), etc.
Here goes an example.
> head(lines)
[1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב"
[2] "513435404"
[3] ""
[4] ""
[5] ""
[6] "4,481"
First line changes to weird characters (below) when using data.frame()
> head(as.data.frame(lines))
[1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1>
The same happens when exporting .TXT or .CSV by write.table or write.csv:
write.csv(lines,"lines.csv",row.names=FALSE)
I tried to change the encoding to "UTF-8", like suggested in several alike questions, yet, the issue remains in a different format:
iconv(lines, to = "UTF-8")
1 ׳’׳׳•׳‘׳ ׳₪׳™׳ ׳ ׳¡ ׳’'׳™.׳׳¨. 2 ׳‘׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳›׳¡׳₪׳™׳™׳ ׳‘׳׳׳₪׳™ ׳“׳•׳׳¨ ׳׳¨׳”"׳‘
Same for Hebrew ISO-8859-8:
iconv(lines, to = "ISO-8859-8")
1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.×ר. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×oר ×ר×""×'
I don't understand why the console prints Hebrew characters well while write.table(), write.csv() and data.frame() presents encoding issues.
Anyone to help me exporting it?
That was answered by Ken, exporting text with writeLines() worked well:
f = file("lines.txt", open = "wt", encoding = "UTF-8")
writeLines(lines, "lines.txt", useBytes = TRUE)
close(f)
Yet, the main issue R has with Hebrew encoding is while dealing with tables, in the form of as.data.frame(), write.table() and write.csv(). Any thoughts?
Some machine info:
Sys.info()
sysname release version
"Windows" "7 x64" "build 7601, Service Pack 1"
nodename machine login
"TALIS-TP" "x86"
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another.
The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too.
For Tests
Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt
Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)
Ways to read the files
Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS).
lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt")
lines
## [1] "העברי ×”×•× ×—×‘×¨ בקבוצה ×”×›× ×¢× ×™×ª של שפות שמיות."
## [2] "זו היתה ×©×¤×ª× ×©×œ ×”×™×”×•×“×™× ×ž×•×§×“×, ×בל מן 586 ×œ×¤× ×”\"ס ×–×” התחיל להיות מוחלף על ידי ב×רמית."
That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8:
# this sets the bit for UTF-8
Encoding(lines) <- "UTF-8"
lines
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."
But better to do this when you read the file:
# this does it in one pass
lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8")
lines2[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Encoding(lines2)
## [1] "UTF-8" "UTF-8"
Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page.
lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt")
lines3[1]
## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú."
Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and Encoding() does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert the text after loading, using iconv():
# this will not fix things
Encoding(lines3) <- "UTF-8"
lines3[1]
## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa."
# but this will
iconv(lines3, "ISO-8859-8", "UTF-8")[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Overall I think the method used above for lines2 is the best approach.
How to output the files, preserving encoding
Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling writeLines() just naming a text file without the textConnection.
## to write lines, use the encoding option of a connection object
f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8")
writeLines(lines2, f)
close(f)
But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt.
So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this:
writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)
You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct.
Added for write.csv
Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using write(myDataFrame, file = "myDataFrame.RData"). But if you really need to output .csv, then:
How to write UTF-8 .csv files from a data.table in Windows
The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files.
This assumes that you have already set the Encoding() for any character elements to "UTF-8" (which happens upon import above for lines2).
df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE)
write_utf8_csv <- function(df, file) {
firstline <- paste('"', names(df), '"', sep = "", collapse = " , ")
data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")})
writeLines(c(firstline, data), file , useBytes = TRUE)
}
write_utf8_csv(df, "df_csv.txt")
When we now look at that file in non-Unicode-challenged OS, it now looks fine:
KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt
"int" , "text"
"1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
"2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית."
KBsMBP15-2:Desktop kbenoit$ file df_csv.txt
df_csv.txt: UTF-8 Unicode text, with CRLF line terminators
I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__
I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks