Some characters in stopwords_tr do not appear Turkish character - r

stopwords_tr <- data.frame(word = stopwords::stopwords("tr",source="stopwords-iso"), stringsAsFactors = FALSE)
stopwords_tr
Some characters in stopwords_tr are not in Turkish. For example;
1 acaba
2 acep
3 adamakıllı
4 adeta
5 ait
6 altmýþ <-Here must be: altmış
7 altmış
8 altý <-Here must be: altı
I'm looking for a way to fix them.
stopwords_tr$word<-gsub("ý","ı",stopwords_tr$word)
The result has not changed.
I tried these, but it didn't.
Encoding (stopwords_tr $ word) <- "WINDOWS-1254"
Encoding (stopwords_tr $ word) <- "LATIN-5"
Encoding (stopwords_tr $ word) <- "UTF-8"
Another interesting thing.
When you double-click stopwords_tr in R Studio to display it, the character appears "ý". In Console, it looks like "y".
Is there a parameter to set encoding?
Thanks to everyone.

If you're sure this is an error, I think the best way to fix this is to fix the original source: post an issue to https://github.com/stopwords-iso/stopwords-iso/issues or https://github.com/stopwords-iso/stopwords-tr/issues (not sure which is better; try one, and if you get it wrong, they'll tell you!)
But check that it really is wrong. I don't know Turkish, but when I do a Google search for "altmýþ", I find it on several pages that look like Turkish to me, e.g. https://greatsong.net/PAROLES-ISMAIL-YK,ISTEMIYORUM-SENI,101646494.html. Probably an encoding error, but if it is a common one, maybe you really do want it in the list.
Regarding the display issues: sounds like you're on Windows. R on Windows has issues displaying non-native characters. You probably don't have Icelandic installed, so it will have trouble displaying a word like altmýþ.

I followed #user2554330's advice. However, I applied to a different address than the address he showed.
I contacted the creator of stopwords-tr (Kenneth Benoit). The problem stems from a mis-encoded data source. I also noticed repetitive words and reported them. Together we solved the character problem. stopwords-tr was updated. In the following address;
(Fix Turkish #16)
https://github.com/quanteda/stopwords/pull/16
devtools::install_github("quanteda/stopwords", ref = "fix-tr")
stopwords("tr", source = "stopwords-iso")
"Turkish Stopwords" now seems to be properly encoded.
Greetings..

Related

How to edit hidden character in String

The appearance of "textparcali" in RStudio Source Editor was as follows.
In textparcali (tbl_df), I ran the following code to delete single strings.
textparcali$word<-gsub("\\W*\\b\\w\\b\\W*",'', textparcali$word)
But the deletion was interesting. You can see the picture below. Please note lines 67 and 50.
Everything was fine for line 50 and lines like that. However, this was not the case for line 67 (and I think there are others like it).
I focused on one line(67) to understand why you deleted it wrong. I've already seen what it says on this line in the editor. But I also wanted to look at the console. I wrote the following code to the console.
textparcali$word[67]
The word on line 67 looks different in the console. The value that doesn't appear when you make a copy paste but surprisingly appears on the console:
The reason I put it as a picture is because this character disappears after the copy-paste command.
You can download the file containing this character from the link below. However, you should open it with Notepad ++.
Character.txt
Gsub did his job right. How is that possible? What's the name of this character? When I try to write code that destroys this character, the " sign changes and does not delete.
textparcali$word<-gsub('[[:punct:]]+',' ',textparcali$word) command also does not work.
What is the explanation of my experience? I do not know. Is there a way to destroy this character? What caused this? I ve asked a lot.
Thank you all.
(I apologize for the bad scribbles in the pictures.)
I found the surprise character.
Above Right, Combining Dot ͘ ͘
The following is the code required to eliminate this character.
c<-"surprise character"
c
[1] "\u0358"
textparcali$word<-gsub("\u0358","",textparcali$word,ignore.case = FALSE)
textparcali$word<-gsub("\u307","",textparcali$word,ignore.case = FALSE)
Code 307 did the job for me. However, you should determine what the actual code is. If not, your character code may be incorrect.
More detailed information can be found in the links below.
https://gist.github.com/ngs/2782436
https://www.charbase.com/0358-unicode-combining-dot-above-right
Thanks a lot!

Using Japanese letters in paste function

I want to download search query data from google trends for both japanese and english search terms. It works perfectly fine when I use english search terms only, but it does not work as soon as I include japanese letters.
My code is the following(I included the default keyword just for this example to make it easier to use):
URL_GT=function(keyword="Toyota Aygo %2B Toyota Yaris %2B Toyota Vitz %2B
トヨタヴィッツ", year=2010, month=1, length=68){
start="http://www.google.com/trends/trendsReport?hl=en-US&q="
end="&cmpt=q&content=1&export=1"
date=""
queries=keyword[1]
if(length(keyword)>1) {
for(i in 2:length(keyword)){
queries=paste(queries, "%2C ", keyword[i], sep="")
}
}
#Dates
if(!is.na(year)){
date="&date="
date=paste(date, month, "%2F", year, " ", month+length-1, "m", sep="")
}
URL=paste(start, queries, date, end, sep="")
browseURL(URL)
}
When I look at the download URL that gets called in my browser I can see that the japanese letters got transformed into some %, numbers and letters, but they are not supposed to change at all.
When I use
Sys.setlocale("LC_CTYPE","japanese_JAPAN")
I get the following paste result
paste("トヨタヴィッツ","Toyota Vitz", sep = "")
[1] "ƒgƒˆƒ^ƒ”ƒBƒbƒcToyota Vitz"
I think this shows pretty good that the paste() function seems not to work as intended.
Using
Sys.setlocale("LC_CTYPE","german_GERMANY")
I get following error message
unexpected INCOMPLETE_STRING
1: URL_GT=function(keyword="Toyota Aygo %2B Toyota Yaris %2B Toyota Vitz %2B ?
indicating that R cannot interpret the japanese letters.
I tried finding a solution, but could only find tips which led me to change my locale. As discribed above this did not work for me so far. I also found this tip, but I got the same error as the enquirer of that question - namely
Warning message: In Sys.setlocale("LC_CTYPE", "UTF-8") : OS reports request
to set locale to "UTF-8" cannot be honored
I am very grateful for any help! Since this is my first post ever I hope that everything concerning structure and detail is alright.
I found a solution that works just fine for me. I had to change the language for unicode-incompatible programs in order for the japanese local to work properly.
On Windows 8.1 you have to go to the control panel, time, region & language, region, administration and there you can change the language accordingly - in my case japanese - restart your pc afterwards.
If you now set your local to
Sys.setlocale("LC_CTYPE","japanese_JAPAN")
typing in paste should return what you asked for e.g.
paste("It works", "トヨタヴィッツ", sep=" ")
[1] "It works トヨタヴィッツ"
The only thing that still confuses me is that when I open the Excel file after the download the Japanese letters appear in a new criptic way.
I tried downloading the data for the word manually and get the same result in the Excel file. So I guess the data should be the correct one. Unfortunately I did not download a CSV file of the japanese data before I changed my unicode language to see if excel messed it up there as well. But when I restored my settings to german again the same criptic letters appeared in the downloaded file.

how to resolve read.fwf run time error: invalid multibyte string in R

I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks

cat and "\b"funky character in external file

I am attempting to cat homegrown LATEX output from R using cat but run into this snag that I suspect has to do with Encoding which I know nothing about or even where to start.
Using cat like this:
cat(paste0("\b", paste0(1, 2, "r")))
Produces exactly what I expect in the console. But:
cat(paste0("\b", paste0(1, 2, "r")), file="foo.txt")
gives an odd square character where the "\b" was (as seen HERE). I doubt this is a new problem for R/LATEX users creating home grown stuff but am obviously not searching with the right key words to find out an answer.
What is happening?
How do I fix it?
EDIT: Per Dason's suggestion:
> readLines("foo.txt")
[1] "\b 1 2 r"
Nothing is wrong. Your editor is displaying the square character in place of \b. Try
readLines("foo.txt")
to see that "\b12r" is what is stored in the file.

Decode <e8> <e9> etc. characters in R

How can I decode Base 64 characters like <e9> (é) or <b0> (°) in a table that was saved with write.table without the UTF-8 option?
Apologies if the answer is obvious, but the R documentation pages mention nothing else than enc2utf8() (which does not work here).
Note: if there is no solution, I know I can either gsub() the whole thing, but that would be long and messy, or I can generate the data again, but that would take some real time (crawler data).

Resources