I am attempting to cat homegrown LATEX output from R using cat but run into this snag that I suspect has to do with Encoding which I know nothing about or even where to start.
Using cat like this:
cat(paste0("\b", paste0(1, 2, "r")))
Produces exactly what I expect in the console. But:
cat(paste0("\b", paste0(1, 2, "r")), file="foo.txt")
gives an odd square character where the "\b" was (as seen HERE). I doubt this is a new problem for R/LATEX users creating home grown stuff but am obviously not searching with the right key words to find out an answer.
What is happening?
How do I fix it?
EDIT: Per Dason's suggestion:
> readLines("foo.txt")
[1] "\b 1 2 r"
Nothing is wrong. Your editor is displaying the square character in place of \b. Try
readLines("foo.txt")
to see that "\b12r" is what is stored in the file.
Related
The appearance of "textparcali" in RStudio Source Editor was as follows.
In textparcali (tbl_df), I ran the following code to delete single strings.
textparcali$word<-gsub("\\W*\\b\\w\\b\\W*",'', textparcali$word)
But the deletion was interesting. You can see the picture below. Please note lines 67 and 50.
Everything was fine for line 50 and lines like that. However, this was not the case for line 67 (and I think there are others like it).
I focused on one line(67) to understand why you deleted it wrong. I've already seen what it says on this line in the editor. But I also wanted to look at the console. I wrote the following code to the console.
textparcali$word[67]
The word on line 67 looks different in the console. The value that doesn't appear when you make a copy paste but surprisingly appears on the console:
The reason I put it as a picture is because this character disappears after the copy-paste command.
You can download the file containing this character from the link below. However, you should open it with Notepad ++.
Character.txt
Gsub did his job right. How is that possible? What's the name of this character? When I try to write code that destroys this character, the " sign changes and does not delete.
textparcali$word<-gsub('[[:punct:]]+',' ',textparcali$word) command also does not work.
What is the explanation of my experience? I do not know. Is there a way to destroy this character? What caused this? I ve asked a lot.
Thank you all.
(I apologize for the bad scribbles in the pictures.)
I found the surprise character.
Above Right, Combining Dot ͘ ͘
The following is the code required to eliminate this character.
c<-"surprise character"
c
[1] "\u0358"
textparcali$word<-gsub("\u0358","",textparcali$word,ignore.case = FALSE)
textparcali$word<-gsub("\u307","",textparcali$word,ignore.case = FALSE)
Code 307 did the job for me. However, you should determine what the actual code is. If not, your character code may be incorrect.
More detailed information can be found in the links below.
https://gist.github.com/ngs/2782436
https://www.charbase.com/0358-unicode-combining-dot-above-right
Thanks a lot!
stopwords_tr <- data.frame(word = stopwords::stopwords("tr",source="stopwords-iso"), stringsAsFactors = FALSE)
stopwords_tr
Some characters in stopwords_tr are not in Turkish. For example;
1 acaba
2 acep
3 adamakıllı
4 adeta
5 ait
6 altmýþ <-Here must be: altmış
7 altmış
8 altý <-Here must be: altı
I'm looking for a way to fix them.
stopwords_tr$word<-gsub("ý","ı",stopwords_tr$word)
The result has not changed.
I tried these, but it didn't.
Encoding (stopwords_tr $ word) <- "WINDOWS-1254"
Encoding (stopwords_tr $ word) <- "LATIN-5"
Encoding (stopwords_tr $ word) <- "UTF-8"
Another interesting thing.
When you double-click stopwords_tr in R Studio to display it, the character appears "ý". In Console, it looks like "y".
Is there a parameter to set encoding?
Thanks to everyone.
If you're sure this is an error, I think the best way to fix this is to fix the original source: post an issue to https://github.com/stopwords-iso/stopwords-iso/issues or https://github.com/stopwords-iso/stopwords-tr/issues (not sure which is better; try one, and if you get it wrong, they'll tell you!)
But check that it really is wrong. I don't know Turkish, but when I do a Google search for "altmýþ", I find it on several pages that look like Turkish to me, e.g. https://greatsong.net/PAROLES-ISMAIL-YK,ISTEMIYORUM-SENI,101646494.html. Probably an encoding error, but if it is a common one, maybe you really do want it in the list.
Regarding the display issues: sounds like you're on Windows. R on Windows has issues displaying non-native characters. You probably don't have Icelandic installed, so it will have trouble displaying a word like altmýþ.
I followed #user2554330's advice. However, I applied to a different address than the address he showed.
I contacted the creator of stopwords-tr (Kenneth Benoit). The problem stems from a mis-encoded data source. I also noticed repetitive words and reported them. Together we solved the character problem. stopwords-tr was updated. In the following address;
(Fix Turkish #16)
https://github.com/quanteda/stopwords/pull/16
devtools::install_github("quanteda/stopwords", ref = "fix-tr")
stopwords("tr", source = "stopwords-iso")
"Turkish Stopwords" now seems to be properly encoded.
Greetings..
When I create a pdf document that I'm writing in Lyx, there are spaces between letters in some words when I use code programs to insert some pieces of programming.
In the Lyx program listing configuration I added the option "showstringspaces = false" but I do not get anything.
Can you tell me how I can remove these annoying spaces so that all the letters of each word in the code lists appear together?
I get ---> fmt. P r i n t
I expect ---> fmt.Print
I answer to myself. Putting the option columns = fullflexible or columns = flexible in the configuration of the code lists is solved
I'm having trouble reading this table into R:
http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt
I tried all of the following:
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=7,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=8,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=10,header=FALSE)
If I tell it that the separator is a tab, i get the wrong table:
d = read.table(file="http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",header=FALSE,skip=7,sep="\t")
the only thing that seems to work is readLines. but then i don't know how to get a data.frame out of each line.
d =readLines("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
any suggestions? thanks.
I agree that read.fwf will work, once you've worked out the widths.
But, Yeah -- I just hate people who allow whitespace inside elements (e.g. "SouthDakota" ) . One other thing you can do is edit the source text file, replacing {2,N} spaces with a tab. That will leave the state names as-is but give you a workable delimiter.
For example, in http://homepages.cwi.nl/~paulv/papers/algorithmicstatistics.pdf at the bottom of page 5 and top of page 6, he uses a plus/equal symbol and a similar plus/lessthan symbol. I can't figure out how to make that symbol, and I'd like to quote him.
Any help?
Try $\stackrel{top}{bottom}$
You'd want something like this:
$X \stackrel{+}{=} Y$
This positions the plus sign above the equals sign. For example, the following code:
$K(x,y|z) \stackrel{+}{=} K(x|z) \stackrel{+}{<} I(x:y|z)$
produces the following output:
The Comprehensive LaTeX Symbol List (from here) is a great resource, and start for questions like this. You could also contact the author, it's possible he did some LaTex voodoo (math accents and such) to get it to work.
Best of luck.
PS: isn't \pm plus-minus, not plus-equals?
Here's the list of Latex Math Symbols. I don't see the two from the PDF you linked to. Do you know what they mean? You might be able to find an equivalent in the Latex list.