in R, data.frame() or write.csv2() functions, change the encoding - r

I have a text in Persian:
tabs <- "سرگرمی"
and I need to have it in dataframe.
when I try:
final <- data.frame(tabs)
I get this:
exporting the text to .csv using write.csv2(), gives me the same problem too.
any idea how to have the text as original encoding?

We can set the locale with Sys.setlocale
Sys.setlocale("LC_ALL","Persian")

Related

Showing unicode character in R

Using read_excel function, I read an excel sheet which has a column that contains data in both English and Arabic language.
English is shown normally in R. but Arabic text is shown like this <U+0627><U+0644><U+0639><U+0645><U+0644>
dataset <- read_excel("Dataset_Draft v1.xlsx",skip = 1 )
dataset %>% select(description)
I tried Sys.setlocale("LC_ALL", "en_US.UTF-8") but with no success.
I want to show Arabic text normally and I want to make filter on this column with Arabic value.
Thank you.
You could try the read.xlsx() function from the xlsx library.
Here you can specify an encoding.
data <- xlsx::read.xlsx("file.xlsx", encoding="UTF-8")

Removing tags after reading PDF in R

I am reading PDF in Hebrew into R using textreadr::read_document, and getting tags which I can't remove, such as <U+202B>. Looking at the data in the console, the tags are absent; if I try to remove them using gsub or stringr::str_replace, nothing happens. However, they are clearly there (see image), and worse - if I export to Excel, they are exported as part of the data. What can I do?
Could you try something like this: This code I used to replace non-ASCII characters.
library(textclean)
attach(CA_videos_df) ##data frame name
Encoding(title) <- "latin1" ## title is data frame column name
name = replace_non_ascii(title,replacement = NA, remove.nonconverted = TRUE) ## replacing title with non-ascii characters with NA

Entering and viewing Cyrillic strings in R

How to handle Cyrillic strings in R?
Sys.setlocale("LC_ALL","Polish")
dataset <- data.frame( ProductName = c('ąęćśżźół','тест') )
#Encoding(dataset) <- "UTF-8" #this line does not change anything
View(dataset)
The code above results in:
But I would like to get what I typed in тест instead of sequence <U+number>. Is there any way for that?
This works for me and see the cyrillic test in my data frame.
I think you should check what your locale is (with sessionInfo) and whether it supports UTF.
Also check this link and try to maybe change the encoding of your column.
Encoding(dataset$Cyrillic) <- "UTF-8"

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

How to split text file into multiple .txt files or data frames based on conditions in R?

I have an XML file in .txt format.
I want to split this file in such a way I get only the text between <TEXT> and </TEXT> and save it as a different text file or data frame. Can anyone please help me on how I can do this in R?
I have tried using grep function to extract the text, however I am not able to achieve my objective. I am very new to text mining and it would be really great if anyone can help me in this.
test_2=grep("[^<TEXT>] [$</TEXT>]",test,ignore.case=T,value=T)
First I did
install.packages("XML")
library(XML)
Now this is a little tricky because your document (as shown above) doesn't have a root. If you wrap it in
<mydoc>
...
</mydoc>
or something like that, you could use this:
doc <- xmlRoot(xmlTreeParse("text.xml"))
df <- vector(length=length(doc))
for (i in 1:length(doc))
{
text_node <- doc[[i]]$children$text
text <- xmlToList(text_node)
df[i] <- text
}
Now suppose you can't add the artificial root I did above. You can still parse it as HTML, which is more tolerant of invalid documents. I also use XPath in this example (which you could in the one above too):
doc <- htmlTreeParse("text_noroot.xml")
content <- doc$children$html
textnodes <- getNodeSet(content, "//text")
df <- vector(length=length(textnodes))
for (i in 1:length(textnodes))
{
text_node <- textnodes[[i]]$children$text
text <- xmlToList(text_node)
df[i] <- text
}
Try XPath with XML
library(XML)
doc <- xmlParse("test.txt")
sapply(xpathApply(doc, "//*/TEXT"), xmlValue)
Then you will get a character vector and do what you want.

Resources