R-Text mining: replace abbreviations, numbers and symbols in german - r

I would like to replace the abbreviations, numbers and symbols in my text.
As my text is in german and not in english I have problems in converting it.
I tried:
review_text <- replace_abbreviation(review_text)
review_text <- replace_number(review_text)
review_text <- replace_symbol(review_text)
But this works just for an English text and not for German.
What should I add that the function also works in German?

qdap and qdap related packages are solely for use with the English language. If you want to use German text with ümlauts and everything, packages like quanteda and udpipe can handle this. But they do not handle abbreviations and symbols. Now the replace_symbol function is easy to adjust, just inspect the function, copy the code to create your own function and replace the English translations with the German translations.
The replace_abbreviation function points to a replacement table where the abbreviation are stored with the corresponding value. You need to create your own table for German.
The biggest issue is translating the numbers to text. This is different for each language is not really available online. Searching for this tends to lead to converting numbers to text in excel. But if you can read python, you can translate a python function to R (or use reticulate) to solve this. See this link to a python library on Github which can do this for a few languages including German. But I'm not sure if this can be used in a text mining context.

Related

Standardise strings in cells of a dataframe in R that appear bold

Loading an Excel sheet into R, some strings in the cells of the dataframe appear to be bold and in a different format. For example, like so:
𝐇𝐚𝐢𝐝𝐚𝐫𝐚
And when I copy paste this string into the R console, it appears like this:
Anyone know how to fix this (revert these strings into the standard format) in R?
Want to avoid going back into Excel to fix it.
Thanks!
These are actually UTF-8 encoded letters in the Mathematical Alphanumeric Symbols block in Unicode, and they don't map nicely back on to 'standard' ASCII letters in R unless you have a pre-existing mapping function such as utf8_normalize from the utf8 package:
library(utf8)
utf8_normalize('𝐇𝐚𝐢𝐝𝐚𝐫𝐚', map_compat = TRUE)
#> [1] "Haidara"
However, I would strongly recommend that you fix your Excel file before importing to avoid having to do this; it works with the example you have given us here, but there may be unwelcome surprises in converting some of your other strings.

Is there a way to count specific words from a corpus of PDFs in R?

How can I count the number of specific words in a corpus of PDFs?
I tried using text_count but I honestly didn't understand what it was returned.
First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools can help with that and converting to text, but I am not sure that it can handle multiple columns.
https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
Here is another post:
Use R to convert PDF files to text files for text mining
As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.
After you have raw text, you can use a package like tm to obtain word counts in a corpus. Let me know if this works or if you have further questions.

R displaying unicode/utf-8 encoding rather the special characters

I have a dataframe in R which has one row of utf-8 encoded special characters and one integer row.
If I display both rows, or go into the view(), I do not see the characters displayed correctly.
However, if I only select the row with the special characters, it works. Any ideas?
This is the output (if I paste it, the encoding disappears):
This looks like a bug in R. I've worked around a number of these in the corpus package. Try the following
library(corpus)
print.corpus_frame(WW_mapping[1:3,])
Alternatively, do
library(corpus)
class(WW_mapping) <- c("corpus_frame", "data.frame")
WW_mapping[1:3,]
Adding the "corpus_frame" class to the data frame changes the print and format methods; otherwise, it does not change the behavior of the object.
If that doesn't work, please report your sessionInfo() along with dput(WW_mapping). (Actually, even if this fix does work, please report this information so that we can let the R core developers know about the problem.)

write.csv with encoding UTF8

I am using Windows7 Home Premium and R Studio 0.99.896.
I have a csv file containing a column with text in several different languages eg english, european, korean, simplified chinese, traditional chinese, greek, japanese etc.
I read it into R using
table<-read.csv("broker.csv",stringsAsFactors =F, encoding="UTF-8")
so that all the text is readable in it's language.
Most of the text is within a column called named "content". Within the console, when I have a look
toplines<-head(table$content,10)
I can see all the languages as they are, but when I try to write to a csv file and open it in excel, I can no longer see the languages. I typed in
write.csv(toplines,file="toplines.csv",fileEncoding="UTF-8")
then I opened toplines.csv in excel 2013 and it looked liked this
1 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
2 [<U+4E2D><U+56FD><U+6C11><U+822A><U+51C6.....
3 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
and so forth
Would anyone be able to tell me how I can write to a csv or excel file so that the languages that can be read as they are in Excel 2013? Thank you very much.
write_excel_csv() from the readr package, as suggested by #PhiSeu in the comments, has solved it for me.

Using ggplot2 and special characters

I am reading in data from a web site, with text identifying each row. I simply copied and pasted the data into Excel, and the file is then read by R. One of these rows contains the name of a German city, "Würzburg", which includes a lower case u with an umlaut. I have no problem seeing the special character on the web or on Excel. The problem is, when this word is passed to ggplot2, it is displayed in the plot as "WÃzburg", with tilde over the capital A. RStudio shows both forms depending on the area in which it is displayed. I would assume that ggplot2 uses a different language for interpreting the special characters.
Is there a way to tell ggplot how to read, interpret and display the special characters? I do not want to write specialized code just for this city, but to solve the problem in general. I am likely to encounter other characters as the data expands over time.
I encountered a similar error with ggplot2, when I used a hardcoded data.frame (i.e., I would write Großbritannien (Great Britain) and it would get encoded to some gibberish).
My solution was to include
Sys.setlocale("LC_ALL", "German")
options(encoding = "UTF-8")
in the beginning of the script.
Read the file in as follows
library('data.table')
fread('path_to_file', ..., encoding = 'UTF-8')
My solution to this problem is switching to cairo for pdf plotting. All special characters are shown properly by the ggplot2. It is enough to put this line of code among the knitr settings:
knitr::opts_chunk$set(dev='cairo_pdf')

Resources