Just started working with R in Arabic as I plan to do text analysis and text mining with Hadith corpus. I have been reading threads related to my question but nevertheless, still can't manage to get the REAL basics here (sorry, absolute beginner).
So, I entered:
textarabic.v <- scan("data/arabic-text.txt", encoding="UTF-8", what= "character",sep="\n")
And what comes out textarabic.v is of course, symbols (pic). Prior to this, I saved my text in utf-8 as I read in a thread but still nothing shows in Arabic.
I can type in Arabic R but scan brings the text in symbols.
Also read and tried to implement other user's are codes to make Arabic text function but I don't even know how and where to implement them.
I added to R, tm and NLP packages.
What do you suggest for me to do next?
Thanks in advance,
I just posted an answer saying that you must definitely be using R on Windows before I saw your comment that you're on OSX. On OSX the situation is not quite so dire. The problem is that you're using too old a version of R. If I right remember, anything prior to 3.2 does not handle Unicode correctly. Try installing 3.3.3 from https://cran.r-project.org/bin/macosx/ and if necessary re-install the packages you need. Then you should be fine. بالتوفيق!
Related
I've encountered a very bizarre problem with my R scripts. I had a bunch saved in a folder, and after reinstalling R (which was itself having some issues), the rather large scripts I had can easily open in R, but appear to have no text on them (this is despite it clearly labeled as an R document in the folder and being 26kb). Yet when I upload my scripts in a message on Slack, it appears perfectly fine.
Here is what my R script looks like presently:
And this is what it should look like:
I'm thinking it has something to do with the way R is reading the text in the script, but I couldn't find any answers online that were helpful. I would greatly appreciate any advice, as I dont' want to have to recreate all of these using Slack of all things...
I figured it out with some tinkering and it was a rather simple fix that mirrored what I thought the issue was. Apparently my RStudio program was set to read the text in CP936 format. I set it to system default:
And viola! My text is now back!
I'm using Sublime Text (Build 4126) on a MacBook Pro (running Monterey 12.2.1) with the R-IDE package. I installed it as described at:
https://packagecontrol.io/packages/R-IDE
i.e., in R: install.packages("languageserver"); then in Sublime: install R-IDE, LSP, and LSP-R from Package Control.
Items under the R-IDE menu work as expected, syntax highlighting works, and the hover feature shows references to other occurrences of the item under the cursor. However, signature help doesn't work. This worked (most of the time) under R-Box (both written by randy3k). I've tried playing with settings files, but really don't know what I'm doing. I suspect this is probably a languageserver issue (https://github.com/REditorSupport/languageserver), but I'm just fumbling around.
Any suggestions on how to get signature help working for R in Sublime Text? Is anyone else not able to get it working?
I have a bunch of .JPGs with some text at the bottom, which consists mostly (but not exclusively of numbers). I wish to use the tesseract package in R to be able to 'read' the text in those .JPGs. Unfortunately, the base tesseract language proved too inaccurate to be worth using. Subsequently I tried using the Magick package to adjust the pictures (crop, resize convert etc) hoping to get a better reading from tesseract, but in my case this failed to get satisfactory results.
I eventually managed to use the description on this link (https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6) to create a new custom language in Tesseract 4.1.1 (as downloaded from https://github.com/tesseract-ocr/tesseract), which I named font_name.traineddata. The custom-made font_name.traineddata works perfectly on the Tesseract 4.1.1 console and shows significant improvement in results on the base language.
The question I have is: How I get the font_name.traineddata file to be part of the ocr command in R? I have tried the simple solution of just pasting the font_name.traineddata file into the appropriate tessdata folder in the package tesseract (the same folder that also contains the standard english data file called eng.traineddata) and then trying the following:
font_name <- tesseract ("font_name")
ocr("C:/1.jpg", engine = font_name)
This does not work and gives the error :
Error in tesseract_engine_internal(datapath, language, configs, opt_names, :
Unable to find training data for: font_name. Please consult manual for: ?tesseract_download
tesseract_download seems to be of no use, as it is a helper function to download training data from the official tessdata repository. I have also tried renaming the file to a three character name, with the same error.
Does anybody have any suggestions on how to make custom .traineddata files work with ocr in R?
I want to remove those red lines in r studio.
I upgraded to the latest version, according to someone's suggestion.
But it is not working.
The problem occurs when I write Korean words.
The default encoding is UTF-8.
I found a similar problem here, but it didn't work for me.
https://community.rstudio.com/t/why-and-where-is-a-an-unexpected-token-in-r-and-how-should-i-deal-with-it/26496/4
df$번호
df$이름
df$성별
This is a bug -- unfortunately, the RStudio diagnostics system does not correctly handle multibyte characters in R Markdown documents on Windows. This will hopefully be fixed in the next release (v1.3).
I have issues when using some twitteR functions when the language of tweet is Arabic, for example if I use twListToDF() and the tweet is in Arabic I get things like " ", I get the same with getTrends(). But when I make SearchTwiter() I get normal Arabic characters,
Kindly note that I use release 1.1.8.
Am I missing something with configuration or this is an issue with the package?
Try using the following command in the R, before using twListToDF()
Sys.setlocale("LC_CTYPE", "Arabic")
I found your question while searching the internet regarding this same issue, and I found this package "arabicStemR". This might help.