reading Tamil corpus in R - r

I am have built a basic word prediction product using R as part of an online course project work. I wanted to extend it for predicting next word from Tamil phases. I had used sample of Tamil language corpora from HC Corpora website. I have read it into R and created a tm() corpus.
testData <- "திருவண்ணாமலை, கொல்லிமலை, சதுரகிரி என அவன் சித்தர்களை பல
இடங்களில், மலைகளில், குகைகளில், இன்னும் பல ரகசிய இடங்களில்
அவன் சித்தர்களை சந்தித்து பல நம்பமுடியாத சக்திகளைப்
பெற்றுவிட்டான் என்று சொல்லிக் கொள்கிறார்கள்"
getUnigrams <- function(x) {NGramTokenizer(x,
Weka_control(min=1, max=1))}
unigrams <- DocumentTermMatrix(VCorpus(VectorSource(testData)),
control=list(tokenize=getUnigrams))
unigramsList <- data.frame(slam::col_sums(unigrams))
head(unigramsList, 3)
> slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
The actual Tamil words are row names of this data-frame and displayed properly on the screen. However, when I try to add it as column against their respective count, the resulting data frame does not displays the Tamil words correctly in column unigramsList$word1. It displays it as unicode characters of underlying Tamil word.
unigramsList$word1 <- rownames(unigramsList) ## Encoding issues arise from here!!!
head(unigramsList, 3)
slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
word1
அவன் <U+0B85><U+0BB5><U+0BA9><U+0BCD>
இடங்களில் <U+0B87><U+0B9F><U+0B99><U+0BCD><U+0B95><U+0BB3><U+0BBF><U+0BB2><U+0BCD>
இன்னும் <U+0B87><U+0BA9><U+0BCD><U+0BA9><U+0BC1><U+0BAE><U+0BCD>
>
I tried to continue with these unicode characters and built n-grams for 2, 3 and 4-grams and used it for my prediction. But all subsequent operations on this column are displayed as raw unicode only. I want to be able to view and predict them in their native Tamil characters.
My session information is as below:
> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-29 tm_0.6-2 NLP_0.1-9 stringi_1.0-1 stringr_1.0.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 parallel_3.2.5 tools_3.2.5 slam_0.1-37
[5] grid_3.2.5 rJava_0.9-8 RWekajars_3.9.0-1

I managed to hack a solution to above and hence thought of posting it for anyone interested in this topic.
a) Instead of saving the n-grams as csv files on Windows, I saved them in R binary format (using save() and load() functions). I had saved the generated n-grams using read.csv() with fileEncoding option set to UTF-8, but still it did not help even after deploying it on Shiny.
b) Deployed and tested on Shiny apps, which runs on a Linux platform and hence it was able to display Tamil characters in unicode correctly. Testing it locally on Windows was not effective as characters were displayed as raw unicodes e.g. , etc.
Thanks to Marek Gagolewski, author of stringi, for suggestions regarding shinyio, which helped me deploy and test on shiny's Linux platform.
You can check out the product using the below link if you are interested: https://periasamyr.shinyapps.io/predictwordml/
Regards
Peri

Related

Displaying UTF-8 encoded characters in R

I am using the RODBC package to read data from SQL server. R is reading the Chinese characters as "?????"
I have passed the parameter DBMSencoding = "UTF-8" to the odbcConnect function.
Following is the sample code I am using:
Connection <- odbcConnect("abc", uid = "123", pwd = "123",
DBMSencoding = "UTF-8", readOnlyOptimize=T)
Var1 <- sqlQuery(Connection, query, errors = TRUE, stringsAsFactors=F)
May be I didn't pass the arguments the way I am supposed to?
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-12
loaded via a namespace (and not attached):
[1] tools_3.2.3
odbcGetInfo(mainConnection)
DBMS_Name DBMS_Ver Driver_ODBC_Ver Data_Source_Name Driver_Name
"Microsoft SQL Server" "10.50.4000" "03.52" "SQLSRV32.DLL"
Driver_Ver ODBC_Ver Server_Name
"06.01.7601" "03.80.0000"
Check the database's character encoding:
select userenv('language') from dual;
SIMPLIFIED CHINESE_CHINA.AL32UTF8
Change your Environment Variable NLS_LANG before connecting to the database:
Sys.setenv(NLS_LANG="SIMPLIFIED CHINESE_CHINA.AL32UTF8")
Connection <- odbcConnect("abc", uid = "123", pwd = "123", DBMSencoding = "UTF-8", readOnlyOptimize=T)
R on Windows has a lot of problems displaying characters outside of ASCII, even though it is often faithfully representing them internally. There is a lot of information in this answer about why this is the case, and some simple diagnostics in this answer. First try plotting, like:
# first, make sure plotting Chinese works in general
# (i.e., you have an appropriate font)
hanzi <- "漢字"
plot(1, 1, type="n")
text(1, 1, hanzi)
If that works, replace the hanzi <- "漢字" line with your sql query line to get some Chinese text from your database into a string variable, and try plotting that. If it shows up on the plot, then the characters are being read fine and represented internally fine, and the problem is just displaying them in the console. If plotting worked for the "漢字" string variable but doesn't work for your SQL-extracted string, then at least you know that the problem is actually with the SQL part and not just with display in the console.
I got the same problem and successfully solved it. It was quite simple. Go to Control Panel --> Region and Language --> Administrative --> Change system locale --> Chinese.

Using special characters in Rstudio

I am working with some special characters in Rstudio. It coverts them into plain letters.
print("Safarzyńska2013")
[1] "Safarzynska2013"
x <- "Māori"
x
[1] "Maori"
Is there any way to read in the exact original characters.
Following info might be helpful:
Rstudio default encoding is UTF-8
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.1.1
This not an exclusively RStudio problem.
Typing print("Safarzyńska2013") on the console of RGui also converts them to plain letters. Running this code from an UTF-8 encoded Script in RGui returns [1] "Safarzy?ska2013".
I don't think that it is a good idea to type such special chars on the console. x <- "SomeString"; Encoding(x) returns "unknown" and that is probably the problem: R has no idea what encoding you are using on the console and probably has no chance to get your original encoding.
I put "Safarzyńska2013\nMāori\n" in a text file encoded with UTF-8. Then the following works fine:
tbl <- read.table('c:/test1.txt', encoding = 'UTF-8', stringsAsFactors = FALSE)
tbl[1,1]
tbl[2,1]
Encoding(tbl[1,1]) # returns "UTF-8"
If you really want to use the console, you probably will have to mask the special chars. In ?Encoding we find the following example to create a word with special chars:
x <- "fa\xE7ile"
Encoding(x)
Actually I don't know at the moment how to get these codes for your special chars and ?Encoding has also no hints...
Go to the label File of RStudio, them click on Save with encoding... , Choose Encoding
UTF-8 , Set as default encoding for source file and save.
Hope this helps

how to display and input chinese (and other non-ASCII) character in r console?

My system: win7 ultimate 64 english version + r-3.1(64) .
Here is my sessionInfo.
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
1.can't input chinese character into r console
When I input a chinese character in r console, it turns to garbled character .
2.can't display chinese character on the r console
When I read data in r console, the chinese character turns into a garbled character .
You can download the data, and test it with
read.table("r1.csv",sep=",")
Download Data
Please see the graph to download the data if you don't know how to get the data from my web.
How can I setup my pc to properly display and input chinese characters in r console?
I have updated the chinese language pack ,and enabled it,but problem remains still.
It is probably not very well documented, but you want to use setlocale in order to use Chinese. And the method applies to many other languages as well. The solution is not obvious as the official document of setlocale didn't specifically mentioned it as a method to solve the display issues.
> print('ÊÔÊÔ') #试试, meaning let's give it a shot in Chinese
[1] "ÊÔÊÔ" #won't show up correctly
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale(category = "LC_ALL", locale = "chs") #cht for traditional Chinese, etc.
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
> read.table("c:/CHS.txt",sep=" ") #Chinese: the 1st record/observation
V1 V2 V3 V4 V5 V6
1 122 第一 122 条 122 记录
If you just want to change the display encoding, without changing other aspects of locales, use LC_CTYPE instead of LC_ALL:
> Sys.setlocale(category = "LC_CTYPE", locale = "chs")
[1] "Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
Now, of course this only applies to the official R console. If you use other IDE's, such as the very popular RStudio, you don't need to do this at all to be able to type and display Chinese, even if you didn't have the Chinese locale loaded.
Migrate some useful stuff from the following comments:
If the data still fails to show up correctly, the we should also look into the issue of the file encoding. If the file is UTF-8 encoded, tither data <- read.table("you_file", sep=',', fileEncoding="UTF-8-BOM", header=TRUE) or fileEncoding="UTF-8" will do, depends on which encoding it really has.
But you may want to stay away from UTF-BOM as it is not recommended: What's different between UTF-8 and UTF-8 without BOM?

R: How to read in Large Dataset (>35 MM rows) with R on Windows piece by piece?

How do you read in / manipulate datasets in R which exceed the allotted memory limit?
EDIT:
Great help so far, thanks. Let me add an additional constraint. The server is enterprise owned and I do not have administrative access. Is there a way to read partial files using read.table or something similar (e.g., by designating nrows to only read 100,000 rows at a time)? Need a workaround which can run with current environment so cannot use fread, bigmemory, etc.
My target dataset contains approx 32 million rows with 30 columns, divided into 12 approximately equal files (some readable, some not).
The files are "|" delimited and stored on a remote serve in 12 individual files. About half of the files can be read using R, the other half exceed the allowable limit.
I'm using a simple read and rbind script:
path<-"filepath/mydata/contains 12 files.txt/"
fulldf<-data.frame()
for(i in 1:length(dir(path))){
file1<-read.table(file=paste0(path,dir(path[i]), sep="|", fill=T, quote="\"")
fulldf<-rbind(fulldf,file2)
}
I'd primarily like to be able to subset the data and write it to a .csv (e.g., read the data piece by piece, subset by location then rbind), but some of the files are simply too big to even read in.
Is there a way to read in part of a large file piece by piece, i.e., split an unreadable file into readable pieces?
System:
Microsoft Windows Server 2003 R2
Enterprise Edition
Service Pack 2
Computer:
Intel(R)Xeon(TM) MP CPU
3.66GHz
3.67 GHz, 12.0 GB RAM
Physical Address Extension
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.7.1
loaded via a namespace (and not attached):
[1] tools_2.12.1
I think iotools with the chunk.reader, read.chunk and chunk.apply may be what you are looking for.
If you cannot install packages then I gather read.table with the nrows and skip arguments should do? You can use the colClasses argument to ensure that they all have the same column class.

how to display đ, ư, ơ, ă in R graphs

I am trying to put Vietnamese labeling in R graphs. I use RStudio and save my code using UTF-8 encoding. It handles the Vietnamese characters I put in the code well, I mean everything shows up in the code properly. However, in the graphs I make, while many characters display OK, several important ones do not show up properly, including
đ - which displays incorrectly as d
ư - which displays incorrectly as u
ơ - which displays incorrectly as o
ă - which displays incorrectly as a
Unfortunately this makes my graphs look unprofessional and untrustworthy.
I would really appreciate it if someone can help me figure this out.
Thanks much!
Trang
#DWin: I am on Windows 7, and here is my sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] MASS_7.3-17 MatchIt_2.4-20 tools_2.15.0
#krlmlr: Here's my code for a simple graph:
knorelative.count <- matrix(nrow=5,ncol=1)
knorelative.count[,1] <- c(1579,638,215,100,120)
par(mar=c(2,4,4,2))
barplot(prop.table(knorelative.count),beside=TRUE,
yaxt="n",ylim=c(0,.6),
legend=c("không ai biết",
"không biết nhiều hơn biết",
"nửa biết, nửa không biết",
"biết nhiều hơn không biết",
"tất cả đều biết"),
main="Người khác trong gia đình, họ hàng biết hay không")
axis(2,at=seq(0,.6,.1),labels=paste(100*seq(0,.6,.1),"%",sep=""),las=1)
When I run this, the đ in the main title and the two ơ's and the đ in the legend turn into d and o.
You are often safer specifying unicode characters by their hex codes:
plot(1:4,rep(1,4),pch=c("\u0111","\u01B0","\u01A1","\u0103"),cex=4)
For any Vietnamese folks out there who run into the same problem, here's an example for the fix using hex codes suggested by James:
print("trường")
[1] "truờng"
print("tr\u01B0ờng")
[1] "trường"
While I can type Vietnamese, in this example the word trường, into my R console ok, any kind of output (e.g. print, graph) fails to display the character ư. Replacing ư with the hex code fixes the output.
(Note: I used the function paste earlier, but then edited this based on James's suggestion to stick the hex code in the character string.)
I am so thankful to learn this way. I will do this for the report I am currently writing.
Trang
You save code in another file. Example: folder R/graph_display.R
After that run this code
eval(parse("R/graph_display.R", encoding = "UTF-8"))

Resources