Mixed kana and kanji romanization to romaji in R - r

I have a large character vector of japanese words (mixed kanji and kana) which needs to be romanized (to romaji).
However with the available functions, (zipangu::str_conv_romanhira() and audubon::strj_romanize()), I am not getting the desired results.
For example for 北海道 (Hokkaido), zipangu::str_conv_romanhira() convert it to chinese pinyin and audubon::strj_romanize() converts only kana characters.
How to convert such mixed kana and kanji text to romaji.
library(zipangu)
library(stringi)
library(audubon)
str_conv_romanhira("北海道", "roman")
#> [1] "běi hǎi dào"
stri_trans_general("北海道", "Any-Latin")
#> [1] "běi hǎi dào"
strj_romanize("北海道")
#> [1] ""

There aren't any R packages that provide transliteration of Japanese kanji to romaji that I can see (at least none that are currently on CRAN). It's easy enough, however, to use the python module pykakasi via R to achieve this:
library(reticulate)
py_install("pykakasi") # Only need to install once
# Make module available in R
pykakasi <- import("pykakasi")
# Alias the convert function for convenience
convert <- pykakasi$kakasi()$convert
convert("北海道")
[[1]]
[[1]]$orig
[1] "北海道"
[[1]]$hira
[1] "ほっかいどう"
[[1]]$kana
[1] "ホッカイドウ"
[[1]]$hepburn
[1] "hokkaidou"
[[1]]$kunrei
[1] "hokkaidou"
[[1]]$passport
[1] "hokkaidou"
# Function to extract romaji and collapse
to_romaji <- function(txt) {
paste(sapply(convert(txt), `[[`, "hepburn"), collapse = " ")
}
# Test on some longer text
lapply(c("北海道", "石の上にも三年", "豚に真珠"), to_romaji)
[[1]]
[1] "hokkaidou"
[[2]]
[1] "ishi no ueni mo sannen"
[[3]]
[1] "buta ni shinju"

Related

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix.
I create the Corpus out of spanish text (with special characters), with no issues.
However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canción).
Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?
Thank you for any help.
As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.
I have tried Encoding(tw2) <- "UTF-8" with no luck.
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT #None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&amp", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*#\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("#\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms
It looks like quanteda (and tm) is losing the encoding when creating the DFM on the windows platform. In this tidytext question the same problem happens with unnesting tokens. Which works fine now and also quanteda's tokens works fine.
If I enforce UTF-8 or latin1 encoding on the #Dimnames$features of the dfm you get the correct results.
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
If you do the following:
Encoding(tdm_quan#Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
Let me guess...are you using Windows? On macOS it works fine:
clean_tw2
## [1] "enmascarados si masduro chingán si quieres aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs enmascarados si masduro chingán quieres aguantas canción
## text1 1 2 1 1 1 1 1
My system information:
sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] tm_0.7-3 NLP_0.1-11 dplyr_0.7.4 quanteda_1.1.6

Why does sapply of an ordered list outputs my content twice

I stored a list of files in a list using this code:
filesList <- list.files(path="/Users/myPath/data/", pattern="*.csv")
I then wanted to output it without the indexes (that usually appear of form [1] at start of each line, so I tried this:
sapply(filesList[order(filesList)], print)
The result is given below copied exactly from RStudio. Why does my list of files output twice? I can work with this, I am just curious.
[1] "IMDB_Bottom250movies.csv"
[1] "IMDB_Bottom250movies2_OMDB_Detailed.csv"
[1] "IMDB_Bottom250movies2.csv"
[1] "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs1.csv"
[1] "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs2.csv"
[1] "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
[1] "IMDB_OMDB_Kaggle_TestSet.csv"
[1] "IMDB_Top250Engmovies.csv"
[1] "IMDB_Top250Engmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Engmovies2.csv"
[1] "IMDB_Top250Indianmovies.csv"
[1] "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Indianmovies2.csv"
[1] "IMDB_Top250movies.csv"
[1] "IMDB_Top250movies2_OMDB_Detailed.csv"
[1] "IMDB_Top250movies2.csv"
[1] "TestDoc2_KaggleData_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData.csv"
[1] "TestDoc2_KaggleData68_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData68.csv"
[1] "TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleDataHUGE.csv"
IMDB_Bottom250movies.csv IMDB_Bottom250movies2_OMDB_Detailed.csv
"IMDB_Bottom250movies.csv" "IMDB_Bottom250movies2_OMDB_Detailed.csv"
IMDB_Bottom250movies2.csv IMDB_ErrorLogIDs1_OMDB_Detailed.csv
"IMDB_Bottom250movies2.csv" "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
IMDB_ErrorLogIDs1.csv IMDB_ErrorLogIDs2_OMDB_Detailed.csv
"IMDB_ErrorLogIDs1.csv" "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
IMDB_ErrorLogIDs2.csv IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv
"IMDB_ErrorLogIDs2.csv" "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
IMDB_OMDB_Kaggle_TestSet.csv IMDB_Top250Engmovies.csv
"IMDB_OMDB_Kaggle_TestSet.csv" "IMDB_Top250Engmovies.csv"
IMDB_Top250Engmovies2_OMDB_Detailed.csv IMDB_Top250Engmovies2.csv
"IMDB_Top250Engmovies2_OMDB_Detailed.csv" "IMDB_Top250Engmovies2.csv"
IMDB_Top250Indianmovies.csv IMDB_Top250Indianmovies2_OMDB_Detailed.csv
"IMDB_Top250Indianmovies.csv" "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
IMDB_Top250Indianmovies2.csv IMDB_Top250movies.csv
"IMDB_Top250Indianmovies2.csv" "IMDB_Top250movies.csv"
IMDB_Top250movies2_OMDB_Detailed.csv IMDB_Top250movies2.csv
"IMDB_Top250movies2_OMDB_Detailed.csv" "IMDB_Top250movies2.csv"
TestDoc2_KaggleData_OMDB_Detailed.csv TestDoc2_KaggleData.csv
"TestDoc2_KaggleData_OMDB_Detailed.csv" "TestDoc2_KaggleData.csv"
TestDoc2_KaggleData68_OMDB_Detailed.csv TestDoc2_KaggleData68.csv
"TestDoc2_KaggleData68_OMDB_Detailed.csv" "TestDoc2_KaggleData68.csv"
TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv TestDoc2_KaggleDataHUGE.csv
"TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv" "TestDoc2_KaggleDataHUGE.csv"
The second copy (without the indexes) is close enough to copy-paste-use, jsut wondering why this happened ?
What is happening here is that sapply is calling print on each element of fileList[order(fileList)] printing the contents to screen. Then Rstudio prints the result of the sapply function itself, which is a list of the contents printed by print. You can use cat to print values without the [1] or wrap sapply in invisible to suppress its output. https://stackoverflow.com/a/12985020/6490232

Nested List Parsing with jsonlite

This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head

Count number of times a word-wildcard appears in text (in R)

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:
1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).
2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).
I'm able to achieve (1), but not (2). Can anyone please help? thanks.
library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Result:
# docs word.count activation activated activat*
# 1 doc 1 5 1(20.00%) 1(20.00%) 0
Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected
> text <- "activation has begunm system activated"
> text <- Corpus(VectorSource(text))
> words <- c("activation", "activated", "activat")
> apply_as_df(text, termco, match.list=words)
docs word.count activation activated activat
1 doc 1 5 1(20.00%) 1(20.00%) 2(40.00%)
Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "2.3"
$year
[1] "2015"
$month
[1] "12"
$day
[1] "10"
$`svn rev`
[1] "69752"
$language
[1] "R"
$version.string
[1] "R version 3.2.3 (2015-12-10)"
$nickname
[1] "Wooden Christmas-Tree"
Hope this helps
Maybe consider different approach using library stringi?
text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")
library(stringi)
counts <- unlist(lapply(words,function(word)
{
newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
stri_count_regex(text, newWord)
}))
ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios
Result is:
activation activated activat*
0.2 0.2 0.4
In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

Extracting hashtags in several tweets using R

I desperately want a solution to extracting hashtags from collective tweets in R.
For example:
[[1]]
[1] "RddzAlejandra: RT #NiallOfficial: What a day for #johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle"
[[2]]
[1] "BPOInsight: RT #atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012"
[[3]]
[1] "BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech"
How can I parse it to extract the list of hashtag words in all the tweets.
Previous solutions display only hashtags in the first tweet with these error messages in the code:
> string <-"MonicaSarkar: RT #saultracey: Sun kissed #olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl"
>
> [[2]]
Error: unexpected '[[' in "[["
> [1] "ccrews467: RT #BBCNews: England manager Roy Hodgson calls #London2012 a \"wake-up call\": footballers and fans should emulate spirit of #Olympics http://t.co/wLD2VA1K"
Error: unexpected '[' in "["
> hashtag.regex <- perl("(?<=^|\\s)#\\S+")
> hashtags <- str_extract_all(string, hashtag.regex)
> print(hashtags)
[[1]]
[1] "#olmpicrings" "#towerbridge" "#london2012"
Using regmatches and gregexpr this gives you a list with hashtags per tweet, assuming hastag is of format # followed by any number of letters or digits (I am not that familiar with twitter):
foo <- c("RddzAlejandra: RT #NiallOfficial: What a day for #johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT #atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")
regmatches(foo,gregexpr("#(\\d|\\w)+",foo))
Returns:
[[1]]
[1] "#London2012" "#MullingarShuffle"
[[2]]
[1] "#london2012"
[[3]]
[1] "#Olympics" "#NBC" "#london2012" "#tech"
How about a strsplit and grep version:
> lapply(strsplit(x, ' '), function(w) grep('#', w, value=TRUE))
[[1]]
[1] "#London2012" "#MullingarShuffle"
[[2]]
[1] "#london2012"
[[3]]
[1] "#Olympics" "#NBC," "#london2012" "#tech"
I couldn't figure out how to return multiple results from each string without first splitting, but I bet there is a way!

Resources