Can MeCab be configured / enhanced to give me the reading of English words too? - transliteration

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
Is there some other library or tool that can take English and spit out Katakana?
Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

Related

Obtaining the QAST of a Perl 6 file from another program

This is related to this question on accesing the POD, but it goes further than that. You can easily access the Abstract Syntax Tree of a Perl 6 program using:
perl6 --target=ast -e '"Þor is mighty!".say'
This will print the whole Q abstract syntax tree. It's not too clear how to make this from your own program, or I haven't found how to do it. In fact, the CoreHackers::Q module runs that as an external script. But being able to access it from your own program, like
use QAST; # It does not exist
my $this-qast = QAST::Load("some-external-file.p6") # Would want something like this
would be great. I'm pretty sure it should be possible, at the NQP level and probably in a Rakudo-dependent way. Does someone know hot it goes?
Since QAST is not a part of the Perl 6 language specification, but an internal implementation detail of Rakudo, there's no official way to do this. Eventually there will be an AST form that is part of the language specification, but that doesn't yet exist (the 007 project which is working on exploring this area).
It is, however, possible to obtain the QAST tree by using:
use nqp;
my $ast = nqp::getcomp("perl6").eval("say 42", :target<ast>);
say $ast.dump();

First token could not be read or is not the keyword 'FoamFile' in OpenFOAM

I am a beginner to programming. I am trying to run a simulation of a combustion chamber using reactingFoam.
I have modified the counterflow2D tutorial.
For those who maybe don't know OpenFOAM, it is a programme built in C++ but it does not require C++ programming, just well-defining the variables in the files needed.
In one of my first tries I have made a very simple model but since I wanted to check it very well I set it to 60 seconds with a 1e-6 timestep.
My computer is not very powerful so it took me for a day aprox. (by this I mean I'd like to find a solution rather than repeating the simulation).
I executed the solver reactingFOAM using 4 processors in parallel using
mpirun -np 4 reactingFOAM -parallel > log
The log does not show any evidence of error.
The problem is that when I use reconstructPar it works perfectly but then I try to watch the results with paraFoam and this error is shown:
From function bool Foam::IOobject::readHeader(Foam::Istream&)
in file db/IOobject/IOobjectReadHeader.C at line 88
Reading "mypath/constant/reactions" at line 1
First token could not be read or is not the keyword 'FoamFile'
I have read that maybe some files are empty when they are not supposed to be so, but I have not found that problem.
My 'reactions' file have not been modified from the tutorial and has always worked.
edit:
Sorry for the vague question. I have modified it a bit.
A typical OpenFOAM dictionary file always contains a Foam::Istream named FoamFile. An example from a typical system/controlDict file can be seen below:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
location "system";
object controlDict;
}
During the construction of the dictionary header, if this Istream is absent, OpenFOAM ceases its operation by raising an error message that you have experienced:
First token could not be read or is not the keyword 'FoamFile'
The benefit of the header is possibly to contribute OpenFOAM's abstraction mechanisms, which would be difficult otherwise.
As mentioned in the comments, adding the header entity almost always solves this problem.

Perl: Shebang (space?) "#! "?

I've seen both:
#!/path/...
#! /path/...
What's right? Does it matter? Is there a history?
I've heard that an ancient version of Unix required there not be a space. But then I heard that was just a rumor. Does anyone know for certain?
Edit: I couldn't think where better to ask this. It is programming related, since the space could make the program operate in a different way, for all I know. Thus I asked it here.
I also have a vague memory that whitespace was not allowed in some old Unix-like systems, but a bit of research doesn't support that.
According to this Wikipedia article, the #! syntax was introduced in Version 8 Unix in January, 1980. Dennis Ritchie's initial announcement of this feature says:
The system has been changed so that if a file being executed begins
with the magic characters #!, the rest of the line is understood to
be the name of an interpreter for the executed file. Previously (and
in fact still) the shell did much of this job; it automatically
executed itself on a text file with executable mode when the text
file's name was typed as a command. Putting the facility into the
system gives the following benefits.
[SNIP]
To take advantage of this wonderful opportunity, put
#! /bin/sh
at the left margin of the first line of your shell scripts. Blanks
after ! are OK. Use a complete pathname (no search is done). At the
moment the whole line is restricted to 16 characters but this limit
will be raised.
It's conceivable that some later Unix-like system supported the #! syntax but didn't allow blanks after the !, but given that the very first implementation explicitly allowed blanks, that seems unlikely.
leonbloy's answer provides some more context.
UPDATE :
The Perl interpreter itself recognizes a line starting with #!, even on systems where that's not recognized by the kernel. Run perldoc perlrun or see this web page for details.
The #! line is always examined for switches as the line is being
parsed. Thus, if you're on a machine that allows only one argument
with the #! line, or worse, doesn't even recognize the #! line, you
still can get consistent switch behaviour regardless of how Perl was
invoked, even if -x was used to find the beginning of the program.
Perl also permits whitespace after the #!.
(Personally, I prefer to write the #! line without whitespace, but it will work either way.)
And leonjoy's answer points to this web page by Sven Mascheck, which discusses the history of #! in depth. (I mention this now because of a recent discussion on comp.unix.shell.)
It seems to usually work both ways. See here. I'd say that the no-space version is much more common today, and, to me, much more appealing.
BTW, this is not specifically related to Perl (but it's definitely related to programming).

Unix write() function (libc)

I am making a C application in Unix that uses raw tty input.
I am calling write() to characters on the display, but I want to manipulate the cursor:
ssize_t
write(int d, const void *buf, size_t nbytes);
I've noticed that if buf has the value 8 (I mean char tmp = 8, then passing &tmp), it will move the cursor/pointer backward on the screen.
I was wondering where I could find all the codes, for example, I wish to move the cursor forward but I cannot seem to find it via Google.
Is there a page that lists all the code for the write() function please?
Thank you very much,
Jary
8 is just the ascii code for backspace. You can type man ascii and look at all the values (the man page on my Ubuntu box has friendlier names for the values). If you want to do more complicated things you may want to look at a library like ncurses.
You have just discovered that character code 8 is backspace (control-H).
You would probably be best off using the curses library to manage the screen. However, you can find out what control sequences curses knows about by using infocmp to decompile the terminfo entry for your terminal. The format isn't particularly easy to understand, but it is relatively comprehensive. The alternative is to find a manual for the terminal, which tends to be rather hard.
For instance, I'm using a color Xterm window; infocmp says:
# Reconstructed via infocmp from file: /usr/share/terminfo/78/xterm-color
xterm-color|nxterm|generic color xterm,
am, km, mir, msgr, xenl,
colors#8, cols#80, it#8, lines#24, ncv#, pairs#64,
acsc=``aaffggiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~,
bel=^G, bold=\E[1m, clear=\E[H\E[2J, cr=^M,
csr=\E[%i%p1%d;%p2%dr, cub=\E[%p1%dD, cub1=^H,
cud=\E[%p1%dB, cud1=^J, cuf=\E[%p1%dC, cuf1=\E[C,
cup=\E[%i%p1%d;%p2%dH, cuu=\E[%p1%dA, cuu1=\E[A,
dch=\E[%p1%dP, dch1=\E[P, dl=\E[%p1%dM, dl1=\E[M, ed=\E[J,
el=\E[K, enacs=\E)0, home=\E[H, ht=^I, hts=\EH, il=\E[%p1%dL,
il1=\E[L, ind=^J,
is2=\E[m\E[?7h\E[4l\E>\E7\E[r\E[?1;3;4;6l\E8, kbs=^H,
kcub1=\EOD, kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA,
kdch1=\E[3~, kf1=\E[11~, kf10=\E[21~, kf11=\E[23~,
kf12=\E[24~, kf13=\E[25~, kf14=\E[26~, kf15=\E[28~,
kf16=\E[29~, kf17=\E[31~, kf18=\E[32~, kf19=\E[33~,
kf2=\E[12~, kf20=\E[34~, kf3=\E[13~, kf4=\E[14~,
kf5=\E[15~, kf6=\E[17~, kf7=\E[18~, kf8=\E[19~, kf9=\E[20~,
kfnd=\E[1~, kich1=\E[2~, kmous=\E[M, knp=\E[6~, kpp=\E[5~,
kslt=\E[4~, meml=\El, memu=\Em, op=\E[m, rc=\E8, rev=\E[7m,
ri=\EM, rmacs=^O, rmcup=\E[2J\E[?47l\E8, rmir=\E[4l,
rmkx=\E[?1l\E>, rmso=\E[m, rmul=\E[m,
rs2=\E[m\E[?7h\E[4l\E>\E7\E[r\E[?1;3;4;6l\E8, sc=\E7,
setab=\E[4%p1%dm, setaf=\E[3%p1%dm, sgr0=\E[m, smacs=^N,
smcup=\E7\E[?47h, smir=\E[4h, smkx=\E[?1h\E=, smso=\E[7m,
smul=\E[4m, tbc=\E[3g, u6=\E[%i%d;%dR, u7=\E[6n,
u8=\E[?1;2c, u9=\E[c,
That contains information about box drawing characters, code sequences generated by function keys, various cursor movement sequences, and so on.
You can find out more about X/Open Curses (v4.2) in HTML. However, that is officially obsolete, superseded by X/Open Curses v7, which you can download for free in PDF.
If you're using write just so you have low-level cursor control, I think you are using the wrong tool for the job. There are command codes for many types of terminal. VT100 codes, for example, are sequences of the form "\x1b[...", but rather than sending raw codes, you'd be much better off using a library like ncurses.

console print w/o scrolling

I see console apps print colors and seen apps such as ffmpeg print text over itself instead of a new line. How do I print over an existing line? I want to display fps in my console app either at the very top or very bottom and have regular printfs go there and scroll normally.
I need this for windows, but this is meant to be cross platform, so I will eventually have a linux and mac implementation.
There is two simple possibilities which work on linux as well as windows, but only for one line:
printf("\b"); will return for one character, so you might count how many character you want to backspace and fire this in a loop, or you know that you only write n numbers and do it likeprintf("\b\b\b\b\b\b\b\b\b\b");
printf("text to be overwritten by next printf\r"); this will return the cursor to the beginning of the line, so any next printf will overwrite it. Make sure to write a string of same length or longer so you overwrite it entirely.
If you want to rewrite several lines, there is nothing so portable as ncurses, there is libs for it on practically every operating system, and you don't have to take care of the ANSI-differences.
edit: added link to ncurses wikipedia page, gives great overview and introduction, as well as link list and maybe a translation to your preferred language
Check out ncurses. It has bindings for most scripting languages.
You can use '\r' instead of '\n'.
The ASCII character number 8 (A.K.A. Ctrl-H, BS or Backspace) lets you back up one character. ASCII Character number 13 (A.K.A Ctrl-M, CR or Carriage Return) returns the cursor at the beggining of the line.
If you are working in C try putchar(8); and putchar(13);
The magic of the colors, cursor locating and bliking and so on are inside ANSI escape codes. Any text console capable of handling ANSI codes can use them just printing them out to console (i.e. by means of echo in a bash script or printf() function in C).
Unix terminals support ANSI escape sequences and Windows world used to support them back in old MS-DOS days, but the multibyte console support put an end to this. There is more information here. However there are other ways out of just ANSI sequences printing available on Windows. Moreover if you have Cygwin installed on your Windows maching ANSI codes work just as great as on any Unix terminal.
Many people mention Ncurses library that is the de-facto standard for any gui-like text based applications. What this library does is to hide all the terminal differences (Windows/Unix flavours) to represent the same information as identical as possible across all the platforms, though from my own experience I tell you this is not always true (i.e. typical text window frames change because the especial chars are not available under all character encodings). The counterpart of using ncurses is that it is a complete API and it is much harder to start out with it than simply writing out some ANSI escape sequences for simple things such as change the font color, cleaning screen or moving back the cursor to a random position.
For the sake of completeness I paste an example of use of ANSI sequence under Linux that changes the prompt to blue and shows the date:
PS1="\[\033[34m\][\$(date +%H%M)][\u#\h:\w]$ "
You can use Ncurses -
ncurses package is a subroutine library for terminal-independent screen-painting and input-event handling which presents a high level screen model to the programmer, hiding differences between terminal types and doing automatic optimization of output to change one screenfull of text into another
Depending on the platform which you are developing on there's probably a more powerful API which you could use, rather than old ASCII control codes.
e.g. If you are working on Win32 you can actually manipulate the console screen buffer directly.
A good place to start might be here
http://msdn.microsoft.com/en-us/library/ms683171(VS.85).aspx
I have been looking for similar functions/API which would allow me to access the console as something other than a stream of text for other platforms. Haven't found anything yet, but then again, I haven't been looking that hard.
Hope it helps.

Resources