SnowballC in R stems "many" and "only" - r

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.
> library(SnowballC)
>
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani" "onli" "thing"
>
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
mani onli thing
"" "online" "things"
You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".
Why is that? Is that a way to fix this?

Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.
To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.
Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.
To see what's happening in your data, let's look at the specifics.
online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.
only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify.
You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.
With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.

That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.
many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc
What you want is a lemmatiser. Have a look at this question for a more detailed explanation:
Stemmers vs Lemmatizers

Related

Writing help information for user defined functions in R

I frequently use user defined functions in my code.
RStudio supports the automatic completion of code using the Tab key. I find this amazing because I always can read quickly what is supposed to go in the (...) of functions/calls.
However, my user defined functions just show the parameters, no additional info and obviously, no help page.
This isn't so much pain for me but I would like to share code I think it would be useful to have some information at hand besides the #coments in every line.
Nowadays, when I share, my lines usually look like this
myfun <- function(x1,x2,x3,...){
# This is a function for this and that
# x1 is a factor, x2 is an integer ...
# This line of code is useful for transformation of x2 by x1
some code here
# Now we do this other thing
more code
# This is where the magic happens
return (magic)
}
I think this line by line comment is great but I'd like to improve it and make some things handy just like every other function.
Not really an answer, but if you are interested in exploring this further, you should start at the rcompgen-help page (although that's not a function name) and also examine the code of:
rc.settings
Also, executing this allows you to see what the .CompletionEnv has in it for currently loaded packages:
names(rc.status())
#-----
[1] "attached_packages" "comps" "linebuffer" "start"
[5] "options" "help_topics" "isFirstArg" "fileName"
[9] "end" "token" "fguess" "settings"
And if you just look at:
rc.status()$help_topics
... you see the character items that the tab-completion mechanism uses for matching. On my machine at the moment there are 8881 items in that vector.

R: How can I disable truncation of listing of package functions?

How can I list all of the results that used to occur when typing packageName<tab>, i.e. the full list offered via auto-completion? In R 2.15.0, I get the following for Matrix::<tab>:
> library(Matrix)
> Matrix::
Matrix::.__C__abIndex Matrix::.__C__atomicVector Matrix::.__C__BunchKaufman Matrix::.__C__CHMfactor Matrix::.__C__CHMsimpl
Matrix::.__C__CHMsuper Matrix::.__C__Cholesky Matrix::.__C__CholeskyFactorization Matrix::.__C__compMatrix Matrix::.__C__corMatrix
Matrix::.__C__CsparseMatrix Matrix::.__C__dCHMsimpl Matrix::.__C__dCHMsuper Matrix::.__C__ddenseMatrix Matrix::.__C__ddiMatrix
Matrix::.__C__denseLU Matrix::.__C__denseMatrix Matrix::.__C__dgCMatrix Matrix::.__C__dgeMatrix Matrix::.__C__dgRMatrix
Matrix::.__C__dgTMatrix Matrix::.__C__diagonalMatrix Matrix::.__C__dMatrix Matrix::.__C__dpoMatrix Matrix::.__C__dppMatrix
Matrix::.__C__dsCMatrix Matrix::.__C__dsparseMatrix Matrix::.__C__dsparseVector Matrix::.__C__dspMatrix Matrix::.__C__dsRMatrix
Matrix::.__C__dsTMatrix Matrix::.__C__dsyMatrix Matrix::.__C__dtCMatrix Matrix::.__C__dtpMatrix Matrix::.__C__dtrMatrix
Matrix::.__C__dtRMatrix Matrix::.__C__dtTMatrix Matrix::.__C__generalMatrix Matrix::.__C__iMatrix Matrix::.__C__index
Matrix::.__C__isparseVector Matrix::.__C__ldenseMatrix Matrix::.__C__ldiMatrix Matrix::.__C__lgCMatrix Matrix::.__C__lgeMatrix
Matrix::.__C__lgRMatrix Matrix::.__C__lgTMatrix Matrix::.__C__lMatrix Matrix::.__C__lsCMatrix Matrix::.__C__lsparseMatrix
[...truncated]
That [...truncated] message is irritating and I want to produce the full listing. Which option/flag/knob/configuration/incantation do I need to invoke in order to avoid the truncation? I have this impression that I used to see the full list, but not anymore - perhaps that was on a different OS (e.g. Linux).
I know that ls("package:Matrix") is one useful approach, but it is not the same as setting an option, and the list is different.
Unfortunately, on Windows, it looks like this behavior is hard-wired into the C code used to construct the console. So the answer seems to be that "no, you can't disable it" (at least not without modifying the sources and then recompiling R from scratch).
Here are the relevant lines from $RHOME/src/gnuwin32/console.c:
909 static void performCompletion(control c)
910 {
911 ConsoleData p = getdata(c);
912 int i, alen, alen2, max_show = 10, cursor_position = p->c - prompt_wid;
...
...
1001 if (alen > max_show)
1002 consolewrites(c, "\n[...truncated]\n");
You are correct that on some other platforms, all of the results are printed out. (I often use Emacs, for instance, and it pops all results of tab completion up in a separate buffer).
As an interesting side note, rcompgen, the backend that actually performs the tab-completion (as opposed to printing results to the console) does always find all completions. It's just that Windows doesn't then print them out for us to see.
You can verify that this happens even on Windows by typing:
library(Matrix)
Matrix::
## Then type <TAB> <TAB>
## Then type <RET>
rc.status() ## Careful not to use tab-completion to complete rc.status !
matches <- rc.status()$comps
length(matches) # -> 288
matches # -> lots of symbols starting with 'Matrix::'
For more details about about the backend, and the functions and options that control its behavior, see ?rcompgen.

Perl: Shebang (space?) "#! "?

I've seen both:
#!/path/...
#! /path/...
What's right? Does it matter? Is there a history?
I've heard that an ancient version of Unix required there not be a space. But then I heard that was just a rumor. Does anyone know for certain?
Edit: I couldn't think where better to ask this. It is programming related, since the space could make the program operate in a different way, for all I know. Thus I asked it here.
I also have a vague memory that whitespace was not allowed in some old Unix-like systems, but a bit of research doesn't support that.
According to this Wikipedia article, the #! syntax was introduced in Version 8 Unix in January, 1980. Dennis Ritchie's initial announcement of this feature says:
The system has been changed so that if a file being executed begins
with the magic characters #!, the rest of the line is understood to
be the name of an interpreter for the executed file. Previously (and
in fact still) the shell did much of this job; it automatically
executed itself on a text file with executable mode when the text
file's name was typed as a command. Putting the facility into the
system gives the following benefits.
[SNIP]
To take advantage of this wonderful opportunity, put
#! /bin/sh
at the left margin of the first line of your shell scripts. Blanks
after ! are OK. Use a complete pathname (no search is done). At the
moment the whole line is restricted to 16 characters but this limit
will be raised.
It's conceivable that some later Unix-like system supported the #! syntax but didn't allow blanks after the !, but given that the very first implementation explicitly allowed blanks, that seems unlikely.
leonbloy's answer provides some more context.
UPDATE :
The Perl interpreter itself recognizes a line starting with #!, even on systems where that's not recognized by the kernel. Run perldoc perlrun or see this web page for details.
The #! line is always examined for switches as the line is being
parsed. Thus, if you're on a machine that allows only one argument
with the #! line, or worse, doesn't even recognize the #! line, you
still can get consistent switch behaviour regardless of how Perl was
invoked, even if -x was used to find the beginning of the program.
Perl also permits whitespace after the #!.
(Personally, I prefer to write the #! line without whitespace, but it will work either way.)
And leonjoy's answer points to this web page by Sven Mascheck, which discusses the history of #! in depth. (I mention this now because of a recent discussion on comp.unix.shell.)
It seems to usually work both ways. See here. I'd say that the no-space version is much more common today, and, to me, much more appealing.
BTW, this is not specifically related to Perl (but it's definitely related to programming).

What is the general syntax of a Unix shell command?

In particular, why is that sometimes the options to some commands are preceded by a + sign and sometimes by a - sign?
for example:
sort -f
sort -nr
sort +4n
sort +3nr
These days, the POSIX standard using getopt() (aka getopt(3)) is widely used as a standard notation, but in the early days, people were experimenting. On some machines, the sort command no longer supports the + notation. However, various commands (notably ar and tar) accept controls without any prefix character - and dd (alluded to by Alok in a comment) uses another convention altogether.
The GNU convention of using '--' for long options (supported by getopt_long(3)) was changed from using '+'. Of course, the X11 software uses a single dash before multi-character options. So, the whole thing is a collection of historic relics as people experimented with how best to handle it.
POSIX documents the Utility Conventions that it works to, except where historical precedent is stronger.
What styles of option handling are there?
[At one time, SO 367309 contained the following material as my answer. It was originally asked 2008-12-15 02:02 by FerranB, but was subsequently closed and deleted.]
How many different types of options do you recognize? I can think of
many, including:
Single-letter options preceded by single dash, groupable when there is
no argument, argument can be attached to option letter or in next
argument (many, many Unix commands; most POSIX commands).
Single-letter options preceded by single dash, grouping not allowed,
arguments must be attached (RCS).
Single-letter options preceded by single dash, grouping not allowed,
arguments must be separate (pre-POSIX SCCS, IIRC).
Multi-letter options preceded by single dash, arguments may be
attached or in next argument (X11 programs; also Java and many programs on Mac OS X with a NeXTSTEP heritage).
Multi-letter options preceded by single dash, may be abbreviated
(Atria Clearcase).
Multi-letter options preceded by single plus (obsolete).
Multi-letter options preceded by double dash; arguments may follow '='
or be separate (GNU utilities).
Options without prefix/suffix, some names have abbreviations or are
implied, arguments must be separate. (AmigaOS
Shell)
For options taking an optional argument, sometimes the argument must be attached (co -p1.3 rcsfile.c),
sometimes it must follow an '=' sign. POSIX doesn't support optional
arguments meaningfully (the POSIX getopt() only allows them for the last
option on the command line).
All sensible option systems use an option consisting of double-dash
('--') alone to mean "end of options" — the following arguments are
"non-option arguments" (usually file names; POSIX calls them 'operands')
even if they start with a
dash. (I regard supporting this notation as an imperative. Be aware that if the -- is preceded by an option requiring an argument, the -- will be treated as the argument to the option, not as the 'end of options' marker.)
Many but not all programs accept single dash as a file name to mean
standard input (usually) or standard output (occasionally). Sometimes,
as with GNU 'tar', both can be used in a single command line:
... | tar -cf - -F - | ...
The first solo dash means 'write to stdout'; the second means 'read file
names from stdin'.
Some programs use other conventions — that is, options not preceded by a
dash. Many of these are from the oldest days of Unix. For example,
'tar' and 'ar' both accept options without a dash, so:
tar cvzf /tmp/somefile.tgz some/directory
The dd command uses opt=value exclusively:
dd if=/some/file of=/another/file bs=16k count=200
Some programs allow you to interleave options and other arguments
completely; the C compiler, make and the GNU utilities run without
POSIXLY_CORRECT in the environment are examples. Many programs expect
the options to precede the other arguments.
Note that git and other VCS commands often use a hybrid system:
git commit -m 'This is why it was committed'
There is a sub-command as one of the arguments. Often, there will be optional 'global' options that can be specified between the command and the sub-command. There are examples of this in POSIX; the sccs command is in this category; you can argue that some of the other commands that run other commands are also in this category: nice and xargs spring to mind from POSIX; sudo is a non-POSIX example, as are svn and cvs.
I don't have strong preferences between the different systems. When
there are few enough options, then single letters with mnemonic value
are convenient. GNU supports this, but recommends backing it up with
multi-letter options preceded by a double-dash.
There are some things I do object to. One of the worst is the same
option letter being used with different meanings depending on what other
option letters have preceded it. In my book, that's a no-no, but I know
of software where it is done.
Another objectionable behaviour is inconsistency in style of handling
arguments (especially for a single program, but also within a suite of
programs). Either require attached arguments or require detached
arguments (or allow either), but do not have some options requiring an
attached argument and others requiring a detached argument. And be
consistent about whether '=' may be used to separate the option and
the argument.
As with many, many (software-related) things — consistency is more
important than the individual decisions. Using tools that automate
and standardize the argument processing helps with consistency.
Whatever you do, please, read the TAOUP's Command-Line Options and
consider Standards for Command Line Interfaces. (Added by J F
Sebastian — thanks; I agree.)
It's completely arbitrary; the command may implement all of the option handling in its own special way or it might call out to some other convenience functions. The getopt() family of functions is pretty popular, so most software written even remotely recently follows the conventions set by those routines. There are always exceptions, of course!
It's left to apps to parse options hence the inconsistency. Expanding on your sort example these are all equivalent for coreutils:
sort -k3
sort --k 3
sort --key 3
sort --key=3
_POSIX2_VERSION=199209 sort +2
A shell command is just a program, and it is free to interpret its command line any way it likes.
Unix never had anything like Apple's interface police to make sure that the command-line interface was consistent across applications. As a result, there is inconsistency, especially in older commands.
Peering into my crystal ball, I think command-line tools will slowly migrate toward GNU standards, double dashes and all. (I grew up with single dashes and still find the double dash very awkward, but it is consistent.)

Why do <C-PageUp> and <C-PageDown> not work in vim?

I have Vim 7.2 installed on Windows. In GVim, the <C-PageUp> and <C-PageDown> work for navigation between tabs by default. However, it doesn't work for Vim.
I have even added the below lines in _vimrc, but it still does not work.
map <C-PageUp> :tabp<CR>
map <C-PageDown> :tabn<CR>
But, map and works.
map <C-left> :tabp<CR>
map <C-right> :tabn<CR>
Does anybody have a clue why?
The problem you describe is generally caused by vim's terminal settings not knowing the correct character sequence for a given key (on a console, all keystrokes are turned into a sequence of characters). It can also be caused by your console not sending a distinct character sequence for the key you're trying to press.
If it's the former problem, doing something like this can work around it:
:map <CTRL-V><CTRL-PAGEUP> :tabp<CR>
Where <CTRL-V> and <CTRL-PAGEUP> are literally those keys, not "less than, C, T, R, ... etc.".
If it's the latter problem then you need to either adjust the settings of your terminal program or get a different terminal program. (I'm not sure which of these options actually exist on Windows.)
This may seem obvious to many, but konsole users should be aware that some versions bind ctrl-pageup / ctrl-pagedown as secondary bindings to it's own tabbed window feature, (which may not be obvious if you don't use that feature).
Simply clearing them from the 'Configure Shortcuts' menu got them working in vim correctly for me. I guess other terminals may have similar features enabeld by default.
I'm adding this answer, taking details from vi & Vim, to integrate those that are already been given/accepted with some more details that sound very important to me.
The alredy proposed answers
It is true what the other answer says:
map <C-PageUp> :echo "hello"<CR> won't work because Vim doesn't know what escape sequence corresponds to the keycode <C-PageUp>;
one solution is to type the escape sequence explicitly: map ^[[5^ :echo "hello"<CR>, where the escape sequence ^[[5^ (which is in general different from terminal to terminal) can be obtained by Ctrl+VCtrl+PageUp.
One additional important detail
On the other hand the best solution for me is the following
set <F13>=^[[5^
map <F13> :echo "hello"<CR>
which makes use of one of additional function key codes (you can use up to <F37>). Likewise, you could have a bunch of set keycode=escapesequence all together in a single place in your .vimrc (or in another dedicated file that you source from your .vimrc, why not?).

Resources