How to find all the related keywords for a root word? - r

I am trying to figure out a way to find all the keywords that come from the same root word (in some sense the opposite action of stemming). Currently, I am using R for coding, but I am open to switching to a different language if it helps.
For instance, I have the root word "rent" and I would like to be able to find "renting", "renter", "rental", "rents" and so on.

Try this code in python:
from pattern.en import lexeme
print(lexeme("rent")
the output generated is:
Installation:
pip install pattern
pip install nltk
Now, open a terminal, type python and run the below code.
import nltk
nltk.download(["wordnet","wordnet_ic","sentiwordnet"])
After the installation is done, run the pattern code again.

You want to find the opposite of Stemming, but stemming can be your way in.
Look at this example in Python:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
words = ["renting", "renter", "rental", "rents", "apple"]
all_rents = {}
for word in words:
stem = stemmer.stem(word)
if stem not in all_rents:
all_rents[stem] = []
all_rents[stem].append(word)
else:
all_rents[stem].append(word)
print(all_rents)
Result:
{'rent': ['renting', 'rents'], 'renter': ['renter'], 'rental': ['rental'], 'appl': ['apple']}
There are several other algorithm to use. However, keep in mind that stemmers are rule-based and are not "smart" to the point where they will select all related words (as seen above). You can even implement your own rules (extend the Stem API from NLTK).
Read more about all available stemmers in NLTK (the module that was used in the above example) here: https://www.nltk.org/api/nltk.stem.html
You can implement your own algorithm as well. For example, you can implement Levenshtein Distance (as proposed in #noski comment) to compute the smaller common prefix. However, you have to do your own research on this one, since it is a complex process.

For an R answer, you can try these functions as a starting point. d.b gives grepl as an example, here are a few more:
words = c("renting", "renter", "rental", "rents", "apple", "brent")
grepl("rent", words) # TRUE TRUE TRUE TRUE FALSE TRUE
startsWith(words, "rent") # TRUE TRUE TRUE TRUE FALSE FALSE
endsWith(words, "rent") # FALSE FALSE FALSE FALSE FALSE TRUE

Related

Automate Response at Prompt in R interactive

Please see below my reference to a previous question asked along these lines.
I am running the library taxize in R. Taxize includes a function for getting a stable number associated with a scientific name, get_tsn().
I can run this in interactive mode or non-interactive mode so that I am either
prompted or not, respectively, to choose among multiple hits.
Interactive:
> tax.num <- get_tsn("Acer rubrum", ask=TRUE)
Retrieving data for taxon 'Acer rubrum'
tsn target commonNames nameUsage
1 28728 Acer rubrum red maple accepted
2 28730 Acer rubrum ssp. drummondii NA not accepted
3 526853 Acer rubrum var. drummondii Drummond's maple accepted
...
More than one TSN found for taxon 'Acer rubrum'!
Enter rownumber of taxon (other inputs will return 'NA'):
Non-interactive:
> tax.num <- get_tsn("Acer rubrum", ask=TRUE)
Retrieving data for taxon 'Acer rubrum'
Warning message:
> 1 result; no direct match found
I need to run this library in interactive mode so that I do not get an empty result when there is more than one match. However, babysitting this script is totally unrealistic for the size of my data, which are in the millions of scientific names. Thus, I want to automate a response to the prompt so that the answer is always 1. This will be the right answer for probably 99% of cases and will ultimately still lead to the right answer downstream in 100% of cases for reasons that are probably beyond the scope of this question.
Thus, how can I automate the response to always be 1?
I looked at this question and tried modifying my code accordingly.
options(httr_oauth_cache=T)
tax.num <- get_tsn("Acer rubrum",ask=T)
However, this gave the same result shown for interactive mode above.
Your help is appreciated.
UPDATE: Ignore below. Obviously Nathan Werth posted the best answer in a comment above.
tax.num <- get_tsn_(searchterm = "Acer rubrum", rows = 1)
works wonderfully!
...
I decided to modify the source code to handle this. I suspect that there is a more desirable solution, but this one meets my needs.
Thus, in the file get_tsn.R from the source, I replaced the following block of code
# prompt
message("\n\n")
print(tsn_df)
message("\nMore than one TSN found for taxon '", x, "'!\n
Enter rownumber of taxon (other inputs will return 'NA'):\n")
# prompt
take <- scan(n = 1, quiet = TRUE, what = 'raw')
with
take <- 1
I could have deleted other echoing to screen bits, that are unnecessary and now not true.
The revised function, which I tested using trace("get_tsn",edit=TRUE), returns as follows:
> print(tax.num)
[1] "28728"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "http://www.itis.gov/servlet/SingleRpt/SingleRpt?
search_topic=TSN&search_value=28728"
attr(,"class")
[1] "tsn"
I will recompile and install it on Linux now with the edit for use with this particular project.
I still welcome other, better answers.

R test if a file exists, and is not a directory

I have an R script that takes a file as input, and I want a general way to know whether the input is a file that exists, and is not a directory.
In Python you would do it this way: How do I check whether a file exists using Python?, but I was struggling to find anything similar in R.
What I'd like is something like below, assuming that the file.txt actually exists:
input.good = "~/directory/file.txt"
input.bad = "~/directory/"
is.file(input.good) # should return TRUE
is.file(input.bad) #should return FALSE
R has something called file.exists(), but this doesn't distinguish files from directories.
There is a dir.exists function in all recent versions of R.
file.exists(f) && !dir.exists(f)
The solution is to use file_test()
This gives shell-style file tests, and can distinguish files from folders.
E.g.
input.good = "~/directory/file.txt"
input.bad = "~/directory/"
file_test("-f", input.good) # returns TRUE
file_test("-f", input.bad) #returns FALSE
From the manual:
Usage
file_test(op, x, y) Arguments
op a character string specifying the test to be performed. Unary
tests (only x is used) are "-f" (existence and not being a directory),
"-d" (existence and directory) and "-x" (executable as a file or
searchable as a directory). Binary tests are "-nt" (strictly newer
than, using the modification dates) and "-ot" (strictly older than):
in both cases the test is false unless both files exist.
x, y character vectors giving file paths.
You can also use is_file(path) from the fs package.

Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file

In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.

R grep with 'AND' logic

I'm working with RJDBC on a server whose maintainers frequently update jar versions. Since RJDBC requires classpaths, this poses a problem when paths break. My situation is fortuitous in that the most current jars will always be in the same directory, but the version numbers will have changed.
I'm trying to use a simple grep function in R to isolate which jar I need based on a regex with some AND logic, however R makes this surprisingly difficult...
This question demonstrates how grep in R can function with the | operator for OR logic, but I can't seem to find similar AND logic operator.
Here's an example:
## Let's say I have three jars in a directory
jars <- list.files('/the/dir')
> jars
[1] "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar" "hive-jdbc-1.1.0-cdh5.4.3.jar" "jython-standalone-2.5.3.jar"
The jar I want is "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar"—how can I use AND logic in grep to extract it?
## I know that OR logic is supported:
j <- jars[grep('hive-jdbc|standalone', jars)]
> j
[1] "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar" "hive-jdbc-1.1.0-cdh5.4.3.jar" "jython-standalone-2.5.3.jar"
## Would AND logic look like the same format?
> jars[grep('hive-jdbc&standalone', jars)]
character(0)
Not all-that-surprisingly, that last piece doesn't work... I found a useful, yet non-comprehensive, link for grep in R, but it doesn't show an AND operator. Any thoughts?
You could try
grep('hive-jdbc.*standalone', jars) # 'hive-jdbc' followed by 'standalone'
or
grepl('hive-jdbc', jars) & grepl('standalone', jars) # 'hive-jdbc' AND 'standalone'

SnowballC in R stems "many" and "only"

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.
> library(SnowballC)
>
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani" "onli" "thing"
>
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
mani onli thing
"" "online" "things"
You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".
Why is that? Is that a way to fix this?
Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.
To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.
Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.
To see what's happening in your data, let's look at the specifics.
online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.
only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify.
You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.
With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.
That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.
many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc
What you want is a lemmatiser. Have a look at this question for a more detailed explanation:
Stemmers vs Lemmatizers

Resources