Is there any way to obtain the index of a particular token in a doc rather than looping? - spacy-3

import spacy
nlp = spacy.load('en_web_core_sm')
doc = nlp("I love spacy. Spacy is so cool.")
for token in doc:
print(token)
This prints every token in a new line. But if I want "cool" token separately, I should know the index of that token right!! What can I do about that?
Ex: If there are 1000 words in the doc, I want a particular word "simulate" as a separate token but I don't know the position of that token in that doc. Rather than looping around those 1000 words, Is there any way to directly obtain the index of "simulate"?

token.i has the token index of a token in the document. token.idx has the character index.
import spacy
nlp = spacy.blank("en")
doc = nlp("I like cheese")
assert doc[2].text == "cheese"
assert doc[2].i == 2
assert doc[2].idx == 7

Related

What should I learn to code a bot in Telegram?

I want to code and creat a bot for telegram that does these things:
1 - shows a massage to the person that hit start button
2 - then it gets a name as an input
3 - then again shows a massage
4 - getting an input
5 - at the end add the inputs to a defualt text and showing it;
for exampele:
-start
+Hi What is your name?
-X
+How old are you?
-Y
+Your name is X and you are Y years old.
My second Question is that how can I Connect to bots together, for example imagine I want to pass some input from this bot to make a poll(voting massage), in order to do that I should send the name to let's say #vote, how is that possible and what should I learn to do such things with my bot?
First you're gonna have to explore telegram bot API documentation here.
Then you should choose your programming language and the library you want to use.
There are different libraries for each language, I'm gonna name a few:
Go: https://github.com/aliforever/go-telegram-bot-api (DISCLAIMER: I wrote and maintain it)
Python: https://github.com/eternnoir/pyTelegramBotAPI
NodeJS: https://github.com/telegraf/telegraf
I'm gonna give you an example for what you want in python using pyTelegramBotAPI:
First install the library using pip:
pip install git+https://github.com/eternnoir/pyTelegramBotAPI.git
Then run this script:
import telebot
API_TOKEN = 'PLACE_BOT_TOKEN_HERE'
bot = telebot.TeleBot(API_TOKEN)
user_info = {}
def set_user_state(user_id, state):
if user_id not in user_info:
user_info[user_id] = {}
user_info[user_id]["state"] = state
def get_user_state(user_id):
if user_id in user_info:
if "state" in user_info[user_id]:
return user_info[user_id]["state"]
return "Welcome"
def set_user_info(user_id, name=None, age=None):
if name is None and age is None:
return
if name is not None:
user_info[user_id]["name"] = name
if age is not None:
user_info[user_id]["age"] = age
def get_user_info(user_id):
return user_info[user_id]
#bot.message_handler()
def echo_all(message):
user_id = message.from_user.id
if message.text == "/start":
bot.reply_to(message, "Hi What is your name?")
set_user_state(user_id, "EnterName")
return
user_state = get_user_state(user_id)
if user_state == "EnterName":
set_user_info(user_id, name=message.text)
bot.reply_to(message, "How old are you?")
set_user_state(user_id, "EnterAge")
return
if user_state == "EnterAge":
set_user_info(user_id, age=message.text)
info = get_user_info(user_id)
bot.reply_to(message, "Your name is %s and you are %s years old." %(info["name"], info["age"]))
set_user_state(user_id, "Welcome")
return
bot.reply_to(message, "To restart please send /start")
bot.infinity_polling()
Here we use a dictionary to place user state and info, you can place them anywhere like databases or json files.
Then we update a user's state based on their interactions with the bot.
For your second question, bots cannot communicate with each other so you should look for other solutions. In the case of your question where you want to create a poll, you should check sendPoll method as well as PollAnswer object which you receive when a user votes in a poll.

Removing words from lemmatisation dictionary/updating lemma dictionary in textstem

I am using the textstem package to lemmatise words in some responses. However there is one word (spotting) which I do not wan't to be included, and reduced to "spot". I want it to remain as spotting. How might I be able to do this? Do I need to make a custom dictionary? Currently doing:
lemmatize_strings(df, dictionary = lexicon::hash_lemmas)
You can create your own dictionary where you remove the token spotting
# hash_lemmas is a datatable, so you can use column name token instead hash_lemmas$token
my_lex <- lexicon::hash_lemmas[!token == "spotting", ]
df_lemmatized <- lemmatize_strings(df, dictionary = my_lex)
Or if you want to do it without creating your own lexicon:
df_lemmatized <- lemmatize_strings(df, dictionary = lexicon::hash_lemmas[!token == "spotting", ])

Iterate list of values from traversal A in traversal B (Gremlin)

This is my test data:
graph = TinkerGraph.open()
g= graph.traversal()
g.addV('Account').property('id',"0x0").as('a1').
addV('Account').property('id',"0x1").as('a2').
addV('Account').property('id',"0x2").as('a3').
addV('Token').property('address','1').as('tk1').
addV('Token').property('address','2').as('tk2').
addV('Token').property('address','3').as('tk3').
addV('Trx').property('address','1').as('Trx1').
addV('Trx').property('address','1').as('Trx2').
addV('Trx').property('address','3').as('Trx3').
addE('sent').from('a1').to('Trx1').
addE('sent').from('a2').to('Trx2').
addE('received_by').from('Trx1').to('a2').
addE('received_by').from('Trx2').to('a3').
addE('distributes').from('a1').to('tk1').
addE('distributes').from('a1').to('tk2').
addE('distributes').from('a1').to('tk3').
iterate()
I need to first get all the Token addresses using the distributes relationship and then with those values loop through a traversal. This is an example of what I need for one single token
h = g.V().has('Account','id','0x0').next()
token = '1'
g.V(h).
out('sent').has('address',token).as('t1').
out('received_by').as('a2').
out('sent').has('address',token).as('t2').
out('received_by').as('a3').
select('a3','a2'). \
by('id').toList()
This is the output:
[a3:0x2,a2:0x1]
Instead of doing that has('address',token) on each hop I could omit it and just make sure the token address is the same by placing a where('t1',eq('t2')).by('address') at the end of the traversal, but this performs badly given my database design and indexes.
So what I do to iterate is:
tokens = g.V(h).out('distributes').values('address').toList()
finalList = []
for (token in tokens){
finalList.add(g.V(h).
out('sent').has('address',token).
out('received_by').as('a2').
out('sent').has('address',token).
out('received_by').as('a3').
select('a3','a2'). \
by('id').toList())
}
And this is what's stored in finalList at the end:
==>[[a3:0x2,a2:0x1]]
==>[]
==>[]
This works but I was wondering how can I iterate that token list this way without leaving Gremlin and without introducing that for loop. Also, my results contain empty results which is not optimal. The key here for me is to always be able to do that has('address',token) for each hop with the tokens that the Account node has ever sent. Thank you very much.
There is still uncertainty about what you are trying to achieve.
Nevertheless, I think this query does what you need:
g.V().has('Account', 'id', '0x0').as('a').
out('distributes').values('address').as('t').
select('a').
repeat(out('sent').where(values('address').
as('t')).
out('received_by')).
emit()
Example: https://gremlify.com/spwya4itlvd

What's wrong with my filter query to figure out if a key is a member of a list(db.key) property?

I'm having trouble retrieving a filtered list from google app engine datastore (using python for server side). My data entity is defined as the following
class Course_Table(db.Model):
course_name = db.StringProperty(required=True, indexed=True)
....
head_tags_1=db.ListProperty(db.Key)
So the head_tags_1 property is a list of keys (which are the keys to a different entity called Headings_1).
I'm in the Handler below to spin through my Course_Table entity to filter the courses that have a particular Headings_1 key as a member of the head_tags_1 property. However, it doesn't seem like it is retrieving anything when I know there is data there to fulfill the request since it never displays the logs below when I go back to iterate through the results of my query (below). Any ideas of what I'm doing wrong?
def get(self,level_num,h_key):
path = []
if level_num == "1":
q = Course_Table.all().filter("head_tags_1 =", h_key)
for each in q:
logging.info('going through courses with this heading name')
logging.info("course name filtered is %s ", each.course_name)
MANY MANY THANK YOUS
I assume h_key is key of headings_1, since head_tags_1 is a list, I believe what you need is IN operator. https://developers.google.com/appengine/docs/python/datastore/queries
Note: your indentation inside the for loop does not seem correct.
My bad apparently '=' for list is already check membership. Using = to check membership is working for me, can you make sure h_key is really a datastore key class?
Here is my example, the first get produces result, where the 2nd one is not
import webapp2 from google.appengine.ext import db
class Greeting(db.Model):
author = db.StringProperty()
x = db.ListProperty(db.Key)
class C(db.Model): name = db.StringProperty()
class MainPage(webapp2.RequestHandler):
def get(self):
ckey = db.Key.from_path('C', 'abc')
dkey = db.Key.from_path('C', 'def')
ekey = db.Key.from_path('C', 'ghi')
Greeting(author='xxx', x=[ckey, dkey]).put()
x = Greeting.all().filter('x =',ckey).get()
self.response.write(x and x.author or 'None')
x = Greeting.all().filter('x =',ekey).get()
self.response.write(x and x.author or 'None')
app = webapp2.WSGIApplication([('/', MainPage)],
debug=True)

English dictionary API that allows wild card look up

I would like to find a dictionary with API that allows me to look up words that match a wild card and a particular part of speech (noun/verb/adjective...), for example, give me a list of verbs that end with "ize".
I've been looking at Wordnet but looks like it doesn't support wildcard look up.
Thanks.
You can achieve this in two steps:
From a big list of words (English dictionary, such as Peter Norvig's word list) you can subset only those words that match your wildcards.
For those matching words, test their parts-of-speech to see if they match your target (Verbs, nouns etc.)
In my example, I use a very small list of words:
(Python)
import nltk
import re
#replace with English dictionary
#Using a small list of words for illustration
lst = ['swim', 'while', 'greet', 'prize', 'jeopardize', 'quartz', 'zebra']
def subset_words_by_wildcard(wordlist, pattern):
matchingwords = []
for w in wordlist:
if re.search(pattern, w):
matchingwords.append(w)
return matchingwords
def subset_words_by_pos(words, pos):
wpos = nltk.pos_tag(words)
for w,p in wpos:
if p == pos:
print w,p
if __name__ == '__main__':
pattern = r'ize$'
#target_pos = "NN"
target_pos = "VBP"
mlist = subset_words_by_wildcard(lst, pattern)
subset_words_by_pos(mlist, target_pos)
Running this produces:
>>> jeopardize VBP
Hope this helps.

Resources