Regular expression to extract pattern from dataframe's column python - nsregularexpression

I have this column :
array(['BCR-ABL (translocation) [HSA:25] [KO:K06619]\n MLL-AF4 (translocation) [HSA:4297 4299] [KO:K09186 K15184]\n E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355]\n TEL-AML1 (translocation) [HSA:861] [KO:K08367]\n c-MYC (rearrangement) [HSA:4609] [KO:K04377]\n CRLF2 (rearrangement) [HSA:64109] [KO:K05078]\n PAX5 (rearrangement) [HSA:5079] [KO:K09383]',
'NOTCH1 (mutation) [HSA:4851] [KO:K02599]\n TAL1 (overexpression) [HSA:6886] [KO:K09068]\n LYL1 (expression) [HSA:4066] [KO:K15604]\n MLL-ENL (translocation) [HSA:4297] [KO:K09186]\n HOX11 (translocation) [HSA:3195] [KO:K09340]\n MYC (translocation) [HSA:4609] [KO:K04377]\n LMO2 (translocation) [HSA:4005] [KO:K15612]\n HOX11L2 (translocation) [HSA:30012] [KO:K15607]\n PICALM-MLLT10 (translocation) [HSA:8028] [KO:K23588]',
'FLT3 (mutation) [HSA:2322] [KO:K05092]\n c-KIT (mutation) [HSA:3815] [KO:K05091]\n N-ras (mutation) [HSA:4893] [KO:K07828]\n K-ras (mutation) [HSA:3845] [KO:K07827]\n PML-RARalpha (translocation) [HSA:5371] [KO:K10054]\n AML1-ETO (translocation) [HSA:861] [KO:K08367]\n PLZF-RARalpha (translocation) [HSA:7704] [KO:K10055]\n AML1 (mutation) [HSA:861] [KO:K08367]\n C/EBPalpha (mutation) [HSA:1050] [KO:K09055]\n PU.1 (mutation) [HSA:6688] [KO:K09438]',
..., nan,
'(OPDM1) LRP12 [HSA:29967] [KO:K20050]\n (OPDM2) GIPC1 [HSA:10755] [KO:K20056]',
'IGSF3 [HSA:3321] [KO:K06522]'], dtype=object)
I need to extract word or number from I need to extract the code within the very square brackets and parentheses () and put each of them in a new column

Related

Company name extraction with bert-base-ner: easy way to know which words relate to which?

Hi I'm trying to extract the full company name from a string description about the company with bert-base-ner. I am also open to trying other methods but I couldn't really find one. The issue is that although it tags the orgs correctly, it tags it by word/token so I can't easily extract the full company name without having to concat and build it myself.
Is there an easier way or model to do this?
Here is my code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(text1)
print(ner_results)
Here is my output for one text string:
[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
I have faced a similar issue and solved it by using a better model called "xlm-roberta-large-finetuned-conll03-English" which is much better than the one you're using right now and will render the complete organization's name rather than the broken pieces. Feel free to test out the below-mentioned code which will extract the full organization's list from the document. Accept my answer by clicking on tick button if it founds useful.
from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
def text_extraction(file):
""""
To extract texts from both pdf and word
"""
if file.endswith(".pdf"):
return extract_text(file)
else:
resume_text = docx2txt.process(file)
if resume_text:
return resume_text.replace('\t', ' ')
return None
# Organisation names extraction
def org_name(file):
# Extract the complete text in the resume
extracted_text = text_extraction(file)
classifier = token_classifier(extracted_text)
# Get the list of dictionary with key value pair "entity":'ORG'
values = [item for item in classifier if item["entity_group"] == "ORG"]
# Get the list of dictionary with key value pair "entity":'ORG'
res = [sub['word'] for sub in values]
final1 = list(set(res)) # Remove duplicates
final = list(filter(None, final1)) # Remove empty strings
print(final)
org_name("your file name")
end = time.time()
print("The time of execution of above program is :", round((end - start), 2))

how to get rid of the extra None item in nested list from textfsm parsing

Environment:
textfsm: 1.1.2
python: 3.9.6
os: windows 10
textFSM template:
Value Filldown policy_name (\S+)
Value Required name (\S+)
Value police_rate (\d+( ms|pps)?)
Value peak_rate (\d+( pps)?)
Value police_burst (\d+( ms|packets)?)
Value peak_burst (\d+( ms|packets)?)
Value police_burst_ex (\d+( ms)?)
Value List conform_action (set|transmit|drop)
Value List conform_action_set_value (.*)
Value List exceed_action (set|transmit|drop)
Value List exceed_action_set_value (.*)
Value List violate_action (set|transmit|drop)
Start
^policy.map\s(?:${policy_name})? -> PolicyClass
PolicyClass
^ class\s${name}
^\s+police cir (percent )?${police_rate}(( bc)? ${police_burst}(( be)? ${police_burst_ex})?)?( pir (percent )?${peak_rate}( be ${peak_burst})?)?(\s+conform-action ${conform_action}(\s+exceed-action ${exceed_action}(\s+violate-action ${violate_action})?)?)?
^\s+conform-action ${conform_action}(-${conform_action_set_value})?
^\s+exceed-action ${exceed_action}(-${exceed_action_set_value})?
^\s+violate-action ${violate_action}
^ ! -> Record
config for parsing:
policy-map INDEPENDENTFIBRENETWORKSLTD348569-G0/1/1:12-Ethernet-IngressQoS-Template1-Standard
!
class CIR_BPS_CONFIG_1
police cir 1000000000 bc 12500000 pir 1800000000 be 12500000
conform-action set-mpls-exp-imposition-transmit 3
conform-action set-qos-transmit 3
conform-action set-discard-class-transmit 1
exceed-action set-mpls-exp-imposition-transmit 1
exceed-action set-discard-class-transmit 0
exceed-action set-qos-transmit 1
violate-action drop
!
!
textFSM parser result:
[['INDEPENDENTFIBRENETWORKSLTD348569-G0/1/1:12-Ethernet-IngressQoS-Template1-Standard', 'CIR_BPS_CONFIG_1', '1000000000', '1800000000', '12500000', '12500000', '', [None, 'set', 'set', 'set'], ['mpls-exp-imposition-transmit 3', 'qos-transmit 3', 'discard-class-transmit 1'], [None, 'set', 'set', 'set'], ['mpls-exp-imposition-transmit 1', 'discard-class-transmit 0', 'qos-transmit 1'], [None, 'drop']]]
As you can see, the conform_action, exceed_action and violate_action columns all have an extra None item.
How can I get rid of it (w/o post-parsing process)?

How to Add multiple nested Dictionaries

dicta={'name': 'C','children': {'name': 'testA','children': {'name': 'test_file'}}}
dictb={'name': 'C','children': {'name': 'testA','children': {'name': 'test_fileB','children': {'name': 'test_file'}}}}
dictc={'name': 'C','children':[{"name":"testA","children":[{"name":"test_file"},{'name': 'test_fileB','children': {'name': 'test_file'}}]}]}
I want to use dicta and dictb to get dictc;but I don't know how.
You can define a function to merge dictionaries and call it recursively:
def merge(dict1, dict2):
result = {**dict1}
if 'children' in dict1 and 'children' in dict2:
if dict1['children']['name'] == dict2['children']['name']:
result['children'] = [merge(dict1['children'], dict2['children'])]
else:
result['children'] = [dict1['children'], dict2['children']]
elif 'children' in dict1:
result['children'] = [dict1['children']]
elif 'children' in dict2:
result['children'] = [dict2['chidlren']]
else:
del result['children']
return result
dictc = merge(dicta, dictb)
You didn't provide many details on how exactly the merge should work, but this example does produce the dictc as you want. You may need to tweak something for your needs

Filter Dictionary using key and Value in list in Python

I am new to python and trying to filter a dictionary from list using value and key. I am using python 3
[{'ctime': 1459426422, 'accesskey': 'xxxxxx', 'secretkey': 'xxxx', 'id': 4, 'fsname': '/mnt/cdrom1', 'name': '/mnt/cdrom1:test1'}, {'ctime': 1459326975, 'accesskey': 'xxxx', 'secretkey': 'xxxx', 'id': 1, 'fsname': '/mnt/cdrom2', 'name': '/mnt/cdrom2:test2'}]
From above output, I need to filter a dictionary with key value as 'name':'/mnt/cdrom2:test2' so I get fitered dictionary as
{'ctime': 1459326975, 'accesskey': 'xxxx', 'secretkey': 'xxxx', 'id': 1, 'fsname': '/mnt/cdrom2', 'name': '/mnt/cdrom2:test2'}
I can then later extract keys and values as needed.
Thanks.
If your above list is in a variable named "my_list"
Python 2.7.x Version
for d in my_list:
if d['name'] == '/mnt/cdrom2:test2':
print d['name'] # or do whatever you want here
Python 3 Version
for d in my_list:
if d['name'] == '/mnt/cdrom2:test2':
print(d['name']) # or do whatever you want here

How to update a Python dictionary with a reference dictionary the Pythonic way?

I think it is pretty straightforward. All I am trying to do is update the original dictionary's 'code' with that of another dictionary which has the value. I get a feeling 2 for loops and an IF loop can be further shortened to get the answer. In my actual problem, I have few 1000's of dicts that I have to update. Thanks guys!
Python:
referencedict = {'A': 'abc', 'B': 'xyz'}
mylistofdict = [{'name': 'John', 'code': 'A', 'age': 28}, {'name': 'Mary', 'code': 'B', 'age': 32}, {'name': 'Joe', 'code': 'A', 'age': 43}]
for eachdict in mylistofdict:
for key, value in eachdict.items():
if key == 'code':
eachdict[key] = referencedict[value]
print mylistofdict
Output:
[{'age': 28, 'code': 'abc', 'name': 'John'}, {'age': 32, 'code': 'xyz', 'name': 'Mary'}, {'age': 43, 'code': 'abc', 'name': 'Joe'}]
There is no need to loop over all values of eachdict, just look up code directly:
for eachdict in mylistofdict:
if 'code' not in eachdict:
continue
eachdict['code'] = referencedict[eachdict['code']]
You can probably omit the test for code being present, your example list always contains a code entry, but I thought it better to be safe. Looking up the code in the referencedict structure assumes that all possible codes are available.
I used if 'code' not in eachdict: continue here; the opposite is just as valid (if 'code' in eachdict), but this way you can more easily remove the line if you do not need it, and you save yourself an indent level.
referencedict = {'A': 'abc', 'B': 'xyz'}
mylistofdict = [{'name': 'John', 'code': 'A', 'age': 28}, {'name': 'Mary', 'code': 'B', 'age': 32}, {'name': 'Joe', 'code': 'A', 'age': 43}]
for x in mylistofdict:
try:
x['code']=referencedict.get(x['code'])
except KeyError:
pass
print(mylistofdict)

Resources