Using Map in PySpark to parse and assign column names

Using Map in PySpark to parse and assign column names - dictionary

Here is what I am trying to do.
The input data looks like this(Tab seperated):
12/01/2018 user1 123.123.222.111 23.3s
12/01/2018 user2 123.123.222.116 21.1s
The data is coming in through Kafka and is being parsed with the following code.
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kafkaStream.map(lambda x: x[1])
parsed_log = lines.flatMap(lambda line: line.split(" "))
.map(lambda item: ('key', {
'date': item['date'],
'user': item['user'],
'ip': item['ip'],
'duration': item['duration'],}))
The parsed logs should be in the following format:
('key', {'date': 12/01/2018, 'user': user1, 'ip': 123.123.222.111, 'duration': 23.3s})
('key', {'date': 12/01/2018, 'user': user2, 'ip': 123.123.222.116, 'duration': 21.1s})
In my code the code lines for "lines" and "parsed_log" and not doing the job. Could you please let me know how to go about this.

This is the solution:
lines = kafkaStream.map(lambda x: x[1])
variables_per_stream = lines.map(lambda line: line.split(" "))
variable_to_key=variables_per_stream.map(lambda item: ('key', {'id': item[0],'name': item[1]}))

Related

Company name extraction with bert-base-ner: easy way to know which words relate to which?

Hi I'm trying to extract the full company name from a string description about the company with bert-base-ner. I am also open to trying other methods but I couldn't really find one. The issue is that although it tags the orgs correctly, it tags it by word/token so I can't easily extract the full company name without having to concat and build it myself.
Is there an easier way or model to do this?
Here is my code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(text1)
print(ner_results)
Here is my output for one text string:
[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]

I have faced a similar issue and solved it by using a better model called "xlm-roberta-large-finetuned-conll03-English" which is much better than the one you're using right now and will render the complete organization's name rather than the broken pieces. Feel free to test out the below-mentioned code which will extract the full organization's list from the document. Accept my answer by clicking on tick button if it founds useful.
from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
def text_extraction(file):
""""
To extract texts from both pdf and word
"""
if file.endswith(".pdf"):
return extract_text(file)
else:
resume_text = docx2txt.process(file)
if resume_text:
return resume_text.replace('\t', ' ')
return None
# Organisation names extraction
def org_name(file):
# Extract the complete text in the resume
extracted_text = text_extraction(file)
classifier = token_classifier(extracted_text)
# Get the list of dictionary with key value pair "entity":'ORG'
values = [item for item in classifier if item["entity_group"] == "ORG"]
# Get the list of dictionary with key value pair "entity":'ORG'
res = [sub['word'] for sub in values]
final1 = list(set(res)) # Remove duplicates
final = list(filter(None, final1)) # Remove empty strings
print(final)
org_name("your file name")
end = time.time()
print("The time of execution of above program is :", round((end - start), 2))

split a key-value pair in Python

I have a dictionairy as follows:
{
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"}
I am trying to adjust the dictionary by splitting the last key-value pair creating a new key('total data'), it should look like this:
"Total data":¨[
{
"last number": "345"
"Value": "K"
}]
}
Does anyone know if there is a split function based on ':' and '+' or a for loop to accomplish this?
Thanks in advance.

One option to accomplish that could be getting the last key from the dict and using split on + for the key and : for the value removing the outer square brackets assuming the format of the data is always the same.
If you want Total data to contain a list, you can wrap the resulting dict in []
from pprint import pprint
d = {
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"
}
last = list(d.keys())[-1]
d["Total data"] = dict(
zip(
last.strip().split('+'),
d[last].strip("[]").split(':')
)
)
pprint(d)
Output (tested with Python 3.9.4)
{'Bank': '98310',
'Stage': 'final',
'Total data': {' Value': 'K', 'last number ': '345'},
'age': '76',
'idnr': '4578',
'last number + Value': '[345:K]'}
Python demo

How to find and print a dictionary key/value that matches user input?

I need to print a dictionary value that matches the input of the user. For example, if the user enters the course number CS101 the output will look like:
The details for CS101 are:
Room: 3004
Instructor: Haynes
Time: 8:00 a.m.
However, if the user enters an incorrect/invalid course number, I need to print out a message letting them know:
CS101 is an invalid course number.
I have tried if, for loops, and while loops. The problem is, every time I get the course info printed, the invalid course number message won't display because of KeyError. On the other hand, if I happen to "fix" the error message, then the course number info won't print out and instead will return a NameError / TypeError.
I will be honest, I have struggled for some time now with this, and I feel as though I am either assigning something incorrectly or printing incorrectly. But I am a beginner and I don't have a great grasp on Python yet, which is why I am asking for help.
Unfortunately, I am not allowed to create one entire dictionary to group everything in (which would have been easier for me), but instead, I have to create 3 dictionaries.
This is the code:
room = {}
room["CS101"] = "3004"
room["CS102"] = "4501"
room["CS103"] = "6755"
room["NT110"] = "1244"
room["CM241"] = "1411"
instructor = {}
instructor["CS101"] = "Haynes"
instructor["CS102"] = "Alvarado"
instructor["CS103"] = "Rich"
instructor["NT110"] = "Burkes"
instructor["CM241"] = "Lee"
time = {}
time["CS101"] = "8:00 a.m."
time["CS102"] = "9:00 a.m."
time["CS103"] = "10:00 a.m."
time["NT110"] = "11:00 a.m."
time["CM241"] = "1:00 p.m."
def info():
print(f'College Course Locater Program')
print(f'Enter a course number below to get information')
info()
get_course = input(f'Enter course number here: ')
print(f'----------------------------------------------')
course_num = get_course
number = course_num
name = course_num
meeting = course_num
if number in room:
if name in instructor:
if meeting in time:
print(f'The details for course {get_course} are: ')
print(f'Room: {number["room"]}')
print(f'Instructor: {name["instructor"]}')
print(f'Time: {meeting["time"]}')
else:
print(f'{course_num} is an invalid course number.')
I have also tried formatting dictionaries in this style:
time_dict = {
"CS101": {
"Time": "8:00 a.m."
},
"CS102": {
"Time": "9:00 a.m."
},
"CS103": {
"Time": "10:00 a.m."
},
"NT110": {
"Time": "11:00 a.m."
},
"CM241": {
"Time": "1:00 p.m."
},
}
I thank everyone in advance who has an advice, answer, or suggestions to a solution.

This code here is unnecessary, because you are essentially setting 4 variables all to the same value get_course:
course_num = get_course
number = course_num
name = course_num
meeting = course_num
This code here doesn't work because you are trying to find a key with string "room" in a dictionary that doesn't exist, and same with the other lines afterwards
print(f'Room: {number["room"]}')
print(f'Instructor: {name["instructor"]}')
print(f'Time: {meeting["time"]}')
I replaced the code above with this:
print(f'Room: {room[get_course]}')
print(f'Instructor: {instructor[get_course]}')
print(f'Time: {time[get_course]}')
This searches the dictionary variable room for the key get_course (ex. "CS101") and returns the value corresponding to that key. The same thing happens for the other lines, except with the dictionary instructor and the dictionary time.
Here is the final code:
room = {}
room["CS101"] = "3004"
room["CS102"] = "4501"
room["CS103"] = "6755"
room["NT110"] = "1244"
room["CM241"] = "1411"
instructor = {}
instructor["CS101"] = "Haynes"
instructor["CS102"] = "Alvarado"
instructor["CS103"] = "Rich"
instructor["NT110"] = "Burkes"
instructor["CM241"] = "Lee"
time = {}
time["CS101"] = "8:00 a.m."
time["CS102"] = "9:00 a.m."
time["CS103"] = "10:00 a.m."
time["NT110"] = "11:00 a.m."
time["CM241"] = "1:00 p.m."
def info():
print(f'College Course Locater Program')
print(f'Enter a course number below to get information')
info()
get_course = input(f'Enter course number here: ')
print(f'----------------------------------------------')
if get_course in room and get_course in instructor and get_course in time:
print(f'The details for course {get_course} are: ')
print(f'Room: {room[get_course]}')
print(f'Instructor: {instructor[get_course]}')
print(f'Time: {time[get_course]}')
else:
print(f'{get_course} is an invalid course number.')
Here is a test with the input "CS101":
College Course Locater Program
Enter a course number below to get information
Enter course number here: CS101
----------------------------------------------
The details for course CS101 are:
Room: 3004
Instructor: Haynes
Time: 8:00 a.m.

You could also do it like this. it'll probably take less time. The function is not very organize, try to organize it a little and it should work. I'm still not very familiar with adding codes on here.
course_info = {
'CS101': {
'Room': '3004',
'Instructor': 'Haynes',
'Time': '8:00 am'
},
'CS102': {
'Room': '4501',
'Instructor': 'Alvarado',
'Time': '9:00 a.m.'
},
'CS103': {
'Room': '6755',
'instructor': 'Rich',
'Time:': '10:00 am',
},
'NT110': {
'Room': '1244',
'instructor': 'Burkes',
'Time': '11:00 am'
},
'CM241': {
'Room': '1411',
'Instructor': 'Lee',
'Time': '1:00 pm'
},
}
get_course = input(f'Enter a course number: ')
try:
courses = course_info[get_course]
print(f'The details for for course {get_course} are: ')
print(f"Room: {courses['Room']}, Time: {courses['Time']},
Instructor: {courses['Instructor']}")
except KeyError:
print(f'Details not found for {get_course}')

Filter Dictionary using key and Value in list in Python

I am new to python and trying to filter a dictionary from list using value and key. I am using python 3
[{'ctime': 1459426422, 'accesskey': 'xxxxxx', 'secretkey': 'xxxx', 'id': 4, 'fsname': '/mnt/cdrom1', 'name': '/mnt/cdrom1:test1'}, {'ctime': 1459326975, 'accesskey': 'xxxx', 'secretkey': 'xxxx', 'id': 1, 'fsname': '/mnt/cdrom2', 'name': '/mnt/cdrom2:test2'}]
From above output, I need to filter a dictionary with key value as 'name':'/mnt/cdrom2:test2' so I get fitered dictionary as
{'ctime': 1459326975, 'accesskey': 'xxxx', 'secretkey': 'xxxx', 'id': 1, 'fsname': '/mnt/cdrom2', 'name': '/mnt/cdrom2:test2'}
I can then later extract keys and values as needed.
Thanks.

If your above list is in a variable named "my_list"
Python 2.7.x Version
for d in my_list:
if d['name'] == '/mnt/cdrom2:test2':
print d['name'] # or do whatever you want here
Python 3 Version
for d in my_list:
if d['name'] == '/mnt/cdrom2:test2':
print(d['name']) # or do whatever you want here

how to print recursively a Python dictionary and its subdictionaries with whitespace alignment into columns

I want to create a function that can take a dictionary of dictionaries such as the following
information = {
"sample information": {
"ID": 169888,
"name": "ttH",
"number of events": 124883,
"cross section": 0.055519,
"k factor": 1.0201,
"generator": "pythia8",
"variables": {
"trk_n": 147,
"zappo_n": 9001
}
}
}
and then print it in a neat way such as the following, with alignment of keys and values using whitespace:
sample information:
ID: 169888
name: ttH
number of events: 124883
cross section: 0.055519
k factor: 1.0201
generator: pythia8
variables:
trk_n: 147
zappo_n: 9001
My attempt at the function is the following:
def printDictionary(
dictionary = None,
indentation = ''
):
for key, value in dictionary.iteritems():
if isinstance(value, dict):
print("{indentation}{key}:".format(
indentation = indentation,
key = key
))
printDictionary(
dictionary = value,
indentation = indentation + ' '
)
else:
print(indentation + "{key}: {value}".format(
key = key,
value = value
))
It produces the output like the following:
sample information:
name: ttH
generator: pythia8
cross section: 0.055519
variables:
zappo_n: 9001
trk_n: 147
number of events: 124883
k factor: 1.0201
ID: 169888
As is shown, it successfully prints the dictionary of dictionaries recursively, however is does not align the values into a neat column. What would be some reasonable way of doing this for dictionaries of arbitrary depth?

Try using the pprint module. Instead of writing your own function, you can do this:
import pprint
pprint.pprint(my_dict)
Be aware that this will print characters such as { and } around your dictionary and [] around your lists, but if you can ignore them, pprint() will take care of all the nesting and indentation for you.