I have some data that is in almost-JSON format, but not quite. I'm trying to convert it from JSON using jsonlite in R. Here is a sample of data:
field
{'email': {'name': 'Bob Smith', 'address': 'bob_smith#blah.com'}}
{'email': {'name': "Sally O'Mally", 'address': 'sally_omally#blah.com'}}
{'email': {'name': 'Sam Daniels', 'address': '"some text"<sam_daniels#xyz.com>'}}
{'email': {'name': "Johnson', Alan", 'address': 'alan.johnson#abc.com'}}
What I want to do is strip out all of the quotation marks (both single and double) that are inside of the main quotations. The data would then look like this:
field
{'email': {'name': 'Bob Smith', 'address': 'bob_smith#blah.com'}}
{'email': {'name': "Sally OMally", 'address': 'sally_omally#blah.com'}}
{'email': {'name': 'Sam Daniels', 'address': 'some text<sam_daniels#xyz.com>'}}
{'email': {'name': "Johnson, Alan", 'address': 'alan.johnson#abc.com'}}
After that, I can handle converting the single quotes to double quotes using stringr and convert from JSON.
Any suggestions?
This is the error I currently get when trying to convert the original data from JSON:
> json_test2 <-
+ json_test %>%
+ dplyr::mutate(
+ field2 = map(field, ~ fromJSON(.) %>% as.data.frame())
+ )
Error: lexical error: invalid char in json text.
{'email': {'name': 'Bob S
(right here) ------^
Related
I have a column with strings in dictionary format, like this:
{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}
I want to create one column for each key, filled with the value for each row.
I tried using separate with regex:
separate(data, address, into = c('district', 'city', 'state', 'country',
'latitude', 'longitude', 'timezone',
'zipCode', 'streetName', 'streetNumber'),
"/:.*?/,")
However this is not working, and perhaps using separate with regex is not even the best course of action for this. Any suggestions on how to do it?
Since it is effectively JSON (though with single-quotes instead of the requisite double-quotes), we can use jsonlite:
txt <- "{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}"
as.data.frame(jsonlite::parse_json(gsub("'", '"', txt)))
# district city state country latitude longitude timezone zipCode streetName streetNumber
# 1 Ilha do Retiro RECIFE PE BR -8.062004 -34.90808 Etc/GMT+3 50830000 Avenida Engenheiro Abdias de Carvalho 365
When using requests.get('https://browse.search.hereapi.com/v1/browse?apiKey=' + YOUR_API_KEY + '&at=53.544348,-113.500571&circle:46.827727",-114.000519,r=3000&limit=10&categories=800-8200-0174') I get a response that shows Canadian postal codes - but only the first 3 characters.
For example, I get this data:
{'title': 'Boyle Street Education Centre', 'id': 'here:pds:place:124c3x29-d6c9cbd3d53a4758b8c953132db92244', 'resultType': 'place', 'address': {'label': 'Boyle Street Education Centre, 10312 105 St NW, Edmonton, AB T5J, Canada', 'countryCode': 'CAN', 'countryName': 'Canada', 'stateCode': 'AB', 'state': 'Alberta', 'county': 'Alberta', 'city': 'Edmonton', 'district': 'Downtown', 'street': '105 St NW', 'postalCode': 'T5J', 'houseNumber': '10312'}, 'position': {'lat': 53.54498, 'lng': -113.5016}, 'access': [{'lat': 53.54498, 'lng': -113.50105}], 'distance': 98, 'categories': [{'id': '800-8200-0174', 'name': 'School', 'primary': True}, {'id': '800-8200-0295', 'name': 'Training & Development'}], 'references': [{'supplier': {'id': 'core'}, 'id': '36335982'}, {'supplier': {'id': 'yelp'}, 'id': 'r3BvVKqluzrZeae9FE4tAw'}], 'contacts': [{'phone': [{'value': '+17804281420'}], 'fax': [{'value': '(780) 429-1458', 'categories': [{'id': '800-8200-0174'}]}], 'www': [{'value': 'http://www.bsec.ab.ca', 'categories': [{'id': '800-8200-0174'}]}]}], 'openingHours': [{'categories': [{'id': '800-8200-0174'}], 'text': ['Mon-Sat: 09:00 - 17:00', 'Sun: 10:00 - 16:00'], 'isOpen': False, 'structured': [{'start': 'T090000', 'duration': 'PT08H00M', 'recurrence': 'FREQ:DAILY;BYDAY:MO,TU,WE,TH,FR,SA'}, {'start': 'T100000', 'duration': 'PT06H00M', 'recurrence': 'FREQ:DAILY;BYDAY:SU'}]}]}
Notice that the postal code listed is "T5J". This is incorrect. Canadian postal codes are 3 characters, a space, and then 3 more characters. I'm guessing this is a parsing error that occurred when the data was captured. The correct postal code is "T5J 1E6".
Yes, HERE has a tool to modify the poi address information.
I reported this poi postal code to be updated to "T5J 1E6".
Please visit below the web tool.
https://mapcreator.here.com/place:124c3x29-d6c9cbd3d53a4758b8c953132db92244/?l=53.5450,-113.5016,18,normal
Thank you!
dicta={'name': 'C','children': {'name': 'testA','children': {'name': 'test_file'}}}
dictb={'name': 'C','children': {'name': 'testA','children': {'name': 'test_fileB','children': {'name': 'test_file'}}}}
dictc={'name': 'C','children':[{"name":"testA","children":[{"name":"test_file"},{'name': 'test_fileB','children': {'name': 'test_file'}}]}]}
I want to use dicta and dictb to get dictc;but I don't know how.
You can define a function to merge dictionaries and call it recursively:
def merge(dict1, dict2):
result = {**dict1}
if 'children' in dict1 and 'children' in dict2:
if dict1['children']['name'] == dict2['children']['name']:
result['children'] = [merge(dict1['children'], dict2['children'])]
else:
result['children'] = [dict1['children'], dict2['children']]
elif 'children' in dict1:
result['children'] = [dict1['children']]
elif 'children' in dict2:
result['children'] = [dict2['chidlren']]
else:
del result['children']
return result
dictc = merge(dicta, dictb)
You didn't provide many details on how exactly the merge should work, but this example does produce the dictc as you want. You may need to tweak something for your needs
Here is what I am trying to do.
The input data looks like this(Tab seperated):
12/01/2018 user1 123.123.222.111 23.3s
12/01/2018 user2 123.123.222.116 21.1s
The data is coming in through Kafka and is being parsed with the following code.
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kafkaStream.map(lambda x: x[1])
parsed_log = lines.flatMap(lambda line: line.split(" "))
.map(lambda item: ('key', {
'date': item['date'],
'user': item['user'],
'ip': item['ip'],
'duration': item['duration'],}))
The parsed logs should be in the following format:
('key', {'date': 12/01/2018, 'user': user1, 'ip': 123.123.222.111, 'duration': 23.3s})
('key', {'date': 12/01/2018, 'user': user2, 'ip': 123.123.222.116, 'duration': 21.1s})
In my code the code lines for "lines" and "parsed_log" and not doing the job. Could you please let me know how to go about this.
This is the solution:
lines = kafkaStream.map(lambda x: x[1])
variables_per_stream = lines.map(lambda line: line.split(" "))
variable_to_key=variables_per_stream.map(lambda item: ('key', {'id': item[0],'name': item[1]}))
I think it is pretty straightforward. All I am trying to do is update the original dictionary's 'code' with that of another dictionary which has the value. I get a feeling 2 for loops and an IF loop can be further shortened to get the answer. In my actual problem, I have few 1000's of dicts that I have to update. Thanks guys!
Python:
referencedict = {'A': 'abc', 'B': 'xyz'}
mylistofdict = [{'name': 'John', 'code': 'A', 'age': 28}, {'name': 'Mary', 'code': 'B', 'age': 32}, {'name': 'Joe', 'code': 'A', 'age': 43}]
for eachdict in mylistofdict:
for key, value in eachdict.items():
if key == 'code':
eachdict[key] = referencedict[value]
print mylistofdict
Output:
[{'age': 28, 'code': 'abc', 'name': 'John'}, {'age': 32, 'code': 'xyz', 'name': 'Mary'}, {'age': 43, 'code': 'abc', 'name': 'Joe'}]
There is no need to loop over all values of eachdict, just look up code directly:
for eachdict in mylistofdict:
if 'code' not in eachdict:
continue
eachdict['code'] = referencedict[eachdict['code']]
You can probably omit the test for code being present, your example list always contains a code entry, but I thought it better to be safe. Looking up the code in the referencedict structure assumes that all possible codes are available.
I used if 'code' not in eachdict: continue here; the opposite is just as valid (if 'code' in eachdict), but this way you can more easily remove the line if you do not need it, and you save yourself an indent level.
referencedict = {'A': 'abc', 'B': 'xyz'}
mylistofdict = [{'name': 'John', 'code': 'A', 'age': 28}, {'name': 'Mary', 'code': 'B', 'age': 32}, {'name': 'Joe', 'code': 'A', 'age': 43}]
for x in mylistofdict:
try:
x['code']=referencedict.get(x['code'])
except KeyError:
pass
print(mylistofdict)