How to separate a JSON string into multiple columns in R - r

I have a column with strings in dictionary format, like this:
{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}
I want to create one column for each key, filled with the value for each row.
I tried using separate with regex:
separate(data, address, into = c('district', 'city', 'state', 'country',
'latitude', 'longitude', 'timezone',
'zipCode', 'streetName', 'streetNumber'),
"/:.*?/,")
However this is not working, and perhaps using separate with regex is not even the best course of action for this. Any suggestions on how to do it?

Since it is effectively JSON (though with single-quotes instead of the requisite double-quotes), we can use jsonlite:
txt <- "{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}"
as.data.frame(jsonlite::parse_json(gsub("'", '"', txt)))
# district city state country latitude longitude timezone zipCode streetName streetNumber
# 1 Ilha do Retiro RECIFE PE BR -8.062004 -34.90808 Etc/GMT+3 50830000 Avenida Engenheiro Abdias de Carvalho 365

Related

Is there a place to report API data errors for the HERE browse search API?

When using requests.get('https://browse.search.hereapi.com/v1/browse?apiKey=' + YOUR_API_KEY + '&at=53.544348,-113.500571&circle:46.827727",-114.000519,r=3000&limit=10&categories=800-8200-0174') I get a response that shows Canadian postal codes - but only the first 3 characters.
For example, I get this data:
{'title': 'Boyle Street Education Centre', 'id': 'here:pds:place:124c3x29-d6c9cbd3d53a4758b8c953132db92244', 'resultType': 'place', 'address': {'label': 'Boyle Street Education Centre, 10312 105 St NW, Edmonton, AB T5J, Canada', 'countryCode': 'CAN', 'countryName': 'Canada', 'stateCode': 'AB', 'state': 'Alberta', 'county': 'Alberta', 'city': 'Edmonton', 'district': 'Downtown', 'street': '105 St NW', 'postalCode': 'T5J', 'houseNumber': '10312'}, 'position': {'lat': 53.54498, 'lng': -113.5016}, 'access': [{'lat': 53.54498, 'lng': -113.50105}], 'distance': 98, 'categories': [{'id': '800-8200-0174', 'name': 'School', 'primary': True}, {'id': '800-8200-0295', 'name': 'Training & Development'}], 'references': [{'supplier': {'id': 'core'}, 'id': '36335982'}, {'supplier': {'id': 'yelp'}, 'id': 'r3BvVKqluzrZeae9FE4tAw'}], 'contacts': [{'phone': [{'value': '+17804281420'}], 'fax': [{'value': '(780) 429-1458', 'categories': [{'id': '800-8200-0174'}]}], 'www': [{'value': 'http://www.bsec.ab.ca', 'categories': [{'id': '800-8200-0174'}]}]}], 'openingHours': [{'categories': [{'id': '800-8200-0174'}], 'text': ['Mon-Sat: 09:00 - 17:00', 'Sun: 10:00 - 16:00'], 'isOpen': False, 'structured': [{'start': 'T090000', 'duration': 'PT08H00M', 'recurrence': 'FREQ:DAILY;BYDAY:MO,TU,WE,TH,FR,SA'}, {'start': 'T100000', 'duration': 'PT06H00M', 'recurrence': 'FREQ:DAILY;BYDAY:SU'}]}]}
Notice that the postal code listed is "T5J". This is incorrect. Canadian postal codes are 3 characters, a space, and then 3 more characters. I'm guessing this is a parsing error that occurred when the data was captured. The correct postal code is "T5J 1E6".
Yes, HERE has a tool to modify the poi address information.
I reported this poi postal code to be updated to "T5J 1E6".
Please visit below the web tool.
https://mapcreator.here.com/place:124c3x29-d6c9cbd3d53a4758b8c953132db92244/?l=53.5450,-113.5016,18,normal
Thank you!

Strip out quotation marks that are found inside quotes using R

I have some data that is in almost-JSON format, but not quite. I'm trying to convert it from JSON using jsonlite in R. Here is a sample of data:
field
{'email': {'name': 'Bob Smith', 'address': 'bob_smith#blah.com'}}
{'email': {'name': "Sally O'Mally", 'address': 'sally_omally#blah.com'}}
{'email': {'name': 'Sam Daniels', 'address': '"some text"<sam_daniels#xyz.com>'}}
{'email': {'name': "Johnson', Alan", 'address': 'alan.johnson#abc.com'}}
What I want to do is strip out all of the quotation marks (both single and double) that are inside of the main quotations. The data would then look like this:
field
{'email': {'name': 'Bob Smith', 'address': 'bob_smith#blah.com'}}
{'email': {'name': "Sally OMally", 'address': 'sally_omally#blah.com'}}
{'email': {'name': 'Sam Daniels', 'address': 'some text<sam_daniels#xyz.com>'}}
{'email': {'name': "Johnson, Alan", 'address': 'alan.johnson#abc.com'}}
After that, I can handle converting the single quotes to double quotes using stringr and convert from JSON.
Any suggestions?
This is the error I currently get when trying to convert the original data from JSON:
> json_test2 <-
+ json_test %>%
+ dplyr::mutate(
+ field2 = map(field, ~ fromJSON(.) %>% as.data.frame())
+ )
Error: lexical error: invalid char in json text.
{'email': {'name': 'Bob S
(right here) ------^

How to match multiline pattern

I am trying to match pattern in a text file. It works well as long as the pattern stands within one line. But it occurred that in some case the pattern can be across two lines.
i have the following code :
#indicate the Name pattern to R
name_pattern = '<nameOfIssuer>([^<]*)</nameOfIssuer>'
#Collect information that match the pattern that we are looking #
datalines = grep(name_pattern, thepage[1:length(thepage)], value = TRUE)
#We will use gregexpr and gsub to extract the information without the html tags
#create a function first
getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
gg = gregexpr(name_pattern, datalines)
matches = mapply(getexpr, datalines, gg)
result = gsub(name_pattern, '\\1', matches)
result <- gsub("&", "&", result)
names(result) = NULL
It works well when the text is as :
<nameOfIssuer>Posco ADR</nameOfIssuer>
BUt in case where the text is as follows, it does not work:
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
Does anyone know how to handle both case dynamically please?
Full text is as follow:
<SEC-DOCUMENT>0001437749-18-018038.txt : 20181009
<SEC-HEADER>0001437749-18-018038.hdr.sgml : 20181009
<ACCEPTANCE-DATETIME>20181005183736
ACCESSION NUMBER: 0001437749-18-018038
CONFORMED SUBMISSION TYPE: 13F-HR
PUBLIC DOCUMENT COUNT: 2
CONFORMED PERIOD OF REPORT: 20180930
FILED AS OF DATE: 20181009
DATE AS OF CHANGE: 20181005
EFFECTIVENESS DATE: 20181009
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: DAILY JOURNAL CORP
CENTRAL INDEX KEY: 0000783412
STANDARD INDUSTRIAL CLASSIFICATION: NEWSPAPERS: PUBLISHING OR PUBLISHING & PRINTING [2711]
IRS NUMBER: 954133299
STATE OF INCORPORATION: SC
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 13F-HR
SEC ACT: 1934 Act
SEC FILE NUMBER: 028-15782
FILM NUMBER: 181111587
BUSINESS ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
BUSINESS PHONE: 2132295300
MAIL ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
FORMER COMPANY:
FORMER CONFORMED NAME: DAILY JOURNAL CO
DATE OF NAME CHANGE: 19870427
</SEC-HEADER>
<DOCUMENT>
<TYPE>13F-HR
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?>
<edgarSubmission xmlns="http://www.sec.gov/edgar/thirteenffiler" xmlns:com="http://www.sec.gov/edgar/common">
<headerData>
<submissionType>13F-HR</submissionType>
<filerInfo>
<liveTestFlag>LIVE</liveTestFlag>
<flags>
<confirmingCopyFlag>false</confirmingCopyFlag>
<returnCopyFlag>true</returnCopyFlag>
<overrideInternetFlag>false</overrideInternetFlag>
</flags>
<filer>
<credentials>
<cik>0000783412</cik>
<ccc>XXXXXXXX</ccc>
</credentials>
</filer>
<periodOfReport>09-30-2018</periodOfReport>
</filerInfo>
</headerData>
<formData>
<coverPage>
<reportCalendarOrQuarter>09-30-2018</reportCalendarOrQuarter>
<filingManager>
<name>DAILY JOURNAL CORP</name>
<address>
<com:street1>915 EAST FIRST STREET</com:street1>
<com:city>LOS ANGELES</com:city>
<com:stateOrCountry>CA</com:stateOrCountry>
<com:zipCode>90012</com:zipCode>
</address>
</filingManager>
<reportType>13F HOLDINGS REPORT</reportType>
<form13FFileNumber>028-15782</form13FFileNumber>
<provideInfoForInstruction5>N</provideInfoForInstruction5>
</coverPage>
<signatureBlock>
<name>Gerald L. Salzman</name>
<title>Chief Executive Officer, President, CFO, Treasurer</title>
<phone>213-229-5300</phone>
<signature>/s/ Gerald L. Salzman</signature>
<city>Los Angeles</city>
<stateOrCountry>CA</stateOrCountry>
<signatureDate>10-05-2018</signatureDate>
</signatureBlock>
<summaryPage>
<otherIncludedManagersCount>0</otherIncludedManagersCount>
<tableEntryTotal>4</tableEntryTotal>
<tableValueTotal>159459</tableValueTotal>
<isConfidentialOmitted>false</isConfidentialOmitted>
</summaryPage>
</formData>
</edgarSubmission>
</XML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>INFORMATION TABLE
<SEQUENCE>2
<FILENAME>rdgit100518.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="us-ascii"?>
<informationTable xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable">
<infoTable>
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>060505104</cusip>
<value>67758</value>
<shrsOrPrnAmt>
<sshPrnamt>2300000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>2300000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Posco ADR</nameOfIssuer>
<titleOfClass>Sponsored ADR</titleOfClass>
<cusip>693483109</cusip>
<value>643</value>
<shrsOrPrnAmt>
<sshPrnamt>9745</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>9745</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>US Bancorp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>902973304</cusip>
<value>7393</value>
<shrsOrPrnAmt>
<sshPrnamt>140000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>140000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Wells Fargo &amp; Co</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>949746101</cusip>
<value>83665</value>
<shrsOrPrnAmt>
<sshPrnamt>1591800</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>1591800</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
</informationTable>
</XML>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
Assuming there could be multiple matching <nameOfIssuer> tags in your document, and you want to match all of them, then we can try using grepexpr with regmatches:
input <- "<nameOfIssuer>Bank of\n America Corp</nameOfIssuer>\n blah blah blah \n"
input <- paste0(input, "<nameOfIssuer>Citigroup</nameOfIssuer>")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", input, perl=TRUE)
regmatches(input, m)[[1]]
[1] "Bank of\n America Corp" "Citigroup"
Using Tim's solution plus the collapse option of paste the program works. The code is as following:
thepage <- paste(thepage, collapse = "")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", thepage, perl=TRUE)
result <- regmatches(thepage, m)[[1]]
names(result) = NULL
#put the result into a dataframe
Positions = as.data.frame(matrix(result, ncol=1, byrow = TRUE))

Using Map in PySpark to parse and assign column names

Here is what I am trying to do.
The input data looks like this(Tab seperated):
12/01/2018 user1 123.123.222.111 23.3s
12/01/2018 user2 123.123.222.116 21.1s
The data is coming in through Kafka and is being parsed with the following code.
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kafkaStream.map(lambda x: x[1])
parsed_log = lines.flatMap(lambda line: line.split(" "))
.map(lambda item: ('key', {
'date': item['date'],
'user': item['user'],
'ip': item['ip'],
'duration': item['duration'],}))
The parsed logs should be in the following format:
('key', {'date': 12/01/2018, 'user': user1, 'ip': 123.123.222.111, 'duration': 23.3s})
('key', {'date': 12/01/2018, 'user': user2, 'ip': 123.123.222.116, 'duration': 21.1s})
In my code the code lines for "lines" and "parsed_log" and not doing the job. Could you please let me know how to go about this.
This is the solution:
lines = kafkaStream.map(lambda x: x[1])
variables_per_stream = lines.map(lambda line: line.split(" "))
variable_to_key=variables_per_stream.map(lambda item: ('key', {'id': item[0],'name': item[1]}))

How to get the original word from trans() Symfony 2

The user should give his country name, the problem is that all countries name are translated to different languages, and I must re-trans to englisch to compare the name with the name in my database.
I did like that but it doesn't work :
$translated_country = $this->get('translator')->trans($q_country, array(), null, 'en_US');
$countries = array("A, B, C");
if( in_array($translated_country, $countries))
{}
For example I have messages.de.yml
Germany : Deutschland
I want that when the user enters Deutschland , In my code I get Germany
You need to have a match in the EN locale for each country translated into the other languages you support.
# messages.en.yml
deutschland: germany
Германия: germany
russland: russia
Россия: russia
# messages.de.yml
germany: deutschland
russia: russland
# messages.ru.yml
russia: Россия
germany: Германия
$toTranslate = 'deutschland';
$translator = $this->get('translator');
$translation = $translator->trans($toTranslate, array(), null, 'en_US');
/** $translation should be 'germany' */

Resources