I'm using the following function to extract the retweet ids from a tweet JSON file (to build retweet cascades). Here is a part of the code in R:
parse_raw_tweets_to_cascades <- function(path, batch = 1000, cores = 1, output_path = NULL, keep_user = F, keep_absolute_time = F, keep_text = F, keep_retweet_count = F, progress = T, return_as_list = T, save_temp = F, api_version=2) {
check_required_packages(c('jsonlite', 'data.table', 'bit64'))
library(data.table)
# a helper function
zero_if_null <- function(count) {
ifelse(is.null(count), 0, count)
}
if (api_version == 2) {
parse_tweet <- function(tweet, keep_text = F) {
tryCatch({
json_tweet <- jsonlite::fromJSON(tweet)
if (is.null(json_tweet$includes) || is.null(json_tweet$includes$users)) {
stop('The author information is required!')
}
id <- json_tweet$data$id
magnitude <-zero_if_null(json_tweet$includes$users$public_metrics$followers_count)
user_id <- json_tweet$data$author_id
username <- json_tweet$user$username
retweet_id <- NA
#print(typeof(id)) #character
#cat(sprintf('id is: %s \n', id))
cat(sprintf('magnitude is: %d \n', magnitude))
if (keep_text) text <- json_tweet$data$text
if (!is.null(json_tweet$data$referenced_tweets) && json_tweet$data$referenced_tweets$type == 'retweeted') {
#if this tweet is a retweet, get original tweet's information
retweet_id <- json_tweet$data$referenced_tweets$id
cat("retweet_id: ", retweet_id, "\n")
if (keep_text) text <- NA
}
cat("Monaaaaa", "\n")
res <- list(id = id, magnitude = magnitude, user_id = user_id,
username = username, retweet_id = retweet_id)
if (keep_text) res[['text']] <- text
res
},
.... # warning for error processing json
)
}
}
This is the error I receive:
Error processing json: Error in if
(!is.null(json_tweet$data$referenced_tweets) &&
json_tweet$data$referenced_tweets$type == : missing value where
TRUE/FALSE needed
Question:
I don't know why the retweet ids are null.
I checked my json file and searched for retweeted. I see the path (json_tweet$data$referenced_tweets$type) is correct.
NOTE: The above function is part of the evently library. The package works fine with their sample data provided on Github which is in Twitter API V1 format, but it doesn't work with my JSON file which is in V2.
Here is a small part of my data (part of JSON for retweets of one user):
{"data": [{"referenced_tweets": [{"type": "retweeted", "id": "1253739069273710594"}], "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}, {"start": 18, "end": 24, "username": "AC360", "id": "227837742"}], "annotations": [{"start": 25, "end": 39, "probability": 0.7096, "type": "Person", "normalized_text": "President Trump"}], "urls": [{"start": 98, "end": 121, "url": "", "expanded_url": "", "display_url": "", "images": [{"url": "", "width": 144, "height": 144}, {"url": "", "width": 144, "height": 144}], "status": 200, "title": "Ultraviolet Irradiation of Blood: \u201cThe Cure That Time Forgot\u201d?", "description": "Ultraviolet blood irradiation (UBI) was extensively used in the 1940s and 1950s to treat many diseases including septicemia, pneumonia, tuberculosis, arthritis, asthma and even poliomyelitis. The early studies were carried out by several physicians in ...", "unwound_url": ""}]}, "public_metrics": {"retweet_count": 3, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253834847258370048", "context_annotations": [{"domain": {"id": "3", "name": "TV Shows", "description": "Television shows from around the world"}, "entity": {"id": "10000271509", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "4", "name": "TV Episodes", "description": "Television show episodes"}, "entity": {"id": "1249271407508242432", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "4", "name": "TV Episodes", "description": "Television show episodes"}, "entity": {"id": "1249277031881138178", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "4", "name": "TV Episodes", "description": "Television show episodes"}, "entity": {"id": "1250891078401552385", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "10", "name": "Person", "description": "Named people in the world like Nelson Mandela"}, "entity": {"id": "799022225751871488", "name": "Donald Trump", "description": "45th US President, Donald Trump"}}, {"domain": {"id": "29", "name": "Events [Entity Service]", "description": "Entity Service related Events domain"}, "entity": {"id": "1249271407508242432", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "29", "name": "Events [Entity Service]", "description": "Entity Service related Events domain"}, "entity": {"id": "1249277031881138178", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "29", "name": "Events [Entity Service]", "description": "Entity Service related Events domain"}, "entity": {"id": "1250891078401552385", "name": "Anderson Cooper 360", "description": "Anderson Cooper goes beyond the headlines with in-depth reporting and investigations. Through nightly \"Keeping Them Honest\" reports, Anderson keeps his commitment to holding those in power accountable. And, of course, there's the RidicuList, a tongue-in-cheek commentary on the day's news that may leave viewers (and Anderson) laughing. Joining him are guests that frequently include political and legal analysts."}}, {"domain": {"id": "35", "name": "Politician", "description": "Politicians in the world, like Joe Biden"}, "entity": {"id": "799022225751871488", "name": "Donald Trump", "description": "45th US President, Donald Trump"}}], "created_at": "2020-04-24T23:54:57.000Z", "author_id": "1890848160", "text": "RT #warriors_mom: #AC360 President Trump was referring to this well-documented medical treatment: ", "source": "Twitter for iPhone", "conversation_id": "1253834847258370048"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253452455540666371"}], "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}], "annotations": [{"start": 24, "end": 27, "probability": 0.691, "type": "Place", "normalized_text": "U.S."}]}, "public_metrics": {"retweet_count": 5, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253828982413410307", "context_annotations": [{"domain": {"id": "123", "name": "Ongoing News Story", "description": "Ongoing News Stories like 'Brexit'"}, "entity": {"id": "1220701888179359745", "name": "COVID-19"}}], "created_at": "2020-04-24T23:31:39.000Z", "author_id": "863857568", "text": "RT #warriors_mom: Major U.S. credit-card issuers begin lowering customer spending limits as coronavirus pandemic shutdowns leave millions j\u2026", "source": "Twitter for iPhone", "conversation_id": "1253828982413410307"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253815956662620163"}], "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}, {"start": 18, "end": 32, "username": "RealMattCouch", "id": "601535938"}], "annotations": [{"start": 33, "end": 41, "probability": 0.8682, "type": "Person", "normalized_text": "Seth Rich"}]}, "public_metrics": {"retweet_count": 2, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253816055161651202", "created_at": "2020-04-24T22:40:16.000Z", "author_id": "1065308069645754368", "text": "RT #warriors_mom: #RealMattCouch Seth Rich", "source": "Twitter for Android", "conversation_id": "1253816055161651202"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253811776103333890"}], "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}], "annotations": [{"start": 63, "end": 67, "probability": 0.9967, "type": "Person", "normalized_text": "Trump"}, {"start": 69, "end": 74, "probability": 0.9523, "type": "Place", "normalized_text": "Russia"}, {"start": 87, "end": 95, "probability": 0.8678, "type": "Organization", "normalized_text": "Alfa Bank"}]}, "public_metrics": {"retweet_count": 1, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253812582806216704", "context_annotations": [{"domain": {"id": "10", "name": "Person", "description": "Named people in the world like Nelson Mandela"}, "entity": {"id": "799022225751871488", "name": "Donald Trump", "description": "45th US President, Donald Trump"}}, {"domain": {"id": "35", "name": "Politician", "description": "Politicians in the world, like Joe Biden"}, "entity": {"id": "799022225751871488", "name": "Donald Trump", "description": "45th US President, Donald Trump"}}, {"domain": {"id": "30", "name": "Entities [Entity Service]", "description": "Entity Service top level domain, every item that is in Entity Service should be in this domain"}, "entity": {"id": "848920371311001600", "name": "Technology", "description": "Technology and computing"}}, {"domain": {"id": "30", "name": "Entities [Entity Service]", "description": "Entity Service top level domain, every item that is in Entity Service should be in this domain"}, "entity": {"id": "898650876658634752", "name": "Cybersecurity", "description": "Cybersecurity"}}], "created_at": "2020-04-24T22:26:29.000Z", "author_id": "987931361963950080", "text": "RT #warriors_mom: Top cyber security team finds no evidence of Trump-Russia chatter on Alfa Bank server: A cyber security report debunks th\u2026", "source": "Twitter for Android", "conversation_id": "1253812582806216704"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253461793168674821"}], "attachments": {"media_keys": ["3_1253461775980339201", "3_1253461780254392326", "3_1253461784981377024", "3_1253461788408102912"]}, "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}], "hashtags": [{"start": 23, "end": 32, "tag": "FakeNews"}], "urls": [{"start": 56, "end": 79, "url": "", "expanded_url": "", "display_url": "pic.twitter.com/po6BRVf2pu", "media_key": "3_1253461775980339201"}, {"start": 56, "end": 79, "url": "", "expanded_url": "", "display_url": "pic.twitter.com/po6BRVf2pu", "media_key": "3_1253461780254392326"}, {"start": 56, "end": 79, "url": "", "expanded_url": "", "display_url": "pic.twitter.com/po6BRVf2pu", "media_key": "3_1253461784981377024"}, {"start": 56, "end": 79, "url": "", "expanded_url": "", "display_url": "pic.twitter.com/po6BRVf2pu", "media_key": "3_1253461788408102912"}]}, "public_metrics": {"retweet_count": 6, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253787731517476866", "created_at": "2020-04-24T20:47:44.000Z", "author_id": "461486301", "text": "RT #warriors_mom: Dear #FakeNews Media... seriously? \ud83d\ude44\ud83e\udd23 ", "source": "Twitter Web App", "conversation_id": "1253787731517476866"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253348491805577216"}], "entities": {"mentions": [{"start": 3, "end": 16, "username": "warriors_mom", "id": "75184478"}], "annotations": [{"start": 18, "end": 23, "probability": 0.868, "type": "Organization", "normalized_text": "Amazon"}]}, "public_metrics": {"retweet_count": 2, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "reply_settings": "everyone", "lang": "en", "id": "1253787648180789253", "context_annotations": [{"domain": {"id": "45", "name": "Brand Vertical", "description": "Top level entities that describe a Brands industry"}, "entity": {"id": "781974596706635776", "name": "Retail"}}, {"domain": {"id": "46", "name": "Brand Category", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "783335558466506752", "name": "Online"}}, {"domain": {"id": "47", "name": "Brand", "description": "Brands and Companies"}, "entity": {"id": "10026792024", "name": "Amazon"}}], "created_at": "2020-04-24T20:47:24.000Z", "author_id": "3003997593", "text": "RT #warriors_mom: Amazon Scooped Up Data From Its Own Sellers to Launch Competing Products: Contrary to assertions to Congress, employees o\u2026", "source": "Twitter for iPhone", "conversation_id": "1253787648180789253"}, {"referenced_tweets": [{"type": "retweeted", "id": "1253716749817729025"}], ....}
I've noticed quite severe inconsistency in result provided by HERE /geocode API endpoint. Some address parts have original special characters like in "Łódź" city and some don't.
When doing following request:
https://geocoder.cit.api.here.com/6.2/geocode.json?lon=19.4734111&lat=51.73771300000001&language=sv-SE&searchtext=sienkiewicza lodz&result_types=address,place&cs=pds&additionaldata=Country2,true
We get the result which is inconsistent
"Address": {
"Label": "ulica Henryka Sienkiewicza, 90-009 Lodz, Polen",
"Country": "POL",
"State": "Woj. Łódzkie",
"County": "Lodz",
"City": "Lodz",
"District": "Lodz",
"Subdistrict": "Śródmieście",
"Street": "ulica Henryka Sienkiewicza",
"PostalCode": "90-009",
"AdditionalData": [
{
"value": "PL",
"key": "Country2"
},
{
"value": "Polen",
"key": "CountryName"
},
{
"value": "Woj. Łódzkie",
"key": "StateName"
},
{
"value": "Lodz",
"key": "CountyName"
}
]
}
As we can see value for state contains polish characters "Woj. Łódzkie", but city is "Lodz" which is not ok.
All results should contain original letters like "Łódź". In other words such results shouldn't be latinized.
Thank you
When using a language code different than the one of the original data, like in your case sv-SE for data in Poland, you get exonyms "where available", which is why you may get a mix of alphabets.
If you remove the language parameter from the query, or set it to Polish explicitely with language=pl-PL, you get the following response for your example:
"Address": {
"Label": "ulica Henryka Sienkiewicza, 90-057 Łódź, Polska",
"Country": "POL",
"State": "Woj. Łódzkie",
"County": "Łódź",
"City": "Łódź",
"District": "Łódź",
"Subdistrict": "Śródmieście",
"Street": "ulica Henryka Sienkiewicza",
"PostalCode": "90-057",
"AdditionalData": [
{
"value": "PL",
"key": "Country2"
},
{
"value": "Polska",
"key": "CountryName"
},
{
"value": "Woj. Łódzkie",
"key": "StateName"
},
{
"value": "Łódź",
"key": "CountyName"
}
]
}
I need to be able to give a City/State or Postal Code and an Integer for the radius and return all cities/postal codes within the given radius using HERE but the Documentation seems to be unclear.
https://geocoder.api.here.com/6.2/reversegeocode.json?app_id=xxx&app_code=xxxx
The return is:
<ns2:Error xmlns:ns2="http://www.navteq.com/lbsp/Errors/1" type="PermissionError" subtype="InvalidCredentials">
<Details>invalid credentials for xxxxxxx</Details>
</ns2:Error>
But when I use
https://geocoder.api.here.com/6.2/geocode.json?app_id=xxx&app_code=xxxx
I get data back.
It is worth mentioning I am on the Freemium Package for the REST API
So first why can't I get data back from the Reverse Geo Code API?
And what is the appropriate string to accomplish the above?
Update:
Leaving the rest here in case someone else runs into this. To use Reverse Geo Code the API Route is actually
https://reverse.geocoder.api.here.com/6.2/reversegeocode.json
Though I still need some help on how to get the Radius Data
Update 2:
https://reverse.geocoder.api.here.com/6.2/reversegeocode.json?app_id=x&app_code=x&level=city&mode=retrieveAreas&prox=52.5309,13.3847,80467.2
Returns:
{
"Response": {
"MetaInfo": {
"Timestamp": "2019-04-27T17:47:41.043+0000"
},
"View": [
{
"_type": "SearchResultsViewType",
"ViewId": 0,
"Result": [
{
"Relevance": 1,
"Distance": 0,
"Direction": 0,
"MatchLevel": "district",
"MatchQuality": {
"Country": 1,
"State": 1,
"County": 1,
"City": 1,
"District": 1,
"PostalCode": 1
},
"Location": {
"LocationId": "NT_0ES-GaH3lZzJCuLQBrdw7C",
"LocationType": "point",
"DisplayPosition": {
"Latitude": 52.5309,
"Longitude": 13.3847
},
"MapView": {
"TopLeft": {
"Latitude": 52.54063,
"Longitude": 13.36566
},
"BottomRight": {
"Latitude": 52.50407,
"Longitude": 13.42964
}
},
"Address": {
"Label": "Mitte, Berlin, Deutschland",
"Country": "DEU",
"State": "Berlin",
"County": "Berlin",
"City": "Berlin",
"District": "Mitte",
"PostalCode": "10178",
"AdditionalData": [
{
"value": "Deutschland",
"key": "CountryName"
},
{
"value": "Berlin",
"key": "StateName"
},
{
"value": "Berlin",
"key": "CountyName"
}
]
},
"MapReference": {
"ReferenceId": "53500282",
"SideOfStreet": "neither",
"CountryId": "20147700",
"StateId": "20187401",
"CountyId": "20187402",
"CityId": "20187403",
"DistrictId": "20187417"
}
}
}
]
}
]
}
}
The documentation says to put in coordinates and a radius, that's 50 miles, so not sure why I am only receiving 1 city, Berlin, in the response.
Update 3:
https://reverse.geocoder.api.here.com/6.2/multi-reversegeocode.json?app_id=x&app_code=x&level=city&mode=retrieveAreas&prox=60.5544,-151.2583,80000
Tried with multi-reversegeocode
Return is: {}
Get cities in a proximity radius
Use the Reverse Geocode endpoint with the following parameters
https://reverse.geocoder.api.here.com/6.2/reversegeocode.json?
prox=52.5309,13.3847,80500 /* Note: 80.5 km around Berlin */
&app_id=YOUR_APP_ID
&app_code=YOUR_APP_CODE
&mode=retrieveAreas
&level=city
&gen=9
In short, your example in "Update 2" is correct, besides that it is missing the gen query parameter. Indeed, as per the API Reference, the query parameter level is valid only in combination with gen=2 or higher.
I'm walking through a huge JSONL file (100G, 100M rows) line by line extracting two key values from the data. Ideally, I want this written to a file with two columns. I'm a real beginner here.
Here is an example of the JSON on each row of the file referenced on my C drive:
https://api.unpaywall.org/v2/10.6118/jmm.2017.23.2.135?email=YOUR_EMAIL
or:
{
"best_oa_location": {
"evidence": "open (via page says license)",
"host_type": "publisher",
"is_best": true,
"license": "cc-by-nc",
"pmh_id": null,
"updated": "2018-02-14T11:18:21.978814",
"url": "FAKEURL",
"url_for_landing_page": "URL2",
"url_for_pdf": "URL4",
"version": "publishedVersion"
},
"data_standard": 2,
"doi": "10.6118/jmm.2017.23.2.135",
"doi_url": "URL5",
"genre": "journal-article",
"is_oa": true,
"journal_is_in_doaj": false,
"journal_is_oa": false,
"journal_issns": "2288-6478,2288-6761",
"journal_name": "Journal of Menopausal Medicine",
"oa_locations": [
{
"evidence": "open (via page says license)",
"host_type": "publisher",
"is_best": true,
"license": "cc-by-nc",
"pmh_id": null,
"updated": "2018-02-14T11:18:21.978814",
"url": "URL6",
"url_for_landing_page": "hURL7": "hURL8",
"version": "publishedVersion"
},
{
"evidence": "oa repository (via OAI-PMH doi match)",
"host_type": "repository",
"is_best": false,
"license": "cc-by-nc",
"pmh_id": "oai:pubmedcentral.nih.gov:5606912",
"updated": "2017-10-21T18:12:39.724143",
"url": "URL9",
"url_for_landing_page": "URL11",
"url_for_pdf": "URL12",
"version": "publishedVersion"
},
{
"evidence": "oa repository (via pmcid lookup)",
"host_type": "repository",
"is_best": false,
"license": null,
"pmh_id": null,
"updated": "2018-10-11T01:49:34.280389",
"url": "URL13",
"url_for_landing_page": "URL14",
"url_for_pdf": null,
"version": "publishedVersion"
}
],
"published_date": "2017-01-01",
"publisher": "The Korean Society of Menopause (KAMJE)",
"title": "A Case of Granular Cell Tumor of the Clitoris in a Postmenopausal Woman",
"updated": "2018-06-20T20:31:37.509896",
"year": 2017,
"z_authors": [
{
"affiliation": [
{
"name": "Department of Obstetrics and Gynecology, Soonchunhyang University Cheonan Hospital, University of Soonchunhyang College of Medicine, Cheonan, Korea."
}
],
"family": "Min",
"given": "Ji-Won"
},
{
"affiliation": [
{
"name": "Department of Obstetrics and Gynecology, Soonchunhyang University Cheonan Hospital, University of Soonchunhyang College of Medicine, Cheonan, Korea."
}
],
"family": "Kim",
"given": "Yun-Sook"
}
]
}
Here's the code i'm using/wrote:
library (magrittr)
library (jqr)
con = file("C:/users/ME/desktop/miniunpaywall.jsonl", "r");
while ( length(line <- readLines(con, n = -1)) > 0) {
write.table( line %>% jq ('.doi,.best_oa_location.license'), file='test.txt', quote=FALSE, row.names=FALSE);}
What results from this is a line of text for each row of JSON that looks like this:
"10.1016/j.ijcard.2018.10.014,CC-BY"
This is effectively:
"[DOI],[LICENSE]"
I want ideally to have the output be:
[DOI] tab [LICENSE]
I believe my problem is that I'm writing the values as a string into a single column when i say:
write.table( line %>% jq ('.doi,.best_oa_location.license')
I havent figured out a way to remove the quotes i'm getting around each line in my file or how i could separate the two values with a tab. I feel I'm pretty close. Help!