Odd text characters after parsing from JSON file? Encoding issue? - r

I have a JSON file with Tweet data, containing fields such as text, published date, author, ID, etc.
I used the parseTweets function from streamR, but when I view the completed df, the text has not been encoded/parsed in correctly.
tweets <- parseTweets("C:/Users/...file.json",simplify = FALSE, verbose = TRUE, legacy = FALSE)
View(tweets)
This is what is shown in the "text" column of the parsed object
think you’re continuing the conversation
It should say: think you're continuing the conversation
I did some searching and this seems to be an encoding issue, but I can't seem to figure it out.
Would I need to parseTweets first, then edit the text column afterwards? Or is there a wrapper method that I can parse correctly the first time I read in the JSON?
Any help is appreciated, thank you!
Here is an example JSON snippet pulled from my larger file
{"created_at":"Sun Jun 10 00:01:12 +0000 2018","id":100565760896,"id_str":"1005600896","text":"think you’re continuing the conversation","source":"Twitter for iPhone","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":403340,"id_str":"40311840","name":"Dvo","screen_name":"ImBorau","location":"Florida, USA","url":"http://Instagram.com/ ","description":"ucf | I your sarcastic quips","translator_type":"none","protected":false,"verified":false,"followers_count":43,"friends_count":166,"listed_count":0,"favourites_count":839,"statuses_count":1460,"created_at":"Wed Nov 02 01:41:45 +0000 2011","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"9AE4E8","profile_background_image_url":"http://abs.twimg.com/images/themes/theme16/bg.gif","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme16/bg.gif","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"BDDCAD","profile_sidebar_fill_color":"DDFFCC","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/10014987138688/RYbZNdVR_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/100149871633688/RYbNdVR_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/40318340/107757914","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"updated":["description","name"]},"geo":null,"coordinates":null,"place":{"id":"4ec0163497","url":"https://api.twitter.com/1.1/geo/id/4ec1c9db497.json","place_type":"admin","name":"Florida","full_name":"Florida, USA","country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-87.634643,24.396308],[-87.634643,31.001056],[-79.974307,31.001056],[-79.974307,24.396308]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1528588108","matching_rules":[{"tag":null,"id":484862573421,"id_str":"48486970421"}]}

Related

Importing, editing and saving JSON in a simple way?

I'm having problems with editing a JSON file and saving the results in a usable form.
My starting point is : modify-json-file-using-some-condition-and-create-new-json-file-in-r
In fact I want to do something even simpler and it still doesn't work! I'm using the jsonlite package.
An equivalent sample would look like this ...
$Apples
$Apples$Origin
$Apples$Origin$Id
[1] 2615
$Apples$Origin$season
[1] "Fall"
$Oranges
$Oranges$Origin
$Oranges$Origin$Id
[1] 2615
$Oranges$Origin$airportLabel
[1] "Orange airport"
$Oranges$Shipping
$Oranges$Shipping$ShipperId
[1] 123
$Oranges$Shipping$ShipperLabel
[1] "Brighter Orange"
I read the file, make some changes and save the resulting file back to HDD. Nothing simpler right?
json_list = read_json(path = "../documents/dummy.json")
json_list$Apples$Origin$Id = 1234
json_list$Oranges$Origin$Id = 4567
json_list$Oranges$Shipping$ShipperLabel = "Suntan Blue"
json_modified <- toJSON(json_list, pretty = TRUE)
write_json(json_modified, path = "../documents/dummy_new.json")
json_list appears as character format under the Rstudio file type column.
json_modified appears as json format under the Rstudio file type column.
Why this difference?
Now if I run the original file it works but the modified file fails. The JSON format checks out and I can't see any errors.
The real file is bigger than the example above but the method I've used is the same.
Am I doing something wrong in the way I edit or save the file?
I'm really new to JSON and this is really frustrating!
Any Ideas?
Thanks
In the absence of reproducible data, I can diagnose at least one potential problem.
Background
Within the jsonlite package, there exist functions that are mutual inverses:
jsonlite::fromJSON() converts from raw text (in JSON format) to R objects.
jsonlite::toJSON() converts from R objects to raw text (in JSON format).
Now this raw text (txt) might be
a JSON string, URL or file
As for jsonlite::read_json() and jsonlite::write_json(), they are also a pair of mutual inverses, which are like the former pair
except [that] they explicitly distinguish between path and literal input, and do not simplify by default.
That is, the latter are simply designed to handle file(path)s rather than strings of raw text.
So toJSON(fromJSON(txt = ...)) should return unchanged the text passed to txt, just as write_json(read_json(path = ...)) should write a file identical to that passed to path.
In short, toJSON() belongs with fromJSON(); while write_json() belongs with read_json().
The Problem
However, you have added a spurious step by mingling toJSON() with read_json() and write_json():
json_list = read_json(...)
# ...
json_modified <- toJSON(json_list, ...) # SPURIOUS STEP
# ...
write_json(json_modified, ...)
You see, write_json() already converts "to JSON", so toJSON() is wholly unnecessary. Indeed, toJSON() actually sabotages the process, since its textual return value is passed on (in json_modified) to write_json(), which expects (a structure of) R objects rather than text.
The Fix
Once you're done modifying json_list, just go straight to writing it:
json_list = read_json(path = "../documents/dummy.json")
json_list$Apples$Origin$Id = 1234
# Further modifications...
write_json(json_list, path = "../documents/dummy_new.json", pretty = TRUE)

Tabulator Cell datetime object - Need to Format

I am attempting to reformat a datetime value in a cell that contains TZ UTC data. An example value is: 2019-12-09T14:50:47.000Z-0500
I need it to display as:
MM/DD/YYYY HH:mm:ssXM - ex: 12/09/2019 02:50:47PM
Local time, of course.
I have tried reading the moment.js doc without success. Here is a snippet I have attempted. The table shows up with "blank rows." If I remove the formatting, the data shows correctly but not with the date and time format I would like.
{title:"Last Submitted", field:"createdOn", sorter:"date", formatter:"datetime", formatterParams:{inputFormat:"YYYY-MM-DD hh:mm:ss", outputFormat:"MM/DD/YYYY", invalidPlaceholder:"(Invalid date)"}},
Any assistance would be greatly appreciated!
Ben
UPDATE BASED ON ANSWER 12/26/2019
Thank you again for responding. However, this is perhaps an issue for the author of Tabulator since I copied the inputFormat and outputFormat verbatim into the column definition of a Tabulator component and it displays blank rows. If I remove the column cell formatter (which is a wrapper around moment.js code), the list displays with the full timestamp (including UTC / zulu time).
ex:
2019-12-09T12:50:47.000Z-0500
Expected result (either 24h or 12h format doesn't matter at this point - and I did try to remove the "A" for the AM/PM indicator)
Unfortunately, I cannot upload the code for this project since it makes internal WS calls for JSON results (which is another issue - Remote Pagination does not appear to be working.)
Here is the source code for the column:
{title:"Last Submitted", field:"createdOn", sorter:"date", formatter:"datetime", formatterParams:{inputFormat:"YYYY-MM-DD[T]HH:mm:ss.SSS[Z]Z", outputFormat:"MM/DD/YYYY HH:mm:ssA",invalidPlaceholder:"(Invalid date)"}},
As stated above, if I add the formatter, blank table appear and nothing else. If I remove the formatter all data is displayed including the unformatted date (well it's formatted in a way in which I nor my users will want).
Any ideas would be greatly appreciated!
Image of Result with datetime formatter
With momentjs you can parse date if you know the format of an input string:
moment(inDate, inFormat);
For example:
moment('12-25-1995', 'MM-DD-YYYY');
In your case format of an input string is YYYY-MM-DD[T]HH:mm:ss.SSS[Z]Z - square brackets work as escape characters.
You can get formatted string from moment object with .format method:
moment().format(outFormat);
For example:
moment().format('MM-DD-YYYY');
In your case format of an output string is MM/DD/YYYY HH:mm:ssA - you can read more in docs
You can see how both parsing and formatting work together in the snippet below:
let inDate = '2019-12-09T14:50:47.000Z-0500',
inFormat = 'YYYY-MM-DD[T]HH:mm:ss.SSS[Z]Z',
outFormat = 'MM/DD/YYYY HH:mm:ssA',
outDate = moment(inDate, inFormat).format(outFormat);
console.log(`In Date: ${inDate}`);
console.log(`Out Date: ${outDate}`);
<script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js"></script>

Format POSIX in R (quantstrat)

I'm working on extracting a date from a variable: "curIndex."
Here's what the code looks likes
show(txntime1 <- timestamp(mktdata[curIndex+1L])[,1])
show(txntime <- strftime(txntime1, '%Y-%m-%d %H:%M:%OS6'))
And the output is this:
"##------ Tue Mar 08 14:31:58 2016 ------##"
"NULL"
I'm working within ruleOrderProc of the quantstrat package.
The order time needs to be POSIXlt for the order book. Does anyone know what to do with this funky date format that I'm getting?
If so, thanks!
When all else fails, read the documentation. ;-) ?timestamp says:
The timestamp function writes a timestamp (or other message)
into the history and echos it to the console. On platforms that
do not support a history mechanism only the console message is
printed.
You probably meant to call time or index. Also, the time needs to be POSIXct for the order book, not POSIXlt.

Copy Row if Sheet1 A contains part of Sheet2 C

So I'm trying to pull the data in a row from a separate sheet (sheet2!), if part of Col A has the the Date that is in sheet1! C1.
Col A ex: "Build 251 at Fri Jun 12 03:03:49 2015"
Col C1 ex: "Fri Jun 12" (Changes date every couple days)
I've tried these formulas but they don't work. The errors I get back are "finished with no results"; "error filter has mismatched range sizes"; "there is no ColumnA"; "formula parse error"
=filter("'GitHub-Changelog'!A", ("'GitHub-Changelog'!A" = 'x64 RSS Data'!C2))
=QUERY('GitHub-Changelog'!A:F,"select * where A contains '(TRANSPOSE(" "&C1:C&" "))'")
=FILTER('GitHub Changelog'!A,MMULT(SEARCH(TRANSPOSE(" "&'x64 RSS Data'!C1:C&" ")," "&'GitHub-Changelog'!A1:A&" "),SIGN(ROW('GitHub-Changelog'!A1:A))))
I'm not sure why I'm not getting results, the date is in A. If I use this =QUERY('GitHub-Changelog'!A:F,"select * where A contains 'Fri Jun 12'") It prints out the single row, it's just not reading C1 for some reason; and I need it to be dynamic to match whatever C1 changes to.
*The true future ideal goal would be to check Sheet1!C against Sheet2!A, if part of A contains C then copy whole row (Sheet2!A:F) into a single cell (Sheet1!E). Along the lines of IF Sheet2!A contains sheet1!C1 then copy (sheet1!E=Sheet2!D&C&B, but I believe that needs full script writing to accomplish this so I'm not sure how to do it yet, but will learn; one thing at a time though (just thought I'd share a better version of what I'm trying to accomplish).
Here is the sheet I'm working on: https://docs.google.com/spreadsheets/d/1lPOwiYGBK0kSJXXU9kaQjG7WNHjnNuxy25WCUudE5sk/edit?usp=sharing. It pulls multiple pages on different sheets, then cleanup pages of the data. The plan is to have an update sheet that searches the changelog info for the date of the current build and puts that data next the build. So the final sheet will show most recent build + commit changes for that nightly build. That's where this function is being used, to scrape the changelog for the same date.
See if this works:
=query('GitHub-Changelog'!A:F; "where A contains '"&C1&"' ")
where C1 (on the same sheet as the formula) is the cell that holds the date (ex: Fri Jun 12).
You don't need to surround the range with "".
Also, you can use Find() in your filter, to check if that date is present in the string.
Here is a working Filter formula:
=FILTER('GitHub-Changelog'!A:F, Find('x64 RSS Data'!C1,'GitHub-Changelog'!A:A))

What data notation is this?

I came across this chunk of data while going through a theme's metadata in wordpress. It looks like instead of using several metadata keys for different bits of data, they smooshed it all together in one chunk. This in particular is meta data for an event post type:
a:3:{s:8:"dateFrom";s:16:"Mon, 10 Feb 2014";s:6:"dateTo";s:16:"Mon, 10 Feb 2014";s:8:"location";s:87:"Convention Center";}"
I mostly just want to extract "dateFrom" so I can display it in a widget.
It looks like for other events the only things that change are the actual values (dates, location). The parts that are [a-z]:[0-9]* (which seem to be keys, but they aren't valid JSON keys cause of the colons) are constant.
That value is PHP serialized. If you unserialize it it'll be converted to an array. So something like (untested):
$orig = 'a:3:{s:8:"dateFrom";s:16:"Mon, 10 Feb 2014";s:6:"dateTo";s:16:"Mon, 10 Feb 2014";s:8:"location";s:87:"Convention Center";}"';
$converted = unserialize($orig);
echo $converted['dateFrom'];
should do the trick

Resources