How to import multiple json objects from a json file into R as a dataframe in such a way that all values are consecutive rows and names are columns - r

I need it as a dataframe in order to create predictive and classification models on it in R. The json file looks as follows:
{"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName": "cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0, 0], "reviewText": "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,", "overall": 5.0, "summary": "good", "unixReviewTime": 1393545600, "reviewTime": "02 28, 2014"}
{"reviewerID": "A14VAT5EAX3D9S", "asin": "1384719342", "reviewerName": "Jake", "helpful": [13, 14], "reviewText": "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]", "overall": 5.0, "summary": "Jake", "unixReviewTime": 1363392000, "reviewTime": "03 16, 2013"}
{"reviewerID": "A195EZSQDW3E21", "asin": "1384719342", "reviewerName": "Rick Bennette \"Rick Bennette\"", "helpful": [1, 1], "reviewText": "The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies. The double cloth filter blocks the pops and lets the voice through with no coloration. The metal clamp mount attaches to the mike stand secure enough to keep it attached. The goose neck needs a little coaxing to stay where you put it.", "overall": 5.0, "summary": "It Does The Job Well", "unixReviewTime": 1377648000, "reviewTime": "08 28, 2013"}
{"reviewerID": "A2C00NNG1ZQQG2", "asin": "1384719342", "reviewerName": "RustyBill \"Sunday Rocker\"", "helpful": [0, 0], "reviewText": "Nice windscreen protects my MXL mic and prevents pops. Only thing is that the gooseneck is only marginally able to hold the screen in position and requires careful positioning of the clamp to avoid sagging.", "overall": 5.0, "summary": "GOOD WINDSCREEN FOR THE MONEY", "unixReviewTime": 1392336000, "reviewTime": "02 14, 2014"}
... and a 100 more.
I tried the rjson package but that just imports the first json object from the file and not the others.
library("rjson")
json_file <- "reviews_Musical_Instruments_5.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
Expected result should be a dataframe with "reviewerID", "asin", "reviewerName", etc. as columns and their values as consecutive rows.

Here I fix the json to be regular json and use jsonlite::fromJSON() to get the data.frame:
library(jsonlite)
json_file <- "reviews_Musical_Instruments_5.json"
json_file_contents <- readLines(json_file)
json_file_contents <- paste(json_file_contents, collapse = ",")
json_file_contents <- paste(c("[", json_file_contents, "]"), collapse = "")
fromJSON(json_file_contents)

(Based on the comments.)
The data is not regular JSON, it's "ndjson", which is Newline-Delimited JSON. The (only?) difference is that each line is self-sufficient JSON.
If this were regular JSON, one would need to encapsulate all of these within a list (or similar) by prepending an open-bracket [, putting commas between each element (dict here), and append a close-bracket ]. JSON-structured streaming data is nice, but if a client connects after the leading [ then everything else is invalid. Ergo, NDJSON.
For you, just use jsonlite::stream_in(file(json_file)) and all should work.

Related

Converting a dataframe which contains list into a csv with r

I am new to R and I am facing difficulties to convert my dataframe (named dffinal) which contains list into a csv.
I tried the following code which gave a csv that is not usable:
dput(dffinal, file="out.txt")
new <- source("out.txt")
write.csv2(dffinal,"C:/Users\\final.csv", row.names = FALSE)
I tried all the option but I found nothing! Here is a sample of my dataframe:
dput(head(dffinal[1:2]))
structure(list(V1 = list("I heard about your products and I would like to give it a try but I'm not sure which product is better for my dry skin, Almond products or Shea Butter products? Thank you",
"Hi,\n\nCan you please tell me the difference between the shea shower oil limited edition and the other shower gels? I got a sample of one in a kit that had a purple label on it. (Please see attached photo.) I love it!\nBut, what makes it limited edition, the smell or what? It is out of stock and I was wondering if it is going to be restocked or not?\n\nAlso, what makes it different from the almond one?\n\nThank you for your help.",
"Hello, Have you discontinued Eau de toilette", "I both an eGift card for my sister and she hasn't received anything via her email\n\nPlease advise \n\nThank you \n\n cann",
"I do not get Coco Pillow Mist. yet. When are you going to deliver it? I need it before January 3rd.",
"Hello,\nI wish to follow up on an email I just received from Lol, notifying\nme that I've \"successfully canceled my subscription of bun Complete.\"\nHowever, I didn't request a cancelation and was expecting my next scheduled\nfulfillment later this month. Could you please advise and help? I'd\nappreciate it if you could reinstate my subscription.\n"),
V2 = list("How long can I keep a product before opening it? shea butter original hand cream large size 5oz, i like to buy a lot during sales promotions, is this alright or should i only buy what i'll use immediately, are these natural organic products that will still have a long stable shelf life? thank you",
"Hi,\nI recently checked to see if my order had been delivered, and I only received my gift box and free sample. Can you please send the advent calendar? Does not seem to have been included in the shipping. Thank you",
"Is the gade fragrance still available?", "I previously contacted you because I purchased your raspberry lip scrub. When I opened the scrub, 25% of the product was missing. Your customer service department agreed to send me a replacement, but I never received the replacement rasberry lip scrub. Could you please tell me when I will receive the replacement product? Thanks, me",
"To whom it may concern:\n\nI have 3 items in my order: 1 Shea Butter Intensive Hand Balm and 2 S‚r‚nit‚ Relaxing Pillow Mist. I have just received the hand balm this morning. I was wondering when I would receive the two bottles of pillow mist.\n\nThanks and regards,\n\nMe",
"I have not received 2X Body Scalp Essence or any shipment information regarding these items. Please let me know if and when you will be shipping these items, otherwise please credit my card. Thanks")), row.names = c(NA,
6L), class = "data.frame")
We can do this in tidyverse
library(dplyr)
library(readr)
dffinal %>%
mutate(across(everything(), unlist)) %>%
write_csv('result.csv')
If you have list of only length 1 for all the rows as shared in the example using unlist will work -
dffinal[] <- lapply(dffinal, unlist)
If the length of list is greater than 1 use -
dffinal[] <- lapply(dffinal, sapply, toString)
Write the data with write.csv -
write.csv(dffinal, 'result.csv', row.names = FALSE)

Add \\ to escape non-UTF 8 characters within a string using regex

I am working with a dataset that has free text containing special characters. I need to clean the text before use strsplit for a subsequent function, but would prefer to add escapes (\\) before the special characters rather than delete them altogether.
For example, the string that looks like this:
Do you love great hair? Wind it up! Your curls are your gift- set
them free and help preserve your natural curl with bounce and
definition. Cleanse hair without weighing it down while reducing
frizz. Infused with pineapple, argan oil and quinoa. Let your natural
beauty shine bright!
Should look like this:
Do you love great hair\\? Wind it up\\! Your curls are your gift\\- set
them free and help preserve your natural curl with bounce and
definition. Cleanse hair without weighing it down while reducing
frizz. Infused with pineapple, argan oil and quinoa. Let your natural
beauty shine bright\\!
I've figured out how to remove a list of several special characters (~!##$%^&*(){}|<>/), but can't find the right tutorial for adding \\ before them.
Note: I am not looking to remove ALL punctuation because some characters are used for subsequent delimiting logic. Instead, I want to address a specific subset of special characters.
Sample data:
>dput(tar$clean.text[1:10])
list(c("Dove go fresh Cucumber and Green Tea Beauty Bar combines the refreshing scent of cucumber and green tea with Dove's gentle cleansers and _ moisturizing cream. Dove Beauty Bar is proven to be more gentle and mild on skin than ordinary soap. It can be used on your hands and as a mild facial cleanser, so if you're also after a fresh face and refreshed hands throughout the day, why not try adding Dove Beauty Bar go fresh Cucumber and Green Tea to your skin care routine? Light, hydrating feel and refreshing formula that effectively nourishes skin. A refreshing shower can be just what you need to start the day off right. Dove's go fresh range blends nourishing ingredients and light, fresh scents in a formula that's gentle on your skin. Dove go fresh beauty bars give you a feeling of hydrating freshness that leaves you and your skin feeling blissfully revived. For best results: Your hands are one of the driest parts of your body so give them a boost and lather your Dove beauty bar between wet hands. Once you've covered your body with the rich lather, making sure to avoid contact with your eyes, rinse away thoroughly. At Dove, our vision is of a world where beauty is a source of confidence, and not anxiety. So, we are on a mission to help the next generation of women develop a positive relationship with the way they look - helping them raise their self-esteem and realize their full potential.",
"Scent: Cucumber", "Health Facts: Sulfate-free", "Suggested Age: 5 Years and Up",
"Wellness Standard: Aluminum-free, paraben-free", "Recommended Skin Type: Normal",
"Beauty Purpose: Moisturizing, basic cleansing", "Package Quantity: 1",
"TCIN: 10819409", "UPC: 011111611023", "Item Number (DPCI): 049-00-0604"
), c("Me! Bath Bath Bomb Papaya Nectar 6 ct is a great idea to add to a spa gift basket. These bath bombs are like scoops for your bath to make mini bath ice cream that gives you super soft skin.",
"Scent: Fruit", "Health Facts: Vegan, paraben-free, aluminum-free",
"Product Form: Bath bomb", "Suggested Age: Adult Use Only", "Wellness Standard: Aluminum-free, cruelty-free, paraben-free, vegan",
"Recommended Skin Type: Normal", "Sustainability Claims: Cruelty-free",
"TCIN: 18828570", "UPC: 858858000358", "Item Number (DPCI): 037-08-1164"
), NA_character_, NA_character_, c("Aura Cacia pure essential oils in 4 fl oz Body Oil has a lavender and cocoa butter scent. This natural skin care oil shows skin tone improvement that you can feel.",
"Scent: Lavender, Cocoa Butter", "Health Facts: Contains lavender, butylparaben-free, phthalate-free, formaldehyde donor-free, formaldehyde-free, nonylphenol ethoxylate free, propylparaben-free, Sulfate-free, paraben-free, dye-free, aluminum-free",
"Product Form: Lotion", "Suggested Age: All Ages", "Recommended Skin Type: Normal",
"Beauty Purpose: Skin tone improvement", "Sustainability Claims: Not tested on animals, cruelty-free",
"TCIN: 50030689", "UPC: 051381911720", "Item Number (DPCI): 037-05-1378"
), c("Deep clean pores with the Facial Cleansing Brush from Eco",
"Tools. This compact brush features soft bristles for moderate exfoliation, leaving you with soft, supple skin. Your serums and moisturizers can more effectively penetrate your skin once all the dead skin cells are out of the way. The compact size is ideal for packing in your weekend tote or suitcase for cleansing on the go.",
"Material: Nylon", "Suggested Age: All Ages", "Beauty Purpose: Basic cleansing, exfoliating",
"TCIN: 52537254", "UPC: 079625074864", "Item Number (DPCI): 037-08-2254"
), c("Deep Steep Rosemary Mint Sugar Scrub gently exfoliates dead skin cells while moisturizing, leaving smooth, radiant, polished skin. This formula is made up of a smooth blend of shea butter, cocoa butter and carefully sourced sugar to give you light, blissful fragrance with just the right amount of exfoliation and no harsh scratching. Apply desired amount of Deep Steep Rosemary Mint Sugar Scrub to wet skin from shoulders to ankles. Massage in a circular motion. Rinse.",
"Scent: Rosemary", "Health Facts: Contains argan oil, contains coconut oil, contains shea butter, formaldehyde donor-free, gluten-free, dye-free, ethyl alcohol-free, paraben-free, phthalate-free, vegan",
"Product Form: Scrub", "Suggested Age: All Ages", "Recommended Skin Type: Dry, normal",
"Beauty Purpose: Exfoliating", "TCIN: 53242409", "UPC: 674749101153",
"Item Number (DPCI): 037-08-2123"), NA_character_, c("Want to feel gorgeously soft skin every day? Transform your daily shower into an irresistible treat with the exquisitely fragranced Caress Evenly Gorgeous body wash. Indulge your skin with a rich exfoliating lather delicately scented with burnt brown sugar and karite butter that makes this body wash smell good enough to eat. Subtle notes of soft crisp apple and berry open up to a bold floral heart, while rich scents of warm tonka bean, vanilla and balsam together round out the lush lather to leave you with perfectly buffed and glowing skin. Caress Evenly Gorgeous is a revitalizing body wash that blends rich, luxurious lather with expertly crafted fine fragrance It is a body wash that gently cleanses your skin to leave it delicately fragrant, beautifully soft.",
"Lather up and indulge in a deeply cleansing and reviving shower experience. With fine floral fragrance and gentle exfoliates, Caress Evenly Gorgeous will leave you feeling delicately perfumed and silky-smooth, making this the perfect body wash for every day? and every night. Caress body wash and beauty bar fragrances are crafted by the world's best perfumers to transform your daily shower into an indulging experience that will make you feel special every day?Scent: Fresh",
"Health Facts: Aluminum-free, paraben-free, fluoride-free", "Product Form: Liquid",
"Suggested Age: 5 Years and Up", "Wellness Standard: Aluminum-free, paraben-free",
"Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing",
"Package Quantity: 1", "TCIN: 13446229", "UPC: 011111014909",
"Item Number (DPCI): 049-00-0806"), c("Maintain a sanitary and healthy atmosphere with the MEDLINE n/a READYBATH, PREMIUM,FRAG FREE, 8/PK - 24pks. These sterile swab sticks are pre-treated with povidone-iodine for preparing skin for incision and other medical issues. Comes in disposable packages of 3.",
"Scent: Unscented", "Health Facts: No fragrance added", "Suggested Age: Adult Use Only",
"Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing",
"Package Quantity: 1", "TCIN: 14339945", "UPC: 080196731445",
"Item Number (DPCI): 037-13-0198"))`
Code that removes a list of symbols:
tar$clean.text<-str_replace_all(tar$clean.text, "~|!|#|#|$|%|^|&|\\*|\\(|\\)|\\{|\\}|_|\\\\|<|>|\\?|\\[|\\]|-", "") # Removes a ton of non-UTF characters
I'm sure there is a simple modification to my regexp, but can't seem to figure it out. All previous answers I've found are more specific to fixing a specific text pattern, rather than generally replacing across a lot of different variations.
You may use
str_replace_all(x, "[~!##$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")
A base R approach:
gsub("([]\\~!##$%^&*(){}_<>?[|-])", "\\\\\\1", "~!##$%^&*(){}_\\<>?[]|-")
See the regex demo.
Details
[ - start of a character class matching any of the following chars:
~ - ~
! - !
# - #
# - #
$ - $
% - %
^ - ^ (if you put it at the start, escape with \\)
& - &
* - * (no need to escape inside a character class)
( - (
) - )
{ - {
} - }
_ - _ (note it is a word char, and \W would not match it)
\\\\ - a \ char (a literal \ escaped with another literal \)
< - a <
> - >
? - ?
\\[ - a [ char (in ICU regex, must be escaped inside a character class
\\] - a ] char (ibid.)
| - a | char (it is not an OR operator inside a character class)
- - a - char
] - end of the character class.
The "\\\\\\0" string replacement pattern is parsed as two literal backslashes that defines a singular literal backslash and a \0 literal string that is a backreference to the whole match in the ICU regex in R.
Note that gsub TRE regex is a bit trickier: ] must be the first char in the character class, [ should not be escaped, literal \ should only be single (no regex escape sequences are supported inside TRE patterns), and - must be at the end. Also, there is no support for the whole match backreference, hence, you need to wrap the whole pattern with a capturing group and replace with \1 backreference.
let dat = tar$clean.text[1:10] then you can do:
Map(gsub,"([[:punct:]])","\\\\\\1",dat)

automate input to user query in R

I apologize if this question has been asked with terminology I don't recognize but it doesn't appear to be.
I am using the function comm2sci in the library taxize to search for the scientific names for a database of over 120,000 rows of common names. Here is a subset of 10:
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", "ABYSSINIAN BLUE
WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
When searching with the NCBI database in this function, it asks for user input if the common name is generic/general and not species specific, for example the following call will ask for clarification for "AARDVARK" by entering '1', '2' or 'return' for 'NA'.
install.packages("taxize")
library(taxize)
ncbioutput <- comm2sci(commnames, db = "ncbi")###querying ncbi database
Because of this, I cannot rely on this function to find the names of the 120000 species without me sitting and entering 'return' every few minutes. I know this question sounds taxize specific - but I've had this situation in the past with other functions as well. My question is: is there a general way to place the comm2sci call in a conditional statement that will return a specific value when user input is prompted? Or otherwise write a function that will return some input when prompted?
All searches related to this tell me how to ask for user input but not how to override user queries. These are two of the question threads I've found, but I can't seem to apply them to my situation: Make R wait for console input?, Switch R script from non-interactive to interactive
I hope this was clear. Thank you very much for your time!
So the get_* functions, used internally, all by default ask for user input when there is > 1 option. But, all of those functions have a sister function with an underscore, e.g., get_uid_ that do not prompt for input, and return all data. You can use that to get all the data, then process however you like.
Made some changes to comm2sci, so update first: devtools::install_github("ropensci/taxize")
Here's an example.
library(taxize)
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA",
"ABYSSINIAN BLUE WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
Then use get_uid_ to get all data
ids <- get_uid_(commnames)
Process the results in ids as you like. Here, for brevity, we'll just grab first row of each
ids <- lapply(ids, function(z) z[1,])
Then grab the uid's out
ids <- as.uid(unname(vapply(ids, "[[", "", "uid")), check = FALSE)
And pass to comm2sci
comm2sci(ids)
$`100830`
[1] "Tetrao urogallus"
$`9818`
[1] "Orycteropus afer"
$`9680`
[1] "Proteles cristatus"
$`51745`
[1] "Chilabothrus exsul"
$`8565`
[1] "Gekko"
$`39789`
[1] "Ciconia abdimii"
$`278977`
[1] "Abronia graminea"
$`8865`
[1] "Cyanochen cyanopterus"
$`9685`
[1] "Felis catus"
$`153643`
[1] "Bucorvus abyssinicus"
Note that NCBI returns common names from get_uid/get_uid_, so you can just go ahead and pluck those out if you want

SnowballC in R stems "many" and "only"

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.
> library(SnowballC)
>
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani" "onli" "thing"
>
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
mani onli thing
"" "online" "things"
You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".
Why is that? Is that a way to fix this?
Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.
To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.
Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.
To see what's happening in your data, let's look at the specifics.
online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.
only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify.
You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.
With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.
That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.
many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc
What you want is a lemmatiser. Have a look at this question for a more detailed explanation:
Stemmers vs Lemmatizers

I need a 2D bone engine for JavaScript

I'm looking for a way to define a Bone animation system, I need a basic one, since my objective is to apply it for inverse kinematics, much like Flash supports.
The desirable feature is that: I can set bones (as position in 2D, defined by 2 dots) each having an ID. So I can make an animation based on frames, ie:
['l_leg', [10, 0],[ 13,30 ] ] ['r_leg', [30, 0 ], [13, 30] ] //Frame 1 (standing)
['l_leg', [10, 0],[ 13,30 ] ] ['r_leg', [35, 30], [13, 30] ] //Frame 2 (lifting right leg)
...
I'm confident that defining Joints ain't necessary.
The lib may be lib in Ruby, since I can port it to JS, but if in JS already is better :)
UPDATE: deprecated for a long time now.
I am developing my own: http://github.com/flockonus/javascriptinmotion
See Wikipedia: Express Animator.
1st result for Google: javascript skeletal.

Resources