The code is
library(rjson)
url <- 'file.json'
j <- fromJSON(file=url, method='C')
there are more than 1000 lines in the file.json, however, the returned result is a list of 9.
the file.json is
{"reviewerID": "A30TL5EWN6DFXT", "asin": "120401325X", "reviewerName": "christina", "helpful": [0, 0], "reviewText": "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again", "overall": 4.0, "summary": "Looks Good", "unixReviewTime": 1400630400, "reviewTime": "05 21, 2014"}
{"reviewerID": "ASY55RVNIL0UD", "asin": "120401325X", "reviewerName": "emily l.", "helpful": [0, 0], "reviewText": "These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)", "overall": 5.0, "summary": "Really great product.", "unixReviewTime": 1389657600, "reviewTime": "01 14, 2014"}
{"reviewerID": "A2TMXE2AFO7ONB", "asin": "120401325X", "reviewerName": "Erica", "helpful": [0, 0], "reviewText": "These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!", "overall": 5.0, "summary": "LOVE LOVE LOVE", "unixReviewTime": 1403740800, "reviewTime": "06 26, 2014"}
what is the problem? thanks!
Your file does not contain valid JSON. You basically have three JSON hashes sitting right next to each other. The exact choice of whitespace that separates the values doesn't matter. It's equivalent to this:
{} {} {}
That's just as invalid as if it was three primitives sitting right next to each other:
3 'a' true
Speaking generally, when the input to a function is invalid, all bets are off. It is desirable to write functions to fail gracefully and emit clear error messages that describe the nature of the invalidity, and very often that is the case, but that doesn't always happen. In this case, what rjson::fromJSON() seems to be doing when it encounters this kind of invalid JSON is to parse and return the first value, and silently ignore everything else. That's unfortunate, but what can we do.
You should probably investigate how the file was generated, and seek to correct the problem at that end. But if you want to hack a solution, we can read in the lines of JSON into a character vector, paste-collapse them on comma, paste bracket delimiters around the resulting string, and then parse that string to get an array of hashes. This will only work if each adjacent hash occupies exactly one line in the file.
fromJSON(paste0('[',paste(collapse=',',readLines(url)),']'));
## [[1]]
## [[1]]$reviewerID
## [1] "A30TL5EWN6DFXT"
##
## [[1]]$asin
## [1] "120401325X"
##
## [[1]]$reviewerName
## [1] "christina"
##
## [[1]]$helpful
## [1] 0 0
##
## [[1]]$reviewText
## [1] "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"
##
## [[1]]$overall
## [1] 4
##
## [[1]]$summary
## [1] "Looks Good"
##
## [[1]]$unixReviewTime
## [1] 1400630400
##
## [[1]]$reviewTime
## [1] "05 21, 2014"
##
##
## [[2]]
## [[2]]$reviewerID
## [1] "ASY55RVNIL0UD"
##
## [[2]]$asin
## [1] "120401325X"
##
## [[2]]$reviewerName
## [1] "emily l."
##
## [[2]]$helpful
## [1] 0 0
##
## [[2]]$reviewText
## [1] "These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)"
##
## [[2]]$overall
## [1] 5
##
## [[2]]$summary
## [1] "Really great product."
##
## [[2]]$unixReviewTime
## [1] 1389657600
##
## [[2]]$reviewTime
## [1] "01 14, 2014"
##
##
## [[3]]
## [[3]]$reviewerID
## [1] "A2TMXE2AFO7ONB"
##
## [[3]]$asin
## [1] "120401325X"
##
## [[3]]$reviewerName
## [1] "Erica"
##
## [[3]]$helpful
## [1] 0 0
##
## [[3]]$reviewText
## [1] "These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!"
##
## [[3]]$overall
## [1] 5
##
## [[3]]$summary
## [1] "LOVE LOVE LOVE"
##
## [[3]]$unixReviewTime
## [1] 1403740800
##
## [[3]]$reviewTime
## [1] "06 26, 2014"
##
##
Related
I would like to get the phone numbers from a file. I know the numbers have different forms, I don't know how to code for each form. Using grep and regrexpr in R. The numbers are written in this form:
xxx-xxx-xxxx ,
(xxx)xxx-xxxx,
xxx xxx xxxx,
xxx.xxx.xxxx
Try this:
phones <- c("foo 111-111-1111 bar" , "(111)111-1111 quux", "who knows 111 111 1111", "111.111.1111 I do", "111)111-1111 should not work", "1111111111 ditto", "a 111-111-1111 b (222)222-2222 c")
re <- gregexpr("(\\(\\d{3}\\)|\\d{3}[-. ])\\d{3}[-. ]\\d{4}", phones)
regmatches(phones, re)
# [[1]]
# [1] "111-111-1111"
# [[2]]
# [1] "(111)111-1111"
# [[3]]
# [1] "111 111 1111"
# [[4]]
# [1] "111.111.1111"
# [[5]]
# character(0)
# [[6]]
# character(0)
# [[7]]
# [1] "111-111-1111" "(222)222-2222"
In the data, I provide a few examples with other text on both, either, and neither side, as well as two examples that should not match. (That is: a starter "test set", as you want to make sure you both match good examples and no-match bad examples.) The last one hopes to match multiple numbers in one string/sentence.
gregexpr and regmatches are useful for finding and extracting or replacing regex-substrings within 1+ strings. For a "replace" example, one could do:
regmatches(phones, re) <- "GONE!"
phones
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
Obviously contrived replacement but certainly usable. Note though that regmatches operates in side-effect, meaning that it modified the phones vector in-place instead of returning the value. It's possible to force it to operate not in side-effect, but it is a little less intuitive:
phones # I reset it to the original value
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
`regmatches<-`(phones, re, value = "GONE!")
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
phones
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
Edit: scope-creep.
out <- unlist(Filter(length, regmatches(phones, re)))
out
# [1] "111-111-1111" "(111)111-1111" "111 111 1111" "111.111.1111" "111-111-1111"
# [6] "(222)222-2222"
gsub("[^0-9]", "", out)
# [1] "1111111111" "1111111111" "1111111111" "1111111111" "1111111111" "2222222222"
out <- gsub("[^0-9]", "", out)
sprintf("(%s)%s-%s", substr(out, 1, 3), substr(out, 4, 6), substr(out, 7, 10))
# [1] "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111"
# [6] "(222)222-2222"
I'm working on a shiny app which plots data trees. I'm looking to incorporate the shinyTree app to permit quick comparison of plotted nodes. The issue is that the shinyTree app returns a redundant list of lists of the sub node plot.
The actual list of list is included below. I would like to keep the longest branches only. I would also like to remove the id node (integer node), I'm struggling as to why it even shows up based on the list. I have tried many different methods to work with this list but it's been a real struggle. The list concept is difficult to understand.
I create the data.tree and plot via:
dataTree.a <- FromListSimple(checkList)
plot(dataTree.a)
> checkList
[[1]]
[[1]]$Asia
[[1]]$Asia$China
[[1]]$Asia$China$Beijing
[[1]]$Asia$China$Beijing$Round
[[1]]$Asia$China$Beijing$Round$`20383994`
[1] 0
[[2]]
[[2]]$Asia
[[2]]$Asia$China
[[2]]$Asia$China$Beijing
[[2]]$Asia$China$Beijing$Round
[1] 0
[[3]]
[[3]]$Asia
[[3]]$Asia$China
[[3]]$Asia$China$Beijing
[1] 0
[[4]]
[[4]]$Asia
[[4]]$Asia$China
[[4]]$Asia$China$Shanghai
[[4]]$Asia$China$Shanghai$Round
[[4]]$Asia$China$Shanghai$Round$`23740778`
[1] 0
[[5]]
[[5]]$Asia
[[5]]$Asia$China
[[5]]$Asia$China$Shanghai
[[5]]$Asia$China$Shanghai$Round
[1] 0
[[6]]
[[6]]$Asia
[[6]]$Asia$China
[[6]]$Asia$China$Shanghai
[1] 0
[[7]]
[[7]]$Asia
[[7]]$Asia$China
[1] 0
[[8]]
[[8]]$Asia
[[8]]$Asia$India
[[8]]$Asia$India$Delhi
[[8]]$Asia$India$Delhi$Round
[[8]]$Asia$India$Delhi$Round$`25703168`
[1] 0
[[9]]
[[9]]$Asia
[[9]]$Asia$India
[[9]]$Asia$India$Delhi
[[9]]$Asia$India$Delhi$Round
[1] 0
[[10]]
[[10]]$Asia
[[10]]$Asia$India
[[10]]$Asia$India$Delhi
[1] 0
[[11]]
[[11]]$Asia
[[11]]$Asia$India
[1] 0
[[12]]
[[12]]$Asia
[[12]]$Asia$Japan
[[12]]$Asia$Japan$Tokyo
[[12]]$Asia$Japan$Tokyo$Round
[[12]]$Asia$Japan$Tokyo$Round$`38001000`
[1] 0
[[13]]
[[13]]$Asia
[[13]]$Asia$Japan
[[13]]$Asia$Japan$Tokyo
[[13]]$Asia$Japan$Tokyo$Round
[1] 0
[[14]]
[[14]]$Asia
[[14]]$Asia$Japan
[[14]]$Asia$Japan$Tokyo
[1] 0
[[15]]
[[15]]$Asia
[[15]]$Asia$Japan
[1] 0
[[16]]
[[16]]$Asia
[1] 0
Well, I did cobble together a poor hack to make this work here is what I did to the 'checkList' list
checkList <- get_selected(tree, format = "slices")
# Convert and collapse shinyTree slices to data.tree
# This is a bit of a cluge to work the graphic with
# shinyTree an alternate one liner is in works
# This transform works by finding the longest branches
# and only plotting them since the other branches are
# subsets due to the slices.
# Extract the checkList name (as characters) from the checkList
tmp <- names(unlist(checkList))
# Determine the length of the individual checkList Names
lens <- lapply(tmp, function(x) length(strsplit(x, ".", fixed=TRUE)[[1]]))
# Find the elements with the highest length returns a list of high vals
lens.max <- which(lens == max(sapply(lens, max)))
# Replace all '.' with '\' prepping for DataFrameTable Converions
tmp <- relist(str_replace_all(tmp, "\\.", "/"), skeleton=tmp)
# Add a root node to work with multiple branches
tmp <- unlist(lapply(tmp, function(x) paste0("Root/", x)))
# Create a list of only the longest branches
longBranches <- as.list(tmp[lens.max])
# Convert the list into a data.frame for convert
longBranches.df <- data.frame(pathString = do.call(rbind, longBranches))
# Publish the data.frame for use
vals$selDF <- longBranches.df
#save(checkList, file = "chkLists.RData") # Save for troubleshooting
print(vals$selDF)ode here
The new checkList looks like this:
[1] "Root/Europe/France/Paris/Round/10843285" "Root/Europe/France/Paris/Round"
[3] "Root/Europe/France/Paris" "Root/Europe/France"
[5] "Root/Europe/Germany/Berlin/Diamond/3563194" "Root/Europe/Germany/Berlin/Diamond"
[7] "Root/Europe/Germany/Berlin/Round/3563194" "Root/Europe/Germany/Berlin/Round"
[9] "Root/Europe/Germany/Berlin" "Root/Europe/Germany"
[11] "Root/Europe/Italy/Rome/Round/3717956" "Root/Europe/Italy/Rome/Round"
[13] "Root/Europe/Italy/Rome" "Root/Europe/Italy"
[15] "Root/Europe/United Kingdom/London/Round/10313307" "Root/Europe/United Kingdom/London/Round"
[17] "Root/Europe/United Kingdom/London" "Root/Europe/United Kingdom"
[19] "Root/Europe"
It works :)... but I think this could be done with a two liner.... I'll work on it again in a week or so. Any other Ideas would be appreciated.
I have the following dataframe which contains reviews that customer have left on a restaurant website:
id<-c(1,2,3,4,5,6)
review<- c("the food was very delicious and hearty - perfect to warm up during a freezing winters day", "Excellent service as usual","Love this place!", "Service and quality of food first class"," Customer services was exceptional by all staff","excellent services")
df<-data.frame(id, review)
Now I am looking for a way (preferably without using a for loop) to find the part-of-speech labels in each customer's review in R.
This is a pretty straightforward adaption of the example on the Maxent_POS_Tag_Annotator help page.
df<-data.frame(id, review, stringsAsFactors=FALSE)
library(NLP)
library(openNLP)
review.pos <-
sapply(df$review, function(ii) {
a2 <- Annotation(1L, "sentence", 1L, nchar(ii))
a2 <- annotate(ii, Maxent_Word_Token_Annotator(), a2)
a3 <- annotate(ii, Maxent_POS_Tag_Annotator(), a2)
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, `[[`, "POS")
sprintf("%s/%s", as.String(ii)[a3w], tags)
})
Which results in this output:
#[[1]]
# [1] "the/DT" "food/NN" "was/VBD" "very/RB" "delicious/JJ"
# [6] "and/CC" "hearty/NN" "-/:" "perfect/JJ" "to/TO"
#[11] "warm/VB" "up/RP" "during/IN" "a/DT" "freezing/JJ"
#[16] "winters/NNS" "day/NN"
#
#[[2]]
#[1] "Excellent/JJ" "service/NN" "as/IN" "usual/JJ"
#
#[[3]]
#[1] "Love/VB" "this/DT" "place/NN" "!/."
#
#[[4]]
#[1] "Service/NNP" "and/CC" "quality/NN" "of/IN" "food/NN"
#[6] "first/JJ" "class/NN"
#
#[[5]]
#[1] "Customer/NN" "services/NNS" "was/VBD" "exceptional/JJ"
#[5] "by/IN" "all/DT" "staff/NN"
#
#[[6]]
#[1] "excellent/JJ" "services/NNS"
It should be relatively straightforward to adapt this to whatever format you want.
Considerig in your example the id column is simply the row index, I believe you can obtain your desired output with the pos() function from the qdap package.
library(qdap)
pos(df$review)
If you do need grouping because of multiple reviews per customer, you can use
pos_by(df$review,df$id)
If you don't mind trying a GitHub package I have the tagger package to wrap NLP/openNLP to do a number of tasks quickly in the way Python users manipulate pos tags. Note that the output prints in the traditional word/tag format but in reality the object is actually a list of named vectors. This makes working with the words and tags easier. Here I demo how to get the tags and a few manipulations that tagger makes easy:
# First load your data and get the tagger package for those playing along at home
id<-c(1,2,3,4,5,6)
review<- c("the food was very delicious and hearty - perfect to warm up during a freezing winters day", "Excellent service as usual","Love this place!", "Service and quality of food first class"," Customer services was exceptional by all staff","excellent services")
df<-data.frame(id, review)
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/tagger")
# Now tag and manipulate
(out <- tag_pos(as.character(df[["review"]])))
## [1] "the/DT food/NN was/VBD very/RB delicious/JJ and/CC hearty/NN -/: perfect/JJ to/TO warm/VB up/RP during/IN a/DT freezing/JJ winters/NNS day/NN"
## [2] "Excellent/JJ service/NN as/IN usual/JJ"
## [3] "Love/VB this/DT place/NN !/."
## [4] "Service/NNP and/CC quality/NN of/IN food/NN first/JJ class/NN"
## [5] "Customer/NN services/NNS was/VBD exceptional/JJ by/IN all/DT staff/NN"
## [6] "excellent/JJ services/NNS"
c(out) ## True structure: list of named vectors
as_word_tag(out) ## Match the print method (less mutable)
count_tags(out, df[["id"]]) ## Get counts by row
plot(out) ## tag distribution (plot at end)
as_basic(out) ## basic pos tags
## [1] "the/article food/noun was/verb very/adverb delicious/adjective and/conjunction hearty/noun -/. perfect/adjective to/preposition warm/verb up/preposition during/preposition a/article freezing/adjective winters/noun day/noun"
## [2] "Excellent/adjective service/noun as/preposition usual/adjective"
## [3] "Love/verb this/adjective place/noun !/."
## [4] "Service/noun and/conjunction quality/noun of/preposition food/noun first/adjective class/noun"
## [5] "Customer/noun services/noun was/verb exceptional/adjective by/preposition all/adjective staff/noun"
## [6] "excellent/adjective services/noun"
select_tags(out, c("NN", "NNP", "NNPS", "NNS"))
## [1] "food/NN hearty/NN winters/NNS day/NN"
## [2] "service/NN"
## [3] "place/NN"
## [4] "Service/NNP quality/NN food/NN class/NN"
## [5] "Customer/NN services/NNS staff/NN"
## [6] "services/NNS"
Everything works pretty nicely within a magrittr pipeline as well, which is my preference. The Examples Section of the README has a nice overview of the package's usage.
I have twitter data. Using library(stringr) i have extracted all the weblinks. However, when I try to do the same I am getting error. The same code had worked some days ago. The following is the code:
library(stringr)
hash <- "#[a-zA-Z0-9]{1, }"
hashtag <- str_extract_all(travel$texts, hash)
The following is the error:
Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)
I have re-installed stringr package....but doesn't help.
The code that I used for weblink is:
pat1 <- "http://t.co/[a-zA-Z0-9]{1,}"
twitlink <- str_extract_all(travel$texts, pat1)
The reproduceable example is as follows:
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
Your problem comes from the whitespace in the hash:
#Not working (look the whitespace after the comma)
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }")
#working
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}")
You may want to consider usig the qdapRegex package that I maintain for this task. It makes extracting urls and hash tags easy. qdapRegex is a package that contains a bunch of canned regex and the uses the amazing stringi package as a backend to do the regex task.
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
library(qdapRegex)
## first combine the built in url + twitter regexes into a function
rm_twitter_n_url <- rm_(pattern=pastex("#rm_twitter_url", "#rm_url"), extract=TRUE)
rm_twitter_n_url(rtt$texts)
rm_hash(rtt$texts, extract=TRUE)
Giving the following output:
## > rm_twitter_n_url(rtt$texts)
## [[1]]
## [1] "httptcoLPihj2sNEP"
##
## [[2]]
## [1] "httptconMHNlDqv69"
##
## [[3]]
## [1] "httptcolMvChSpaBT"
## > rm_hash(rtt$texts, extract=TRUE)
## [[1]]
## [1] "#stevenewman"
##
## [[2]]
## [1] "#Job" "#Canada" "#Marlin" "#St"
##
## [[3]]
## [1] "#Fiji" "#NewZealand"
So I've been trying to get a subset of a character vector for the last hour or so. In my (floundering) attempt to get this working I ran into an interesting characteristic of R. I have data (after JSON parsing) in the form of
[[1]]
[[1]]$business_id
[1] "rncjoVoEFUJGCUoC1JgnUA"
[[1]]$full_address
[1] "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345"
[[1]]$open
[1] TRUE
[[1]]$categories
[1] "Accountants" "Professional Services" "Tax Services"
[4] "Financial Services"
[[1]]$city
[1] "Peoria"
[[1]]$review_count
[1] 3
[[1]]$name
[1] "Peoria Income Tax Service"
[[1]]$neighborhoods
list()
[[1]]$longitude
[1] -112.2416
[[1]]$state
[1] "AZ"
[[1]]$stars
[1] 5
[[1]]$latitude
[1] 33.58187
[[1]]$type
[1] "business"
Here's the code I'm using
#!/usr/bin/Rscript
require(graphics)
require(RJSONIO)
parsed_data <- lapply(readLines("yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json"), fromJSON)
#parsed_data[,c("categories")]
print(parsed_data[1])
As I was trying to drop everything but the categories column I ran into this interesting behaviour
print(parsed_data[1])
print(parsed_data[1][1])
print(parsed_data[1][1][1][1][1][1])
All produce the same output (the one posted above). Why is that?
This is the difference between [ and [[. It is hard to search for these online, but ?'[' will bring up the help.
When indexing a list with [, a list is returned:
list(a=1:10, b=11:20)[1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
This is a list of one element, so repeating the operation again results in the same value:
list(a=1:10, b=11:20)[1][1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
[[ returns the element, not a list containing the element. It also only accepts a single index (whereas [ accepts a vector):
list(a=1:10, b=11:20)[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
And this operation is not idempotent on lists:
list(a=1:10, b=11:20)[[1]][[1]]
## [1] 1
Your JSON data is currently stored in a list, rather than a vector, so the indexing is different.
As Matthew has pointed out, there is a difference between using [] to access an element and using [[]]. For a discussion on this I will refer you to this stack overflow thread:
In R, what is the difference between the [] and [[]] notations for accessing the elements of a list?
Looking at the data print out your data is stored as a nested list:
parsed_data[[1]]
Will give you a list containing each of the columns. To access the categories column you can use any of the following:
parsed_data[[1]][["categories"]]
parsed_data[[1]][[4]]
parsed_data[[1]]$categories
This will give you a vector of names as a you'd expect:
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"
Note that when accessing by index (either named or numeric) you still have to use the double bracket notation: [[]]. If you use [] instead, it will give you a list instead of a vector:
parsed_data[[1]]["categories"]
## [[1]]
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"