Parsing text file of one line JSON objects using RJSONIO - r

What I want: I would like to parse a text file of the form
{"business_id": "rncjoVoEFUJGCUoC1JgnUA", "full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345", "open": true, "categories": ["Accountants", "Professional Services", "Tax Services", "Financial Services"], "city": "Peoria", "review_count": 3, "name": "Peoria Income Tax Service", "neighborhoods": [], "longitude": -112.241596, "state": "AZ", "stars": 5.0, "latitude": 33.581867000000003, "type": "business"}
{"business_id": "0FNFSzCFP_rGUoJx8W7tJg", "full_address": "2149 W Wood Dr\nPhoenix, AZ 85029", "open": true, "categories": ["Sporting Goods", "Bikes", "Shopping"], "city": "Phoenix", "review_count": 5, "name": "Bike Doctor", "neighborhoods": [], "longitude": -112.10593299999999, "state": "AZ", "stars": 5.0, "latitude": 33.604053999999998, "type": "business"}
where every line is an individual json object. I would like the parsed form to be of a type which RPart can take as an argument.
I can get this working if I loop through every line but according to this SO answer it's more R like to use the apply function and not by looping through each line individually.
For each row in an R dataframe
Problem: When I run my code I'm getting this error
Error in apply(yelp_df, 1, fromJSON) : dim(X) must have a positive length
My code
#!/usr/bin/Rscript
require(graphics)
require(RJSONIO)
con <- file("yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json", "r")
yelp_df <- readLines(con) #rather then guessing what the optimal buffer size of the system is I'll just put everything into memeory
apply(yelp_df, 1, fromJSON)

readLines is returning a character vector. apply expects an array. Use lapply or something similar.
out <- lapply(readLines("test.txt"), fromJSON)
> head(out[[1]])
$business_id
[1] "rncjoVoEFUJGCUoC1JgnUA"
$full_address
[1] "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345"
$open
[1] TRUE
$categories
[1] "Accountants" "Professional Services" "Tax Services"
[4] "Financial Services"
$city
[1] "Peoria"
$review_count
[1] 3

Related

Read JSON file with nested lists in R

I have a large json dataset and I would like to convert it to a data frame in R
(Sorry if it may be a duplicated question but other answers didn't help me)
My Json file is as follows:
[{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.899-1", "orig_lang": "en", "new": ["- having regard to its resolution of 13", "December 20071 on Justice for the", "'Comfort Women' (sex slaves in Asia", "before and during World War II) as well", "as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993", "and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the", "Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for", "wartime victims, including victims of the", "'comfort women' system,", "_______________________", "1", "OJ C 323E, 18.12.2008, p.531"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 6 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.863v01-00", "reference": "2014/2016(INI)", "date": "2014-02-27T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.863-1", "orig_lang": "en", "new": ["- having regard to the Statement by the", "Vice-President of the Commission/ High", "Representative of the Union for Foreign", "affairs and Security Policy (VP/HR)", "Catherine Ashton of 20 March 2013 on", "the Magnitsky case in the Russian", "Federation,"], "authors": "Jacek Protasiewicz", "meps": [23782], "location": [["Motion for a resolution", "Citation 4 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.713v01-00", "reference": "2013/2149(INI)", "date": "2014-02-12T00:00:00", "committee": ["AFET"], "seq": 238, "id": "PE529.713-238", "orig_lang": "en", "old": ["A. whereas the European Neighbourhood", "Policy (ENP), in particular the Eastern", "Partnership (EaP), aims to extend the", "values and ideas of the founders of the EU;"], "new": ["A. whereas the European Neighbourhood", "Policy (ENP) embraces the values and", "ideas of the founders of the EU, notably", "the principles of Peace, Solidarity and", "Prosperity;"], "authors": "Mário David", "meps": [96973], "location": [["Motion for a resolution", "Recital A"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 2, "id": "PE529.899-2", "orig_lang": "en", "new": ["- having regard to the catastrophic", "earthquake and subsequent tsunami", "which devastated important parts of", "Japan's coast on 11 March 2011 and led", "to the destruction of the Fukushima", "nuclear power plant, causing possibly the", "greatest radiation disaster in human", "history,"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 11 a (new)"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
I would like to have a dataframe as follows:
src peid reference date committee seq id orig_lang new ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ... p.531"] ...
http://www.europarl.eu PE529.863v01-00 2014/2016(INI) 2014-02-27T00:00:00 AFET 128 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.713v01-00 2013/2149(INI) 2014-02-12T00:00:00 AFET 238 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ..."Federation,"] ...
(I didn't write the complete table above)
I have already tried the following codes:
library(rjson)
library(jsonlite)
Data <- fromJSON(file="data.json")
but each row is shown as below:
[[1]]
[[1]]$src
[1] "http://www.europarl.eu/sides/getDoc.do?pubRef=-//EP//NONSGML+COMPARL+PE-529.899+01+DOC+PDF+V0//EN&language=EN"
[[1]]$peid
[1] "PE529.899v01-00"
[[1]]$reference
[1] "2014/2021(INI)"
[[1]]$date
[1] "2014-03-05T00:00:00"
[[1]]$committee
[1] "AFET"
[[1]]$seq
[1] 1
[[1]]$id
[1] "PE529.899-1"
[[1]]$orig_lang
[1] "en"
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
[[1]]$authors
[1] "Reinhard Bütikofer on behalf of the Verts/ALE Group"
[[1]]$meps
[1] 96739
[[1]]$location
[[1]]$location[[1]]
[1] "Motion for a resolution" "Citation 6 a (new)"
[[1]]$meta
[[1]]$meta$created
[1] "2019-07-03T05:06:17"
[[1]]$changes
list()
dput version is below:
list(list(src = "http://www.europarl.eu",
peid = "PE529.899v01-00", reference = "2014/2021(INI)", date = "2014-03-05T00:00:00",
committee = "AFET", seq = 1, id = "PE529.899-1", orig_lang = "en",
new = c("- having regard to its resolution of 13", "December 20071 on Justice for the",
"'Comfort Women' (sex slaves in Asia", "before and during World War II) as well",
"as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993",
"and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the",
"Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for",
"wartime victims, including victims of the", "'comfort women' system,",
"_______________________", "1", "OJ C 323E, 18.12.2008, p.531"
), authors = "Reinhard Bütikofer on behalf of the Verts/ALE Group",
meps = 96739, location = list(c("Motion for a resolution",
"Citation 6 a (new)")), meta = list(created = "2019-07-03T05:06:17"),
changes = list()))
One of the problems that I have is in column 9 as you can see below, I want to put all the 15 components in one cell of the dataframe
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
How can I get the table I mentioned above?
We may either convert the nested list elements with lengths greater than 1 to a single string by pasteing (str_c) and then bind the named list to columns with _dfr
library(purrr)
library(dplyr)
library(stringr)
map_dfr(Data, ~ map(.x, unlist) %>%
map_dfr(~ if(length(.x) > 1) str_c(.x, collapse = ";") else .x))
Or use a recursive function rrapply to bind the elements having length greater than 1 as list column
library(rrapply)
map_dfr(Data, ~ rrapply(.x, how = "bind"))

Store S4 objects in data.frame or data.table

I'm trying to put complex S4 objects (generated with Seurat package) in data.table (I read that it was not possible to use a list or a data.frame, but I didn't find anything about the compatibility of data.table with S4 objects) depending on the value of one of their attribute with a function.
These objects all come from a bigger object that I called dataset in the function I wrote:
subsets_by_cluster <- function(dataset){
nclust=data.table(cluster_ID=c(rep(NA,length(unique(dataset#active.ident)))))
for (i in length(nclust)){
nclust[i]=dataset[,dataset#active.ident==unique(dataset#active.ident)[i]]
}
return(nclust)}
I was expecting getting a data.table full of S4 objects, with one column with as many rows as number of different #active.ident values (cluster IDs)
But when I run it on my original dataset, I get the error
Error in [<-.data.frame(*tmp*, i, 1, value = new("Seurat", assays = list( : replacement has 2965 rows, data has 1
I also tried to do it manually with this kind of line
nclust[1]=dataset[,dataset#active.ident==unique(dataset#active.ident)[1]]
but it didn't work either, prompting the error :
type 'S4' cannot be coerced to 'logical'
Storing the subset in a variable works perfectly, but I would like my script be able do handle different cluster numbers.
I was thinking about writing the files to read so they can then be read, but it seems far from being a optimal solution.
Do you have suggestions ?
First, creating a simple S4 class (taken from Hadley Wickham's Advanced R)
setClass("Person",
slots = c(
name = "character",
age = "numeric"
)
)
As #John Paul mentions, you can create a few and store them in a list
john <- new("Person", name = "John Smith", age = NA_real_)
jane <- new("Person", name = "Jane Smith", age = NA_integer_)
myPeeps <- list(john, jane)
Printing the list
> myPeeps
[[1]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
Since a data.frame is a special type of list and as we see above a list element can be an S4 object, you can store them in a column as well. You just have to use the I() function
size <- 5
propsToMyPeeps <- data.frame(
propsFrom = I(sample(myPeeps, size, replace = TRUE)),
propsValue = sample.int(10, size, replace = TRUE),
propsTo = I(sample(myPeeps, size, replace = TRUE))
)
By default, the print method for data.frame doesn't know how to coerce our Person to a character string so printing the data.frame will cause an error. But if you subset the column, you can see all the objects are there.
> print(propsToMyPeeps$propsTo)
[[1]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[3]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[4]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[5]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
You can do it like this:
library(Seurat)
library(data.table)
data(pbmc_small)
nclust = data.table(cluster_ID=levels(Idents(pbmc_small)))
nclust$data = lapply(nclust$cluster_ID,function(i){
pbmc_small[,Idents(pbmc_small)==i]
})
And they can be accessed:
library(gridExtra)
grid.arrange(grobs=lapply(nclust$data,DimPlot),ncol=3)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>
the error in your code comes with first defining the column to be only NAs,and replacing them one at a time. And, it should be for for(i in 1:nrow(nclust)) instead of for(i in length(nclust))
If you start by defining it as a list of NAs, it works:
subsets_by_cluster <- function(dataset){
lvl = levels(Idents(dataset))
nclust=data.table(
cluster_ID = lvl,
data=replicate(length(lvl),NA,simplify=FALSE)
)
for (i in 1:nrow(nclust)){
nclust$data[[i]]=dataset[,Idents(dataset)==lvl[i]]
}
return(nclust)}
subsets_by_cluster(pbmc_small)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>

United Kingdom emoji with `emo::ji`

The emo package (https://github.com/hadley/emo) allows to insert emoji into R. I can't find the name of the emoji for the flag of the United Kingdom.
> emo::ji("Australia")
🇦🇺
> emo::ji("Laos")
🇱🇦
> emo::ji("United Kingdom")
Error in find_emoji(keyword) : Couldn't find emoji 'United Kingdom'
I have also tried with "UK", "GB", "Great Britain" but without success.
It's uk:
ji("uk")
🇬🇧
Using emo::ji_name (the full list) along with grep is somewhat helpful:
grep("uk", names(emo::ji_name), value = TRUE, ignore.case = TRUE)
# [1] "uk" "ukraine"
grep("britain", names(emo::ji_name), value = TRUE, ignore.case = TRUE)
# character(0)
grep("Laos", names(emo::ji_name), value = TRUE, ignore.case = TRUE)
# [1] "laos" "Laos"
The documented name for the United Kingdom flag is "uk".
"uk": {
"keywords": ["united", "kingdom", "great", "britain", "northern", "ireland", "flag", "nation", "country", "banner", "british", "UK", "english", "england", "union jack"],
"char": "🇬🇧",
"fitzpatrick_scale": false,
"category": "flags"
},
The emo package that you are using, is using the emojilib package as its base, so you can use this project's emoji searcher (or you can just look through the source code) to find the emoji you are looking for.

read_csv issue in R Evaluation error: Column 1 must be named

I am new to R and am trying to read a csv file that contains data in one column separated by commas and within quotes similar to this:
"First Name", "Last Name", "City", "State", "Country", "Zip Code"
"Amy", "Smith", "San Fransisco", "California", "USA", "10000"
"John", "Parker", "New York", "New York", "USA", "10010"
"Homer", "Smith", "New Haven", "Connecticut", "USA", "21292"
How do I import the file so that the commas become columns and the quotes dissappear?
First Name Last Name City State Country Zip Code
Amy Smith San Fransisco California USA 10000
John Parker New York New York USA 10010
Homer Smith New Haven Connecticut USA 21292
I tried
read_csv("path to my file.csv", col_names= TRUE, col_types = NULL, header = FALSE)
but I get :
Parsed with column specification:
cols(
col_character()
)
Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
Evaluation error: Column 1 must be named.
I had a similar issue reading in a tsv. I changed the file encoding to UTF-8 and that solved it for me.
Not elegant, but I did this:
1) Opened file in Sublime Text
2) Saved as "UTF-8"
Worked for me...

Instructing R to find variable name in rows when reading csv file

Is there a way to have R read the column/variable name in each cell when reading csv file?
My csv file is malformed. Not every row has every variable and not every row is of the same length. However, every row has a variable name within it, e.g. "id": "37189", "city": "Phoenix", "type": "business". When I tell R to read the csv can I instruct it to find the variable name within the data and sort accordingly?
Data sample for your convenience:
business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: UsFtqoBl7naz8AVUBZMjQQ,full_address: 202 McClure St\nDravosburg, PA 15034, open: true, categories: [Nightlife], city: Dravosburg, review_count: 4, name: Clancy's Pub, neighborhoods: [], longitude: -79.886930000000007, state: PA, stars: 3.5, latitude: 40.350518999999998, attributes: Happy Hour: true, Accepts Credit Cards: true, Good For Groups: true, Outdoor Seating: false, Price Range: 1, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: cE27W9VPgO88Qxe4ol6y_g,{ full_address: 1530 Hamilton Rd\nBethel Park, PA 15234}, open: false, categories: [Active Life, Mini Golf, Golf], city: Bethel Park, review_count: 5, name: Cool Springs Golf Center, neighborhoods: [], longitude: -80.015910000000005, state: PA, stars: 2.5, latitude: 40.356896200000001, attributes: Good for Kids: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
In bold are a few of the variables which do not appear in other entries.
This will get you started but you still have quite a bit of work to do. This works for one line (and it may work for the other two in the example) but it can be extrapolated to work with all of the lines (lapply FTW). Basically you need to rebuild the JSON structure from that single field (there may be alternative ways, especially if you do not need all the fields). It's easier than it might otherwise be since the Yelp schema is known.
You have to attack it in a pretty deterministic way, converting some fields before others, accounting for spaces in field names, dealing with arrays & nested structures, etc. As I said, you have quite a bit of work ahead of you. If your regex-fu is weak, this will provide ample practice to become a regex ninja.
library(stringi)
library(stringr)
library(jsonlite)
txt <- 'business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business'
txt <- gsub("\n", "|", txt)
txt <- sub("business_id: ([[:alnum:]\\:]+)", '"business_id": "\\1"', txt)
txt <- sub('attributes: ', '"attributes": {', txt)
txt <- sub('By Appointment Only: ', '"By Appointment Only": ', txt)
txt <- sub('Accepts Credit Cards: ', '"Accepts Credit Cards": ', txt)
txt <- sub('Good For Groups: ', '"Good For Groups": ', txt)
txt <- sub('Outdoor Seating: ', '"Outdoor Seating": ', txt)
txt <- sub('Price Range: ', '"Price Ranges": ', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), close:", '"full_address": "\\1", close:', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), open:", '"full_address": "\\1", open:', txt)
txt <- sub("name: (.*), neighborhoods:", '"name": "\\1", "neighborhoods":', txt)
txt <- gsub("open: ([[:alnum:]\\:]+)", '"open": "\\1"', txt)
txt <- sub("close: ([[:alnum:]\\:]+)", '"close": "\\1"', txt)
txt <- sub("longitude: ([[:digit:]\\.-]+)", '"longitude": "\\1"', txt)
txt <- sub("latitude: ([[:digit:]\\.-]+)", '"latitude": "\\1"', txt)
txt <- sub("review_count: ([[:digit:]\\.]+)", '"review_count": "\\1"', txt)
txt <- sub("stars: ([[:digit:]\\.]+)", '"stars": "\\1"', txt)
txt <- sub("state: ([[:alpha:]]+)", '"state": "\\1"', txt)
txt <- sub("city: ([[:alpha:] \\.-]+)", '"city": "\\1"', txt)
txt <- sub("type: ([[:alpha:]]+)", '"type": "\\1"', txt)
cats <- paste0(sprintf('"%s"', str_trim(str_split(str_match_all(txt, "categories: \\[([[:alpha:] &-,]+)\\],")[[1]][,2], ",")[[1]])), collapse=", ")
txt <- sub("categories: \\[([[:alpha:] &-,]+)\\],", '"categories": [' %s+% cats %s+% '],', txt)
txt <- "{" %s+% txt %s+% "}}"
fromJSON(txt)
## $business_id
## [1] "vcNAWiLM4dR7D2nwwJ7nCA"
##
## $full_address
## [1] "4840 E Indian School Rd|Ste 101|Phoenix, AZ 85018"
##
## $close
## [1] "17:00"
##
## $open
## [1] "08:00"
##
## $open
## [1] "true"
##
## $categories
## [1] "Doctors" "Health & Medical"
##
## $city
## [1] "Phoenix"
##
## $review_count
## [1] "9"
##
## $name
## [1] "Eric Goldberg, MD"
##
## $neighborhoods
## list()
##
## $longitude
## [1] "-111.98375799999999"
##
## $state
## [1] "AZ"
##
## $stars
## [1] "3.5"
##
## $latitude
## [1] "33.499313000000001"
##
## $attributes
## $attributes$`By Appointment Only`
## [1] TRUE
##
## $attributes$type
## [1] "business"
And, whomever gave you this file deserves whatever evil comes their way in their programmatic life. I'd give them back whatever they wanted from this in gnarly XML with EBCDIC encoding.

Resources