I have a large json dataset and I would like to convert it to a data frame in R
(Sorry if it may be a duplicated question but other answers didn't help me)
My Json file is as follows:
[{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.899-1", "orig_lang": "en", "new": ["- having regard to its resolution of 13", "December 20071 on Justice for the", "'Comfort Women' (sex slaves in Asia", "before and during World War II) as well", "as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993", "and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the", "Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for", "wartime victims, including victims of the", "'comfort women' system,", "_______________________", "1", "OJ C 323E, 18.12.2008, p.531"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 6 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.863v01-00", "reference": "2014/2016(INI)", "date": "2014-02-27T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.863-1", "orig_lang": "en", "new": ["- having regard to the Statement by the", "Vice-President of the Commission/ High", "Representative of the Union for Foreign", "affairs and Security Policy (VP/HR)", "Catherine Ashton of 20 March 2013 on", "the Magnitsky case in the Russian", "Federation,"], "authors": "Jacek Protasiewicz", "meps": [23782], "location": [["Motion for a resolution", "Citation 4 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.713v01-00", "reference": "2013/2149(INI)", "date": "2014-02-12T00:00:00", "committee": ["AFET"], "seq": 238, "id": "PE529.713-238", "orig_lang": "en", "old": ["A. whereas the European Neighbourhood", "Policy (ENP), in particular the Eastern", "Partnership (EaP), aims to extend the", "values and ideas of the founders of the EU;"], "new": ["A. whereas the European Neighbourhood", "Policy (ENP) embraces the values and", "ideas of the founders of the EU, notably", "the principles of Peace, Solidarity and", "Prosperity;"], "authors": "Mário David", "meps": [96973], "location": [["Motion for a resolution", "Recital A"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 2, "id": "PE529.899-2", "orig_lang": "en", "new": ["- having regard to the catastrophic", "earthquake and subsequent tsunami", "which devastated important parts of", "Japan's coast on 11 March 2011 and led", "to the destruction of the Fukushima", "nuclear power plant, causing possibly the", "greatest radiation disaster in human", "history,"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 11 a (new)"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
I would like to have a dataframe as follows:
src peid reference date committee seq id orig_lang new ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ... p.531"] ...
http://www.europarl.eu PE529.863v01-00 2014/2016(INI) 2014-02-27T00:00:00 AFET 128 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.713v01-00 2013/2149(INI) 2014-02-12T00:00:00 AFET 238 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ..."Federation,"] ...
(I didn't write the complete table above)
I have already tried the following codes:
library(rjson)
library(jsonlite)
Data <- fromJSON(file="data.json")
but each row is shown as below:
[[1]]
[[1]]$src
[1] "http://www.europarl.eu/sides/getDoc.do?pubRef=-//EP//NONSGML+COMPARL+PE-529.899+01+DOC+PDF+V0//EN&language=EN"
[[1]]$peid
[1] "PE529.899v01-00"
[[1]]$reference
[1] "2014/2021(INI)"
[[1]]$date
[1] "2014-03-05T00:00:00"
[[1]]$committee
[1] "AFET"
[[1]]$seq
[1] 1
[[1]]$id
[1] "PE529.899-1"
[[1]]$orig_lang
[1] "en"
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
[[1]]$authors
[1] "Reinhard Bütikofer on behalf of the Verts/ALE Group"
[[1]]$meps
[1] 96739
[[1]]$location
[[1]]$location[[1]]
[1] "Motion for a resolution" "Citation 6 a (new)"
[[1]]$meta
[[1]]$meta$created
[1] "2019-07-03T05:06:17"
[[1]]$changes
list()
dput version is below:
list(list(src = "http://www.europarl.eu",
peid = "PE529.899v01-00", reference = "2014/2021(INI)", date = "2014-03-05T00:00:00",
committee = "AFET", seq = 1, id = "PE529.899-1", orig_lang = "en",
new = c("- having regard to its resolution of 13", "December 20071 on Justice for the",
"'Comfort Women' (sex slaves in Asia", "before and during World War II) as well",
"as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993",
"and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the",
"Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for",
"wartime victims, including victims of the", "'comfort women' system,",
"_______________________", "1", "OJ C 323E, 18.12.2008, p.531"
), authors = "Reinhard Bütikofer on behalf of the Verts/ALE Group",
meps = 96739, location = list(c("Motion for a resolution",
"Citation 6 a (new)")), meta = list(created = "2019-07-03T05:06:17"),
changes = list()))
One of the problems that I have is in column 9 as you can see below, I want to put all the 15 components in one cell of the dataframe
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
How can I get the table I mentioned above?
We may either convert the nested list elements with lengths greater than 1 to a single string by pasteing (str_c) and then bind the named list to columns with _dfr
library(purrr)
library(dplyr)
library(stringr)
map_dfr(Data, ~ map(.x, unlist) %>%
map_dfr(~ if(length(.x) > 1) str_c(.x, collapse = ";") else .x))
Or use a recursive function rrapply to bind the elements having length greater than 1 as list column
library(rrapply)
map_dfr(Data, ~ rrapply(.x, how = "bind"))
I'm trying to put complex S4 objects (generated with Seurat package) in data.table (I read that it was not possible to use a list or a data.frame, but I didn't find anything about the compatibility of data.table with S4 objects) depending on the value of one of their attribute with a function.
These objects all come from a bigger object that I called dataset in the function I wrote:
subsets_by_cluster <- function(dataset){
nclust=data.table(cluster_ID=c(rep(NA,length(unique(dataset#active.ident)))))
for (i in length(nclust)){
nclust[i]=dataset[,dataset#active.ident==unique(dataset#active.ident)[i]]
}
return(nclust)}
I was expecting getting a data.table full of S4 objects, with one column with as many rows as number of different #active.ident values (cluster IDs)
But when I run it on my original dataset, I get the error
Error in [<-.data.frame(*tmp*, i, 1, value = new("Seurat", assays = list( : replacement has 2965 rows, data has 1
I also tried to do it manually with this kind of line
nclust[1]=dataset[,dataset#active.ident==unique(dataset#active.ident)[1]]
but it didn't work either, prompting the error :
type 'S4' cannot be coerced to 'logical'
Storing the subset in a variable works perfectly, but I would like my script be able do handle different cluster numbers.
I was thinking about writing the files to read so they can then be read, but it seems far from being a optimal solution.
Do you have suggestions ?
First, creating a simple S4 class (taken from Hadley Wickham's Advanced R)
setClass("Person",
slots = c(
name = "character",
age = "numeric"
)
)
As #John Paul mentions, you can create a few and store them in a list
john <- new("Person", name = "John Smith", age = NA_real_)
jane <- new("Person", name = "Jane Smith", age = NA_integer_)
myPeeps <- list(john, jane)
Printing the list
> myPeeps
[[1]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
Since a data.frame is a special type of list and as we see above a list element can be an S4 object, you can store them in a column as well. You just have to use the I() function
size <- 5
propsToMyPeeps <- data.frame(
propsFrom = I(sample(myPeeps, size, replace = TRUE)),
propsValue = sample.int(10, size, replace = TRUE),
propsTo = I(sample(myPeeps, size, replace = TRUE))
)
By default, the print method for data.frame doesn't know how to coerce our Person to a character string so printing the data.frame will cause an error. But if you subset the column, you can see all the objects are there.
> print(propsToMyPeeps$propsTo)
[[1]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[3]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[4]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[5]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
You can do it like this:
library(Seurat)
library(data.table)
data(pbmc_small)
nclust = data.table(cluster_ID=levels(Idents(pbmc_small)))
nclust$data = lapply(nclust$cluster_ID,function(i){
pbmc_small[,Idents(pbmc_small)==i]
})
And they can be accessed:
library(gridExtra)
grid.arrange(grobs=lapply(nclust$data,DimPlot),ncol=3)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>
the error in your code comes with first defining the column to be only NAs,and replacing them one at a time. And, it should be for for(i in 1:nrow(nclust)) instead of for(i in length(nclust))
If you start by defining it as a list of NAs, it works:
subsets_by_cluster <- function(dataset){
lvl = levels(Idents(dataset))
nclust=data.table(
cluster_ID = lvl,
data=replicate(length(lvl),NA,simplify=FALSE)
)
for (i in 1:nrow(nclust)){
nclust$data[[i]]=dataset[,Idents(dataset)==lvl[i]]
}
return(nclust)}
subsets_by_cluster(pbmc_small)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>
Is there a way to have R read the column/variable name in each cell when reading csv file?
My csv file is malformed. Not every row has every variable and not every row is of the same length. However, every row has a variable name within it, e.g. "id": "37189", "city": "Phoenix", "type": "business". When I tell R to read the csv can I instruct it to find the variable name within the data and sort accordingly?
Data sample for your convenience:
business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: UsFtqoBl7naz8AVUBZMjQQ,full_address: 202 McClure St\nDravosburg, PA 15034, open: true, categories: [Nightlife], city: Dravosburg, review_count: 4, name: Clancy's Pub, neighborhoods: [], longitude: -79.886930000000007, state: PA, stars: 3.5, latitude: 40.350518999999998, attributes: Happy Hour: true, Accepts Credit Cards: true, Good For Groups: true, Outdoor Seating: false, Price Range: 1, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: cE27W9VPgO88Qxe4ol6y_g,{ full_address: 1530 Hamilton Rd\nBethel Park, PA 15234}, open: false, categories: [Active Life, Mini Golf, Golf], city: Bethel Park, review_count: 5, name: Cool Springs Golf Center, neighborhoods: [], longitude: -80.015910000000005, state: PA, stars: 2.5, latitude: 40.356896200000001, attributes: Good for Kids: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
In bold are a few of the variables which do not appear in other entries.
This will get you started but you still have quite a bit of work to do. This works for one line (and it may work for the other two in the example) but it can be extrapolated to work with all of the lines (lapply FTW). Basically you need to rebuild the JSON structure from that single field (there may be alternative ways, especially if you do not need all the fields). It's easier than it might otherwise be since the Yelp schema is known.
You have to attack it in a pretty deterministic way, converting some fields before others, accounting for spaces in field names, dealing with arrays & nested structures, etc. As I said, you have quite a bit of work ahead of you. If your regex-fu is weak, this will provide ample practice to become a regex ninja.
library(stringi)
library(stringr)
library(jsonlite)
txt <- 'business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business'
txt <- gsub("\n", "|", txt)
txt <- sub("business_id: ([[:alnum:]\\:]+)", '"business_id": "\\1"', txt)
txt <- sub('attributes: ', '"attributes": {', txt)
txt <- sub('By Appointment Only: ', '"By Appointment Only": ', txt)
txt <- sub('Accepts Credit Cards: ', '"Accepts Credit Cards": ', txt)
txt <- sub('Good For Groups: ', '"Good For Groups": ', txt)
txt <- sub('Outdoor Seating: ', '"Outdoor Seating": ', txt)
txt <- sub('Price Range: ', '"Price Ranges": ', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), close:", '"full_address": "\\1", close:', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), open:", '"full_address": "\\1", open:', txt)
txt <- sub("name: (.*), neighborhoods:", '"name": "\\1", "neighborhoods":', txt)
txt <- gsub("open: ([[:alnum:]\\:]+)", '"open": "\\1"', txt)
txt <- sub("close: ([[:alnum:]\\:]+)", '"close": "\\1"', txt)
txt <- sub("longitude: ([[:digit:]\\.-]+)", '"longitude": "\\1"', txt)
txt <- sub("latitude: ([[:digit:]\\.-]+)", '"latitude": "\\1"', txt)
txt <- sub("review_count: ([[:digit:]\\.]+)", '"review_count": "\\1"', txt)
txt <- sub("stars: ([[:digit:]\\.]+)", '"stars": "\\1"', txt)
txt <- sub("state: ([[:alpha:]]+)", '"state": "\\1"', txt)
txt <- sub("city: ([[:alpha:] \\.-]+)", '"city": "\\1"', txt)
txt <- sub("type: ([[:alpha:]]+)", '"type": "\\1"', txt)
cats <- paste0(sprintf('"%s"', str_trim(str_split(str_match_all(txt, "categories: \\[([[:alpha:] &-,]+)\\],")[[1]][,2], ",")[[1]])), collapse=", ")
txt <- sub("categories: \\[([[:alpha:] &-,]+)\\],", '"categories": [' %s+% cats %s+% '],', txt)
txt <- "{" %s+% txt %s+% "}}"
fromJSON(txt)
## $business_id
## [1] "vcNAWiLM4dR7D2nwwJ7nCA"
##
## $full_address
## [1] "4840 E Indian School Rd|Ste 101|Phoenix, AZ 85018"
##
## $close
## [1] "17:00"
##
## $open
## [1] "08:00"
##
## $open
## [1] "true"
##
## $categories
## [1] "Doctors" "Health & Medical"
##
## $city
## [1] "Phoenix"
##
## $review_count
## [1] "9"
##
## $name
## [1] "Eric Goldberg, MD"
##
## $neighborhoods
## list()
##
## $longitude
## [1] "-111.98375799999999"
##
## $state
## [1] "AZ"
##
## $stars
## [1] "3.5"
##
## $latitude
## [1] "33.499313000000001"
##
## $attributes
## $attributes$`By Appointment Only`
## [1] TRUE
##
## $attributes$type
## [1] "business"
And, whomever gave you this file deserves whatever evil comes their way in their programmatic life. I'd give them back whatever they wanted from this in gnarly XML with EBCDIC encoding.