So I have nested JSON data as shown below,
{
"School_Days" :[
{
"ts" : 1234,
"val": "ABC"
},
{
"ts" : 0987,
"val": "EFG"
}
]
}
So when I create a data frame it creates a dataframe but it has 4 columns and 1 row instead of 2 columns and 2 rows
below is my code for parsing the Json data,
sc_data <- content(school_json,"parsed", "application/json","Accept:
application/json")
sc_df <- data.frame(sc_data, stringsAsFactors = FALSE)
Current Dataframe
School_Days.ts School_Days.Val School_Days.ts1 School_Days.val1
123 ABC 0987 EFG
Expected DataFrame
School_Days.ts School_Days.Val
123 ABC
0987 EFG
NOTE: I am currently fetching JSON data from and REST API GET call and store it in school_json
Also, typeof(school_json) results as List which is of the following format,
$School_Days
$School_Days[[1]]
$School_Days[[1]]$ts
[1] 1234
$School_Days[[1]]$Val
[1] "ABC"
$School_Days[[2]]
$School_Days[[2]]$ts
[1] 0987
$School_Days[[2]]$Val
[1] "EFG"
So here I found the solution to my question,
I did the following changes to my content() function,
sc_data <- content(school_json,"text", "application/json")
sc_df <- fromJSON(sc_data, flatten = TRUE)
sc_df <- data.frame(sc_df,stringAsFactors = FALSE)
Instead of retrieving as "parsed" I retrieved it as "text" which gave me JSON data in the form of String.
Related
I'm trying to put complex S4 objects (generated with Seurat package) in data.table (I read that it was not possible to use a list or a data.frame, but I didn't find anything about the compatibility of data.table with S4 objects) depending on the value of one of their attribute with a function.
These objects all come from a bigger object that I called dataset in the function I wrote:
subsets_by_cluster <- function(dataset){
nclust=data.table(cluster_ID=c(rep(NA,length(unique(dataset#active.ident)))))
for (i in length(nclust)){
nclust[i]=dataset[,dataset#active.ident==unique(dataset#active.ident)[i]]
}
return(nclust)}
I was expecting getting a data.table full of S4 objects, with one column with as many rows as number of different #active.ident values (cluster IDs)
But when I run it on my original dataset, I get the error
Error in [<-.data.frame(*tmp*, i, 1, value = new("Seurat", assays = list( : replacement has 2965 rows, data has 1
I also tried to do it manually with this kind of line
nclust[1]=dataset[,dataset#active.ident==unique(dataset#active.ident)[1]]
but it didn't work either, prompting the error :
type 'S4' cannot be coerced to 'logical'
Storing the subset in a variable works perfectly, but I would like my script be able do handle different cluster numbers.
I was thinking about writing the files to read so they can then be read, but it seems far from being a optimal solution.
Do you have suggestions ?
First, creating a simple S4 class (taken from Hadley Wickham's Advanced R)
setClass("Person",
slots = c(
name = "character",
age = "numeric"
)
)
As #John Paul mentions, you can create a few and store them in a list
john <- new("Person", name = "John Smith", age = NA_real_)
jane <- new("Person", name = "Jane Smith", age = NA_integer_)
myPeeps <- list(john, jane)
Printing the list
> myPeeps
[[1]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
Since a data.frame is a special type of list and as we see above a list element can be an S4 object, you can store them in a column as well. You just have to use the I() function
size <- 5
propsToMyPeeps <- data.frame(
propsFrom = I(sample(myPeeps, size, replace = TRUE)),
propsValue = sample.int(10, size, replace = TRUE),
propsTo = I(sample(myPeeps, size, replace = TRUE))
)
By default, the print method for data.frame doesn't know how to coerce our Person to a character string so printing the data.frame will cause an error. But if you subset the column, you can see all the objects are there.
> print(propsToMyPeeps$propsTo)
[[1]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[2]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[3]]
An object of class "Person"
Slot "name":
[1] "John Smith"
Slot "age":
[1] NA
[[4]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
[[5]]
An object of class "Person"
Slot "name":
[1] "Jane Smith"
Slot "age":
[1] NA
You can do it like this:
library(Seurat)
library(data.table)
data(pbmc_small)
nclust = data.table(cluster_ID=levels(Idents(pbmc_small)))
nclust$data = lapply(nclust$cluster_ID,function(i){
pbmc_small[,Idents(pbmc_small)==i]
})
And they can be accessed:
library(gridExtra)
grid.arrange(grobs=lapply(nclust$data,DimPlot),ncol=3)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>
the error in your code comes with first defining the column to be only NAs,and replacing them one at a time. And, it should be for for(i in 1:nrow(nclust)) instead of for(i in length(nclust))
If you start by defining it as a list of NAs, it works:
subsets_by_cluster <- function(dataset){
lvl = levels(Idents(dataset))
nclust=data.table(
cluster_ID = lvl,
data=replicate(length(lvl),NA,simplify=FALSE)
)
for (i in 1:nrow(nclust)){
nclust$data[[i]]=dataset[,Idents(dataset)==lvl[i]]
}
return(nclust)}
subsets_by_cluster(pbmc_small)
cluster_ID data
1: 0 <Seurat>
2: 1 <Seurat>
3: 2 <Seurat>
I have a list and the field inside each list element is of same name(only values are different) and I need to convert that into a data.frame with column name is same as that of field name. Following is my list,
Data input (data input in json format.json)
library(rjson)
data <- fromJSON(file = "data input in json format.json")
head(data,3)
[[1]]
[[1]]$floors
[1] 5
[[1]]$elevation
[1] 15
[[1]]$bmi
[1] 23.7483
[[2]]
[[2]]$floors
[1] 4
[[2]]$elevation
[1] 12
[[2]]$bmi
[1] 23.764
[[3]]
[[3]]$floors
[1] 3
[[3]]$elevation
[1] 9
[[3]]$bmi
[1] 23.7797
And my expected data.frame is,
floors elevation bmi
5 15 23.7483
4 12 23.7640
3 9 23.7797
Can you help me to figure out this ?.
Thanks in adavance.
You can use jsonlite.
library(jsonlite)
Then use fromJSON() and specify the path to your file (or alternatively a URL or the raw text) in the argument txt:
fromJSON(txt = 'path/to/json/file.json')
The result is:
floors elevation bmi
1 5 15 23.7483
2 4 12 23.7640
3 3 9 23.7797
If you prefer rjson, you could first read it as previously:
data <- rjson::fromJSON(file = 'path/to/json/file.json')
Then use do.call() and rbind.data.frame() to convert the list to a dataframe:
do.call("rbind.data.frame", data)
Alternatively to do.call(): use data.tables rbindlist() which is faster:
data.table::rbindlist(data)
This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head
I have an output from Elastic that takes very long to convert to an R data frame. I have tried multiple options; and feel there may be some trick there to quicken the process.
The structure of the list is as follows. The list has aggregated data over 29 days (say). If lets say the Elastic query output is in list 'v_day' then l[[5]]$articles_over_time$buckets[1:29] represents each of the 29 days
length(v_day[[5]]$articles_over_time$buckets)
[1] 29
page(v_day[[5]]$articles_over_time$buckets[[1]],method="print")
$key
[1] 1446336000000
$doc_count
[1] 35332
$group_by_state
$group_by_state$doc_count_error_upper_bound
[1] 0
$group_by_state$sum_other_doc_count
[1] 0
$group_by_state$buckets
$group_by_state$buckets[[1]]
$group_by_state$buckets[[1]]$key
[1] "detail"
$group_by_state$buckets[[1]]$doc_count
[1] 876
There is a "key" value here right at the top here (1446336000000) that I am interested in (lets call it "time bucket key").
Within each day(lets take day i), "v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets" has more data I am interested in. This is an aggregation over each property (property is an entity in the scheme of things here).
page(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets,method="print")
[[1]]
[[1]]$key
[1] "detail"
[[1]]$doc_count
[1] 876
[[2]]
[[2]]$key
[1] "ff8081814fdf2a9f014fdf80b05302e0"
[[2]]$doc_count
[1] 157
[[3]]
[[3]]$key
[1] "ff80818150a7d5930150a82abbc50477"
[[3]]$doc_count
[1] 63
[[4]]
[[4]]$key
[1] "ff8081814ff5f428014ffb5de99f1da5"
[[4]]$doc_count
[1] 57
[[5]]
[[5]]$key
[1] "ff8081815038099101503823fe5d00d9"
[[5]]$doc_count
[1] 56
This shows data over 5 properties in day i, each property has a "key" (lets call it "property bucket key") and a "doc_count" that I am interested in.
Eventually I want a data frame with "time bucket key", "property bucket key", "doc count".
Currently I am looping over using the below code:
v <- NULL
ndays <- length(v_day[[5]]$articles_over_time$buckets)
for (i in 1:ndays) {
v1 <- do.call("rbind", lapply(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets, data.frame))
th_dt <- as.POSIXct(v_day[[5]]$articles_over_time$buckets[[i]]$key / 1000, origin="1970-01-01")
v1$view_date <- th_dt
v <- rbind(v, v1)
msg <- sprintf("Read views for %s. Found %d \n", th_dt, sum(v1$doc_count))
cat(msg)
}
v
How to read the following vector "c" of strings into a list of tables? Which way is the shortest read.table strsplit? e.g. I cant see how to read the table Edit:c[4:6] a[4:6] in one command.
require(car)
m<-matrix(rnorm(16),4,4,byrow=T)
a<-Anova(lm(m~1),type=3,idata=data.frame(treatment=factor(1:4)),idesign=~treatment)
c<-capture.output(summary(a,multivariate=F))
c
This returns lines 4:6
c[4:6]
Now if you wanted to parse this I would do it in two steps. First on the column values from rows 5:6 and then add back the names.
> vals <- read.table(text=c[5:6])
> txt <- " \t SS\t num Df\t Error SS\t den Df\t F\t Pr(>F)"
> names(vals) <- names(read.delim(text=txt))
> vals
X SS num.Df Error.SS den.Df F Pr..F.
1 (Intercept) 0.57613392 1 0.4219563 3 4.09616 0.13614
2 treatment 1.85936442 3 8.2899759 9 0.67287 0.58996
EDIT --
you could look at the source code of the summary function and calculate the quantities required by yourself
getAnywhere(summary.Anova.mlm)
The original idea seems not to work.
c2 <- summary(a)
# find out what 'properties' the summary object has
# turns out, it is just the Anova object
class(c2) <- "list"
names(c2)
This returns
[1] "SSP" "SSPE" "P" "df" "error.df"
[6] "terms" "repeated" "type" "test" "idata"
[11] "idesign" "icontrasts" "imatrix" "singular"
and we can get access them
c2$SSP
c2$SSPE
It seems not a good idea to use R internal c function as a variable name