Nested Json with Different Attribute Names in R - r

I am playing with the Kaggle Star Trek Scripts dataset but I am struggling with converting from json to a dataframe in R. Ideally I would convert it in a long form dataset with index columns for episodes and characters with their lines on individual rows. I did find this answer, however it is not in R.
Currently the json looks like the photo below. Sorry it is not a full exmaple, but I put a small mocked version below as well. If you want you can download the data yourselves from here.
Current JSON View
Mock Example
"ENT": {
"episode_0": {
"KLAANG": {
"\"Pungghap! Pung ghap!\"": {},
"\"DujDajHegh!\"": {}
}
},
"eipsode_1": {
"ARCHER": {
"\"Warpme!\"": {},
"\"Toboldly go!\"": {}
}
}
}
}
The issue I have is that the second level, epsiodes, are individually numbered. So my regular bag of tricks for flattening by attribute name are not working. I am unsure how to loop through a level rather than an attribute name.
What I would ideally want is a long form data set that looks like this:
Series Episode Character Lines
ENT episode_0 KLAANG Pung ghap! Pung ghap!
ENT episode_0 KLAANG DujDaj Hegh!
ENT episode_1 ARCHER Warp me!
ENT episode_1 ARCHER To boldly go!
My currnet code looks like the below, which is what I would normally start with, but is obviously not working or far enough.
your_df <- result[["ENT"]] %>%
purrr::flatten() %>%
map_if(is_list, as_tibble) %>%
map_if(is_tibble, list) %>%
bind_cols()
I have also tried using stack() and map_dfr() but with no success. So I yet again come humbly to you, dear reader, for expertise. Json is the bane of my existance. I struggle with applying other answers to my circumstance so any advice or examples I can reverse engineer and lear from are most appreciated.
Also happy to clarify or expand on anything if possible.
-Jake

So I was able to brute force it thanks to an answer from Michael on this tread called How to flatten a list of lists? so shout out to them.
The function allowed me to covert JSON to a list of lists.
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
So Putting it all together I ended up with the following solution. Annotation for your entertainment.
library(jsonlite)
library(tidyverse)
library(dplyr)
library(data.table)
library(rjson)
result <- fromJSON(file = "C:/Users/jacob/Downloads/all_series_lines.json")
# Mike's function to get to a list of lists
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
# Mike's function applied
final<-as.data.frame(do.call("rbind", flattenlist(result)))
# Turn all the lists into a master data frame and ensure the index becomes a column I can separate later for context.
final <- cbind(Index_Name = rownames(final), final)
rownames(final) <- 1:nrow(final)
# So the output takes the final elements at the end of the JSON and makes those the variables in a dataframe so I need to force it back to a long form dataset.
final2<-gather(final,"key","value",-Index_Name)
# I separate each element of index name into my three mapping variables; Series,Episode and Character. I can also keep the original column names from above as script line id
final2$Episode<-gsub(".*\\.(.*)\\..*", "\\1", final2$Index_Name)
final2$Series<-substr(final2$Index_Name, start = 1, stop = 3)
final2$Character<-sub('.*\\.'," ", final2$newColName)

Related

Looping over rows of self-reported job titles to gather publicly available data

Suppose I have a data frame with one case of self-reported job titles (this is done in R):
x <- data.frame("job.title" = c("psychologist"))
I'd like to have this job title entered into a search engine on a website (this part I can do) in order to have data on these jobs pulled into a data frame (this part I can also do).
The following function does this for me:
onet.sum <- function(x) {
obj1 <- as.list(ONETr::keySearch(x)) # enter self-reported job title into ONET's search engine
job.title <- obj1[["title"]][1] # pull best-matching title
soc.code <- obj1[["code"]][1] # pull best matching title's SOC code
obj4 <- as.data.frame(cbind(job.title,soc.code))
return(obj4)
}
However, once I add a second job title in a second row...
x <- data.frame("job.title" = c("psychologist", "social worker"))
...I get this system error that I'm not sure how to diagnose.
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Any advice?
UPDATE
So it turns out that there are two solutions that work if I pass job titles that do not contain spaces:
Using lapply(). Make sure that the job titles do not contain spaces.
So this works:
final_data <- lapply(c("psychologist","socialworker"), onet.sum) %>%
bind_rows
...but this doesn't work:
final_data <- lapply(c("psychologist","social worker"), onet.sum) %>%
bind_rows
Use purrr's map_df() is more flexible.
result <- purrr::map_df(gsub('\s', '', x$job.title), onet.sum)
You can try with a lapply statement -
result <- do.call(rbind, lapply(x$job.title, onet.sum))
Using purrr::map_df might be shorter.
result <- purrr::map_df(x$job.title, onet.sum)

Rename multiple colnames using a dictionary

I have multiple csv including multiple information (such as "age") with different spellings for the same variable. For standardizing them I plan to read each of them and turn each into a dataframe for standardizing and then writing back the csv.
Therefore, I created a dictionary that looks like this:
I am struggling to find a way to do the following in R:
Asking it to look through each of the colnames of the dataframe and comparing each to every "old_name" in the dictionary dataframe.
If it finds the a match then replace the "old_name" with the "new_name"
Any help would be really useful!
Edit: the issue is not only with upper and lower case. For example, in some cases it could be: "years" instead of "age".
Here is a quick and dirty approach. I wrote a function so you could just change the arguments and quickly cycle through all your files. Using the stringi package is optional -- I'm using it to check the provided .csv file name, but you could remove that if you decide it's unnecessary.
library(stringi)
dict <- data.frame(path=c('../csv1','../csv1','../csv2','../csv3','../csv3'),
old_name=c('Age','agE','Name','years','NamE'),
new_name=c('age','age','name','age','name'))
example_csv <- data.frame(Age=c(43,34,42,24),NamE=c('Michael','Jim','Dwight','Kevin'))
standardizeColumnNames <- function(df,csvFileName,dictionary){
colHeaders <- character(ncol(df))
for(i in 1:ncol(df)){
index <- which(dictionary$old_name == names(df)[i])
if(length(index) > 0){
colHeaders[i] <- as.character(dictionary$new_name[index[1]])
} else {
colHeaders[i] <- names(df)[i]
}
}
names(df) <- colHeaders
if(stri_sub(csvFileName,-4) != '.csv'){
csvFileName <- paste0(csvFileName,'.csv')
}
write.csv(df,csvFileName)
}
standardizeColumnNames(example_csv,'test_file_name',dict)

extracting list-in-a-list-in-a-list to build dataframe in R

I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))

Reading nodes from multiple html and storing result as a vector

I have a list of locally saved html files. I want to extract multiple nodes from each html and save the results in a vector. Afterwards, I would like to combine them in a dataframe. Now, I have a piece of code for 1 node, which works (see below), but it seems quite long and inefficient if I apply it for ~ 20 variables. Also, something really strange with the saving to vector (XXX_name) it starts with the last observation and then continues with the first, second, .... Do you have any suggestions for simplifying the code/ making it more efficient?
# Extracts name variable and stores in a vector
XXX_name <- c()
for (i in 1:216) {
XXX_name <- c(XXX_name, name)
mydata <- read_html(files[i], encoding = "latin-1")
reads_name <- html_nodes(mydata, 'h1')
name <- html_text(reads_name)
#print(i)
#print(name)
}
Many thanks!
You can put the workings inside a function then apply that function to each of your variables with map
First, create the function:
read_names <- function(var, node) {
mydata <- read_html(files[var], encoding = "latin-1")
reads_name <- html_nodes(mydata, node)
name <- html_text(reads_name)
}
Then we create a df with all possible combinations of inputs and apply the function to that
library(tidyverse)
inputs <- crossing(var = 1:216, node = vector_of_nodes)
output <- map2(inputs$var, inputs$node, read_names)

Displaying data from a list in R without dynamically changing variable names

I'm writing some code in R that builds a list of data frames. While it runs, it needs to display each of the data frames it creates in a separate tab. The data frames and the list are both created by several nested for loops, along the lines of:
df.list <- vector("list", length(e))
i <- 1
for (...){
data <- as.data.frame(stuff)
j <- 1
for (...){
for (...){
[loop stuff]
data[j,] <- [more stuff]
}
}
df.list[[i]] <- data
i <- i + 1
}
The question is where to put the "View" function. If I add a second loop at the end that runs through the list and displays the data frames, then they all get named "df.list". If I put View(data) right before df.list[[i]] <- data then they all get named "data". Having them all have the same name is not an acceptable situation for this context. Ideally, I would be able to name them whatever string I want, but I would settle for anything that is reasonably understandable and distinguishable from the other data frames.
I know I can solve this by dynamically changing the variable name to be datai where i is the list index, but that's almost always the wrong way to do things.
I thought I'd never post an answer using eval(parse()), but it's the only way I can think to make this work:
# sample data
df.list = list(mtcars, iris)
# name your list however you want the tabs to be named
names(df.list) = c("mtcars data", "this is iris")
for (i in seq_along(df.list)) eval(parse(text = sprintf("View(df.list[['%s']])", names(df.list)[i])))
This might be what you meant by "dynamically changing the variable name to be datai where i is the list index", and I agree that it's almost always wrong. In this case it may also be by far the most expedient way to do it as well.
Posting the solution from the comments so I can close:
The View() function takes names as optional arguments! View(data, name) will display data and call the tab name

Resources