R JSON to tibble - r

I have the following data passed back from an API and I cannot change it's structure. I would like to convert the following JSON into a tibble.
data <- '{ "ids":{
"00000012664":{
"state":"Indiana",
"version":"10",
"external_ids":[
{
"db":"POL",
"db_id":"18935"
},
{
"db":"CIT",
"db_id":"1100882"
}
],
"id":"00000012520",
"name":"Joe Smith",
"aliases":[
"John Smith",
"Bill Smith"
]
},
"00000103162":{
"state":"Kentucky",
"external_ids":[
{
"db":"POL",
"db_id":"69131"
},
{
"db":"CIT",
"db_id":"1098802"
}
],
"id":"00000003119",
"name":"Sue Smith",
"WIP":98203059
} ,
"0000019223":{
"state":"Ohio",
"external_ids":[
{
"db":"POL",
"db_id":"69134"
},
{
"db":"JT",
"db_id":"615234"
}
],
"id":"0000019223",
"name":"Larry Smith",
"WIP":76532172,
"aliases":[
"Test 1",
"Test 2",
"Test 3",
"Test 4"
],
"insured":1
} } }'
Please Note: This is a small subset of the data and could have thousands of "ids".
I've tried jsonlite and tidyjson with a combination of purrr.
The following gives me a tibble, but I cannot figure out how to get aliases back.
obj <- jsonlite::fromJSON(data, simplifyDataFrame = T, flatten = F)
obj$ids %>% {
data_frame(id=purrr::map_chr(., 'id'),
state=purrr::map_chr(., 'state', ''),
WIP=purrr::map_chr(., 'WIP', .default=''),
#aliases=purrr::map(purrr::map_chr(., 'aliases', .default=''), my_fun)
)
}
I cannot figure out with tidyjson either:
data %>% enter_object(ids) %>% gather_object %>% spread_all
What I would like back is a tibble with the following fields (regardless if they are in the JSON or not.
id
name
state
version
aliases -> as a string comma separated
WIP
BONUS: ;-)
Can I get external_ids as a string as well?

Instead of extracting each element with multiple calls with map, an option is to convert to tibble with (as_tibble) and select the columns of interest, grouped by 'id' collapse the 'aliases' into a single string and get the distinct rows by 'id'
library(tibble)
library(purrr)
library(stringr)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
select(id, one_of("name", "state", "version", "aliases", "WIP"))) %>%
group_by(id) %>%
mutate(aliases = toString(unique(aliases))) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 6
# Groups: id [2]
# id name state version aliases WIP
# <chr> <chr> <chr> <chr> <chr> <int>
#1 00000012520 Joe Smith Indiana 10 John Smith, Bill Smith NA
#2 00000003119 Sue Smith Kentucky <NA> NA 98203059
If we also need the 'external_ids' (which is a data.frame)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
mutate(external_ids = reduce(external_ids, str_c, sep = " "))) %>%
group_by(id) %>%
mutate_at(vars(aliases, external_ids), ~ toString(unique(.))) %>%
ungroup %>%
distinct(id, .keep_all= TRUE)
# A tibble: 2 x 7
# state version external_ids id name aliases WIP
# <chr> <chr> <chr> <chr> <chr> <chr> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith, Bill Smith NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith NA 98203059
Update
For the new data, we can use
obj$ids %>%
map_dfr(~ map_df(.x, reduce, str_c, collapse = ", ", sep= " ") )
# A tibble: 3 x 8
# state version external_ids id name aliases WIP insured
# <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith Bill Smith NA NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith <NA> 98203059 NA
#3 Ohio <NA> POL 69134, JT 615234 0000019223 Larry Smith Test 1 Test 2 Test 3 Test 4 76532172 1

Related

Can I control how tidyjson joins nested arrays?

It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with NAs?
For example, these fake data have one record with a populated nested array (middles) and two records where middles is empty:
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with NAs (~ left join) as such:
# A tibble: 3 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
2 54 Susan Doe NA NA
3 18 Ann Jones NA NA
However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
enter_object("middles") %>% gather_array %>%
spread_all() %>% select(-array.index) %>%
tbl_df()
# A tibble: 1 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?
A workaround but not a solution
A possible workaround is to literally do a left join, but this means duplicating the JSON read, which is not trivial given gigabytes of data.
wrap_dplyr_verb <- function(dplyr.verb) {
# Creates a tidyjson verb out of a dplyr verb
# https://github.com/colearendt/tidyjson/blob/master/R/tbl_json.R
function(.data, ...) {
# Check if reserved ..JSON name already in data.frame
if ("..JSON" %in% names(.data))
stop("'..JSON' in the column names of tbl_json object being filtered")
# Assign JSON to the data.frame so it is treated as any other column
.data$..JSON <- attr(.data, "JSON")
# Apply the transformation
y <- dplyr.verb(dplyr::as_tibble(.data), ...)
# Reconstruct tbl_json without ..JSON column
tbl_json(dplyr::select(y, -..JSON), y$..JSON)
}
}
left_join_json = wrap_dplyr_verb(left_join)
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
left_join_json(
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
enter_object("middles") %>% gather_array %>%
spread_all() %>% select(-array.index)
) %>%
tbl_df()
Joining, by = c("age", "first", "last")
# A tibble: 3 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
2 54 Susan Doe NA NA
3 18 Ann Jones NA NA
Thanks so much for asking this question! It's definitely an interesting use case and one that I think could be improved upon within tidyjson.
Of course, I forgot that I had already worked with you on this question in GitHub, so I implemented with a different pattern here, in case it's any help to you or others. Basically, the idea is to go "tall" instead of "wide" first. I have no idea what it does to execution time, and whether or not there are left joins hidden in here that hurt the timing.
library(tidyjson)
library(dplyr)
library(tidyr)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
gather_object("key") %>%
{bind_rows(
filter(., key != "middles") %>% append_values_string("value"),
filter(., key == "middles") %>% gather_array("middleid") %>%
select(-key) %>%
gather_object("key") %>%
append_values_string("value")
)} %>%
# drop tbl_json
as_tibble() %>%
select(-document.id, -nameid, -middleid) %>%
# could also use tidyr::pivot_wider
tidyr::spread(key, value)
#> # A tibble: 3 x 5
#> age first last middle1 middle2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 18 Ann Jones <NA> <NA>
#> 2 32 Bob Smith John Rick
#> 3 54 Susan Doe <NA> <NA>
Created on 2020-06-28 by the reprex package (v0.3.0)

Creating a variable based off of matching conditions in two datasets

I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0
I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A
Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0

Combine rows based on multiple columns and keep all unique values

I have a dataset with User Information. For a specific user I have often multiple rows with more or less complete information. I want to summarize all rows that belong to a customer on the basis of First_Name, Last_Name, Street while keeping all information of the other columns and if there are two unique observation for a specific column I want to collapse them with ",".
This is what the df looks like
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA
My desired output would be
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab,ad a,b
John Smith Y xyz xy,xm n
I would like using dplyr and tried something with
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(funs())
The problem with that function is that I only had the option of using something like the mean or the first occuring value for a column and this would mean the loss of values. What I would like are columns with all unique values without NA's
You can write your own summarization function like
concat_unique <- function(x){paste(unique(x), collapse=',')}
and then apply it using
summarize_all(concat_unique)
A solution using tidyverse.
library(tidyverse)
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Replace NA with ""
mutate_all(funs(replace(., is.na(.), ""))) %>%
# Combine all strings
summarize_all(funs(toString(unique(.)))) %>%
# Replace the strings ended with ", "
mutate_all(funs(str_replace(., ", $", ""))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
After seeing others answer, I realized that we don't have to deal with NA and , as strings. The following is more efficient.
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Combine all strings
summarize_all(funs(toString(unique(.[!is.na(.)])))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
DATA
dat <- read.table(text = 'First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA',
header = TRUE, stringsAsFactors = FALSE)
Using tidyverse:
df %>%
group_by(First_Name, Last_Name, Street) %>%
summarise_all(funs(paste0(unique(.[!is.na(.)]), collapse= ",")))
First_Name Last_Name Street Column1 Colum2 Colum_n
<fct> <fct> <fct> <chr> <chr> <chr>
1 John Smith Y xyz xy,xm n
2 Mike Smith X abc ab,ad a,b
First, it is grouping by "First_Name", "Last_Name" and "Street". Then, it takes all the unique non-NA values and collapses them into one string.
If you want to keep them as a vector, instead of converting them to a single character string, you can do
library(dplyr)
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(~list(unique(.[!is.na(.)]))) %>%
print.data.frame
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
or with data.table
library(data.table)
setDT(df)
df[, lapply(.SD, function(x) .(unique(x[!is.na(x)])))
, by = .(First_Name,Last_Name, Street)]
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1: Mike Smith X abc ab,ad a,b
# 2: John Smith Y xyz xy,xm n

Use merge with one data frame in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 4 years ago.
I have one data frame in R with duplicate indexes stored in the first column.
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane
Doe"), "Address" = c("123 Fake Street","780 York Street","456 Elm
Street"),"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"))
Index Name Address Telephone
1 Jane Doe 123 Fake Street xxx-xxx-xxxx
2 John Doe 780 York Street zzz-zzz-zzzz
1 Jane Doe 456 Elm Street yyy-yyy-yyyy
I would like to combine the above data frame to look like:
Index Name Address Telephone Address 2 Telephone 2
1 Jane, Doe 123 Fake Street xxx-xxx-xxxx 456 Elm Street yyy-yyy-yyyy
2 John Doe 780 York Street zzz-zzz-zzzz NA NA
Can I use "merge" on the same data frame or is their another command in R that would accomplish this task? Thank you.
with tidyverse
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),paste, collapse="|") %>%
separate(Address,into=c("Address1","Address2"),sep="\\|") %>%
separate(Telephone,into=c("Telephone1","Telephone2"),sep="\\|")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <chr> <chr> <chr> <chr>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
To be more general, we can nest the values using summarize and list, and reformat the content to unnestit with the right format:
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address),
~lst(setNames(invoke(tibble,.),seq_along(.)))) %>%
unnest(.sep = "")
# # A tibble: 2 x 5
# Age Telephone1 Telephone2 Address1 Address2
# <fct> <fct> <fct> <fct> <fct>
# 1 Jane Doe xxx-xxx-xxxx yyy-yyy-yyyy 123 Fake Street 456 Elm Street
# 2 John Doe zzz-zzz-zzzz <NA> 780 York Street <NA>
The function inside of summarize is a bit scary but you can wrap it into a friendlier name if you want to use it again (I added a names parameter just in case):
nest2row <- function(x,names = seq_along(x))
lst(setNames(invoke(tibble,x),names[seq_along(x)]))
df %>%
group_by(Age) %>%
summarize_at(vars(Telephone,Address), nest2row) %>%
unnest(.sep = "")
And this would be the recommended tidy way I suppose :
df %>%
group_by(Age) %>%
mutate(id=row_number()) %>%
gather(key,value,Address,Telephone) %>%
unite(key,key,id,sep="") %>%
spread(key,value)
# # A tibble: 2 x 6
# # Groups: Age [2]
# Index Age Address1 Address2 Telephone1 Telephone2
# <dbl> <fct> <chr> <chr> <chr> <chr>
# 1 1 Jane Doe 123 Fake Street 456 Elm Street xxx-xxx-xxxx yyy-yyy-yyyy
# 2 2 John Doe 780 York Street <NA> zzz-zzz-zzzz <NA>
With my second solution you keep your factors and there's not this awkward forcing different types of variables in the same column that the idiomatic way has.
Try something like this:
df <- data.frame("Index" = c(1,2,1), "Age" = c("Jane Doe","John Doe","Jane Doe"),
"Address" = c("123 Fake Street","780 York Street","456 Elm Street"),
"Telephone" = c("xxx-xxx-xxxx","zzz-zzz-zzzz","yyy-yyy-yyyy"),
stringsAsFactors = F)
df$unindex=paste(df$Index,df$Age)
sapply(unique(df$unindex),function(li){ # li="1 Jane Doe"
dft=df[li==df$unindex,3:4]
if(nrow(dft)==1)dft else c(t(dft))
})

Spread (tidyr) - Spreading repeated values

Given this data:
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4)
y <- c('Name', 'Street', 'Gender', 'Name', 'Street', 'Name', 'Street', 'Street', 'Dateofbirth', 'Gender','Name')
z <- c('Jasper', 'Broadway', 'Male', 'Alice', 'Narrowstreet', 'Peter', 'Neverland', 'Treasureisland', '1841', 'Male','Martin')
k <- data.frame(id = x, key = y, value = z)
I would like to create a clean 4-column table that has has keys as headers (i.e. Name, Street, Gender and Date of birth). The problem here is that the key 'Street' is double for Peter. I've tried to use spread (tidyr) but I haven't managed to make it work.
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(key, value)
I also gave a shot to:
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(id, value)
The result is not what I was expecting and both tables are quite difficult to work with. Any ideas?
Don't know if this is exactly what you are looking for, but if you just want to keep the first, you can group_by(id,key) and summarise value using first. Then, regroup by id and spread:
library(dplyr)
library(tidyr)
k <- k %>% group_by(id, key) %>% summarise(value=first(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway
##2 2 NA NA Alice Narrowstreet
##3 3 1841 Male Peter Neverland
##4 4 NA NA Martin NA
To put the doubled values in separate columns, use make.names to create unique keys:
k <- k %>% group_by(id) %>% mutate(key=make.names(key,unique=TRUE)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 6]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street Street.1
##* <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway NA
##2 2 NA NA Alice Narrowstreet NA
##3 3 1841 Male Peter Neverland Treasureisland
##4 4 NA NA Martin NA NA
Alternatively, you can group_by(id,key) and summarise value using toString or paste with collapse to flatten the doubled values:
k <- k %>% group_by(id, key) %>% summarise(value=toString(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <chr> <chr> <chr> <chr>
##1 1 <NA> Male Jasper Broadway
##2 2 <NA> <NA> Alice Narrowstreet
##3 3 1841 Male Peter Neverland, Treasureisland
##4 4 <NA> <NA> Martin <NA>

Resources