Can I control how tidyjson joins nested arrays? - r

It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with NAs?
For example, these fake data have one record with a populated nested array (middles) and two records where middles is empty:
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with NAs (~ left join) as such:
# A tibble: 3 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
2 54 Susan Doe NA NA
3 18 Ann Jones NA NA
However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
enter_object("middles") %>% gather_array %>%
spread_all() %>% select(-array.index) %>%
tbl_df()
# A tibble: 1 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?
A workaround but not a solution
A possible workaround is to literally do a left join, but this means duplicating the JSON read, which is not trivial given gigabytes of data.
wrap_dplyr_verb <- function(dplyr.verb) {
# Creates a tidyjson verb out of a dplyr verb
# https://github.com/colearendt/tidyjson/blob/master/R/tbl_json.R
function(.data, ...) {
# Check if reserved ..JSON name already in data.frame
if ("..JSON" %in% names(.data))
stop("'..JSON' in the column names of tbl_json object being filtered")
# Assign JSON to the data.frame so it is treated as any other column
.data$..JSON <- attr(.data, "JSON")
# Apply the transformation
y <- dplyr.verb(dplyr::as_tibble(.data), ...)
# Reconstruct tbl_json without ..JSON column
tbl_json(dplyr::select(y, -..JSON), y$..JSON)
}
}
left_join_json = wrap_dplyr_verb(left_join)
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
left_join_json(
people %>%
spread_all() %>%
enter_object("name") %>% gather_array() %>%
spread_all() %>% select(-document.id,-array.index) %>%
enter_object("middles") %>% gather_array %>%
spread_all() %>% select(-array.index)
) %>%
tbl_df()
Joining, by = c("age", "first", "last")
# A tibble: 3 x 5
age first last middle1 middle2
<dbl> <chr> <chr> <chr> <chr>
1 32 Bob Smith John Rick
2 54 Susan Doe NA NA
3 18 Ann Jones NA NA

Thanks so much for asking this question! It's definitely an interesting use case and one that I think could be improved upon within tidyjson.
Of course, I forgot that I had already worked with you on this question in GitHub, so I implemented with a different pattern here, in case it's any help to you or others. Basically, the idea is to go "tall" instead of "wide" first. I have no idea what it does to execution time, and whether or not there are left joins hidden in here that hurt the timing.
library(tidyjson)
library(dplyr)
library(tidyr)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
gather_object("key") %>%
{bind_rows(
filter(., key != "middles") %>% append_values_string("value"),
filter(., key == "middles") %>% gather_array("middleid") %>%
select(-key) %>%
gather_object("key") %>%
append_values_string("value")
)} %>%
# drop tbl_json
as_tibble() %>%
select(-document.id, -nameid, -middleid) %>%
# could also use tidyr::pivot_wider
tidyr::spread(key, value)
#> # A tibble: 3 x 5
#> age first last middle1 middle2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 18 Ann Jones <NA> <NA>
#> 2 32 Bob Smith John Rick
#> 3 54 Susan Doe <NA> <NA>
Created on 2020-06-28 by the reprex package (v0.3.0)

Related

R dataframe Removing duplicates / choosing which duplicate to remove

I have a dataframe that has duplicates based on their identifying ID, but some of the columns are different. I'd like to keep the rows (or the duplicates) that have the extra bit of info. The structure of the df is as such.
id <- c("3235453", "3235453", "21354315", "21354315", "2121421")
Plan_name<- c("angers", "strasbourg", "Benzema", "angers", "montpellier")
service_line<- c("", "AMRS", "", "Therapy", "")
treatment<-c("", "MH", "", "MH", "")
df <- data.frame (id, Plan_name, treatment, service_line)
As you can see, the ID row has duplicates, but I'd like to keep the second duplicate where there is more info in treatment and service_line.
I have tried using
df[duplicated(df[,c(1,3)]),]
but it doesn't work as an empty df is returned. Any suggestions?
Maybe you want something like this:
First we replace all blank with NA, then we arrange be Section.B and finally slice() first row from group:
library(dplyr)
df %>%
mutate(across(-c(id, Plan_name),~ifelse(.=="", NA, .))) %>%
group_by(id) %>%
arrange(Section.B, .by_group = TRUE) %>%
slice(1)
id Plan_name Section.B Section.C
<chr> <chr> <chr> <chr>
1 2121421 montpellier NA NA
2 21354315 angers MH Therapy
3 3235453 strasbourg MH AMRS
Try with
library(dplyr)
df %>%
filter(if_all(treatment:service_line, ~ .x != ""))
-output
id Plan_name Section.B Section.C
1 3235453 strasbourg MH AMRS
2 21354315 angers MH Therapy
If we need ids with blanks and not duplicated as well
df %>%
group_by(id) %>%
filter(n() == 1|if_all(treatment:service_line, ~ .x != "")) %>%
ungroup
-output
# A tibble: 3 × 4
id Plan_name treatment service_line
<chr> <chr> <chr> <chr>
1 3235453 strasbourg "MH" "AMRS"
2 21354315 angers "MH" "Therapy"
3 2121421 montpellier "" ""

str_detect on multiple columns in the same row

I have two datasets, one with full names and one with first and last names.
library(tidyverse)
(x = tibble(fullname = c("Michael Smith",
"Elisabeth Brown",
"John-Henry Albert")))
#> # A tibble: 3 x 1
#> fullname
#> <chr>
#> 1 Michael Smith
#> 2 Elisabeth Brown
#> 3 John-Henry Albert
(y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown"))
#> # A tibble: 3 x 2
#> first last
#> <chr> <chr>
#> 1 Elisabeth Smith
#> 2 John Albert
#> 3 Roland Brown
I'd like to make a single boolean column that is true only if the first and last column is within the fullname column.
In essence, I'm looking for something like:
x %>%
mutate(fname_match = str_detect(fullname, paste0(y$first, collapse = "|")), ## correct
lname_match = str_detect(fullname, paste0(y$last, collapse = "|"))) ## correct
#> # A tibble: 3 x 3
#> fullname fname_match lname_match
#> <chr> <lgl> <lgl>
#> 1 Michael Smith FALSE TRUE
#> 2 Elisabeth Brown TRUE TRUE
#> 3 John-Henry Albert TRUE TRUE
But here if I took the columns with two TRUE's Elisabeth Brown would be a false positive because the matching first name and last name are not in the same row.
My best idea so far is to combine the first and last column and search for this, but this creates a false negative for John-Henry
y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown") %>%
rowwise() %>%
mutate(longname = paste(first, last, sep = "&"))
x %>%
mutate(full_match = str_detect(fullname, paste0(y$longname, collapse = "|")))
#> # A tibble: 3 x 2
#> fullname full_match
#> <chr> <lgl>
#> 1 Michael Smith FALSE
#> 2 Elisabeth Brown FALSE
#> 3 John-Henry Albert FALSE
I think this does what you want, using purrr::map2 to iterate over the tuples of first and last.
library(dplyr)
library(purrr)
y %>%
mutate(
name_match = map2_lgl(
first, last,
.f = ~any(grepl(paste0(.x, '.*', .y), x$fullname, ignore.case = T))
)
)
Do mind, paste0(.x, '.*', .y) combines them into a regex that only lets rows pass in which the last name appears fully after the first. That seemed reasonable to do (otherwise, first name "Elisabeth", last name "Abe" would still be TRUE, which I here assume you would not want).
Also, the above is case insensitive.
// UPDATE:
I forgot; inversely, if you want to check the fullname values in x, then you can run this:
x %>%
rowwise() %>%
mutate(
name_match = any(map2_lgl(
y$first, y$last,
.f = ~grepl(paste0('\\b', .x, '\\b.*\\b', .y, '\\b'), fullname, ignore.case = T)
))
)
Depending on how important this check is for you and how many assumptions you want to make, it might make sense to tweak the above regex a little further:
ensure that the first name and last name stand as isolated words in the fullname
-> paste0('\\b', .x, '\\b.*\\b', .y, '\\b')
test that the first name comes right at the beginning
-> paste0('^', .x, '\\b.*\\b', .y, '\\b')
test that the fullname ends after the last name
-> paste0('\\b', .x, '\\b.*\\b', .y, '$')

R JSON to tibble

I have the following data passed back from an API and I cannot change it's structure. I would like to convert the following JSON into a tibble.
data <- '{ "ids":{
"00000012664":{
"state":"Indiana",
"version":"10",
"external_ids":[
{
"db":"POL",
"db_id":"18935"
},
{
"db":"CIT",
"db_id":"1100882"
}
],
"id":"00000012520",
"name":"Joe Smith",
"aliases":[
"John Smith",
"Bill Smith"
]
},
"00000103162":{
"state":"Kentucky",
"external_ids":[
{
"db":"POL",
"db_id":"69131"
},
{
"db":"CIT",
"db_id":"1098802"
}
],
"id":"00000003119",
"name":"Sue Smith",
"WIP":98203059
} ,
"0000019223":{
"state":"Ohio",
"external_ids":[
{
"db":"POL",
"db_id":"69134"
},
{
"db":"JT",
"db_id":"615234"
}
],
"id":"0000019223",
"name":"Larry Smith",
"WIP":76532172,
"aliases":[
"Test 1",
"Test 2",
"Test 3",
"Test 4"
],
"insured":1
} } }'
Please Note: This is a small subset of the data and could have thousands of "ids".
I've tried jsonlite and tidyjson with a combination of purrr.
The following gives me a tibble, but I cannot figure out how to get aliases back.
obj <- jsonlite::fromJSON(data, simplifyDataFrame = T, flatten = F)
obj$ids %>% {
data_frame(id=purrr::map_chr(., 'id'),
state=purrr::map_chr(., 'state', ''),
WIP=purrr::map_chr(., 'WIP', .default=''),
#aliases=purrr::map(purrr::map_chr(., 'aliases', .default=''), my_fun)
)
}
I cannot figure out with tidyjson either:
data %>% enter_object(ids) %>% gather_object %>% spread_all
What I would like back is a tibble with the following fields (regardless if they are in the JSON or not.
id
name
state
version
aliases -> as a string comma separated
WIP
BONUS: ;-)
Can I get external_ids as a string as well?
Instead of extracting each element with multiple calls with map, an option is to convert to tibble with (as_tibble) and select the columns of interest, grouped by 'id' collapse the 'aliases' into a single string and get the distinct rows by 'id'
library(tibble)
library(purrr)
library(stringr)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
select(id, one_of("name", "state", "version", "aliases", "WIP"))) %>%
group_by(id) %>%
mutate(aliases = toString(unique(aliases))) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 6
# Groups: id [2]
# id name state version aliases WIP
# <chr> <chr> <chr> <chr> <chr> <int>
#1 00000012520 Joe Smith Indiana 10 John Smith, Bill Smith NA
#2 00000003119 Sue Smith Kentucky <NA> NA 98203059
If we also need the 'external_ids' (which is a data.frame)
map_dfr(obj$ids, ~ as_tibble(.x) %>%
mutate(external_ids = reduce(external_ids, str_c, sep = " "))) %>%
group_by(id) %>%
mutate_at(vars(aliases, external_ids), ~ toString(unique(.))) %>%
ungroup %>%
distinct(id, .keep_all= TRUE)
# A tibble: 2 x 7
# state version external_ids id name aliases WIP
# <chr> <chr> <chr> <chr> <chr> <chr> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith, Bill Smith NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith NA 98203059
Update
For the new data, we can use
obj$ids %>%
map_dfr(~ map_df(.x, reduce, str_c, collapse = ", ", sep= " ") )
# A tibble: 3 x 8
# state version external_ids id name aliases WIP insured
# <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#1 Indiana 10 POL 18935, CIT 1100882 00000012520 Joe Smith John Smith Bill Smith NA NA
#2 Kentucky <NA> POL 69131, CIT 1098802 00000003119 Sue Smith <NA> 98203059 NA
#3 Ohio <NA> POL 69134, JT 615234 0000019223 Larry Smith Test 1 Test 2 Test 3 Test 4 76532172 1

How to return values from group_by in R dplyr?

Good morning,
I've got a two-column dataset which I'd like to spread to more columns based on a group_by in Dplyr but I'm not sure how.
My data looks like:
Person Case
John A
John B
Bill C
David F
I'd like to be able to transform it to the following structure:
Person Case_1 Case_2 ... Case_n
John A B
Bill C NA
David F NA
My original thought was along the lines of:
data %>%
group_by(Person) %>%
spread()
Error: Please supply column name
What's the easiest, or most R-like way to achieve this?
You should first add a case id to the dataset, which can be done with a combination of group_by and mutate:
dat = data.frame(Person = c('John', 'John', 'Bill', 'David'), Case = c('A', 'B', 'C', 'F'))
dat = dat %>% group_by(Person) %>% mutate(id = sprintf('Case_%d', row_number()))
dat %>% head()
# A tibble: 4 × 3
Person Case id
<fctr> <fctr> <chr>
1 John A Case_1
2 John B Case_2
3 Bill C Case_1
4 David F Case_1
Now you can use spread to transform the data:
dat %>% spread(Person, Case)
# A tibble: 2 × 4
id Bill David John
* <chr> <fctr> <fctr> <fctr>
1 Case_1 C F A
2 Case_2 NA NA B
You can get the structure you list above using:
res = dat %>% spread(Person, Case) %>% select(-id) %>% t() %>% as.data.frame()
names(res) = unique(dat$id)
res
Case_1 Case_2
Bill C <NA>
David F <NA>
John A B

Spread (tidyr) - Spreading repeated values

Given this data:
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4)
y <- c('Name', 'Street', 'Gender', 'Name', 'Street', 'Name', 'Street', 'Street', 'Dateofbirth', 'Gender','Name')
z <- c('Jasper', 'Broadway', 'Male', 'Alice', 'Narrowstreet', 'Peter', 'Neverland', 'Treasureisland', '1841', 'Male','Martin')
k <- data.frame(id = x, key = y, value = z)
I would like to create a clean 4-column table that has has keys as headers (i.e. Name, Street, Gender and Date of birth). The problem here is that the key 'Street' is double for Peter. I've tried to use spread (tidyr) but I haven't managed to make it work.
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(key, value)
I also gave a shot to:
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(id, value)
The result is not what I was expecting and both tables are quite difficult to work with. Any ideas?
Don't know if this is exactly what you are looking for, but if you just want to keep the first, you can group_by(id,key) and summarise value using first. Then, regroup by id and spread:
library(dplyr)
library(tidyr)
k <- k %>% group_by(id, key) %>% summarise(value=first(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway
##2 2 NA NA Alice Narrowstreet
##3 3 1841 Male Peter Neverland
##4 4 NA NA Martin NA
To put the doubled values in separate columns, use make.names to create unique keys:
k <- k %>% group_by(id) %>% mutate(key=make.names(key,unique=TRUE)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 6]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street Street.1
##* <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway NA
##2 2 NA NA Alice Narrowstreet NA
##3 3 1841 Male Peter Neverland Treasureisland
##4 4 NA NA Martin NA NA
Alternatively, you can group_by(id,key) and summarise value using toString or paste with collapse to flatten the doubled values:
k <- k %>% group_by(id, key) %>% summarise(value=toString(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <chr> <chr> <chr> <chr>
##1 1 <NA> Male Jasper Broadway
##2 2 <NA> <NA> Alice Narrowstreet
##3 3 1841 Male Peter Neverland, Treasureisland
##4 4 <NA> <NA> Martin <NA>

Resources