Parse multi-level json file in r - r

I have a pretty good understanding of R but am new to JSON file types and best practices for parsing. I'm having difficulties building a data frame from a raw JSON file. The JSON file (data below) is made up of repeated measure data that has multiple observations per user.
When the raw file is read into r
jdata<-read_json("./raw.json")
It comes in as a "List of 1" with that list being user_ids. Within each user_id are further lists, like so -
jdata$user_id$`sjohnson`$date$`2020-09-25`$city
The very last position actually splits into two options - $city or $zip. At the highest level, there are about 89 users in the complete file.
My goal would be to end up with a rectangular data frame or multiple data frames that I can merge together like this - where I don't actually need the zip code.
example table
I've tried jsonlite along with tidyverse and the farthest I seem to get is a data frame with one variable at the smallest level - cities and zip codes alternating rows
using this
df <- as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))
Any help/suggestions to get closer to the table above would be much appreciated. I have a feeling I'm failing at looping it back through the different levels.
Here is an example of the raw json file structure:
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
},
"asmith: {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"City": "Elmhurst",
"zip": "00013
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}

Another (straightforward) solution doing the heavy-lifting with rrapply() in the rrapply-package:
library(rrapply)
library(dplyr)
rrapply(jdata, how = "melt") %>%
filter(L5 == "city") %>%
select(user_id = L2, date = L4, city = value)
#> user_id date city
#> 1 sjohnson 2020-09-25 Denver
#> 2 sjohnson 2020-10-01 Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4 asmith 2020-10-16 Cleavland
#> 5 asmith 2020-11-10 Elmhurst
Data
jdata <- jsonlite::fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')

We can build our desired structure step by step:
library(jsonlite)
library(tidyverse)
df <- fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(city = unlist(value)) %>%
filter(var == 'city') %>%
select(-var, -value)
which gives:
# A tibble: 5 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
Alternative solution inspired by #Greg where we change the last two rows:
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(value = unlist(value)) %>%
pivot_wider(names_from = "var") %>%
select(user_id, date, city)
This gives almost the same results with the exception of one additional case where city is NA:
# A tibble: 6 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
6 asmith 2020-11-10 08:49:36 NA

Here's a solution in the tidyverse: a custom function unnestable() designed to recursively unnest into a table the contents of a list like you describe. See Details for particulars regarding the format of such a list and its table.
Solution
First ensure the necessary libraries are present:
library(jsonlite)
library(tidyverse)
Then define the unnestable() function as follows:
unnestable <- function(v) {
# If we've reached the bottommost list, simply treat it as a table...
if(all(sapply(
X = v,
# Check that each element is a single value (or NULL).
FUN = function(x) {
is.null(x) || purrr::is_scalar_atomic(x)
},
simplify = TRUE
))) {
v %>%
# Replace any NULLs with NAs to preserve blank fields...
sapply(
FUN = function(x) {
if(is.null(x))
NA
else
x
},
simplify = FALSE
) %>%
# ...and convert this bottommost list into a table.
tidyr::as_tibble()
}
# ...but if this list contains another nested list, then recursively unnest its
# contents and combine their tabular results.
else if(purrr::is_scalar_list(v)) {
# Take the contents within the nested list...
v[[1]] %>%
# ...apply this 'unnestable()' function to them recursively...
sapply(
FUN = unnestable,
simplify = FALSE,
USE.NAMES = TRUE
) %>%
# ...and stack their results.
dplyr::bind_rows(.id = names(v)[1])
}
# Otherwise, the format is unrecognized and yields no results.
else {
NULL
}
}
Finally, process the JSON data as follows:
# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")
# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)
# View the raw table.
flat_data
Naturally, you can reformat this table however you desire:
library(lubridate)
flat_data <- flat_data %>%
dplyr::transmute(
user_id = as.character(user_id),
date = lubridate::as_datetime(date),
city = as.character(city)
) %>%
dplyr::distinct()
# View the reformatted table.
flat_data
Results
Given a raw.json file like that sampled here
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}
then unnestable() will yield a tibble like this
# A tibble: 6 x 6
user_id date city zip location timestamp
<chr> <chr> <chr> <chr> <lgl> <dbl>
1 sjohnson 2020-09-25 Denver 80014 NA NA
2 sjohnson 2020-10-01 Atlanta 30301 NA NA
3 sjohnson 2020-11-04 Jacksonville 14001 NA NA
4 asmith 2020-10-16 Cleavland 34321 NA NA
5 asmith 2020-11-10 Elmhurst 00013 NA NA
6 asmith 2020-11-10 08:49:36 NA NA NA 1605016176013
which dplyr will format into the result below:
# A tibble: 6 x 3
user_id date city
<chr> <dttm> <chr>
1 sjohnson 2020-09-25 00:00:00 Denver
2 sjohnson 2020-10-01 00:00:00 Atlanta
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith 2020-10-16 00:00:00 Cleavland
5 asmith 2020-11-10 00:00:00 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
Details
List Format
To be precise, the list represents nested groupings by the fields {group_1, group_2, ..., group_n}, and it must be of the form:
list(
group_1 = list(
"value_1" = list(
group_2 = list(
"value_1.1" = list(
# .
# .
# .
group_n = list(
"value_1.1.….n.1" = list(
field_a = 1,
field_b = TRUE
),
"value_1.1.….n.2" = list(
field_a = 2,
field_c = "2"
)
# ...
)
),
"value_1.2" = list(
# .
# .
# .
)
# ...
)
),
"value_2" = list(
group_2 = list(
"value_2.1" = list(
# .
# .
# .
group_n = list(
"value_2.1.….n.1" = list(
field_a = 3,
field_d = 3.0
)
# ...
)
),
"value_2.2" = list(
# .
# .
# .
)
# ...
)
)
# ...
)
)
Table Format
Given a list of this form, unnestable() will flatten it into a table of the following form:
# A tibble: … x …
group_1 group_2 ... group_n field_a field_b field_c field_d
<chr> <chr> ... <chr> <dbl> <lgl> <chr> <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1 1 TRUE NA NA
2 value_1 value_1.1 ... value_1.1.….n.2 2 NA 2 NA
3 value_1 value_1.2 ... value_1.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
j value_2 value_2.1 ... value_2.1.….n.1 3 NA NA 3
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
k value_2 value_2.2 ... value_2.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

Related

expanding a named list inside a dataframe

I have a number of JSON files in the following format
{ "year": 2019,
"numberofhomes": 480,
"meetingdate": "2019-02-09",
"votes":
[
{ "date": "2019-01-23", "votes": 39 },
{ "date": "2019-02-01", "votes": 124 },
{ "date": "2019-02-09", "votes": 164 }
]
}
While reading this in with jsonlite::read_json, the resulting column for votes is a named list.
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = FALSE) %>%
as_tibble()
# A tibble: 3 x 4
year numberofhomes meetingdate votes
<int> <int> <chr> <list>
1 2019 480 2019-02-09 <named list [2]>
2 2019 480 2019-02-09 <named list [2]>
3 2019 480 2019-02-09 <named list [2]>
Or the alternative
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = TRUE) %>%
as_tibble()
# A tibble: 3 x 4
year numberofhomes meetingdate votes$date $votes
<int> <int> <chr> <chr> <int>
1 2019 480 2019-02-09 2019-01-23 39
2 2019 480 2019-02-09 2019-02-01 124
3 2019 480 2019-02-09 2019-02-09 164
How can I transform the last column(s) into a normal dataframe column? Alternatively, is there a better way to read in JSON files with nested arrays?
You can use unnest_wider:
library(tibble)
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = FALSE) %>%
as_tibble() %>%
unnest_wider(votes)

How to read files in two separate lists in a function based on a condition in R

Okay, I hope I manage to sum up what I need to achieve. I am running experiments in which I obtain data from two different source, with a date_time being the matching unifying variable. The data in the two separate sources have the same structure (in csv or txt). The distinction is in the filenames. I provide an example below:
list_of_files <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
frequency = c("30000",
"31000",
"32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
frequency = c("20000",
"21000",
"22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
temperature = c("22.3",
"22.6",
"22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
temperature = c("24.5",
"24.6",
"24.8")
)
)
)
First I want to read the data files in two separate lists. One that contains the keyword "solid" in the file name, and the other one that contains "water".
Then I need to create a new columns from the filename in each data frame that will be separated by "_" (e.g paint = "epox1", thickness = "10"), by which I could do an inner join by the date_time column, paint, thickness,etc. Basically what I struggle so far is to create a function that loads that files in two separate lists. This is what I've tried so far
load_files <-
function(list_of_files) {
all.files.board <- list()
all.files.temp <- list()
for (i in 1:length(list_of_files))
{
if (exists("board")) {
all.files.board[[i]] = fread(list_of_files[i])
}
else{
all.files.temp[[i]] = fread(list_of_files[i])
}
return(list(all.files.board, all.files.temp))
}
}
But it doesn't do what I need it. I hope I made it as clear as possible. I'm pretty comfortable with the tidyverse package but writing still a newbie in writing custom functions. Any ideas welcomed.
Regarding question in the title -
first issue, calling return() too early and thus breaking a for-loop, was already mentioned in comments and that should be sorted.
next one is condition itself, if (exists("board")){} checks if there is an object called board; in provided sample it would evaluate to TRUE only if something was assigned to global board object before calling load_files() function and it would evaluate to FALSE only if there were no such assignment or board was explicitly removed. I.e. with
board <- "something"; dataframes <- load_files(file_list) that check will be TRUE while with
rm(board); dataframes <- load_files(file_list) it will be FALSE, there's nothing in function itself that would change the "existance" of board, so the result is actually determined before calling the function.
If core of the question is about joining 2 somewhat different datasets and splitting result by groups, I'd just drop loops, conditions and most of involved lists and would go with something like this with Tidyverse:
library(fs)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
# prepare input files for sample ------------------------------------------
sample_dfs <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
frequency = c("30000", "31000", "32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
frequency = c("20000", "21000", "22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
temperature = c("22.3", "22.6", "22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
temperature = c("24.5", "24.6", "24.8")
)
)
)
tmp_path <- file_temp("reprex")
dir_create(tmp_path)
sample_filenames <- str_glue("{1:length(sample_dfs)}_{names(sample_dfs)}.csv")
for (i in seq_along(sample_dfs)) {
write_csv(sample_dfs[[i]], path(tmp_path, sample_filenames[i]))
}
dir_ls(tmp_path, type = "file")
#> Temp/RtmpqUoct8/reprex5cc517f177b/1_solid_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/2_solid_otherpaint_20.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/3_water_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/4_water_otherpaint_20.csv
# read files --------------------------------------------------------------
t_solid <- dir_ls(tmp_path, glob = "*solid*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_solid
#> # A tibble: 6 × 5
#> paint thickness date_time frequency index
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1
#> 2 epoxy1 10 20/07/2022 13:56 31000 2
#> 3 epoxy1 10 20/07/2022 14:06 32000 3
#> 4 otherpaint 20 20/07/2022 13:10 20000 1
#> 5 otherpaint 20 20/07/2022 13:20 21000 2
#> 6 otherpaint 20 20/07/2022 14:30 22000 3
t_water <- dir_ls(tmp_path, glob = "*water*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_water
#> # A tibble: 6 × 4
#> paint thickness date_time temperature
#> <chr> <chr> <chr> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 22.3
#> 2 epoxy1 10 20/07/2022 13:56 22.6
#> 3 epoxy1 10 20/07/2022 14:06 22.5
#> 4 otherpaint 20 20/07/2022 13:10 24.5
#> 5 otherpaint 20 20/07/2022 13:20 24.6
#> 6 otherpaint 20 20/07/2022 14:30 24.8
# or implement as a function ----------------------------------------------
load_files <- function(csv_path, glob = "*.csv") {
return(
dir_ls(csv_path, glob = glob, type = "file") %>%
# store filenames in filename column
read_csv(id = "filename", show_col_types = FALSE) %>%
# extract each regex group to its own column
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv"))
}
# join / group / split ----------------------------------------------------
t_solid <- load_files(tmp_path, "*solid*.csv")
t_water <- load_files(tmp_path, "*water*.csv")
# either join by multiple columns or select only required cols
# to avoid x.* & y.* columns in result
inner_join(t_solid, t_water, by = c("date_time", "paint", "thickness")) %>%
group_by(paint) %>%
group_split()
Final result as a list of tibbles:
#> <list_of<
#> tbl_df<
#> paint : character
#> thickness : character
#> date_time : character
#> frequency : double
#> index : double
#> temperature: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1 22.3
#> 2 epoxy1 10 20/07/2022 13:56 31000 2 22.6
#> 3 epoxy1 10 20/07/2022 14:06 32000 3 22.5
#>
#> [[2]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 otherpaint 20 20/07/2022 13:10 20000 1 24.5
#> 2 otherpaint 20 20/07/2022 13:20 21000 2 24.6
#> 3 otherpaint 20 20/07/2022 14:30 22000 3 24.8

Conditionally counting number of ocurrences in a dataframe - performance improvement

I need to detect (among other things) the first occurrence of a non-"F" code in a patient's list, after the first "F" code occurrence. The below code seems to succeed in this, however it is shown to be too inefficient on the server running in a data set of one million observations.
The final data set should have a variable of number of non-F codes (nhosp), and the first non-F code found after the first F-code appearance on the DAIGNOSTICO variable. No duplicates of ID.
How can I improve both in terms of complexity and speed? Tidyverse pipe preferred.
This is how the result should look like:
# A tibble: 7 × 6
# Groups: ID [7]
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <date> <date> <dbl> <chr>
1 1555 F180 1930-04-05 2005-03-15 1 T124
2 1234 F100 1980-04-01 2005-03-02 2 O155
3 16666 F120 1990-06-05 2005-03-18 0 <NA>
4 123456 F145 2001-03-07 2005-03-11 2 T123
5 177778 F155 2001-04-13 2005-03-22 2 G123
6 166666 F125 2002-03-12 2005-03-19 2 W345
7 12345 F150 2002-06-03 2005-03-07 4 K709
This is how my code looks like currently:
library(readr)
library(dplyr)
library(tidyr)
simulation <- read_csv("SIMULADO.txt", col_types = cols(
data_entrada = col_date("%d/%m/%Y"),
data_saida = col_date("%d/%m/%Y")
)
)
simulation <- as.data.frame(simulation)
simulation[, "nhosp"] <- 0
oldpos <- 1
for (i in 1:nrow(simulation)) {
if (grepl("F", simulation[i, "DAIGNOSTICO"], )) { # Has F?
oldpos <- i
clin <- 0
simulation[i, "hasF"] <- T
} else {
simulation[i, "hasF"] <-F
}
if (simulation[i, "ID"] == simulation[oldpos, "ID"]) { # same person?
if (simulation[oldpos, "hasF"] == T) { # Did she/him had F?
simulation[i, "hasF"] <- T
if (simulation[i, "data_entrada"] > simulation[oldpos, "data_entrada"]) { # é subsequente?
if (!grepl("F", simulation[i, "DAIGNOSTICO"], )) { # not-F?
simulation[i,"hasC"] <- T
clin <- 1
simulation[i, "ficd"] <- simulation[i, "DAIGNOSTICO"]
simulation[i, "nhosp"] <- clin
first_cc <- simulation[i, "DAIGNOSTICO"]
}
}
}
}
}
dt1 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
select(ficd) %>%
drop_na() %>%
slice(1)
dt2 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
filter(hasF == T) %>%
mutate(nhosp = cumsum(nhosp),
nhosp = max(nhosp)) %>%
select(-ficd,-hasF, -hasC) %>%
distinct(ID, .keep_all = TRUE) %>%
full_join(dt1, by = "ID")
dt2
And this is an example data set, with some errors to check robustness of the code:
ID, DAIGNOSTICO, data_entrada, data_saida
123490, O100, 01/04/1980, 02/03/2005
123490, O100, 01/04/1981, 02/03/2005
123491, O101, 01/04/1980, 02/03/2005
123491, O101, 01/04/1981, 02/03/2005
1234, F100, 01/04/1980, 02/03/2005
1234, O155, 02/04/1980, 03/03/2005
1234, G123, 05/05/1982, 04/03/2005
12345, T124, 01/06/2002, 05/03/2005
12345, Y124, 02/06/2002, 06/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K709, 04/06/2002, 08/03/2005
12345, Y709, 05/06/2002, 09/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K710, 06/06/2002, 08/03/2005
12345, K711, 07/06/2002, 10/03/2005
12345, F150, 08/06/2002, 07/03/2005
123456, F145, 07/03/2001, 11/03/2005
123456, T123, 08/03/2001, 12/03/2005
123456, P123, 09/03/2001, 13/03/2005
1555 ,R155, 04/04/1930, 14/03/2005
1555 ,F180, 05/04/1930, 15/03/2005
1555 ,T124, 06/04/1930, 16/03/2005
1555 ,F708, 07/04/1930, 17/03/2005
16666 ,F120, 05/06/1990, 18/03/2005
166666, F125, 12/03/2002, 19/03/2005
166666, W345, 13/03/2002, 20/03/2005
166666, L123, 14/03/2002, 21/03/2005
177778, F155, 13/04/2001, 22/03/2005
177778, G123, 14/04/2001, 23/03/2005
177778, F190, 15/04/2001, 24/03/2005
177778, E124, 16/04/2001, 25/03/2005
177779, G155, 13/04/2001, 22/03/2005
177779, G123, 14/04/2001, 23/03/2005
177779, G190, 15/04/2001, 24/03/2005
177779, E124, 16/04/2001, 25/03/2005
You could use
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(cumsum(str_detect(DAIGNOSTICO, "^F")) > 0) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
This returns
# A tibble: 7 x 6
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <chr> <chr> <int> <chr>
1 1234 F100 01/04/1980 02/03/2005 2 O155
2 1555 F180 05/04/1930 15/03/2005 1 T124
3 12345 F150 03/06/2002 07/03/2005 4 K709
4 16666 F120 05/06/1990 18/03/2005 0 NA
5 123456 F145 07/03/2001 11/03/2005 2 T123
6 166666 F125 12/03/2002 19/03/2005 2 W345
7 177778 F155 13/04/2001 22/03/2005 2 G123
Edit
I think there might be a flaw, perhaps
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(
cumsum(str_detect(DAIGNOSTICO, "^F")) == 1 |
!str_detect(DAIGNOSTICO, "^F") & cumsum(str_detect(DAIGNOSTICO, "^F")) > 0
) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
is a better solution.

How to exact match two column values in entire Dataset using R

I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.
DF:
ID Date city code uid
I-1 2020-01-01 10:12:15 New York 123 K-1
I-1 2020-01-01 10:12:15 Utha 103 K-1
I-2 2020-01-02 10:12:15 Washington 122 K-1
I-3 2020-02-01 10:12:15 Tokyo 123 K-2
I-3 2020-02-01 10:12:15 Osaka 193 K-2
I-4 2020-02-02 10:12:15 London 144 K-3
I-5 2020-02-04 10:12:15 Dubai 101 K-4
I-6 2019-11-01 10:12:15 Dubai 101 K-4
I-7 2019-11-01 10:12:15 London 144 K-3
I-8 2018-12-13 10:12:15 Tokyo 143 K-5
I-9 2019-05-17 10:12:15 Dubai 101 K-4
I-19 2020-03-11 10:12:15 Dubai 150 K-7
Dput:
structure(list(ID = c("I-1", "I-1",
"I-2", "I-3", "I-3", "I-4",
"I-5", "I-6", "I-7", "I-8", "I-9","I-19"
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607",
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952",
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536",
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York",
"UTAH", "Washington", "Tokyo",
"Osaka", "London", "Dubai",
"Dubai", "London", "Tokyo", "Dubai",
"Dubai"), CODE = c("221010",
"411017", "638007", "583101", "560029", "643102", "363001", "452001",
"560024", "509208"), UID = c("K-1",
"K-1", "K-1", "K-2", "K-2",
"K-3", "K-4", "K-4", "K-3",
"K-5","K-4","K-7")), .Names = c("ID", "DATE",
"CITY", "CODE", "UID"), row.names = c(NA,
10L), class = "data.fram)
Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.
Where,
Match - combination of city and code match with other ID in database
Same_uid - classification of Match ids to identify how many ID have similar uid
different_uid - classification of Match ids to identify how many ID doesn't have similar uid
uid_count - count of similar uid of that particular ID in entire database
Note - I have more than 10M records in the dataframe.
Required Output
ID Date city code uid Match Same_uid different_uid uid_count
I-1 2020-01-01 10:12:15 New York 123 K-1 No 0 0 2
I-2 2020-01-02 10:12:15 Washington 122 K-1 No 0 0 2
I-3 2020-02-01 10:12:15 Tokyo 123 K-2 No 0 0 1
I-4 2020-02-02 10:12:15 London 144 K-3 Yes 1 0 2
I-5 2020-02-04 10:12:15 Dubai 101 K-4 Yes 2 0 3
An approach,
Load in the dataset
library(tidyverse)
library(lubridate)
mydata <- tibble(
ID = c("I-1","I-1",
"I-2","I-3",
"I-3","I-4",
"I-5","I-6",
"I-7","I-8",
"I-9","I-19"),
Date = c("2020-01-01", "2020-01-01",
"2020-01-02", "2020-02-01",
"2020-02-01", "2020-02-02",
"2020-02-04", "2019-11-01",
"2019-11-01", "2018-12-13",
"2019-05-17", "2020-03-11"),
city = c("New York", "Utha",
"Washington", "Tokyo",
"Osaka", "London",
"Dubai", "Dubai",
"London", "Tokyo",
"Dubai", "Dubai"),
code = c("123", "103", "122", "123", "193, "144",
"101", "101", "144", "143", "101", "150"),
uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
"K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))
mydata <- mydata %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
Where clause number 1
I use count from dplyr to count the codes by cities. Then case_when to further identify with a "Yes" or "No" as requested.
# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <- "2020-03-29"
mydata %>%
filter(Date >= startdate,
Date <= enddate) %>%
count(city, code, name = "count_samecode") %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No")) %>%
head()
# # A tibble: 6 x 4
# city code count_samecode Match
# <chr> <chr> <int> <chr>
# 1 Dubai 101 3 Yes
# 2 Dubai 150 1 No
# 3 London 144 2 Yes
# 4 New York 123 1 No
# 5 Osaka 193 1 No
# 6 Tokyo 123 1 No
Where clause number 2
I will do the same with UID
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
head()
# # A tibble: 6 x 3
# city uid `UIDs_#_filtered`
# <chr> <chr> <int>
# 1 Dubai K-4 3
# 2 Dubai K-7 1
# 3 London K-3 2
# 4 New York K-1 1
# 5 Osaka K-2 1
# 6 Tokyo K-2 1
Where clause number 3
I can repeat the count of clause number 2 to find how many of these cities have a different UID, where > 1 signals a different UID.
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
count(city, name = "UIDs_#_different") %>%
head()
# # A tibble: 6 x 2
# city `UIDs_#_different`
# <chr> <int>
# 1 Dubai 2
# 2 London 1
# 3 New York 1
# 4 Osaka 1
# 5 Tokyo 2
# 6 Utha 1
Where clause number 4
Taking the same code from #2, I can eliminate the filter to find the entire dataset
mydata %>%
count(city, uid, name = "UIDs_#_all") %>%
head()
Putting it all together
Using several left_join's we can get closer to your desired output.
EDIT: Now will bring the first instance of the ID from the first City / Code combination
check_duplicates_filterview.f <- function( df, startdate, enddate ){
# df should be a tibble
# startdate should be a string "yyyy-mm-dd"
# enddate should be a string "yyyy-mm-dd"
cityfilter <- df %>% filter(Date >= startdate,
Date <= enddate) %>% distinct(city) %>% pull(1)
df <- df %>%
filter(city %in% cityfilter) %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
entire.db.countcodes <- df %>% # Finds count of code in entire DB
count(city, code)
where.1 <- df %>% filter(Date >= startdate,
Date <= enddate) %>%
distinct(city, code, .keep_all = T) %>%
left_join(entire.db.countcodes) %>%
rename("count_samecode" = n) %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No"))
where.2 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered")
where.3 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
distinct(city, uid) %>%
count(city, name = "UIDs_#_distinct")
where.4 <- df %>%
filter(city %in% cityfilter) %>%
count(city, uid, name = "UIDs_#_all")
first_half <- left_join(where.1, where.2)
second_half <- left_join(where.4, where.3)
full <- left_join(first_half, second_half)
return(full)
}
# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city code count_samecode Match uid `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr> <chr> <int> <chr> <chr> <int> <int> <int>
# 1 Dubai 101 2 Yes K-4 2 3 1
# 2 London 144 1 No K-3 1 2 1
# 3 New York 123 1 No K-1 1 1 1
# 4 Tokyo 143 1 No K-5 1 1 1
# 5 Utha 103 1 No K-1 1 1 1

Write JSON with multiple inputs in R

I have a tibble and a list which I would like to write to a json file.
# A tibble: 2 x 12
i n c x
<chr> <chr> <chr> <chr>
1 NYC New York City United States LON,271;BOS,201
2 LON London United Kingdom NYC,270
I would like to replace the 'x' column with a list.
When I try to merge by the 'i' column with the element of the list, a lot of data is duplicated... :/
sample list:
$NYC
d p
1: LON 271
2: BOS 201
$LON
d p
1: NYC 270
I would like to end up with something that looks like this:
[
{
"i": "NYC",
"n": "New York City",
"c": "United States",
"C": "US",
"r": "Northern America",
"F": 66.256,
"L": -166.063,
"b": 94.42,
"s": 0.752,
"q": 4417,
"t": "0,0,0,0,0",
"x": [{
"d": "LON",
"p": 271
},
{
"d": "BOS",
"p": 201
}]
}
...
]
I'm thinking there should be a way to write the json file without merging the list and the tibble, or maybe there is a way to merge them in a ragged way ?
ah. I just had another idea. maybe I can convert my dataframe to a list then use Reduce to combine the lists...
http://www.sharecsv.com/s/2e1dc764430c6fe746d2299f71879c2e/routes-before-split.csv
http://www.sharecsv.com/s/b114e2cc6236bd22b23298035fb7e042/tibble.csv
We may do the following:
tbl
# A tibble: 1 x 13
# X i n c C r F L b s q t x
# <int> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <fct> <fct>
# 1 1 LON London United Kingd… GB Northern Eur… 51.5 -0.127 55.4 1.25 2088 0,0,1,3… AAL,15;AAR,15;A…
require(tidyverse)
tbl$x <- map(tbl$x, ~ strsplit(., ";|,")[[1]] %>%
{data.frame(d = .[c(T, F)], p = as.numeric(.[c(F, T)]))})
The latter two lines are a shortened version of this base R equivalent:
tbl$x <- lapply(tbl$x, function(r) {
tmp <- strsplit(r, ";|,")[[1]]
data.frame(d = tmp[seq(1, length(tmp), 2)],
p = as.numeric(tmp[seq(2, length(tmp), 2)]))
})
We go over the x column, split its elements by ; and , whenever possible, and then use the fact that the resulting odd elements will correspond do the d column in the desired outcome, and the even elements to the p column.
Output:
toJSON(tbl, pretty = TRUE)
[
{
"X": 1,
"i": "LON",
"n": "London",
"c": "United Kingdom",
"C": "GB",
"r": "Northern Europe",
"F": 51.508,
"L": -0.127,
"b": 55.43,
"s": 1.25,
"q": 2088,
"t": "0,0,1,3,1",
"x": [
{
"d": "AAL",
"p": 15
},
{
"d": "AAR",
"p": 15
},
{
"d": "ABZ",
"p": 48
}
]
}
]

Resources