convert json list to a data frame in R - r

I have a json file as follows:
{
"1234":{"Messages":{"1":{"Content":["How are you","today"]},"2":{"Content":["I am great"]}}},
"2344":{"Messages":{"1":{"Content":["It's a plan"]}}}}
I am trying to convert this content to this data frame:
df <- data.frame(id=c(1234,2344), Content1=c("How are you today","It's a plan"), Content2=c("I am great", ""))
I have tried a few things with jsonlite and pluck but challenged over the iterative part of the code.
Any advice appreciated.
Thank you.

We could read the .json into a list with fromJSON and then get the 'Content' with rrapply and convert to a data.frame
library(jsonlite)
library(rrapply)
library(dplyr)
library(tidyr)
lst1 <- fromJSON("file1.json")
rrapply(lst1, condition = function(x, .xname)
.xname == 'Content', how = "melt") %>%
select(-L2) %>%
unite(L4, L4, L3, sep = "") %>%
unnest(value) %>%
pivot_wider(names_from = L4, values_from = value, values_fn = toString)
-output
# A tibble: 2 × 3
L1 Content1 Content2
<chr> <chr> <chr>
1 1234 How are you, today I am great
2 2344 It's a plan <NA>

This is admittedly a hack:
data.table::rbindlist(
lapply(rapply(L, function(z) paste(z, collapse = " "), how = "replace"),
function(z) as.data.frame(z$Messages)),
fill = TRUE, idcol = "id")
# id Content Content.1
# <char> <char> <char>
# 1: 1234 How are you today I am great
# 2: 2344 It's a plan <NA>
It also works with dplyr if you prefer:
dplyr::bind_rows(
lapply(rapply(L, function(z) paste(z, collapse = " "), how = "replace"),
function(z) as.data.frame(z$Messages)),
.id = "id")

Within tidyverse we could pivot_longer followed by unnest_wider:
library(tidyr)
library(tibble)
library(jsonlite)
jsontext |>
fromJSON() |>
as_tibble() |>
pivot_longer(everything()) |>
unnest_wider(value, transform = ~paste(unlist(.), collapse = " "), names_sep = "_")
Output
# A tibble: 2 × 3
name value_1 value_2
<chr> <chr> <chr>
1 1234 How are you today "I am great"
2 2344 Its a plan ""

Related

Search elements of a single character string in a dataframe column to subset it

I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2

How to sort a concatenated string in a column in R?

Have a data frame with a concatenated column that I want to order numerically with the number after -
df <- data.frame(Order = c("A23_2-A27_3-A40_4-A10_1", "A25_2-A21_3-A11_1", "A9_1", "A33_2-A8_1"))
and want to have a result like this:
df <- data.frame(Order = c("A10A23A27A40", "A11A25A21", "A9", "A8A33"))
tried couple of things with tidyverse but couldn't get a clean result.
df %>%
rowid_to_column() %>%
separate_rows(Order, sep='-') %>%
separate(Order, c('Order', 'v'), convert = TRUE) %>%
arrange(v)%>%
group_by(rowid) %>%
summarise(Order = str_c(Order, collapse = ''))
# A tibble: 4 x 2
rowid Order
<int> <chr>
1 1 A10A23A27A40
2 2 A11A25A21
3 3 A9
4 4 A8A33
Another base R approach:
df$Order <- sapply(strsplit(df$Order, '-'), function(x) {
spl <- strsplit(x, '_') # split by '_'
spl <- do.call(rbind, spl) # create a 2-column matrix
ord <- order(as.numeric(spl[, 2])) # order of numeric parts
paste(spl[ord, 1], collapse='') # concatenate in correct order
})
Here is a base R option:
df$Order <-
sapply(strsplit(df$Order, "-"), function(x)
paste0(gsub("\\_.*", "", x[order(as.numeric(sub("^[^_]*_", "", x)))]), collapse = ""))
Output
Order
1 A10A23A27A40
2 A11A25A21
3 A9
4 A8A33
Or a tidyverse option:
library(tidyverse)
df %>%
mutate(Order = map(str_split(Order, "-"), ~
str_c(
str_replace_all(.x[order(as.numeric(str_replace_all(.x, "^[^_]*_", "")))], "\\_.*", ""), collapse = ""
)))

Summarizing by multiple groups in R

I'm trying to summarize a dataset based on "station" and "depth bin" with total counts of family for each. This is how the dataset looks:
The end result should look like this"
...
Using dplyr,
Data
df <- read.table(text = "Family Station 'Total Count' 'Depth Bin'
Macrouridae 1504-04 1 2500-2550
Ophidiidae 1504-04 1 3500-3550
Synaphobranchidae 1504-05 2 3000-3050", header= TRUE)
Code
library(dplyr)
library(tidyr)
df %>%
group_by(Family,Station, Depth.Bin) %>%
summarise(n = sum(Total.Count)) %>%
mutate(newcol = paste0(c(Station, Depth.Bin), collapse = ":")) %>%
ungroup() %>%
select(Family, n, newcol) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = newcol, values_from = n) %>%
select(-row)
Family `1504-04:2500-2550` `1504-04:3500-3550` `1504-05:3000-3050`
<chr> <int> <int> <int>
1 Macrouridae 1 NA NA
2 Ophidiidae NA 1 NA
3 Synaphobranchidae NA NA 2
Base-R version, with tapply (I changed some of your variable names to avoid spaces):
dd <- read.table(header = TRUE, text = "
Family Station Total_Count Depth_Bin
Macrouridae 1504-04 1 2500-2550
Ophidiidae 1504-04 1 3500-3550
Synaphobranchidae 1504-05 2 3000-3050
")
with(dd, tapply(
Total_Count,
list(Family, interaction(Station, Depth_Bin, sep = ":")),
FUN = sum))

R: How do I c() nested character vectors grouped by another column?

I have strings containing enumerations of words grouped under word type. The example below only has one type for simplicity's sake.
ka = tibble(
words = c('apple, orange', 'pear, apple, plum'),
type = 'fruit'
)
I want to find out the number of UNIQUE words per type.
I figured I would split the character vectors,
ka = ka %>%
mutate(
word_list = str_split(words, ', ')
)
and then bind the columns per group. The end result would be
c(
ka$word_list[[1]],
ka$word_list[[2]],
)
Then I can unique these vectors and get their length.
I don't know how to bind columns together, grouped by a separate column. I could do this with an ugly loop within a loop, but there must be a map/apply solution as well, following the logic of:
ka %>%
group_by(type) %>%
summarise(
biglist = map(word_list, ~ c(.)), # this doesn't work, obviously
biglist_unique = map(biglist, ~ unique(.)),
biglist_length = map(biglist_unique, ~ length(.))
)
Here is an option for you. First we collapse the vectors, then we map out what you're looking for. Note that we have to trim off the whitespace to get the proper unique words.
library(tidyverse)
ka %>%
group_by(type) %>%
summarise(all_words = paste(words, collapse = ",")) %>%
mutate(biglist = str_split(all_words, ",") %>% map(., ~str_trim(.x, "both")),
biglist_unique = map(biglist, ~.x[unique(.x)]),
biglist_length = map_dbl(biglist_unique, length))
#> # A tibble: 1 x 5
#> type all_words biglist biglist_unique biglist_length
#> <chr> <chr> <list> <list> <dbl>
#> 1 fruit apple, orange,pear, apple, plum <chr [5]> <chr [4]> 4
Another option would be to use tidy data principles and the tidyr package.
ka = ka %>%
mutate(
word_list = str_split(words, ', ')
)
ka %>%
# If you need to maintain information about each row you can create an index
# mutate(index = row_number()) %>%
# unnest the wordlist to get one word per row
unnest(word_list) %>%
# Only keep unique words per group
group_by(type) %>%
distinct(word_list, .keep_all = FALSE) %>% # if you need to maintain row info .keep_all = TRUE
summarise(n_unique = n())
# A tibble: 1 x 2
# type n_unique
# <chr> <int>
# 1 fruit 4
Here's a way you can do using separate_rows:
ka %>%
separate_rows(words, sep = ', ') %>%
group_by(type) %>%
summarise(word_c = n_distinct(words))
Something like this:
library(tidyverse)
ka %>%
mutate(words = strsplit(as.character(words), ",")) %>%
unnest(words) %>%
mutate(words = gsub(" ","",words)) %>%
group_by(type) %>%
summarise(number = n_distinct(words),
words = paste0(unique(words), collapse =' '))
# A tibble: 1 x 3
type number words
<chr> <int> <chr>
1 fruit 4 apple orange pear plum

Iterate over column names and separate fields recursively with dplyr

I want to iterate over column names of the data frame, then using dplyr, separate fields using a delimiter(->) found among the row fields. This is how the dataset looks like :
dput(df)
structure(list(v1 = c("Silva->Mark", "Brandon->Livo", "Mango->Apple"),
v2 = c("Austin", "NA ", "Orange"),
v3 = c("James -> Jacy","NA->Jane", "apple -> Orange")),
class = "data.frame", row.names = c(NA, -3L))
Now I wrote a code that filters out column names with delimiter(->) on rows which are column v1 and column v3. Here is the code:
rows_true <- apply(df,2,function(x) any(sapply(x,function(y)grepl("->",y))))
ss<-df[,rows_true]
Then I tried to loop through those column names so that I can separate using the delimiter using this code but it ain't working
cols<- names(df)
if (names %in% df){
splitcols <- ss %>%
tidyr::separate(cols, into = c(paste0(names,+ "old"), "paste0(names,+ "New")"), sep = "->")
}
The reason I am using paste0 is because I do want the columns split into two using the delimiter then the newly formed columns should be named using the original name plus suffix Old for the first one and New for second split column
End result after looping through column names and recursively separating them should look like this
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Mark", "Livo", "Apple"),
v3_Old = c("James","NA", "apple"),
v3_New = c("Jacy","Jane", "Orange")),
class = "data.frame", row.names = c(NA, -3L))
For the sake of completeness, here is also a solution which uses data.table().
There are some differences to the other answers posted so far:
It is not required to identify the columns to be split beforehand. Instead, columns without "->" are dropped from the result on the fly.
The regular expression which is used for splitting includes surrounding white space (if any)
" *-> *". This avoids to call trimws() on the resulting pieces afterwards or to remove white space beforehand.
.
library(data.table)
library(magrittr) # piping used to improve readability
setDT(df)
lapply(names(df), function(x) {
mDT <- df[, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_Old", "_New")))
}) %>% as.data.table()
v1_Old v1_New v3_Old v3_New
1: Silva Mark James Jacy
2: Brandon Livo NA Jane
3: Mango Apple apple Orange
One possibility involving dplyr and tidyr could be:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid) %>%
mutate(val = trimws(val),
var = make.unique(var)) %>%
ungroup() %>%
spread(var, val) %>%
select(-rowid)
v1 v1.1 v3 v3.1
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange
Or to further match the expected output:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid, var) %>%
mutate(val = trimws(val),
var2 = if_else(row_number() == 2, paste0(var, "_old"), paste0(var, "_new"))) %>%
ungroup() %>%
select(-var) %>%
spread(var2, val) %>%
select(-rowid)
v1_new v1_old v3_new v3_old
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange
A different approach with dplyr, purr, and stringr is the following.
library(dplyr)
library(purrr)
library(stringr)
# Detect the columns with at least on "->"
my_df_cols <- map_lgl(my_df, ~ any(str_detect(., "->")))
my_df %>%
# Select only the columns with at least "->"
select(which(my_df_cols)) %>%
# Mutate these columns and only keep the mutated columns with new names
transmute_all(list(old = ~ str_split(., "->", simplify = TRUE)[, 1],
new = ~ str_split(., "->", simplify = TRUE)[, 2]))
# v1_old v3_old v1_new v3_new
# 1 Silva James Mark Jacy
# 2 Brandon NA Livo Jane
# 3 Mango apple Apple Orange
We can also use cSplit from splitstackshape
#Detect columns with "->"
cols <- names(df)[colSums(sapply(df, grepl, pattern = "->")) > 1]
#Remove unwanted whitespaces before and after "->"
df[cols] <- lapply(df[cols], function(x) gsub("\\s+", "", x))
#Split into new columns specifying sep as "->"
splitstackshape::cSplit(df[cols], cols, sep = "->")
# v1_1 v1_2 v3_1 v3_2
#1: Silva Mark James Jacy
#2: Brandon Livo <NA> Jane
#3: Mango Apple apple Orange

Resources