Get Array out of JSON-structure with tidyjson - r

I'm trying to read an array out of a JSON structure with tidyjson as I'm trying to fasten up my code.
My input data is of the structure
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
I want my output to be a data frame where key1 is one column and key2 is the second column in which all elements of the array are pasted together and separated by ";".
I tried something like
result <- json %>% spread_values(a = jstring("key1"), b = paste0(jstring("key2"), collapse = ";"))
I really have no idea how to get the array out of the JSON in the spread_values function.
I got what I want with
key2 <- json %>% enter_object("key2")
attributes(key2)$JSON %>% unlist() %>% paste0(collapse = ";")
but as I don't have unique keys I can't join it to the rest of my data and I think there must be a better way.

I'm glad you got something working! In case anyone else happens upon this question, there are definitely many ways to accomplish this task!
One is to use tidyjson to gather the data into a tall structure, then summarize:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
# make the data tall
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
gather_array("idx") %>%
append_values_string("key2") %>%
# now summarize
group_by(key1) %>%
summarize(key2 = paste(key2, collapse = ";"))
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)
Another way is to grab the json data directly with json_get_column() and mutate that:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
json_get_column("array") %>%
mutate(key2 = purrr::map_chr(array, ~ paste(.x, collapse = ";"))) %>%
as_tibble() %>% # drop tbl_json structure
select(key1, key2)
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)

Related

How to merge duplicate rows in R

I am new to R and very stuck on a problem which I've tried to solve in various ways.
I have data I want to plot to a graph that shows twitter engagements per day.
To do this, I need to merge all the 'created at' rows, so there is only one data per row, and each date has the 'total engagements' assigned to it.
This is the data:
So far, I've tried to do this, but can't seem to get the grouping to work.
I mutated the data to get a new 'total engage' column:
lgbthm_data_2 <- lgbthm_data %>%
mutate(
total_engage = favorite_count + retweet_count
) %>%
Then I've tried to merge the dates:
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise_all(na.omit)
But no idea!
Any help would be great
Thanks
You are looking for:
library(dplyr)
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise(n = n())
And there is even a shorthand for this in dplyr:
only_one_date <- lgbthm_data_2 %>%
count(created_at)
group_by + summarise can be used for many things that involve summarising all values in a group to one value, for example the mean, max and min of a column. Here I think you simply want to know how many rows each group has, i.e., how many tweets were created in one day. The special function n() tells you exactly that.
From experience with Twitter, I also know that the column created_at is usually a time, not a date format. In this case, it makes sense to use count(day = as.Date(created_at)) to convert it to a date first.
library(tidyverse)
data <- tribble(
~created_at, ~favorite_count, ~retweet_count,
"2022-02-01", 0, 2,
"2022-02-01", 1, 3,
"2022-02-02", 2, NA
)
summary_data <-
data %>%
type_convert() %>%
group_by(created_at) %>%
summarise(total_engage = sum(favorite_count, retweet_count, na.rm = TRUE))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> created_at = col_date(format = "")
#> )
summary_data
#> # A tibble: 2 × 2
#> created_at total_engage
#> <date> <dbl>
#> 1 2022-02-01 6
#> 2 2022-02-02 2
qplot(created_at, total_engage, geom = "col", data = summary_data)
Created on 2022-04-04 by the reprex package (v2.0.0)

Purrr add new columns to a data frame that are an output from a map function call

I am working with a data frame (call it full_df) that contains links which I want to use to scrape two further links. This is a sample for the data frame:
structure(list(CIK = c("1082339", "1276755", "1280511"), COMPANY_NAME = c("COLDSTREAM CAPITAL MANAGEMENT INC",
"CHELSEA COUNSEL CO", "QUANTUM CAPITAL MANAGEMENT"), FORM_TYPE = c("13F-HR",
"13F-HR", "13F-HR"), FILE_DATE = c("2020-05-27", "2020-06-12",
"2020-05-26"), FORM_LINK = c("edgar/data/1082339/0001082339-20-000002.txt",
"edgar/data/1276755/0001420506-20-000683.txt", "edgar/data/1280511/0001280511-20-000003.txt"
), QTR_YEAR = c("Q22020", "Q22020", "Q22020"), FULL_LINK = c("https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm",
"https://www.sec.gov/Archives/edgar/data/1276755/0001420506-20-000683-index.htm",
"https://www.sec.gov/Archives/edgar/data/1280511/0001280511-20-000003-index.htm"
)), row.names = c(NA, 3L), class = "data.frame")
I would like to iterate over the FULL_LINK column and obtain two further links that I would then want to add to my original data frame as two new columns - xml_link and html_link.
I can get the links using a function that i have written like so (with a single link used an an example here):
library(polite)
library(rvest)
library(glue)
library(tidyverse)
test_link <- "https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm"
ua = 'Kartik P (for personal use)'
session <- bow("https://www.sec.gov/",
user_agent = ua)
xml_scraper <- function(urll) {
print(glue("Scraping: {urll}"))
temp_link <- session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
html_nodes("a") %>%
html_attr('href')
xml_link <- temp_link %>%
nth(12)
html_link <- temp_link %>%
nth(11)
return(data.frame(xml_link, html_link))
}
Great! this works as expected and returns a data frame with two columns that I want
xml_scraper(test_link)
Scraping: https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm
xml_link
1 /Archives/edgar/data/1082339/000108233920000002/CCMI13F2020Q1.xml
html_link
1 /Archives/edgar/data/1082339/000108233920000002/xslForm13F_X01/CCMI13F2020Q1.xml
However, what I would like to do is to iterate over each element of the FULL_LINK column in the full_df and add the two new links as elements of newly created xml_link and html_link column in the original data frame. It feels like this should be doable with purr::map_dfr and a bind_cols call or mutating two names variables simultaneously, but I am unable to figure out the syntax.
Would appreciate any suggestions on how to get this to work with dplyr and purrr.
Thanks in advance.
Maybe:
df_new <- bind_cols(map_dfr(df$FULL_LINK, xml_scraper), df)
Result:
#> # A tibble: 3 × 9
#> xml_link html_link CIK COMPANY_NAME FORM_TYPE FILE_DATE FORM_LINK QTR_YEAR
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 /Archive… /Archives… 1082… COLDSTREAM … 13F-HR 2020-05-… edgar/da… Q22020
#> 2 /Archive… /Archives… 1276… CHELSEA COU… 13F-HR 2020-06-… edgar/da… Q22020
#> 3 /Archive… /Archives… 1280… QUANTUM CAP… 13F-HR 2020-05-… edgar/da… Q22020
#> # … with 1 more variable: FULL_LINK <chr>
Created on 2022-01-01 by the reprex package (v2.0.1)
You can just mutate the data set using your xml_scraper function. You need do the mutate "rowwise", since your function isn't vectorized.
data_full<-data %>%
rowwise() %>%
mutate(xml_link=xml_scraper(FULL_LINK) %>% pluck("xml_link"),
html_link=xml_scraper(FULL_LINK) %>% pluck("html_link"))
#If you want just the results of the scrape, you can use map
the_xml<-data %>%
split(1:nrow(.)) %>%
map(~pluck(.x$"FULL_LINK")) %>%
map(xml_scraper) %>%
bind_rows()
You can edit your function to output also FULL_LINK and use it to join the 2 new columns to your original data
xml_scraper <- function(urll) {
print(glue("Scraping: {urll}"))
temp_link <- session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
html_nodes("a") %>%
html_attr('href')
xml_link <- temp_link %>%
nth(12)
html_link <- temp_link %>%
nth(11)
return(data.frame(FULL_LINK = urll, xml_link, html_link))
}
Then
data2 <- map_dfr(data$FULL_LINK, .f = xml_scrapper) %>%
left_join(data, ., by = "FULL_LINK")

String with values mapped from other data frame in R

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

fixing incompatible types error in R using dplyr/mutate

I'm trying to use the tidyverse/dplyr package in R to work with data including vectorized calls to an online API (from Altmetric) to add rows using mutate.
The smallest code I can create that reproduces the error is that below. I get the error "Error: incompatible types, expecting a numeric vector"
library(tidyverse)
library(jsonlite)
fromJSON_wrapper <- function(x,y) {
fromJSON(x)[[c(y)]]
}
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% rowwise() %>% mutate(score = fromJSON_wrapper(url,"score"))
The code for extracting a single score below works, whether just using the wrapper or on a one row tibble and I'm not sure why my code isn't working.
fromJSON_wrapper("https://api.altmetric.com/v1/doi/10.1007/s13721-015-0095-0")
extracted <- toy[1,] %>% rowwise() %>% mutate(score = fromJSON_wrapper(url, "score"))
Any suggestions would be appreciated.
It's simpler to just iterate over the vector of URLs and extract what you need. purrr::map_dbl makes this simple, though sapply would work fine, too.
library(tidyverse)
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% mutate(score = map_dbl(url, ~jsonlite::fromJSON(.x)$score))
extracted %>% select(doi, score)
#> # A tibble: 3 × 2
#> doi score
#> <chr> <dbl>
#> 1 10.1002/anie.201500251 0.25
#> 2 10.1080/19443994.2015.1005695 1.00
#> 3 10.1007/s13721-015-0095-0 1.00

Function to Generate Multiple htmlTables using Purrr::map

library(htmlTable)
library(tidyverse)
library(ggmosaic) for "happy" dataset
I want to create a function that creates frequency tables for all the categorical variables in a dataset and then generate htmlTables for each one. However, by using purrr::map, the tables are in a list. How do I generate the tables using htmlTable? Or any better package that generates similar tables for publication? I suppose I need to split the list or use additional purrr::map functions? Help would be appreciated...
Something like this...
FUN<-function(data){
TAB<-happy%>%select_if(is.factor)%>%
map(table)
TABLES<-htmlTable(TAB)
return(TABLES)
}
Here's a solution that uses a tibble to store the arguments to the function as well as the resulting HTML strings:
Edit: added new column (percent)
library(ggmosaic)
library(purrr)
library(tidyverse)
library(htmlTable)
library(magrittr)
library(scales)
data(happy)
# Use a subset of `happy` for the example
h <- happy %>% as_tibble %>% sample_n(100)
# create the function
make_html_table <- function(data, .name, .col_names) {
data %>%
table %>%
as.data.frame %>%
set_colnames(.col_names) %>%
as.data.frame %>%
mutate(percent = scales::percent(count/sum(count))) %>% # add the percent column
htmlTable(caption = .name)
}
# Apply the function and store the results in a tibble
tbl <-
h %>%
select_if(is.factor) %>%
{ tibble(NAME = names(.),
data = map(., ~.x)) } %>%
mutate(TABLE = map2(.x = data,
.y = NAME,
.f = make_html_table,
.col_names = c("levels", "count")))
# Check out the tables in the Viewer Pane (if you're using RStudio)
tbl %>% extract2("TABLE") %>% map(htmlTableWidget)
#> $happy
#>
#> $sex
#>
#> $marital
#>
#> $degree
#>
#> $finrela
#>
#> $health
Here's a screenshot of the one of the tables this creates:

Resources