load JSON data into a dataframe - r

I am a beginner working with R and especially JSON files, and this is probably a simple question but I have been unsuccessful for a while.
Here is a sample row of data from a provided text file (there are ~4000 rows):
{"040070005001":4,"040070005003":4,"040138101003":4,"040130718024":4}
Each row has a variable number of values in the string.
I am trying to use a function, but it is only loading the last row of the data set rather than capturing the data from each row?
For (row in 1:nrow(origins)) {
json <- origins$home_cbgs[row] %>%
fromJSON() %>%
unlist() %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(
origin_census_block_group = "rowname",
origin_visitors = "."
)
}

Related

tidyjson: Replace existing column values in dataframe with output of spread_values

I have a data frame that has a column with JSON values. I found the library "tidyjson" which helps to extract this JSON. However, it is always extracted into a new data frame.
I am looking for a way to replace the JSON in the original data frame with the result of tidyjson.
Code:
mydf <- df$response %>% as.tbl_json %>% gather_array %>%
spread_values(text=jstring('text'))
Is there a way that "df$response" is replaced with the extracted json "text"-value?
Thanks in advance!
This solution worked for me:
df %>% as.tbl_json(json.column = 'response') %>% gather_array %>%
spread_values(response=jstring('text'))

How to save the summarise data into a dataframe

I'm currently using the group_by() function and summaraise() to get the sum of my columns, is it possible to save that information to another data frame somehow? Maybe even create a csv file with its information. Thanks
workday %>% group_by(Date) %>% mutate_if(is.character,as.numeric) %>% summarise(across(Axis1:New_Sitting,sum))
Store the pipe result in a new object, say a
a <- workday %>% group_by(Date) %>% #other ops on workday
To save it to a file there are several options, including the base write.csv:
write.csv(a, "Path to file")

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

load bigquery JSON data dump into R tibble

I have downloaded a JSON extract from Big Query which has nested and repeated fields (similar to the package bigrquery) and am attempting to further manipulate the resulting tibble.
I have the following code to load from JSON and convert to a tibble
library(tidyverse)
ga.list <- lapply(readLines("temp.json"), jsonlite::fromJSON, flatten = TRUE)
ga.df <- tibble(dat = ga.list) %>%
unnest_wider(dat) %>%
mutate(id = row_number()) %>%
unnest_wider(b_nested) %>%
unnest_wider(b3) %>%
unnest_wider(b33)
So there were two list columns:
b_nested, this column is a nested list (which I unnested recursively .. maybe there is a more automated way, if so, please advise!)
rr1 and rr2, these columns will always have the same number of elements. So elements 1 of rr1 and rr2 should be read together.
I am still working out how to extract id, rr1 and rr2 and make into a long table with repeated rows for each id row.
Note: this question has been edited a few times as I progress further along .. originally I had got stuck getting it from JSON to a tibble until I found unnest_wider()
temp.json:
{"a":"4000","b_nested":{"b1":"(not set)","b2":"some -
text","b3":{"b31":"1591558980","b32":"60259425255","b33":{"b3311":"133997175"},"b4":false},"b5":true},"rr1":[],"rr2":[]}
{"a":"4000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},
"rr1":["v1","v2","v3"],"rr2":["x1","x2","x3"]}
{"a":"6000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},"rr1":["v1","v2","v3","v4","v5"],"rr2":["aja1","aja2","aja3","aja14","aja5"]}
The final piece of the puzzle; in order to get the repeating rows for repeating record
ga.df %>% select(id, rr1, rr2) %>%
unnest(cols = c(rr1, rr2))
FYI: Link to Big Query Specifying nested and repeated columns
Another solution (my preference) would be to create a tibble from rr1 and rr1 and keep as a column in ga.df so that purrr functions can be used
ga.df %>%
mutate(rr = map2(rr1, rr2, function(x,y) {
tibble(rr1 = x, rr2 = y)
})) %>%
select(-rr1, -rr2) %>%
mutate(rr_length = map_int(rr, ~nrow(.x)))

Importing data using readxl

Issue description
I'm trying to iterate over multiple sheets in a spread sheet to pick up the first row as a column name and rows 11+ as data. I'm hoping to import them as a single dataframe. I'm having trouble because there are 10 header rows in the sheets and I don't seem to be able to aggregate the sheets without losing data.
Data
The file in question is found at Table 6 on this page of the ABS website.
My attempt
The first chunk does the heavy lifting of getting the data into r. The map function naturally results in a list of lists, containing data found in the sheets whose name contained the text "Data" (done this way because there are two sheets in every one of these spread sheets that contain some irrelevant info).
BUT I want the output in a dataframe so I tried to use the map_df function but all data from spreadsheets after the first is imported as NA values (incorrect).
library(tidyverse)
library(stringr)
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(read_excel, path = path, skip = 9)
The second chunk picks up the column names in each of the sheets so that they can be applied to df1.
nms <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map_df(read_excel, path = path, n_max = 0, col_names = T) %>%
mutate(
date = 1
) %>%
select(
date, everything()
)
names(df1) <- names(nms)
If anyone could show me how to import the data without the NA's in a single dataframe that would be great. Bonus points for showing me how to do it in a single step, without the need for the second chunk to name the columns.
Not exactly sure what you're looking for, but if you want to read all the sheets in that workbook keeping and skipping the first 9 rows. Then you just need to stitch these all together through a reduce using left_join to get rid of the NA values.
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, skip = 9)) %>%
reduce(left_join, by = "Series ID")
If you want to keep the original header names:
path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, col_names = FALSE) %>%
set_names(., c("Series ID", .[1, 2:ncol(.)])) %>%
slice(-1:-10)) %>%
reduce(left_join, by = "Series ID") %>%
mutate_at(vars(-`Series ID`), as.numeric)

Resources