load bigquery JSON data dump into R tibble - r

I have downloaded a JSON extract from Big Query which has nested and repeated fields (similar to the package bigrquery) and am attempting to further manipulate the resulting tibble.
I have the following code to load from JSON and convert to a tibble
library(tidyverse)
ga.list <- lapply(readLines("temp.json"), jsonlite::fromJSON, flatten = TRUE)
ga.df <- tibble(dat = ga.list) %>%
unnest_wider(dat) %>%
mutate(id = row_number()) %>%
unnest_wider(b_nested) %>%
unnest_wider(b3) %>%
unnest_wider(b33)
So there were two list columns:
b_nested, this column is a nested list (which I unnested recursively .. maybe there is a more automated way, if so, please advise!)
rr1 and rr2, these columns will always have the same number of elements. So elements 1 of rr1 and rr2 should be read together.
I am still working out how to extract id, rr1 and rr2 and make into a long table with repeated rows for each id row.
Note: this question has been edited a few times as I progress further along .. originally I had got stuck getting it from JSON to a tibble until I found unnest_wider()
temp.json:
{"a":"4000","b_nested":{"b1":"(not set)","b2":"some -
text","b3":{"b31":"1591558980","b32":"60259425255","b33":{"b3311":"133997175"},"b4":false},"b5":true},"rr1":[],"rr2":[]}
{"a":"4000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},
"rr1":["v1","v2","v3"],"rr2":["x1","x2","x3"]}
{"a":"6000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},"rr1":["v1","v2","v3","v4","v5"],"rr2":["aja1","aja2","aja3","aja14","aja5"]}

The final piece of the puzzle; in order to get the repeating rows for repeating record
ga.df %>% select(id, rr1, rr2) %>%
unnest(cols = c(rr1, rr2))
FYI: Link to Big Query Specifying nested and repeated columns
Another solution (my preference) would be to create a tibble from rr1 and rr1 and keep as a column in ga.df so that purrr functions can be used
ga.df %>%
mutate(rr = map2(rr1, rr2, function(x,y) {
tibble(rr1 = x, rr2 = y)
})) %>%
select(-rr1, -rr2) %>%
mutate(rr_length = map_int(rr, ~nrow(.x)))

Related

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

Looping a pipe through columns of a tibble

I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))

Transpose my R Dataset for association analysis

I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)

load JSON data into a dataframe

I am a beginner working with R and especially JSON files, and this is probably a simple question but I have been unsuccessful for a while.
Here is a sample row of data from a provided text file (there are ~4000 rows):
{"040070005001":4,"040070005003":4,"040138101003":4,"040130718024":4}
Each row has a variable number of values in the string.
I am trying to use a function, but it is only loading the last row of the data set rather than capturing the data from each row?
For (row in 1:nrow(origins)) {
json <- origins$home_cbgs[row] %>%
fromJSON() %>%
unlist() %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(
origin_census_block_group = "rowname",
origin_visitors = "."
)
}

How to use purrr with dplyr to filter list elements and export lists into Excel

I'm fairly new to working with lists in R and have a quick question that also involes using purrr. Below are too small sample data frames as an example.
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals <- c("Cat","Cat","Dog","Rat","Bird")
Living <- c("House","Condo","Condo","Apartment","House")
Data1 <- data.frame(Client1,Animals,Living)
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals2 <- c("Cat","Dog","Dog","Rat","Cat")
Living2 <- c("House","Apartment","Apartment","Family","Apartment")
Data2 <- data.frame(Client1,Animals2,Living2)
Bonus if you can include how to rename list elements at once instead of using the two lines below:
names(Data1)[1:3] <- c("Client","Animals","Living")
names(Data2)[1:3] <- c("Client","Animals","Living")
So next if I want to filter each data frame by Animals and then export each into an Excel spreadsheet by using the two lines of code below:
Data1 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data1.csv")
Data2 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data2.csv")
However, to be more efficient I can join both data frames into a list and use purrr to filter each at the same time.
DataList <- list(Data1,Data2)
DataList %>% map(~filter(.,Animals=="Cat"))
For the above code, I will use multiple ~filter lines for each animal, so not sure if there's a more efficient way that will avoid writing many different lines of code while still using purrr and dplyr?
Also, how do I use write.csv with purrr. I can either export the list into one spreadsheet, but I'm not sure how to break up the list so that it exports properly. Also, I can export each list element into separate spreadsheets. It would be great to see a solution for both of these situations.
If I understand your question correctly, you want to write a separate file for each of the Animals of both the data frames:
DataList <- list(Data1, Data2)
library(purrr)
a <- DataList %>% map(., function(x) {
colnames(x) <- c("Client","Animals","Living")
x
}) %>% map(., function(x) {
split(x, x$Animals)
}) %>% flatten(.)
names(a) <- paste0("Data", (1:length(a)))
lapply(1:length(a), function(x) write.csv(a[[x]],
file = paste0(names(a[x]), ".csv"),
row.names = FALSE))
We first dump both the data frames in DataList, then rename the columns for both the data frames with the first map, then split both the data frames by Animals, and finally flatten the nested list.
I wish I could do this without breaking the chain, but I couldn't find another way.
From here, we first rename the elements of the list, then use lapply to loop over all the elements in the list and apply write.csv on each of them.
You mentioned Excel - you can just as easily replace write.csv with any of the functions for writing excel files from R
Here is one option, involving binding the two datasets together before re-splitting.
library(purrr)
library(dplyr)
DataList %>%
map(~setNames(.x, c("Client","Animals","Living"))) %>%
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id") %>%
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
The first map line shows how to rename the columns of all the datasets in a list at once via setNames.
DataList %>%
map(~setNames(.x, c("Client","Animals","Living")))
I then set the names of the datasets in the list via setNames. While stacking the datasets together into a single data.frame via dplyr's bind_rows, these names are added as a new column, id.
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id")
The last step is to split the combined data.frame by id and Animal before writing each split into a separate csv file. Information is pulled out of the dataset for naming the individual files by dataset and animal (this was the reason to name the elements of DataList). I removed the id variable via select prior to writing the files, as it may be extraneous to your needs.
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
This can be all be done without putting these into a single data.frame, but I had trouble with naming the files at the end.

Resources