Unpack json columns into a dataframe - r

I have json strings inside a dataframe column. I want to bring all these new json columns into the dataframe.
# Input
JsonID <- as.factor(c(1,2,3))
JsonString1 = "{\"device\":{\"site\":\"Location1\"},\"tags\":{\"Engine Pressure\":\"150\",\"timestamp\":\"2608411982\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"2608411982\"}"
JsonString2 = "{\"device\":{\"site\":\"Location2\"},\"tags\":{\"Engine Pressure\":\"160\",\"timestamp\":\"3608411983\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"3608411983\"}"
JsonString3 = "{\"device\":{\"site\":\"Location3\"},\"tags\":{\"Brake Fluid\":\"100\",\"timestamp\":\"4608411984\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"4608411984\"}"
JsonStrings = c(JsonString1, JsonString2, JsonString3)
Example <- data.frame(JsonID, JsonStrings)
Using the jsonlite library I can make each json string into a 1 row dataframe.
library(jsonlite)
# One row dataframes
DF1 <- data.frame(fromJSON(JsonString1))
DF2 <- data.frame(fromJSON(JsonString2))
DF3 <- data.frame(fromJSON(JsonString3))
Unfortunately the JsonID variable column is lost. All json strings share common column name such as "time". But there are column names they don't share. By pivoting the data longer I could Rbind all the dataframes together.
library(dplyr)
library(tidyr)
# Row bindable one row dataframes
DF1_RowBindable <- DF1 %>%
rename_all(~gsub("tags.", "", .x)) %>%
tidyr::pivot_longer(cols = c(colnames(.)[2]))
Is there a better way to do this?
I have never worked with json strings before. The solution must be computationally scalable.

We can store the data from fromJSON in list in the dataframe itself so we don't loose any information that we already have in the data. We can use unnest_wider to create new columns from named list.
library(dplyr)
library(tidyr)
library(jsonlite)
Example %>%
rowwise() %>%
mutate(data = list(fromJSON(JsonStrings))) %>%
unnest_wider(data) %>%
select(-JsonStrings) %>%
unnest_wider(tags) %>%
unnest_wider(device)
# JsonID site `Engine Pressure` timestamp historic adhoc `Brake Fluid` online time
# <fct> <chr> <chr> <chr> <lgl> <lgl> <chr> <lgl> <chr>
#1 1 Location1 150 2608411982 FALSE FALSE NA TRUE 2608411982
#2 2 Location2 160 3608411983 FALSE FALSE NA TRUE 3608411983
#3 3 Location3 NA 4608411984 FALSE FALSE 100 TRUE 4608411984
Since each column (data, tags, device) are of different lengths we need to use unnest_wider separately on each one of them.

Related

Extracting JSON data with asymetric content from a dataframe column in R

I loaded a table from a database which contains a column that has JSON data in each row.
The table looks something like the example below. (I was not able to replicate the data.frame I have, due to the format of the column data)
dataframe_example <- data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
JSON_col = c({"_inv": [10,20,30,40]}, "_person": ["_personid": "green"],
{"_inv": [15,22]}, "_person": ["_personid": "blue"],
{"_inv": []}, "_person": ["_personid": "red"]))
I have the following two issues:
Some of the items (e.g. "_inv") sometimes have the full 4 numeric entries, sometimes less, and sometimes nothing. Some of the other items (e.g. "_person") usually contain another header, but only one character data point.
My goal is to preserve the existing dataframes colums (such as id and name) and spread the data in the json column such that I have new columns containing each point of information. The target dataframe would look a little like this:
data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
`_inv_1` = c(10,15,NA),
`_inv_2` = c(20,22,NA),
`_inv_3` = c(30,NA,NA),
`_inv_4` = c(40,NA,NA),
`_person_id` = c("green","blue","red"))
Please bear in mind that I have very little experience handling JSON data and no experience dealing with uneven JSON data.
Using purrr I got:
frame <- purrr::map(dataframe_example$JSON_col, jsonlite::fromJSON)
This gave me a large list with n elements, where n is the length of the original dataframe. The "Name" item contains n lists [[1]], each one with its own type of object, ranging from double to data.frame. The double object contain four numeric observations, (such as _inv), some of the objects are lists themselves (such as _person), which further contains "_personid" and then a single entry. The dataframe contains the datetime stamps for each observation in the JSON data. (each _inv item has a timestamp)
Is there a way to obtain the solution above, either by extracting the data from my "frame" object, or an altogether different solution?
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
dataframe_example <-
data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
JSON_col = c(
"{\"_inv\": [10,20,30,40], \"_person\": {\"_personid\": \"green\"}}",
"{\"_inv\": [15,22], \"_person\": {\"_personid\": \"blue\"}}",
"{\"_inv\": [], \"_person\": {\"_personid\": \"red\"}}"
)
)
dataframe_example %>%
as_tibble() %>%
mutate(
JSON_col = JSON_col %>% map(parse_json)
) %>%
unnest_wider(JSON_col) %>%
unnest(`_inv`) %>%
unnest(`_inv`) %>%
unnest(`_person`) %>%
unnest(`_person`) %>%
group_by(id, name) %>%
mutate(inv_id = row_number()) %>%
pivot_wider(names_from = inv_id, values_from = `_inv`, names_prefix = "_inv_")
#> # A tibble: 2 x 7
#> # Groups: id, name [2]
#> id name `_person` `_inv_1` `_inv_2` `_inv_3` `_inv_4`
#> <dbl> <chr> <chr> <int> <int> <int> <int>
#> 1 1 name1 green 10 20 30 40
#> 2 2 name2 blue 15 22 NA NA
Created on 2021-11-25 by the reprex package (v2.0.1)

Numbers as column name in tibbles: problem when using select()

I am trying to select some column by name, and the names are numbers. This is the code:
df2 <- df1 %>% select(`Year`, all_of(append(list1, list2))) %>%
I get this error:
Error: Can't subset columns that don't exist. x Locations 61927,
169014, 75671, 27059, 225963, etc. don't exist. i There are only 5312
columns.
I think the error is due to column names being numbers. How do I solve it? (I want to keep the column name as numbers)
We may use any_of with paste so that if there are numeric values as column names, it still work and if some of them are missing too, it would not throw an error
library(dplyr)
df1 %>%
select(Year, any_of(paste(c(list1, list2))))
If you insert a number in select it will use as position, buy you can use the number as characters.
Example
library(dplyr)
df <- tibble(`2020` = NA,`2021` = NA, "var" = NA)
df
# A tibble: 1 x 3
`2020` `2021` var
<lgl> <lgl> <lgl>
1 NA NA NA
Using number inside select
I will give an error, since there just 3 variables, and if you use 2020 will search the 2020th column.
df %>%
select(2020)
Erro: Can't subset columns that don't exist. x Location 2020 doesn't
exist. i There are only 3 columns.
Using number as string inside select
df %>%
select("2020")
# A tibble: 1 x 1
`2020`
<lgl>
1 NA
You can clean the column names using the janitor package.
df1 <- janitor::clean_names(df1)

Separate string into columns by extracting al groups that match regex

I have these strings in every row of one column.
example_df <- tibble(string = c("[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":19.843418733556412,\"volgorde\":10},{\"positieVergelekenMetSchooladvies\":\"op niveau\",\"percentage\":81.81818181818181,\"percentageVergelijking\":78.58821425834631,\"volgorde\":20},{\"positieVergelekenMetSchooladvies\":\"onder niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":1.5683670080972694,\"volgorde\":30}]"))
I'm only interested in the numbers. This regex works:
example_df %>%
.$string %>%
str_extract_all(., "[0-9]+\\.[0-9]+")
Instead of using the separate() function I want to use the extract() function. My understanding is that it differs from separate() in that extract() matches your regex you want to populate your new columns with. separate() matches, of course, the separation string. But where separate() matches all strings you fill in at sep= extract() matches only one group.
example_df %>%
extract(string,
into = c("boven_niveau_school",
"boven_niveau_verg",
"op_niveau_school",
"op_niveau_verg",
"onder_niveau_school",
"onder_niveau_verg"),
regex = "([0-9]+\\.[0-9]+)")
What am I doing wrong?
Instead of separate or extract I would extract all the numbers from the string and then use unnest_wider to create new columns.
library(tidyverse)
example_df %>%
mutate(temp = str_extract_all(string, "[0-9]+\\.[0-9]+")) %>%
unnest_wider(temp)
You can rename the columns as per your choice.
We can use regmatches/regexpr from base R
out <- regmatches(example_df$string, gregexpr("\\d+\\.\\d+", example_df$string))[[1]]
example_df[paste0("new", seq_along(out))] <- as.list(out)
example_df
# A tibble: 1 x 7
# string new1 new2 new3 new4 new5 new6
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 "[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9… 9.09090909… 19.84341873… 81.8181818… 78.588214… 9.0909090… 1.56836700…

convert named list with mixed content to data frame

Is there a better and nicer way to convert named list with mixed content to data frame?
The working example:
my_list <- list("a" = 1.0, "b" = "foo", "c" = TRUE)
my_df <- data.frame(
"key" = names(my_list),
stringsAsFactors = F
)
my_df[["value"]] <- unname(my_list)
Is it possible to do this conversion in one step?
We can use stack from base R
stack(my_list)
According to ?stack
The stack function is used to transform data available as separate columns in a data frame or list into a single column that can be used in an analysis of variance model or other linear model. The unstack function reverses this operation.
Or with enframe
library(tidyverse)
enframe(my_list) %>% # creates the 'value' as a `list` column
mutate(value = map(value, as.character)) %>% # change to single type
unnest
You can use dplyr::as_tibble to coerce the list into a data frame / tibble. This will automatically create a data frame where the list's names are column names, and the list items correspond to rows.
library(dplyr)
library(tidyr)
my_list <- list("a" = 1.0, "b" = "foo", "c" = TRUE)
as_tibble(my_list)
#> # A tibble: 1 x 3
#> a b c
#> <dbl> <chr> <lgl>
#> 1 1 foo TRUE
To reshape into the two-column format you have, pipe it into tidyr::gather, where the default column names are key and value. Because of the different data types in the column value, this will coerce all the values to character.
as_tibble(my_list) %>%
gather()
#> # A tibble: 3 x 2
#> key value
#> <chr> <chr>
#> 1 a 1
#> 2 b foo
#> 3 c TRUE
Created on 2018-11-09 by the reprex package (v0.2.1)

Long to Wide with Non-Unique Key Combinations in R

I am trying to convert a dataset from long to wide format. Need to do this to feed into another program for analysis purposes. My input data is below:
sdata <- data.frame(c(1,1,1,1,1,1,1,1,1,1,1,1,1),c(1,1,1,1,1,1,1,1,1,2,2,2,2),c("X1","A","B","C","D","X2","A","B","C","X1","A","B","C"),c(81,31,40,5,5,100,8,90,2,50,20,24,6))
col_headings <- c("Orig","Dest","Desc","Estimate")
names(sdata) <- col_headings
Input Data
Depending on the unique combination of Orig-Dest-X1, Orig-Dest-X2 category above, the subcategories vary from only A,B,C to A,B,C,D to A,B, etc. I am trying to get the desired output (code to recreate in R below) along with image of desired output.
sdata_spread <- data.frame(c(1,1),c(1,2),c(81,50),c(31,20),c(40,24),c(5,6),c(5,NA),c(100,NA),c(8,NA),c(90,NA),c(2,NA))
col_headings <- c("Orig","Dest","X1", "X1_A", "X1_B", "X1_C", "X1_D","X2", "X2_A", "X2_B", "X2_C")
names(sdata_spread) <- col_headings
Desired Output
I tried the following:
sdata_spread <- sdata %>% spread(Desc,Estimate)
The error I got was:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 6 rows
I also tried the accepted answer given here: Long to wide with no unique key and here: Long to wide format with several duplicates. Circumvent with unique combo of columns but it did not get me the desired output.
Any insights would be much appreciated.
Thanks,
Krishnan
One option is to create a grouping variable based on the occurrence of 'X' as the first character in the 'Desc', use that to modify the 'Desc' by pasteing the first element of 'Desc' with each of the element based on a condition in case_when and reshape to wide format with pivot_wider (from tidyr_1.0.0, spread/gather are getting deprecated and in its place pivot_wider/pivot_longer are used)
library(dplyr)
library(tidyr)
library(stringr)
sdata %>%
group_by(grp = cumsum(str_detect(Desc, '^X'))) %>%
mutate(Desc = case_when(row_number() > 1 ~ str_c(first(Desc), Desc, sep="_"),
TRUE ~ as.character(Desc))) %>%
ungroup %>%
select(-grp) %>%
pivot_wider(names_from = Desc, values_from = Estimate)
# A tibble: 2 x 11
# Orig Dest X1 X1_A X1_B X1_C X1_D X2 X2_A X2_B X2_C
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 81 31 40 5 5 100 8 90 2
#2 1 2 50 20 24 6 NA NA NA NA NA

Resources