Extracting JSON data with asymetric content from a dataframe column in R - r

I loaded a table from a database which contains a column that has JSON data in each row.
The table looks something like the example below. (I was not able to replicate the data.frame I have, due to the format of the column data)
dataframe_example <- data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
JSON_col = c({"_inv": [10,20,30,40]}, "_person": ["_personid": "green"],
{"_inv": [15,22]}, "_person": ["_personid": "blue"],
{"_inv": []}, "_person": ["_personid": "red"]))
I have the following two issues:
Some of the items (e.g. "_inv") sometimes have the full 4 numeric entries, sometimes less, and sometimes nothing. Some of the other items (e.g. "_person") usually contain another header, but only one character data point.
My goal is to preserve the existing dataframes colums (such as id and name) and spread the data in the json column such that I have new columns containing each point of information. The target dataframe would look a little like this:
data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
`_inv_1` = c(10,15,NA),
`_inv_2` = c(20,22,NA),
`_inv_3` = c(30,NA,NA),
`_inv_4` = c(40,NA,NA),
`_person_id` = c("green","blue","red"))
Please bear in mind that I have very little experience handling JSON data and no experience dealing with uneven JSON data.
Using purrr I got:
frame <- purrr::map(dataframe_example$JSON_col, jsonlite::fromJSON)
This gave me a large list with n elements, where n is the length of the original dataframe. The "Name" item contains n lists [[1]], each one with its own type of object, ranging from double to data.frame. The double object contain four numeric observations, (such as _inv), some of the objects are lists themselves (such as _person), which further contains "_personid" and then a single entry. The dataframe contains the datetime stamps for each observation in the JSON data. (each _inv item has a timestamp)
Is there a way to obtain the solution above, either by extracting the data from my "frame" object, or an altogether different solution?

library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
dataframe_example <-
data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
JSON_col = c(
"{\"_inv\": [10,20,30,40], \"_person\": {\"_personid\": \"green\"}}",
"{\"_inv\": [15,22], \"_person\": {\"_personid\": \"blue\"}}",
"{\"_inv\": [], \"_person\": {\"_personid\": \"red\"}}"
)
)
dataframe_example %>%
as_tibble() %>%
mutate(
JSON_col = JSON_col %>% map(parse_json)
) %>%
unnest_wider(JSON_col) %>%
unnest(`_inv`) %>%
unnest(`_inv`) %>%
unnest(`_person`) %>%
unnest(`_person`) %>%
group_by(id, name) %>%
mutate(inv_id = row_number()) %>%
pivot_wider(names_from = inv_id, values_from = `_inv`, names_prefix = "_inv_")
#> # A tibble: 2 x 7
#> # Groups: id, name [2]
#> id name `_person` `_inv_1` `_inv_2` `_inv_3` `_inv_4`
#> <dbl> <chr> <chr> <int> <int> <int> <int>
#> 1 1 name1 green 10 20 30 40
#> 2 2 name2 blue 15 22 NA NA
Created on 2021-11-25 by the reprex package (v2.0.1)

Related

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?
tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.
This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

String with values mapped from other data frame in R

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

Unpack json columns into a dataframe

I have json strings inside a dataframe column. I want to bring all these new json columns into the dataframe.
# Input
JsonID <- as.factor(c(1,2,3))
JsonString1 = "{\"device\":{\"site\":\"Location1\"},\"tags\":{\"Engine Pressure\":\"150\",\"timestamp\":\"2608411982\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"2608411982\"}"
JsonString2 = "{\"device\":{\"site\":\"Location2\"},\"tags\":{\"Engine Pressure\":\"160\",\"timestamp\":\"3608411983\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"3608411983\"}"
JsonString3 = "{\"device\":{\"site\":\"Location3\"},\"tags\":{\"Brake Fluid\":\"100\",\"timestamp\":\"4608411984\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"4608411984\"}"
JsonStrings = c(JsonString1, JsonString2, JsonString3)
Example <- data.frame(JsonID, JsonStrings)
Using the jsonlite library I can make each json string into a 1 row dataframe.
library(jsonlite)
# One row dataframes
DF1 <- data.frame(fromJSON(JsonString1))
DF2 <- data.frame(fromJSON(JsonString2))
DF3 <- data.frame(fromJSON(JsonString3))
Unfortunately the JsonID variable column is lost. All json strings share common column name such as "time". But there are column names they don't share. By pivoting the data longer I could Rbind all the dataframes together.
library(dplyr)
library(tidyr)
# Row bindable one row dataframes
DF1_RowBindable <- DF1 %>%
rename_all(~gsub("tags.", "", .x)) %>%
tidyr::pivot_longer(cols = c(colnames(.)[2]))
Is there a better way to do this?
I have never worked with json strings before. The solution must be computationally scalable.
We can store the data from fromJSON in list in the dataframe itself so we don't loose any information that we already have in the data. We can use unnest_wider to create new columns from named list.
library(dplyr)
library(tidyr)
library(jsonlite)
Example %>%
rowwise() %>%
mutate(data = list(fromJSON(JsonStrings))) %>%
unnest_wider(data) %>%
select(-JsonStrings) %>%
unnest_wider(tags) %>%
unnest_wider(device)
# JsonID site `Engine Pressure` timestamp historic adhoc `Brake Fluid` online time
# <fct> <chr> <chr> <chr> <lgl> <lgl> <chr> <lgl> <chr>
#1 1 Location1 150 2608411982 FALSE FALSE NA TRUE 2608411982
#2 2 Location2 160 3608411983 FALSE FALSE NA TRUE 3608411983
#3 3 Location3 NA 4608411984 FALSE FALSE 100 TRUE 4608411984
Since each column (data, tags, device) are of different lengths we need to use unnest_wider separately on each one of them.

Is there an R function that can convert a existing metric into a new logical metric?

I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:
ID Name Type1 Type2 HP
001 Bulba.. Grass Poison 45
ect...
I understand the type1/type2 metric might be problematic, Is there a function that would let me create a new create/modify new columns were if a Pokemon had a particular type it would add a logical value(0 for false, 1 for true) in that new column?
I apologize for a lack luster explanation but what I want is for my dataset to look like this:
ID Name Grass Poison Water HP
001 Bulba.. 1 1 0 45
ect...
tidyr is a package for data reshaping. Here, we'll use pivot_longer() to put it into a long format, where the type names (Type1, Type2) will reside in column "name", while the values (Grass, Poison, etc.) will reside in column "value". We
filter out rows with is.na(value) because that means the pokemon did not have a second type. We create an indicator variable -- this gets a 1. Each pokemon will then have indicator == 1 for the types it has. We drop the now extraneous "name" column, and use pivot_wider() to transform each unique value in value into its own column, which will receive indicator's value as the cell value for each row. Finally, we mutate on all numeric columns to replace missings with 0, since we know those pokemon aren't those types.
A better solution than mutate_if(is.numeric, ...) would be to compute the unique values of types and use mutate_at(vars(pokemon_types), .... This would not affect other numeric columns unintentionally.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
pokemon <- tibble(ID = c(1,2), Name = c("Bulbasaur", "Squirtle"),
Type1 = c("Grass", "Water"),
Type2 = c("Poison", NA),
HP = c(40, 50))
pokemon %>% pivot_longer(
starts_with("Type")
) %>%
filter(!is.na(value)) %>%
mutate(indicator = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = indicator,
) %>%
mutate_if(is.numeric, .funs = function(x) if_else(is.na(x), 0, x))
#> # A tibble: 2 x 6
#> ID Name HP Grass Poison Water
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Bulbasaur 40 1 1 0
#> 2 2 Squirtle 50 0 0 1

convert named list with mixed content to data frame

Is there a better and nicer way to convert named list with mixed content to data frame?
The working example:
my_list <- list("a" = 1.0, "b" = "foo", "c" = TRUE)
my_df <- data.frame(
"key" = names(my_list),
stringsAsFactors = F
)
my_df[["value"]] <- unname(my_list)
Is it possible to do this conversion in one step?
We can use stack from base R
stack(my_list)
According to ?stack
The stack function is used to transform data available as separate columns in a data frame or list into a single column that can be used in an analysis of variance model or other linear model. The unstack function reverses this operation.
Or with enframe
library(tidyverse)
enframe(my_list) %>% # creates the 'value' as a `list` column
mutate(value = map(value, as.character)) %>% # change to single type
unnest
You can use dplyr::as_tibble to coerce the list into a data frame / tibble. This will automatically create a data frame where the list's names are column names, and the list items correspond to rows.
library(dplyr)
library(tidyr)
my_list <- list("a" = 1.0, "b" = "foo", "c" = TRUE)
as_tibble(my_list)
#> # A tibble: 1 x 3
#> a b c
#> <dbl> <chr> <lgl>
#> 1 1 foo TRUE
To reshape into the two-column format you have, pipe it into tidyr::gather, where the default column names are key and value. Because of the different data types in the column value, this will coerce all the values to character.
as_tibble(my_list) %>%
gather()
#> # A tibble: 3 x 2
#> key value
#> <chr> <chr>
#> 1 a 1
#> 2 b foo
#> 3 c TRUE
Created on 2018-11-09 by the reprex package (v0.2.1)

Resources