pull all elements with specific name from a nested list - r

I have some archived Slack data that I am trying to get some of key message properties. I'd done this by stupidly flattening the entire list, getting a data.frame or tibble with lists nested in some cells. As this dataset gets bigger, I want to pick elements out of this list more smartly so that when this cache becomes big it doesn't take forever to create the data.frame or tibble with the elements I want.
Example where I am trying to pull everything named "type" below into a vector or flat list that I can pull in as a dataframe variable. I named the folder and message level for convenience. Anyone have model code that can help?
library(tidyverse)
l <- list(folder_1 = list(
`msg_1-1` = list(type = "message",
subtype = "channel_join",
ts = "1585771048.000200",
user = "UFUNNF8MA",
text = "<#UFUNNF8MA> has joined the channel"),
`msg_1-2` = list(type = "message",
subtype = "channel_purpose",
ts = "1585771049.000300",
user = "UNFUNQ8MA",
text = "<#UNFUNQ8MA> set the channel purpose: Talk about xyz")),
folder_2 = list(
`msg_2-1` = list(type = "message",
subtype = "channel_join",
ts = "1585771120.000200",
user = "UQKUNF8MA",
text = "<#UQKUNF8MA> has joined the channel"))
)
# gets a specific element
print(l[[1]][[1]][["type"]])
# tried to get all elements named "type", but am not at the right list level to do so
print(purrr::map(l, "type"))

As OP mentioned, this can solve the issue:
#Code
unlist(l)[grepl('.type',names(unlist(l)),fixed=T)]
Output:
folder_1.msg_1-1.type folder_1.msg_1-2.type folder_2.msg_2-1.type
"message" "message" "message"
Another options are (Many thanks and credit to #Abdessabour Mtk)
#Code1
purrr::map(l, ~ purrr::map(.x, "type"))

Depending on the desired output, I would probably use a simple recursive function here.
get_elements <- function(x, element) {
if(is.list(x))
{
if(element %in% names(x)) x[[element]]
else lapply(x, get_elements, element = element)
}
}
This allows:
get_elements(l, "type")
#> $folder_1
#> $folder_1$`msg_1-1`
#> [1] "message"
#>
#> $folder_1$`msg_1-2`
#> [1] "message"
#>
#>
#> $folder_2
#> $folder_2$`msg_2-1`
#> [1] "message"
Or if you want to get all "users":
get_elements(l, "user")
#> $folder_1
#> $folder_1$`msg_1-1`
#> [1] "UFUNNF8MA"
#>
#> $folder_1$`msg_1-2`
#> [1] "UNFUNQ8MA"
#>
#>
#> $folder_2
#> $folder_2$`msg_2-1`
#> [1] "UQKUNF8MA"
You could obviously unlist the result if you prefer it flattened into a vector.
unlist(get_elements(l, "type"))
#> folder_1.msg_1-1 folder_1.msg_1-2 folder_2.msg_2-1
#> "message" "message" "message"

Another option is to use rrapply() in the rrapply-package:
library(rrapply)
## return unlisted vector
rrapply(l, condition = function(x, .xname) .xname == "type", how = "unlist")
#> folder_1.msg_1-1.type folder_1.msg_1-2.type folder_2.msg_2-1.type
#> "message" "message" "message"
## return melted data.frame
rrapply(l, condition = function(x, .xname) .xname == "type", how = "melt")
#> L1 L2 L3 value
#> 1 folder_1 msg_1-1 type message
#> 2 folder_1 msg_1-2 type message
#> 3 folder_2 msg_2-1 type message

Related to those provided by #Duck & #Abdessabour Mtk yesterday, purrr has a function map_depth() that will let you get a named attribute if you know its name and how deep it is in the hierarchy. REALLY useful when crawling this big nested lists, and is a simpler solution to the nested map() calls above.
purrr::map_depth(l, 2, "type")

Related

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

Plot the longest transcript in GenomicRanges with ggbio

I am trying to plot an specific region using ggbio. I am using the below code that produced my desire output, except that it contains several transcript. Is it possible to only plot the longest transcript? I've not been able to access the genomic ranges object within Homo.sapiens that I assume contains this information.
library(ggbio)
library(Homo.sapiens)
range <- GRanges("chr10" , IRanges(start = 78000000 , end = 79000000))
p.txdb <- autoplot(Homo.sapiens, which = range)
p.txdb
Here is a solution that involves filtering TxDb.Hsapiens.UCSC.hg19.knownGene on the longest transcript by gene_id (which does remove genes without gene_id):
suppressPackageStartupMessages({
invisible(lapply(c("ggbio", "biovizBase", "data.table",
"TxDb.Hsapiens.UCSC.hg19.knownGene",
"org.Hs.eg.db"),
require, character.only = TRUE))})
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# retrieve transcript lengths
txlen <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
setDT(txlen)
txlen$len <- rowSums(as.matrix(txlen[, .(tx_len, utr5_len, utr3_len)]))
setkey(txlen, gene_id, len, tx_id)
# filter longesttranscript by gene_id
ltx <- txlen[!is.na(gene_id)][, tail(.SD,1), by=gene_id]$tx_id
# filter txdb object
txb <- as.list(txdb)
txb$transcripts <- txb$transcripts[txb$transcripts$tx_id %in% ltx, ]
txb$splicings <- txb$splicings[txb$splicings$tx_id %in% ltx,]
txb$genes <- txb$genes[txb$genes$tx_id %in% ltx,]
txb <- do.call(makeTxDb, txb)
# plot according to vignette, chapter 2.2.5
range <- GRanges("chr10", IRanges(start = 78000000 , end = 79000000))
gr.txdb <- crunch(txb, which = range)
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
colnames(values(gr.txdb))[4] <- "model"
grl <- split(gr.txdb, gr.txdb$gene_id)
symbols <- select(org.Hs.eg.db, keys=names(grl), columns="SYMBOL", keytype="ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns
names(grl) <- symbols[match(symbols$ENTREZID, names(grl), nomatch=0),"SYMBOL"]
autoplot(grl, aes(type = "model"), gap.geom="chevron")
#> Constructing graphics...
Created on 2020-05-29 by the reprex package (v0.3.0)
Edit:
To get gene symbols instead of gene (or transcript) ids, just replace the names of grl with the associated gene symbols, e.g. via org.Hs.eg.db, or any other resource that matches them up.

Using code_to_plan and target(..., format = "fst") in drake

I really like using the code_to_plan function when constructing drake plans. I also really using target(..., format = "fst") for big files. However I am struggling to combine these two workflows. For example if I have this _drake.R file:
# Data --------------------------------------------------------------------
data_plan = code_to_plan("code/01-data/data.R")
join_plan = code_to_plan("code/01-data/merging.R")
# Cleaning ----------------------------------------------------------------
cleaning_plan = code_to_plan("code/02-cleaning/remove_na.R")
# Model -------------------------------------------------------------------
model_plan = code_to_plan("code/03-model/model.R")
# Combine Plans
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
)
config <- drake_config(dplan)
This works fine when called with r_make(r_args = list(show = TRUE))
As I understand it though target can only be used within a drake_plan. If I try something like this:
dplan2 <- drake_plan(full_plan = target(dplan, format = "fst"))
config <- drake_config(dplan2)
I get an r_make error like this:
target full_plan
Error in fst::write_fst(x = value$value, path = tmp) :
Unknown type found in column.
In addition: Warning message:
You selected fst format for target full_plan, so drake will convert it from class c("drake_plan", "tbl_df", "tbl", "data.frame") to a plain data frame.
Error:
-->
in process 18712
See .Last.error.trace for a stack trace.
So ultimately my question is where does one specify special data formats for targets when you are using code_to_plan?
Edit
Using #landau helpful suggestion, I defined this function:
add_target_format <- function(plan) {
# Get a list of named commands.
commands <- plan$command
names(commands) <- plan$target
# Turn it into a good plan.
do.call(drake_plan, commands)
}
So that this would work:
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
) %>%
add_target_format()
It is possible, but not convenient. Here is a workaround.
writeLines(
c(
"x <- small_data()",
"y <- target(large_data(), format = \"fst\")"
),
"script.R"
)
cat(readLines("script.R"), sep = "\n")
#> x <- small_data()
#> y <- target(large_data(), format = "fst")
library(drake)
# Produces a plan, but does not process target().
bad_plan <- code_to_plan("script.R")
bad_plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x small_data()
#> 2 y target(large_data(), format = "fst")
# Get a list of named commands.
commands <- bad_plan$command
names(commands) <- bad_plan$target
# Turn it into a good plan.
good_plan <- do.call(drake_plan, commands)
good_plan
#> # A tibble: 2 x 3
#> target command format
#> <chr> <expr> <chr>
#> 1 x small_data() <NA>
#> 2 y large_data() fst
Created on 2019-12-18 by the reprex package (v0.3.0)

Dynamically Change as.POSIXlt Value

In R, I am trying to read a file that has a timestamp, and update the timestamp based on the condition of another field. The below code works with no problem:
t <- data.frame(user = as.character(c("bshelton#email1.com", "lwong#email1.com")),
last_update = rep(as.POSIXlt(Sys.time(), tz = "America/Los_Angeles"), 2))
Sys.sleep(5)
t$last_update <- as.POSIXlt(ifelse(t$user == "bshelton#email1.com", Sys.time(), t$last_update), origin = "1970-01-01")
print(t)
The problem is when I read an existing file and try to dynamically change an as.POSIXlt value. The following code is producing the error that accompanies it in the code block afterwards:
t <- data.frame(user = as.character(c("bshelton#email1.com", "lwong2#email1.com")),
last_update = rep(as.POSIXlt(Sys.time(), tz = "America/Los_Angeles"), 2))
write.csv(t, "so_question.csv", row.names = FALSE)
t <- read.csv("so_question.csv")
t$last_update <- as.POSIXlt(t$last_update)
Sys.sleep(5)
t$last_update <- as.POSIXlt(ifelse(t$user == "bshelton#email1.com", Sys.time(), t$last_update), origin = "1970-01-01")
Error in as.POSIXlt.default(ifelse(t$user == "bshelton#email1.com", Sys.time(), :
do not know how to convert 'ifelse(t$user == "bshelton#email1.com", Sys.time(), t$last_update)' to class “POSIXlt”
In addition: Warning message:
In ans[!test & ok] <- rep(no, length.out = length(ans))[!test & :
number of items to replace is not a multiple of replacement length
The first case is curiously working only because you don't have what you think—those datetimes are in fact POSIXct, not POSIXlt:
last_update <- rep(as.POSIXlt(Sys.time(), tz = "America/Los_Angeles"), 2)
str(last_update)
#> POSIXlt[1:2], format: "2019-07-28 20:52:10" "2019-07-28 20:52:10"
t <- data.frame(user = as.character(c("bshelton#email1.com", "lwong#email1.com")),
last_update = last_update)
str(t)
#> 'data.frame': 2 obs. of 2 variables:
#> $ user : Factor w/ 2 levels "bshelton#email1.com",..: 1 2
#> $ last_update: POSIXct, format: "2019-07-28 20:52:10" "2019-07-28 20:52:10"
If you dig into ?data.frame, it says
data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected by I or argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument (except for matrices protected by I).
This is what's happening: as.data.frame.POSIXlt in fact converts to POSIXct:
now <- Sys.time()
str(now)
#> POSIXct[1:1], format: "2019-07-28 22:50:12"
str(data.frame(time = now))
#> 'data.frame': 1 obs. of 1 variable:
#> $ time: POSIXct, format: "2019-07-28 22:50:12"
as.data.frame.POSIXlt
#> function (x, row.names = NULL, optional = FALSE, ...)
#> {
#> value <- as.data.frame.POSIXct(as.POSIXct(x), row.names,
#> optional, ...)
#> if (!optional)
#> names(value) <- deparse(substitute(x))[[1L]]
#> value
#> }
#> <bytecode: 0x7fc938a11060>
#> <environment: namespace:base>
More immediately, since Sys.time() returns a POSIXct object, ifelse(t$user == "bshelton#email1.com", Sys.time(), t$last_update) in the second case is getting a POSIXct object for one observation and POSIXlt for the other. The POSIXlt object's class attribute is dropped by ifelse revealing the list underneath, which ifelse then doesn't know how to turn into a vector together with the unclassed POSIXct object (which is just a number).
The solution here, then, is to follow the hint data.frame is giving you and use POSIXct instead of POSIXlt.
If you really want to make it work with POSIXlt, you can iterate over the conditions and POSIXlt vector with Map with if/else (which maintain attributes including class, but only handle scalar conditions) and coerce the resulting list back to a vector with do.call(c, ...):
t <- data.frame(user = as.character(c("bshelton#email1.com", "lwong#email1.com")),
last_update = rep(as.POSIXlt(Sys.time(), tz = "America/Los_Angeles"), 2))
t$last_update <- as.POSIXlt(t$last_update)
t$last_update <- do.call(c, Map(
function(condition, last_update){
if (condition) {
as.POSIXlt(Sys.time() + 5)
} else {
last_update
}
},
condition = t$user == "bshelton#email1.com",
last_update = t$last_update
))
t
#> user last_update
#> 1 bshelton#email1.com 2019-07-28 23:11:04
#> 2 lwong#email1.com 2019-07-28 23:10:59
...but frankly that's a little silly. Just use POSIXct instead, and your life will be better.

How to store a "complex" data structure in R (not "complex numbers")

I need to train, store, and use a list/array/whatever of several ksvm SVM models, which once I get a set of sensor readings, I can call predict() on each of the models in turn. I want to store these models and metadata about tham in some sort of data structure, but I'm not very familiar with R, and getting a handle on its data structures has been a challenge. My familiarity is with C++, C, and C#.
I envision some sort of array or list that contains both the ksvm models as well as the metadata about them. (The metadata is necessary, among other things, for knowing how to select & organize the input data presented to each model when I call predict() on it.)
The data I want to store in this data structure includes the following for each entry of the data structure:
The ksvm model itself
A character string saying who trained the model & when they trained it
An array of numbers indicating which sensors' data should be presented to this model
A single number between 1 and 100 that represents how much I, the trainer, trust this model
Some "other stuff"
So in tinkering with how to do this, I tried the following....
First I tried what I thought would be really simple & crude, hoping to build on it later if this worked: A (list of (list of different data types))...
>
> uname = Sys.getenv("USERNAME", unset="UNKNOWN_USER")
> cname = Sys.getenv("COMPUTERNAME", unset="UNKNOWN_COMPUTER")
> trainedAt = paste("Trained at", Sys.time(), "by", uname, "on", cname)
> trainedAt
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
> sensorsToUse = c(12,14,15,16,24,26)
> sensorsToUse
[1] 12 14 15 16 24 26
> trustFactor = 88
>
> TestModels = list()
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
> TestModels
[[1]]
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
>
...wha? What did it think I was trying to replace? I was just trying to populate element 1 of TestModels. Later I would add an element [2], [3], etc... but this didn't work and I don't know why. Maybe I need to define TestModels as a list of lists right up front...
> TestModels = list(list())
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Hmm. That no workie either. Let's try something else...
> TestModels = list(list())
> TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Drat. Still no workie.
Please clue me in on how I can do this. And I'd really like to be able to access the fields of my data structure by name, perhaps something along the lines of...
> print(TestModels[1]["TrainedAt"])
Thank you very much!
You were very close. To avoid the warning, you shouldn't use
TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
but instead
TestModels[[1]] = list(trainedAt, sensorsToUse, trustFactor)
To access a list element you use [[ ]]. Using [ ] on a list will return a list containing the elements inside the single brackets. The warning is shown because you were replacing a list containing one element (because this is how you created it) with a list containing 3 elements. This wouldn't be a problem for other elements:
TestModels[2] = list(trainedAt, sensorsToUse, trustFactor) # This element did not exist, so no replacement warning
To understand list subsetting better, take a look at this:
item1 <- list("a", 1:10, c(T, F, T))
item2 <- list("b", 11:20, c(F, F, F))
mylist <- list(item1=item1, item2=item2)
mylist[1] #This returns a list containing the item 1.
#$item1 #Note the item name of the container list
#$item1[[1]]
#[1] "a"
#
#$item1[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#$item1[[3]]
#[1] TRUE FALSE TRUE
#
mylist[[1]] #This returns item1
#[[1]] #Note this is the same as item1
#[1] "a"
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#[[3]]
#[1] TRUE FALSE TRUE
To access the list items by name, just name them when creating the list:
mylist <- list(var1 = "a", var2 = 1:10, var3 = c(T, F, T))
mylist$var1 #Or mylist[["var1"]]
# [1] "a"
You can nest this operators like you suggested. So you coud use
containerlist <- list(mylist)
containerlist[[1]]$var1
#[1] "a"

Resources