Reading in a .geojson file with geojsonio, geojsonR - r

I am trying to read in a geojson file (https://www.svz-bw.de/fileadmin/verkehrszentrale/RadNETZ-BW_Daten_GeoJSON_2018-20.zip) in R.
I have tried different packages but my knowledge is too limited to find the errors and solve them. Im new to spatial data in R, especially reading geojson file format.
Googling and searching in stackoverflow hasnt helped.
geojsonR::FROM_geojson("../Sonstiges/RadNETZ.geojson")
Error in unlink(x) : file name conversion problem -- name too long?
geojsonR::FROM_GeoJson("../Sonstiges/RadNETZ.geojson")
Error in export_From_geojson(url_file_string, Flatten_Coords,
Average_Coordinates, : invalid GeoJson geometry object -->
geom_OBJ() function

Your file does not comply with the current GeoJSON standards; it uses a projected coordinate reference system, which goes against RFC 7946 - https://www.rfc-editor.org/rfc/rfc7946#page-12
This may, and may not, be the reason why geojson specific packages have hard time interpreting it.
In order to process your file I suggest using {sf}, which is - via GDAL and PROJ - able to digest the file.
library(dplyr)
library(sf)
asdf <- st_read("RadNETZ.geojson") %>%
st_transform(4326) # safety of unprojected CRS
plot(st_geometry(asdf))

As #Jindra Lacko mentioned your 'RadNETZ.geojson' file does not comply with the RFC 7946 that's why you receive the error. If you don't have GDAL installed on your Operating System besides the 'sf' package you can use either the geojsonR::shiny_from_JSON (which does not follow the RFC and is meant to be used in shiny applications),
dat = geojsonR::shiny_from_JSON("../Sonstiges/RadNETZ.geojson")
str(dat)
List of 4
$ crs :List of 2
..$ properties:List of 1
.. ..$ name: chr "urn:ogc:def:crs:EPSG::31467"
..$ type : chr "name"
$ features:List of 70097
..$ :List of 3
.. ..$ geometry :List of 2
.. .. ..$ coordinates:List of 6
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3563993
.. .. .. .. ..$ : num 5353055
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3564002
.. .. .. .. ..$ : num 5353070
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3564009
.. .. .. .. ..$ : num 5353087
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3564013
.. .. .. .. ..$ : num 5353103
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3564016
.. .. .. .. ..$ : num 5353109
.. .. .. ..$ :List of 2
.. .. .. .. ..$ : num 3564030
.. .. .. .. ..$ : num 5353121
.. .. ..$ type : chr "LineString"
.. ..$ properties:List of 24
.....
or the jsonlite::fromJSON function,
dat = jsonlite::fromJSON("../Sonstiges/RadNETZ.geojson")
str(dat)
List of 4
$ type : chr "FeatureCollection"
$ name : chr "sql_statement"
$ crs :List of 2
..$ type : chr "name"
..$ properties:List of 1
.. ..$ name: chr "urn:ogc:def:crs:EPSG::31467"
$ features:'data.frame': 70097 obs. of 3 variables:
..$ type : chr [1:70097] "Feature" "Feature" "Feature" "Feature" ...
..$ properties:'data.frame': 70097 obs. of 24 variables:
.. ..$ gid : int [1:70097] 4 15 23 22 45 72 60 74 13072 75 ...
.. ..$ lrvn_kat: int [1:70097] 3 1 1 3 1 1 3 1 3 1 ...
.....
For the record I'm the author / maintainer of the geojsonR package

Related

Plotly in R: How to reference and extract figure values?

I want to know how can I access, extract, and reference values from a plotly figure in R.
Consider, for example, the Sankey diagram from plotly's own site of which there is an abbreviated version here:
library(plotly)
fig <- plot_ly(
type = "sankey",
node = list(
label = c("A1", "A2", "B1", "B2", "C1", "C2"),
color = c("blue", "blue", "blue", "blue", "blue", "blue"),
line = list()
),
link = list(
source = c(0,1,0,2,3,3),
target = c(2,3,3,4,4,5),
value = c(8,4,2,8,4,2)
)
)
fig
If I do View(fig) in Rstudio, a new tab opens titled . (I don't know why this instead of 'fig'). In this tab I can go to x > visdat > 'strig of letters and numbers that is a function?' > attrs > node > x (as shown bellow).
Here all the x coordinates for the Sankey nodes appear.
I want to access these values so I can use them somewhere else. How do I do this? If I click on the right side of the Rsutudio tab to copy the code to console I get:
environment(.[["x"]][["visdat"]][["484c3ec36899"]])[["attrs"]][["node"]][["x"]]
which obviously doesn't work as there is no object named ..
In this case I have tried fig$x$visdat$`484c3ec36899`() but I cant do fig$x$visdat$`484c3ec36899`()$attr, and I don't know what else to do.
So, how can I access any value from a plotly object? Any documentation referencing this topic would also be helpful.
Thanks.
You can find the documentation of the data structure of plotly in R here: https://plotly.com/r/figure-structure/
To check the data structure you can use str(fig):
List of 8
$ x :List of 6
..$ visdat :List of 1
.. ..$ a3b8795a4:function ()
..$ cur_data: chr "a3b8795a4"
..$ attrs :List of 1
.. ..$ a3b8795a4:List of 6
.. .. ..$ node :List of 3
.. .. .. ..$ label: chr [1:6] "A1" "A2" "B1" "B2" ...
.. .. .. ..$ color: chr [1:6] "blue" "blue" "blue" "blue" ...
.. .. .. ..$ line : list()
.. .. ..$ link :List of 3
.. .. .. ..$ source: num [1:6] 0 1 0 2 3 3
.. .. .. ..$ target: num [1:6] 2 3 3 4 4 5
.. .. .. ..$ value : num [1:6] 8 4 2 8 4 2
.. .. ..$ alpha_stroke: num 1
.. .. ..$ sizes : num [1:2] 10 100
.. .. ..$ spans : num [1:2] 1 20
.. .. ..$ type : chr "sankey"
..$ layout :List of 3
.. ..$ width : NULL
.. ..$ height: NULL
.. ..$ margin:List of 4
.. .. ..$ b: num 40
.. .. ..$ l: num 60
.. .. ..$ t: num 25
.. .. ..$ r: num 10
..$ source : chr "A"
..$ config :List of 1
.. ..$ showSendToCloud: logi FALSE
..- attr(*, "TOJSON_FUNC")=function (x, ...)
$ width : NULL
$ height : NULL
$ sizingPolicy :List of 6
..$ defaultWidth : chr "100%"
..$ defaultHeight: num 400
..$ padding : NULL
..$ viewer :List of 6
.. ..$ defaultWidth : NULL
.. ..$ defaultHeight: NULL
.. ..$ padding : NULL
.. ..$ fill : logi TRUE
.. ..$ suppress : logi FALSE
.. ..$ paneHeight : NULL
..$ browser :List of 5
.. ..$ defaultWidth : NULL
.. ..$ defaultHeight: NULL
.. ..$ padding : NULL
.. ..$ fill : logi TRUE
.. ..$ external : logi FALSE
..$ knitr :List of 3
.. ..$ defaultWidth : NULL
.. ..$ defaultHeight: NULL
.. ..$ figure : logi TRUE
$ dependencies :List of 5
..$ :List of 10
.. ..$ name : chr "typedarray"
.. ..$ version : chr "0.1"
.. ..$ src :List of 1
.. .. ..$ file: chr "htmlwidgets/lib/typedarray"
.. ..$ meta : NULL
.. ..$ script : chr "typedarray.min.js"
.. ..$ stylesheet: NULL
.. ..$ head : NULL
.. ..$ attachment: NULL
.. ..$ package : chr "plotly"
.. ..$ all_files : logi FALSE
.. ..- attr(*, "class")= chr "html_dependency"
..$ :List of 10
.. ..$ name : chr "jquery"
.. ..$ version : chr "1.11.3"
.. ..$ src :List of 1
.. .. ..$ file: chr "lib/jquery"
.. ..$ meta : NULL
.. ..$ script : chr "jquery.min.js"
.. ..$ stylesheet: NULL
.. ..$ head : NULL
.. ..$ attachment: NULL
.. ..$ package : chr "crosstalk"
.. ..$ all_files : logi TRUE
.. ..- attr(*, "class")= chr "html_dependency"
..$ :List of 10
.. ..$ name : chr "crosstalk"
.. ..$ version : chr "1.1.0.1"
.. ..$ src :List of 1
.. .. ..$ file: chr "www"
.. ..$ meta : NULL
.. ..$ script : chr "js/crosstalk.min.js"
.. ..$ stylesheet: chr "css/crosstalk.css"
.. ..$ head : NULL
.. ..$ attachment: NULL
.. ..$ package : chr "crosstalk"
.. ..$ all_files : logi TRUE
.. ..- attr(*, "class")= chr "html_dependency"
..$ :List of 10
.. ..$ name : chr "plotly-htmlwidgets-css"
.. ..$ version : chr "1.52.2"
.. ..$ src :List of 1
.. .. ..$ file: chr "htmlwidgets/lib/plotlyjs"
.. ..$ meta : NULL
.. ..$ script : NULL
.. ..$ stylesheet: chr "plotly-htmlwidgets.css"
.. ..$ head : NULL
.. ..$ attachment: NULL
.. ..$ package : chr "plotly"
.. ..$ all_files : logi FALSE
.. ..- attr(*, "class")= chr "html_dependency"
..$ :List of 10
.. ..$ name : chr "plotly-main"
.. ..$ version : chr "1.52.2"
.. ..$ src :List of 1
.. .. ..$ file: chr "htmlwidgets/lib/plotlyjs"
.. ..$ meta : NULL
.. ..$ script : chr "plotly-latest.min.js"
.. ..$ stylesheet: NULL
.. ..$ head : NULL
.. ..$ attachment: NULL
.. ..$ package : chr "plotly"
.. ..$ all_files : logi FALSE
.. ..- attr(*, "class")= chr "html_dependency"
$ elementId : NULL
$ preRenderHook:function (p, registerFrames = TRUE)
$ jsHooks : list()
- attr(*, "class")= chr [1:2] "plotly" "htmlwidget"
- attr(*, "package")= chr "plotly"
You could extract the coordinates with:
unlist(fig$x$attrs)

R: Loaded tweets structure is untidy when str()

Differently from my collegue, after I load the tweets with R and I try to see the structure with str() the data appears in a messy way with a lot of dots, rather than being organized as a table, which is what happens with my collegue's computer, even if the codes are the same. I can't understand what is the problem, we have the same packages installed and the same R version.
library(rtweet)
library(ggplot2)
library(dplyr)
library(tibble)
library(tidytext)
library(stringr)
library(stringi)
library(igraph)
library(ggraph)
library(readr)
library(lubridate)
library(zoo)
appname <- ""
key <- ""
secret <- ""
twitter_token <- create_token( app = "", consumer_key = "", consumer_secret = "", access_token = "", access_secret = "")
tweets <- search_tweets(q = "#water + #climatechange", n = 10000, lang = "en", include_rts = FALSE)
str(tweets)
.. ..$ media :'data.frame': 1 obs. of 11 variables:
.. .. ..$ id : num 1.57e+18
.. .. ..$ id_str : chr "1573815153484759040"
.. .. ..$ indices :List of 1
.. .. .. ..$ :'data.frame': 1 obs. of 2 variables:
.. .. .. .. ..$ start: int 241
.. .. .. .. ..$ end : int 264
.. .. .. ..- attr(*, "class")= chr "AsIs"
.. .. ..$ media_url : chr "http://pbs.twimg.com/media/FddQiy2WAAAl59Q.jpg"
.. .. ..$ media_url_https: chr "https://pbs.twimg.com/media/FddQiy2WAAAl59Q.jpg"
.. .. ..$ url : chr "https
.. .. ..$ display_url : chr "pic.twitter.com/iFJTkF1S9S"
.. .. ..$ expanded_url : chr "https://twitter.com/TreeBanker/status/1573815156768968706/photo/1"
.. .. ..$ type : chr "photo"
.. .. ..$ sizes :List of 1
.. .. .. ..$ :'data.frame': 4 obs. of 4 variables:
.. .. .. .. ..$ w : int [1:4] 1096 680 150 1096
.. .. .. .. ..$ h : int [1:4] 733 455 150 733
.. .. .. .. ..$ resize: chr [1:4] "fit" "fit" "crop" "fit"
.. .. .. .. ..$ type : chr [1:4] "large" "small" "thumb" "medium"
.. .. ..$ ext_alt_text : logi NA
..$ :List of 5
.. ..$ media :'data.frame': 1 obs. of 11 variables:
.. .. ..$ id : num 1.57e+18
.. .. ..$ id_str : chr "1573815153484759040"
.. .. ..$ indices :List of 1
.. .. .. ..$ :'data.frame': 1 obs. of 2 variables:

How do I count treated and untreated in R

I'm trying to learn R again and am trying to count the number total number of genes that are "treated" and "untreated" with dex in the bioconductor airway dataset. (https://bioconductor.org/packages/release/data/experiment/html/airway.html).
I'm trying:
airway$dex=='trted'
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
and it's not working.
After installing that package I performed the following actions at my console ( and including all output):
> library(airway)
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following object is masked from ‘package:dplyr’:
count
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs,
colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs,
colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans,
colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs,
rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats,
rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs,
rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:bit64’:
match, order, rank
The following objects are masked from ‘package:dplyr’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:Matrix’:
expand
The following objects are masked from ‘package:data.table’:
first, second
The following objects are masked from ‘package:tidygraph’:
active, rename
The following object is masked from ‘package:tidyr’:
expand
The following objects are masked from ‘package:dplyr’:
first, rename
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:data.table’:
shift
The following object is masked from ‘package:nlme’:
collapse
The following object is masked from ‘package:tidygraph’:
slice
The following object is masked from ‘package:purrr’:
reduce
The following objects are masked from ‘package:dplyr’:
collapse, desc, slice
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
The following object is masked from ‘package:bit64’:
cache
Attaching package: ‘SummarizedExperiment’
The following object is masked from ‘package:SeuratObject’:
Assays
The following object is masked from ‘package:Seurat’:
Assays
I looked at the help page
> help(pac=airway)
So after reading that I thought the airway dataset might be accessible, but no:
> str(airway)
Error in str(airway) : object 'airway' not found
So I tried loading it with the data function (and no error was reported) so I looked at its structure:
> data(airway)
> str(airway)
Formal class 'RangedSummarizedExperiment' [package "SummarizedExperiment"] with 6 slots
..# rowRanges :Formal class 'GRangesList' [package "GenomicRanges"] with 3 slots
.. .. ..# elementMetadata:Formal class 'DataFrame' [package "IRanges"] with 6 slots
.. .. .. .. ..# rownames : NULL
.. .. .. .. ..# nrows : int 64102
.. .. .. .. ..# listData : Named list()
.. .. .. .. ..# elementType : chr "ANY"
.. .. .. .. ..# elementMetadata: NULL
.. .. .. .. ..# metadata : list()
.. .. ..# elementType : chr "GRanges"
.. .. ..# metadata :List of 1
.. .. .. ..$ genomeInfo:List of 20
.. .. .. .. ..$ Db type : chr "TranscriptDb"
.. .. .. .. ..$ Supporting package : chr "GenomicFeatures"
.. .. .. .. ..$ Data source : chr "BioMart"
.. .. .. .. ..$ Organism : chr "Homo sapiens"
.. .. .. .. ..$ Resource URL : chr "www.biomart.org:80"
.. .. .. .. ..$ BioMart database : chr "ensembl"
.. .. .. .. ..$ BioMart database version : chr "ENSEMBL GENES 75 (SANGER UK)"
.. .. .. .. ..$ BioMart dataset : chr "hsapiens_gene_ensembl"
.. .. .. .. ..$ BioMart dataset description : chr "Homo sapiens genes (GRCh37.p13)"
.. .. .. .. ..$ BioMart dataset version : chr "GRCh37.p13"
.. .. .. .. ..$ Full dataset : chr "yes"
.. .. .. .. ..$ miRBase build ID : chr NA
.. .. .. .. ..$ transcript_nrow : chr "215647"
.. .. .. .. ..$ exon_nrow : chr "745593"
.. .. .. .. ..$ cds_nrow : chr "537555"
.. .. .. .. ..$ Db created by : chr "GenomicFeatures package from Bioconductor"
.. .. .. .. ..$ Creation time : chr "2014-07-10 14:55:55 -0400 (Thu, 10 Jul 2014)"
.. .. .. .. ..$ GenomicFeatures version at creation time: chr "1.17.9"
.. .. .. .. ..$ RSQLite version at creation time : chr "0.11.4"
.. .. .. .. ..$ DBSCHEMAVERSION : chr "1.0"
..# colData :Formal class 'DataFrame' [package "IRanges"] with 6 slots
.. .. ..# rownames : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
.. .. ..# nrows : int 8
.. .. ..# listData :List of 9
.. .. .. ..$ SampleName: Factor w/ 8 levels "GSM1275862","GSM1275863",..: 1 2 3 4 5 6 7 8
.. .. .. ..$ cell : Factor w/ 4 levels "N052611","N061011",..: 4 4 1 1 3 3 2 2
.. .. .. ..$ dex : Factor w/ 2 levels "trt","untrt": 2 1 2 1 2 1 2 1
.. .. .. ..$ albut : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1
.. .. .. ..$ Run : Factor w/ 8 levels "SRR1039508","SRR1039509",..: 1 2 3 4 5 6 7 8
.. .. .. ..$ avgLength : int [1:8] 126 126 126 87 120 126 101 98
.. .. .. ..$ Experiment: Factor w/ 8 levels "SRX384345","SRX384346",..: 1 2 3 4 5 6 7 8
.. .. .. ..$ Sample : Factor w/ 8 levels "SRS508567","SRS508568",..: 2 1 3 4 5 6 7 8
.. .. .. ..$ BioSample : Factor w/ 8 levels "SAMN02422669",..: 1 4 6 2 7 3 8 5
.. .. ..# elementType : chr "ANY"
.. .. ..# elementMetadata: NULL
.. .. ..# metadata : list()
..# assays :Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"] with 1 field
.. ..$ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
.. .. .. ..# listData :List of 1
.. .. .. .. ..$ counts: int [1:64102, 1:8] 679 0 467 260 60 0 3251 1433 519 394 ...
.. .. .. ..# elementType : chr "ANY"
.. .. .. ..# elementMetadata: NULL
.. .. .. ..# metadata : list()
.. ..and 12 methods.
..# NAMES : NULL
..# elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
.. .. ..# rownames : NULL
.. .. ..# nrows : int 64102
.. .. ..# listData : Named list()
.. .. ..# elementType : chr "ANY"
.. .. ..# elementMetadata: NULL
.. .. ..# metadata : list()
..# metadata :List of 1
.. ..$ :Formal class 'MIAME' [package "Biobase"] with 13 slots
.. .. .. ..# name : chr "Himes BE"
.. .. .. ..# lab : chr NA
.. .. .. ..# contact : chr ""
.. .. .. ..# title : chr "RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine"| __truncated__
.. .. .. ..# abstract : chr "Asthma is a chronic inflammatory respiratory disease that affects over 300 million people worldwide. Glucocorti"| __truncated__
.. .. .. ..# url : chr "http://www.ncbi.nlm.nih.gov/pubmed/24926665"
.. .. .. ..# pubMedIds : chr "24926665"
.. .. .. ..# samples : list()
.. .. .. ..# hybridizations : list()
.. .. .. ..# normControls : list()
.. .. .. ..# preprocessing : list()
.. .. .. ..# other : list()
.. .. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. .. ..# .Data:List of 2
.. .. .. .. .. .. ..$ : int [1:3] 1 0 0
.. .. .. .. .. .. ..$ : int [1:3] 1 1 0
Scanning through that list of S4 structured data I saw this line:
.. .. .. ..$ dex : Factor w/ 2 levels "trt","untrt": 2 1 2 1 2 1 2 1
So the dex items do have "trt" and "untrt" as values but that "column" is located somewhat deeper in the entire DesignedExperiment structure. There might be a specific function, that I do not know the name of, to pull out values from such structures, but we now have enough information to answer (or hack together) the question. Follow the names and operators in that nested list backward to its origin and use the S4 extraction operator: "#" where it appropriate and $ when not:
sum( airway# colData # listData $ dex == "trt")
#[1] 4
Use sum() function to count True values:
sum(airway$dex=='trted')

Get data to be usable

I have been trying to get the data from this link to be usable
url <- "https://www.sec.gov/Archives/edgar/data/1061165/0001567619-21-010580.txt"
that should be the same information as the one on this link
https://www.sec.gov/Archives/edgar/data/1061165/000156761921010580/xslForm13F_X01/form13fInfoTable.xml
I have been able to download the file into a .txt, but can not get the data
Thanks
The file appears to be two nested XML files. We can extract each of the components into lists with this code:
txt <- readLines("https://www.sec.gov/Archives/edgar/data/1061165/0001567619-21-010580.txt")
grep("</?XML>", txt)
# [1] 46 101 109 719
txt[grep("</?XML>", txt)]
# [1] "<XML>" "</XML>" "<XML>" "</XML>"
A brief inspection of the file informed that grep, suggesting that an XML file started and stopped, and then another started/stopped. If we stay within that, we can extract most of the data with
library(xml2)
first <- as_list(read_xml(paste(txt[47:100], collapse = "")))
str(first)
# List of 1
# $ edgarSubmission:List of 2
# ..$ headerData:List of 2
# .. ..$ submissionType:List of 1
# .. .. ..$ : chr "13F-HR"
# .. ..$ filerInfo :List of 4
# .. .. ..$ liveTestFlag :List of 1
# .. .. .. ..$ : chr "LIVE"
# .. .. ..$ flags :List of 3
# .. .. .. ..$ confirmingCopyFlag :List of 1
# .. .. .. .. ..$ : chr "false"
# .. .. .. ..$ returnCopyFlag :List of 1
# .. .. .. .. ..$ : chr "true"
# .. .. .. ..$ overrideInternetFlag:List of 1
# .. .. .. .. ..$ : chr "false"
# .. .. ..$ filer :List of 1
# .. .. .. ..$ credentials:List of 2
# .. .. .. .. ..$ cik:List of 1
# .. .. .. .. .. ..$ : chr "0001061165"
# .. .. .. .. ..$ ccc:List of 1
# .. .. .. .. .. ..$ : chr "XXXXXXXX"
# .. .. ..$ periodOfReport:List of 1
# .. .. .. ..$ : chr "03-31-2021"
# ..$ formData :List of 3
and the second batch:
second <- as_list(read_xml(paste(txt[110:718], collapse = "")))
str(second)
# List of 1
# $ informationTable:List of 38
# ..$ infoTable:List of 7
# .. ..$ nameOfIssuer :List of 1
# .. .. ..$ : chr "ADOBE SYSTEMS INCORPORATED"
# .. ..$ titleOfClass :List of 1
# .. .. ..$ : chr "COM"
# .. ..$ cusip :List of 1
# .. .. ..$ : chr "00724F101"
# .. ..$ value :List of 1
# .. .. ..$ : chr "1246613"
# .. ..$ shrsOrPrnAmt :List of 2
# .. .. ..$ sshPrnamt :List of 1
# .. .. .. ..$ : chr "2622406"
# .. .. ..$ sshPrnamtType:List of 1
# .. .. .. ..$ : chr "SH"
# .. ..$ investmentDiscretion:List of 1
# .. .. ..$ : chr "SOLE"
# .. ..$ votingAuthority :List of 3
# .. .. ..$ Sole :List of 1
# .. .. .. ..$ : chr "2622406"
# .. .. ..$ Shared:List of 1
# .. .. .. ..$ : chr "0"
# .. .. ..$ None :List of 1
# .. .. .. ..$ : chr "0"
# ..$ infoTable:List of 7
I'm not certain offhand how to extract the front-matter, I hope this is a good enough start.

How to find out which index is out of bounds in object in R

Although I understand OOP, I've only just encountered them in R
I am using a package from Bioconductor to churn through some genomic data.
The object it creates is called readCounts and typing this into the command gives the following.
QDNAseqReadCounts (storageMode: lockedEnvironment)
assayData: 206391 features, 1 samples
element names: counts
protocolData: none
phenoData
sampleNames: SLX-10457.FastSeqA.BloodDMets_11AF_-AHMMH.s_1.r_1.fq.gz
varLabels: name total.reads used.reads expected.variance
varMetadata: labelDescription
featureData
featureNames: 1:825001-840000 1:840001-855000 ... 22:51165001-51180000 (168063 total)
fvarLabels: chromosome start ... use (9 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
I am trying to plot readcounts on a simple xy graph as follows:
plot(readCounts, logTransform=TRUE, ylim=c(-1000, binSize * 15))
However when I do so I get the following error:
Error in sort.int(x, partial = unique(c(lo, hi))) :
index 180 outside bounds
with the traceback() showing:
6: sort.int(x, partial = unique(c(lo, hi)))
5: FUN(newX[, i], ...)
4: apply(copynumber, 2, sdFUN, na.rm = TRUE)
3: .local(x, y, ...)
2: plot(readCounts, logTransform = TRUE, ylim = c(-1000, binSize *
15))
1: plot(readCounts, logTransform = TRUE, ylim = c(-1000, binSize *
15))
so having googled I thought it might be a missing values problem so I tried na.omit(readCounts) but got the same error again but this time setting the out of bounds index as being 207.
I have tried to inspect the data but I can't find anything wrong at row 207 although I'm not really sure which slot this refers to. I really don't know how to debug this. I'm happy to give more info regarding what I'm trying to do but I don't really know how to determine what the problem is with this error in a R object.
When I do str(readCounts) I get:
Formal class 'QDNAseqReadCounts' [package "QDNAseq"] with 7 slots
..# assayData :<environment: 0x13a99ed90>
..# phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..# varMetadata :'data.frame': 4 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr [1:4] NA NA NA NA
.. .. ..# data :'data.frame': 1 obs. of 4 variables:
.. .. .. ..$ name : chr "SLX-10457.FastSeqA.BloodDMets_11AF_-AHMMH.s_1.r_1.fq.gz"
.. .. .. ..$ total.reads : num 0
.. .. .. ..$ used.reads : num 0
.. .. .. ..$ expected.variance: num Inf
.. .. ..# dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..# .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..# featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..# varMetadata :'data.frame': 9 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr [1:9] "Chromosome name" "Base pair start position" "Base pair end position" "Percentage of non-N nucleotides (of full bin size)" ...
.. .. ..# data :'data.frame': 168063 obs. of 9 variables:
.. .. .. ..$ chromosome : chr [1:168063] "1" "1" "1" "1" ...
.. .. .. ..$ start : num [1:168063] 825001 840001 855001 870001 885001 ...
.. .. .. ..$ end : num [1:168063] 840000 855000 870000 885000 900000 915000 930000 945000 960000 975000 ...
.. .. .. ..$ bases : num [1:168063] 100 100 100 100 100 100 100 100 100 100 ...
.. .. .. ..$ gc : num [1:168063] 48 61.8 65.1 65.5 62.6 ...
.. .. .. ..$ mappability: num [1:168063] 58.6 91.5 94.1 93.2 93.9 ...
.. .. .. ..$ blacklist : num [1:168063] 0.727 0 0 0 0 ...
.. .. .. ..$ residual : num [1:168063] -0.0627 0.05036 0.09384 0.00541 -0.00588 ...
.. .. .. ..$ use : logi [1:168063] TRUE TRUE TRUE TRUE TRUE TRUE ...
.. .. .. ..- attr(*, "na.action")=Class 'omit' Named int [1:38328] 1 2 3 4 5 6 7 8 9 10 ...
.. .. .. .. .. ..- attr(*, "names")= chr [1:38328] "1:1-15000" "1:15001-30000" "1:30001-45000" "1:45001-60000" ...
.. .. ..# dimLabels : chr [1:2] "featureNames" "featureColumns"
.. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..# .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..# experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
.. .. ..# name : chr ""
.. .. ..# lab : chr ""
.. .. ..# contact : chr ""
.. .. ..# title : chr ""
.. .. ..# abstract : chr ""
.. .. ..# url : chr ""
.. .. ..# pubMedIds : chr ""
.. .. ..# samples : list()
.. .. ..# hybridizations : list()
.. .. ..# normControls : list()
.. .. ..# preprocessing : list()
.. .. ..# other : list()
.. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..# .Data:List of 2
.. .. .. .. .. ..$ : int [1:3] 1 0 0
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..# annotation : chr(0)
..# protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..# varMetadata :'data.frame': 0 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr(0)
.. .. ..# data :'data.frame': 1 obs. of 0 variables
.. .. ..# dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..# .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..# .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. ..# .Data:List of 4
.. .. .. ..$ : int [1:3] 3 1 2
.. .. .. ..$ : int [1:3] 2 26 0
.. .. .. ..$ : int [1:3] 1 3 0
.. .. .. ..$ : int [1:3] 1 2 4

Resources