text/xml parsing to dataframe - r

Using httr package, I pulled content from an api as follows. the content received is "text/xml". The output data from below is pasted at the end.
I converted to xml as follows
res_xml <- httr::content(res, as = "text", encoding = "UTF-8") %>%
xml2::read_xml()
When I checked the children it seems to have node_set
xml_children(res_xml )
{xml_nodeset (4)}
[1] <type value="searchset"/>
[2] <total value="1"/>
[3] <link>\n <relation value="self"/>\n <url value= ...
[4] <entry>\n <link>\n <relation value="self"/>\n ...
But, when I pull the "entry" node, there seems to be no data.
xml_find_all(res_xml ,".//entry")
{xml_nodeset (0)}
Instead of working with xml format, I converted to list and that is a complete nested lists of lists with unequal entries.
xml_list <- xml2::as_list(res_xml)
I would need a data frame for specific entries or more favorably complete xml output as separate tables, so that I can select the data I can work further much easily. Hence, tried the following. But, the output is all NULL.
lst <- xml_list$Bundle$entry
lst %>% dplyr::bind_rows()
# A tibble: 17 x 4
relation url Patient mode
<list> <list> <list> <list>
1 <NULL> <NULL> <list [0]> <NULL>
2 <NULL> <NULL> <named list [1]> <NULL>
3 <NULL> <NULL> <named list [1]> <NULL>
4 <NULL> <NULL> <named list [1]> <NULL>
.....
when I noticed the str(lst) the list has NULL list() and attr(value). I am interested in the attr value. If I convert the xml to JSON, everything is NULL.
Any help on how to flatten this list appropriately would be appreciated. or More better I can parse directly using xml2 package.
input data
<Bundle xmlns="\"http://hl7.org/fhir\""><type value="\"searchset\""/><total value="\"1\""/>
<resource>
<Patient><id value="\"TzfXm.YeCZh5GLGCXoCQqmjyn9vSjtJtIlcakCeyfbEcB\""/>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-race\"">
<valueCodeableConcept>
<coding><system value="\"urn:oid:2.16.840.1.113883.5.104\""/><code value="\"UNK\""/><display value="\"Unknown\""/></coding><text value="\"Unknown\""/></valueCodeableConcept>
</extension>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-ethnicity\"">
<valueCodeableConcept>
<coding><system value="\"urn:oid:2.16.840.1.113883.5.50\""/><code value="\"UNK\""/><display value="\"Unknown\""/></coding><text value="\"Unknown\""/></valueCodeableConcept>
</extension>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-birth-sex\"">
<valueCodeableConcept>
<coding><system value="\"http://hl7.org/fhir/v3/AdministrativeGender\""/><code value="\"F\""/><display value="\"Female\""/></coding><text value="\"Female\""/></valueCodeableConcept>
</extension>
<identifier><use value="\"usual\""/><system value="\"urn:oid:1.2.840.114350.1.13.172.3.7.5.737384.0\""/><value value="\"E296\""/></identifier>
<identifier><use value="\"usual\""/><system value="\"urn:oid:1.2.840.114350.1.13.172.2.7.5.737384.100\""/><value value="\"410000236\""/></identifier><active value="\"true\""/>
<name><use value="\"usual\""/><text value="\"Mother Milltest\""/><family value="\"Milltest\""/><given value="\"Mother\""/></name><gender value="\"female\""/><birthDate value="\"1978-05-06\""/><deceasedBoolean value="\"false\""/></Patient>
</resource>
<search><mode value="\"match\""/></search>
</entry>
</Bundle>

Related

googledrive R library 404 not found

I am using R 3.6.3 on Ubuntu Desktop 20.04 and googledrive version 1.0.1
I am downloading data from this open google drive folder with about 3k files: https://drive.google.com/drive/folders/0B-owdnU_9_lpei1wbTBpS3RyTW8?resourcekey=0-SxZAhXpvnVSBVJjG_HYZ_w
if you add it to a google account of yours you can see that:
library(googledrive)
drive_about()
s2012 <- drive_ls(path = "CSV Stazione_Parametro_AnnoMese", pattern = paste0("storico_2012"))
drive_download(file = as_id(s2012$id[1]))
Which returns:
Error: Client error: (404) Not Found
* domain: global
* reason: notFound
* message: File not found: 0B-owdnU_9_lpY0otY3FYaF9nOG8.
* locationType: parameter
* location: fileId
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/gargle_error_request_failed>
Client error: (404) Not Found
* message: File not found: 0B-owdnU_9_lpSGR2SGtCUnlRekk.
* domain: global
* reason: notFound
* location: fileId
* locationType: parameter
Backtrace:
1. base::mapply(function(x) drive_download(file = as_id(x)), bla$id)
3. googledrive::drive_download(file = as_id(x))
5. googledrive:::as_dribble.drive_id(file)
6. googledrive::drive_get(id = x)
8. purrr::map(as_id(id), get_one_file)
9. googledrive:::.f(.x[[i]], ...)
10. gargle::response_process(response)
11. gargle:::stop_request_failed(error_message(resp), resp)
Run `rlang::last_trace()` to see the full context.
Note that the tibble of drive_ls is:
s2012
# A tibble: 201 x 3
name id drive_resource
* <chr> <chr> <list>
1 storico_2012_07000027_005.csv 0B-owdnU_9_lpY0otY3FYaF9nOG8 <named list [37]>
2 storico_2012_10000001_005.csv 0B-owdnU_9_lpcmFkUDYwUzh4X0k <named list [37]>
3 storico_2012_05000020_010.csv 0B-owdnU_9_lpcTlEMTFpbjJLSVE <named list [37]>
4 storico_2012_03000006_005.csv 0B-owdnU_9_lpbDJiNFZWUy1CcEU <named list [37]>
5 storico_2012_09000018_111.csv 0B-owdnU_9_lpRHlwN0JnNVNseDg <named list [37]>
6 storico_2012_07000041_005.csv 0B-owdnU_9_lpV1hINnZtSFRYaTg <named list [37]>
7 storico_2012_04000155_009.csv 0B-owdnU_9_lpMzh6a29BQ3hJbHM <named list [37]>
8 storico_2012_09000014_020.csv 0B-owdnU_9_lpS0Y0ZFIzbV9mX1U <named list [37]>
9 storico_2012_03000006_038.csv 0B-owdnU_9_lpMlpGbkpFdVdURzQ <named list [37]>
10 storico_2012_06000036_009.csv 0B-owdnU_9_lpa0kxTTBuLU83U2s <named list [37]>
# … with 191 more rows
Note that it perfectly works for other files in the same folder.
Any hint?
A question aside: is there any alternative to this library? googledrivestruggles to find all files available based a specific pattern, i.e. it has be run multiple times to find them all.
Thx,
A

How to read XML files with initial tags in R

I have several XML files which are missing the initial tag. For example, this is the proper formatted file:-
<?xml version="1.0"?>
<UDI>
<Test_Equipment_Number>3300061-01</Test_Equipment_Number>
<Test_SW_Number>3300062</Test_SW_Number>
<Test_SW_Version>2.1</Test_SW_Version>
<GTIN>(01)00884838088597</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190322</Date_of_Mfg>
<Device_SN>(21)1160001242</Device_SN>
<Material_Number>(96)300001287651</Material_Number>
<PCBA_WO_and_SN>00190311-0001242</PCBA_WO_and_SN>
<FW_Version>06</FW_Version>
<Model>324PHB</Model>
</UDI>
And this is the file with missing initial tag:-
<Test_Equipment_Number>3300011-01</Test_Equipment_Number>
<Test_SW_Number>3300012</Test_SW_Number>
<Test_SW_Version>5.1</Test_SW_Version>
<GTIN>(01)00884838085497</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190411</Date_of_Mfg>
<Device_SN>(21)1120104548</Device_SN>
<Material_Number>(96)300000267981</Material_Number>
<PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
<FW_Version>V01.0001</FW_Version>
<Model>7000PHW</Model>
How could I read the file with missing initial tag in R Programming Language ?
One option would be to parse the xml fragment by specifying a top node to be added:
# install.packages('XML')
library(XML)
fragment <-
'<Test_Equipment_Number>3300011-01</Test_Equipment_Number>
<Test_SW_Number>3300012</Test_SW_Number>
<Test_SW_Version>5.1</Test_SW_Version>
<GTIN>(01)00884838085497</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190411</Date_of_Mfg>
<Device_SN>(21)1120104548</Device_SN>
<Material_Number>(96)300000267981</Material_Number>
<PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
<FW_Version>V01.0001</FW_Version>
<Model>7000PHW</Model>'
XML::parseXMLAndAdd(fragment, top = 'content')
#> <content>
#> <Test_Equipment_Number>3300011-01</Test_Equipment_Number>
#> <Test_SW_Number>3300012</Test_SW_Number>
#> <Test_SW_Version>5.1</Test_SW_Version>
#> <GTIN>(01)00884838085497</GTIN>
#> <LOT/>
#> <Date_of_Mfg>(11)20190411</Date_of_Mfg>
#> <Device_SN>(21)1120104548</Device_SN>
#> <Material_Number>(96)300000267981</Material_Number>
#> <PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
#> <FW_Version>V01.0001</FW_Version>
#> <Model>7000PHW</Model>
#> </content>

Error in UseMethod("http_error") in roadoi

I'm trying out roadoi to access Unpaywall from R, but no matter what I try to query for, I'm getting this response:
Error in UseMethod("http_error") : no applicable method for
'http_error' applied to an object of class "c('simpleError', 'error',
'condition')"
Running methods(http_error) gives me this:
[1] http_error.character* http_error.integer* http_error.response*
Could this be caused by me being behind an institutional firewall? (even so, it seems weird that this would be the response...)
Is there a way around it?
The http_error (actually from library httr) is a very simple function: it loads an url given by a character (http_error.character), retrieves the response (http_error.response) and ultimately looks at the response code (http_error.integer). If the response code is >=400 the function returns TRUE otherwise FALSE.
What your error says, is that you (or any function in your chain) tries to call http_error on a simpleError object. My guess is that your firewall settings block the request. Because the request is blocked the underlying httr::RETRY (which is called from oadoi_fetch) returns an error instead of a proper response object and http_error sees just this error object and breaks.
If I locally switch off my proxy (through which I can make requests) I also get an error:
library(roadoi)
Sys.unsetenv(c("HTTP_PROXY", "HTTPS_PROXY"))
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', 'error', 'condition')"
As soon as my proxy is set properly I get
Sys.setenv(HTTPS_PROXY = my_proxy, HTTP_PROXY = my_proxy)
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# # A tibble: 1 x 16
# doi best_oa_location oa_locations data_standard is_oa genre journal_is_oa journal_is_in_d~ journal_issns journal_name publisher title year updated non_compliant authors
# <chr> <list> <list> <int> <lgl> <chr> <lgl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <list> <list>
# 1 10.1038~ <tibble [1 x 10]> <tibble [4 x~ 2 TRUE journa~ FALSE FALSE 0028-0836,147~ Nature Springer ~ Nanometre-s~ 2013 2019-04-0~
If the problem lies indeed with the proxy, I would try the following, which helped me on my corporate Windows machine, but may be dependent on your local IT setting:
## get the proxy settings
system("netsh winhttp show proxy")
Sys.setenv(HTTP_PROXY = <the proxy from netsh>, HTTPS_PROXY = <the proxy from netsh>)
Actually, you can reproduce the error easily:
httr::http_error(simpleError("Cannot reach the page"))
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', # 'error', 'condition')"

Write XML using pipe operator with xml2

The xml2 package allows users to create XML documents. I'm trying to create a document using the pipe operator %>% to add various combinations of child and sibling nodes. I cannot figure out how to create a child node within a child node that is following by the original child's sibling (see example below).
Is it possible to "rise" up a level to then create more nodes or must they be created outside of the chained commands?
What I want
library(xml2)
x1 <- read_xml("<parent><child>1</child><child><grandchild>2</grandchild></child><child>3</child><child>4</child></parent>")
message(x1)
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> <child>4</child>
#> </parent>
What I'm creating that's wrong
library(magrittr)
library(xml2)
x2 <- xml_new_document()
x2 %>%
xml_add_child("parent") %>%
xml_add_child("child", 1) %>%
xml_add_sibling("child", 4, .where="after") %>%
xml_add_sibling("child", 3) %>%
xml_add_sibling("child", .where="before") %>%
xml_add_child("grandchild", 2)
message(x2)
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>4</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> </parent>
Solution using XML package
This is actually fairly straightforward if done using the XML package.
library(XML)
x2 <- newXMLNode("parent")
invisible(newXMLNode("child", 1, parent=x2))
invisible(newXMLNode("child", newXMLNode("grandchild", 2), parent=x2))
invisible(newXMLNode("child", 3, parent=x2))
invisible(newXMLNode("child", 4, parent=x2))
x2
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> <child>4</child>
#> </parent>
I'm going to start by saying that I think this is generally a bad idea. xml2 works using pointers, which means that it has reference semantics ("pass by reference"), which is not the typical behavior in R. Functions in xml2 work by producing side effects on the XML tree, not by returning values like in functional programming ("pass by value").
This means that piping is basically the wrong principle. You just need a series of steps that modify the object in the correct order.
That said, you can do:
library("magrittr")
library("xml2")
x2 <- xml_new_document()
x2 %>%
xml_add_child(., "parent") %>%
{
xml_add_child(., "child", 1, .where = "after")
(xml_add_child(., "child") %>% xml_add_child("grandchild", 2))
xml_add_child(., "child", 3, .where = "after")
xml_add_child(., "child", 4, .where = "after")
}
message(x2)
## <?xml version="1.0" encoding="UTF-8"?>
## <parent>
## <child>1</child>
## <child>
## <grandchild>2</grandchild>
## </child>
## <child>3</child>
## <child>4</child>
## </parent>
The . tells the %>% where to place the "parent" node in subsequent calls to xml_add_child(). The ()-bracketed expression in the middle takes advantage of the fact that you want to pipe into the "child" node then pipe that child node into the grandchild node.
Another option, if you really want to use pipes throughout is to use the %T>% pipe, instead of the %>% pipe (or rather, a mix of the two). The difference between the two is the following:
> 1:3 %>% mean() %>% str()
num 2
> 1:3 %T>% mean() %>% str()
int [1:3] 1 2 3
The %T>% pipe pushes the value of the lefthand side expression into the righthand side expression, but further pushes it into the subsequent expression. This means you can call functions in the middle of a pipeline for their side effects and continue to pass the earlier object reference forward in the pipeline.
This is what you're trying to do when you say "rise up a level" - that is, back up to a previous value in the pipeline and work from there. So you need to just %T>% pipe until you get to a point where you want to %>% pipe (e.g., to create the grandchild) and then return to %T>% piping to continue carrying the parent object reference forward. An example:
x3 <- xml_new_document()
x3 %>%
xml_add_child("parent") %T>%
xml_add_child("child", 1, .where = "after") %T>%
{xml_add_child(., "child") %>% xml_add_child("grandchild", 2)} %T>%
xml_add_child("child", 3, .where = "after") %>%
xml_add_child("child", 4, .where = "after")
message(x3)
## <?xml version="1.0" encoding="UTF-8"?>
## <parent>
## <child>1</child>
## <child>
## <grandchild>2</grandchild>
## </child>
## <child>3</child>
## <child>4</child>
## </parent>
Note the final %>% instead of %T>%. If you swapped %>% for %T>% the value of the whole pipeline would be the "parent" node tree only:
{xml_document}
<parent>
[1] <child>1</child>
[2] <child>\n <grandchild>2</grandchild>\n</child>
[3] <child>3</child>
[4] <child>4</child>
(Which - again - ultimately doesn't really matter because we're actually building x3 using side effects, but it will print the parent node tree to the console, which is probably confusing.)
Again, I'd suggest not using the pipe at all given the awkwardness, but it's up to you. A better way is just to preserve each object you want to attach a child to and then refer to it again each time. Like in the first example, save the parent node as p, skip all the pipes, and just refer to p everywhere that . is used in the example code.

Sparklyr how to view variables

Hi I have a deeply nested json file. I used sparklyr to read this json file and called this "data" object.
Firstly I will show what the data structure looks like:
# Database: spark_connection
data
-a : string
-b : string
-c : (struct)
c1 : string
c2 : (struct)
c21: string
c22: string
Something like this. So if I extract "a" using:
data %>% sdf_select(a)
I can view what the data inside, like:
# Database: spark_connection
a
<chr>
1 Hello world
2 Stack overflow is epic
THE PROBLEM now comes is when i use sdf_select() a deeper structure i.e.
data %>% sdf_select(c.c2.c22)
Viewing the data inside, I get this
# Database: spark_connection
c22
<list>
1 <list [1]>
2 <list [1]>
3 <list [1]>
4 <lgl [1]>
so if I collect the data so that the spark data frame turns into R data frame and viewing the data using commands
View(collect(data %>% sdf_select(c.c2.c22)))
The data shows
1 list("Good")
2 list("Bad")
3 NA
How do I turn every entry in each list above to a data frame table so that it shows Good, Bad, NA only instead with list("") on it?
I was unable to reproduce this. I used
[{"a":"jkl","b":"mno","c":{"c1":"ghi","c2":{"c21":"abc","c22":"def"}}}]
written to a test.json, followed by
spk_df <- spark_read_json(sc, "tmp", "file:///path/to/test.json")
spk_df %>% sdf_schema_viewer()
This seems to match the schema that you provided. However when I use sparklyr.nested::sdf_select() I get a different result.
spk_df %>% sdf_select(c.c2.c22)
# # Source: table<sparklyr_tmp_7431373dca00> [?? x 1]
# # Database: spark_connection
# c22
# <chr>
# 1 def
where c22 is a character column.
My guess is that in your real data, one of the levels is actually an array of structs. If this is the case, then indexing into an array forces a list wrapping (or else data would need to be dropped). You can resolve this in spark land using sdf_explode or you can resolve it locally in a variety of ways. For example, using purrr you would do something like:
df <- collect(spk_df)
df %>% mutate(c22=purrr::map(c22, ~unlist))
It is possible that you will need to write a function wrapping unlist to deal with different data types in different rows (the NA values are logical).
unlist_and_cast <- function(x) {
as.charater(unlist(x))
}
df %>% mutate(c22=purrr::map(c22, ~unlist_and_cast))
would do the trick I think (untested).

Resources