rvest: select and scrape html table(s) after a specific (title) string - r

I am trying to scrape the following website with public healthcare data:
https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d
I would like to scrape only a table or tables that are preceded with a title
"Tabela xx procedury ICD-9" where xx is not a fixed number.
There may be 1 but occasionally 2-3 such tables on the page. And they can occur in different order with other pages, so it is not possible to indicate I want n'th page on the website.
For instance, I am interested only in tables in the same page that are preceded by "Icd-9 main" string and skip others. There are 2 such tables and I would like to scrape their content to a data.frame. They may not appear in the same order as below, so I must rely on preceding string. Occasionally there will be no Icd-9 main table at all.
Page
----
Icd-10
====
Table
====
Icd-9 main
====
Table
===
Icd-9 main
====
Table
====
Icd-9 supplementary
====
Table
===
I am only aware of the code that selects the nth table in order like in this tutorial:
https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
whereby all the tables are dumped into a list and then we can select which table we want by number
However in my case I never know which table to scrape, the order may vary, it can be 2 tables preceded by a string containing "Tabela xx: procedury ICD-9"
My question is, how to select and scrape html tables based on preciding title or description string. Or a table or tables appearing only immediately after a string containing e.g "Tabela xx procedury ICD-9"

library(rvest)
library(stringr)
doc <- read_html("https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d")
# extract all the nodes that have the title (id = "tytul") or a table
# the cs selector "," is like a boolean OR.
nodes <- doc %>% html_nodes(".tytul,table")
# loop though each node.
signal <- FALSE
my_tables <- list()
j <- 0
for (i in 1:length(nodes)) {
# if title signal previously set and this is a table tag
if (signal & html_name(nodes[i]) == "table") {
cat("Match..\n")
# get the table (data frame)
this_table <- html_table(nodes[i], fill = TRUE, header = TRUE)[[1]]
# append to list
j = j + 1
my_tables[[j]] <- this_table
# and reset the signal so we search for the next one
signal <- FALSE
}
# if the signal is clear look for matching title
if (!signal) {
signal <- nodes[i] %>% html_text() %>% str_detect("Tabela.+ICD 9")
}
}
my_tables[[1]][1:5,]
my_tables[[2]][1:5,]
# > my_tables[[1]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 2.051 ZaĹ\u0082oĹźenie pĹ\u0082ytki sztucznej do czaszki 168 32,31 7
# 2 1.247 Kraniotomia z usuniÄ\u0099ciem krwiaka podtwardĂłwkowego 55 10,58 20
# 3 2.022 Odbarczenie zĹ\u0082amania czaszki 43 8,27 6
# 4 2.040 Przeszczep kostny do koĹ\u009bci czaszki 35 6,73 8
# 5 1.093 Inne aspiracje w zakresie czaszki 33 6,35 5
# > my_tables[[2]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 O35 SĂłd (Na) 239 45,96 8
# 2 89.00 Porada lekarska, konsultacja, asysta 230 44,23 9
# 3 N45 Potas (K) 217 41,73 8
# 4 87.030 TK gĹ\u0082owy bez kontrastu 214 41,15 9
# 5 89.04 Opieka pielÄ\u0099gniarki lub poĹ\u0082oĹźnej 202 38,85 8

Related

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match, when using Rinat package in R

I am using the Rinat API to extract the observations submitted by a list of users to the iNaturalist app. The API function I'm using is as below, which is taken from GitHub (https://github.com/ropensci/rinat/blob/master/R/get_inat_obs_user.R):
get_inat_obs_user <- function(username, maxresults = 10000){
# check Internet connection
if (!curl::has_internet()) {
message("No Internet connection.")
return(invisible(NULL))
}
base_url <- "http://www.inaturalist.org/"
# check that iNat can be reached
if (httr::http_error(base_url)) { # TRUE: 400 or above
message("iNaturalist API is unavailable.")
return(invisible(NULL))
}
q_path <- paste0(username, ".csv")
ping_path <- paste0(username, ".json")
ping_query <- "&per_page=1&page=1"
### Make the first ping to the server to get the number of results
### easier to pull down if you make the query in json, but easier to arrange results
### that come down in CSV format
ping <- GET(base_url, path = paste0("observations/", ping_path), query = ping_query)
total_res <- as.numeric(ping$headers$`x-total-entries`)
if(total_res == 0){
stop("Your search returned zero results. Perhaps your user does not exist.")
}
page_query <-"&per_page=200&page=1"
dat <- GET(base_url, path = paste0("observations/", q_path), query = page_query)
data_out <- read.csv(textConnection(content(dat, as = "text")))
if(maxresults > 200){
for(i in 2:ceiling(total_res/200)){
page_query <- paste0("&per_page=200&page=", i)
dat <- GET(base_url, path = paste0("observations/", q_path), query = page_query)
data_out <- rbind(data_out, read.csv(textConnection(content(dat, as = "text"))))
}
}
if(maxresults < dim(data_out)[1]){
data_out <- data_out[1:maxresults,]
}
return(data_out)
}
I'm also using a for loop to iterate the API function through a list of usernames (in below sample data):
usernames<-
user_login
1 a_random_kaiser
2 a_wade
3 achaas33
4 adam247
5 adam294
6 ailsa2
7 anne278
8 anne370
9 apwiemer
10 arbw
11 aslan1758
12 asmalley
13 aspinwall
14 atypicalhermit
15 aurora143
16 autumn307
17 avenkal
18 awburck
19 badbotanist
20 baleinenature
21 barich
22 batesfire
23 beavers
24 becca57
25 beesis
26 beetlequeenrs
27 ben543
28 ben566
29 bethanyk63
30 betty108
31 boatbirder
32 bobbyboy
33 bobkat
34 brad131987
35 bradders100
36 cadds
37 caketea86
38 carluk
39 carolina711
40 cleo2211
41 cobydo
42 cocothecat
EDIT
sequence <- seq_along(usernames)
User_obs_global <- data.frame()
for (var in sequence){
User_obs <- get_inat_obs_user(as.character(
User_no_obs_Oxfordshire$user_login[var]))
User_obs_global <- rbind(User_obs_global, User_obs)
}
sequence = 1:nrow(usernames)
for (var in sequence)
{User_obs<- get_inat_obs_user(as.character(username$user_login[var]))}
However, I keep getting the below error:
Error in rbind(deparse.level, ...) : numbers of columns of arguments
do not match
I tried searching previous Stack Overflow posts related to this error, and understand this could be due to mismatch in number of columns between successive username records. Nevertheless, the number of columns should be the same for all username records, since their source is the same. Hence, I'm wondering if there is something wrong with my code.
I can run the API function successfully for individual usernames by the way, but I have a few thousand usernames to loop through.

R Load XML to dataframe, and include attributes

I am having trouble loading XML file into R data frame.
This is my XML structure [the data is made up]:
<?xml version="1.0" encoding="UTF-8"?>
-<CancerExtract>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Jones</PatSurname>
<PatFirstName>John</PatFirstName>
<PatSecondName>Peter</PatSecondName>
</PatientName>
-<PatientDetail Sex="1" IndigStatus="12">
<DOB>01012000</DOB>
<MedicareNo>xxxx776xxx66xx</MedicareNo>
<COB>1101</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 Address Rd</StreetAddr>
<Suburb>AwesomeCity</Suburb>
<Postcode>ZZ304</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
-<TreatingDoctor>
<TDSurname>Doctor</TDSurname>
<TDFirstName>The Good</TDFirstName>
<TDAddress>FixemUp ct</TDAddress>
<TDMediProvidNo>DR0001</TDMediProvidNo>
</TreatingDoctor>
-<HospitalEpisode>
<HospitalName>FixMeUp</HospitalName>
<CampusCode>0000</CampusCode>
<URN>123456</URN>
<AdmissionDate>01012020</AdmissionDate>
<DischargeDate>03012020</DischargeDate>
</HospitalEpisode>
-<TumourDetail Grade="1" ECOG="9">
<DiagnosisDate>01012050</DiagnosisDate>
<PrimarySite>C61</PrimarySite>
<Morph>81403</Morph>
<Investigations>8 8 7 10 3</Investigations>
<AdditInfo>Some free text can be available here</AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>MyGP</GPSurname>
<GPFirstName>Peter</GPFirstName>
<GPAddress>100 GP street</GPAddress>
</GP>
-<RegDetail>
<RegName>Some name</RegName>
<RegDate>05122021</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Pt2</PatSurname>
<PatFirstName>Frits</PatFirstName>
<PatSecondName/>
</PatientName>
-<PatientDetail Sex="4" IndigStatus="22" SomeOtherVariable="random value">
<DOB>12121834</DOB>
<MedicareNo>xxxxxxxx00001</MedicareNo>
<COB>1201</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 church street</StreetAddr>
<Suburb>Cityname Here</Suburb>
<Postcode>7777YY</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
+<TreatingDoctor>
-<HospitalEpisode>
<HospitalName>HospitalName two </HospitalName>
<CampusCode>2166192</CampusCode>
<URN>10REWR8XX640</URN>
<AdmissionDate>23122025</AdmissionDate>
<DischargeDate>23122027</DischargeDate>
</HospitalEpisode>
-<TumourDetail EstDateFlag="1" PriorDiagFlag="Y" Laterality="8">
<DiagnosisDate>01121812</DiagnosisDate>
<WhereDiagnosed>At home</WhereDiagnosed>
<PrimarySite>C9000</PrimarySite>
<Morph>81403</Morph>
<Investigations>7 3 1</Investigations>
<MetSite>C792 C788</MetSite>
<AdditInfo>This is a second record. </AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>Jones</GPSurname>
<GPFirstName>John</GPFirstName>
<GPAddress>Test street 12 Unit 1</GPAddress>
</GP>
-<RegDetail>
<RegName>Me Myself and I</RegName>
<RegDate>01011801</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
</CancerExtract>
I created this R function to load the file and extract all data:
load_XML_File <- function(file){
load <- tryCatch(expr = { xml2::read_xml(file) },
warning = function(warning_condition) {
message(paste("\n\n\nWarning loading file: ", file))
message("\nHere's the original warning message:\n")
message(warning_condition)
return(NA)
},
error = function(error_condition) {
message(paste("\n\n\nError loading file: ", file))
message("\nHere's the original error message:\n")
message(error_condition)
return(NA)
},
finally = {
message(paste0("\nLoaded file ", file))
}
)
PerPt <- xml2::xml_find_all(load, ".//CancerRegRec")
tmp <- xml2::as_list(PerPt)
if(length(tmp) == 0){out <- NA}
if(length(tmp) >= 1){
for(i in 1:length(tmp)){
tt <- data.frame(t(data.frame(unlist(tmp[i]))))
rownames(tt) <- NULL
if(i == 1){out <- tt}
if(i > 1){out <- plyr::rbind.fill(out, tt)}
}
}
return(out)
}
This works well and is fast enough for my purpose, but does NOT extract the attributes.
How would I change my function so that also the attributes are included?
> load_XML_File(file)
Loaded file H:/TMP/testFile.xml
Demographic.PatientName.PatSurname Demographic.PatientName.PatFirstName Demographic.PatientName.PatSecondName Demographic.PatientDetail.DOB
1 Jones John Peter 01012000
2 Pt2 Frits <NA> 12121834
Demographic.PatientDetail.MedicareNo Demographic.PatientDetail.COB Demographic.PatientDetail.Language Demographic.PatientAddress.StreetAddr
1 xxxx776xxx66xx 1101 1201 1 Address Rd
2 xxxxxxxx00001 1201 1201 1 church street
Demographic.PatientAddress.Suburb Demographic.PatientAddress.Postcode Tumour.TreatingDoctor.TDSurname Tumour.TreatingDoctor.TDFirstName
1 AwesomeCity ZZ304 Doctor The Good
2 Cityname Here 7777YY Jansen Jan
Tumour.TreatingDoctor.TDAddress Tumour.TreatingDoctor.TDMediProvidNo Tumour.HospitalEpisode.HospitalName Tumour.HospitalEpisode.CampusCode
1 FixemUp ct DR0001 FixMeUp 0000
2 Jansen rd DVR0001 HospitalName two 2166192
Tumour.HospitalEpisode.URN Tumour.HospitalEpisode.AdmissionDate Tumour.HospitalEpisode.DischargeDate Tumour.TumourDetail.DiagnosisDate
1 123456 01012020 03012020 01012050
2 10REWR8XX640 23122025 23122027 01121812
Tumour.TumourDetail.PrimarySite Tumour.TumourDetail.Morph Tumour.TumourDetail.Investigations Tumour.TumourDetail.AdditInfo Tumour.GP.GPSurname
1 C61 81403 8 8 7 10 3 Some free text can be available here MyGP
2 C9000 81403 7 3 1 This is a second record. Jones
Tumour.GP.GPFirstName Tumour.GP.GPAddress Tumour.RegDetail.RegName Tumour.RegDetail.RegDate Tumour.TumourDetail.WhereDiagnosed Tumour.TumourDetail.MetSite
1 Peter 100 GP street Some name 05122021 <NA> <NA>
2 John Test street 12 Unit 1 Me Myself and I 01011801 At home C792 C788
It seems like the attributes are present on tmp.
PerPt <- xml2::xml_find_all(load, ".//CancerRegRec")
tmp <- xml2::as_list(PerPt)
This function visits each element of a list, recursively. It makes attributes into members of the element.
move_attr_to_member <- function(x) {
## capture names, and attributes but not names
names <- names(x)
attributes <- attributes(unname(x))
## recursive application
if (is.list(x))
x <- lapply(x, fun)
## return x (with attributes but not names removed) and attributes
attributes(x) <- NULL
names(x) <- names
c(x, attributes)
}
This could be used like
list_with_attrs_as_members <- move_attr_to_member(tmp)
A tibble is easily created with
dplyr::bind_rows(lapply(list_with_attrs_as_members, unlist))
I'd carefully check the output of move_attr_to_member() to make sure that it's doing the right thing!

How to access Youtube Data API v3 with R

I am trying to use R to retrieve data from the YouTube API v3 and there are few/no tutorials out there that show the basic process. I have figured out this much so far:
# Youtube API query
base_url <- "https://youtube.googleapis.com/youtube/v3/"
my_yt_search <- function(search_term, max_results = 20) {
my_api_url <- str_c(base_url, "search?part=snippet&", "maxResults=", max_results, "&", "q=", search_term, "&key=",
my_api_key, sep = "")
result <- GET(my_api_url)
return(result)
}
my_yt_search(search_term = "salmon")
But I am just getting some general meta-data and not the search results. Help?
PS. I know there is a package 'tuber' out there but I found it very unstable and I just need to perform simple searches so I prefer to code the requests myself.
Sadly there is no way to directly get the durations, you'll need to call the videos endpoint (with the part set to part=contentDetails) after doing the search if you want to get those infos, however you can pass as much as 50 ids in a single call thus we can save some time by pasting all the ids together.
library(httr)
library(jsonlite)
library(tidyverse)
my_yt_duration <- function(...){
my_api_url <- paste0(base_url, "videos?part=contentDetails", paste0("&id=", ..., collapse=""), "&key=",
my_api_key )
GET(my_api_url) -> resp
fromJSON(content(resp, "text"))$items %>% as_tibble %>% select(id, contentDetails) -> tb
tb$contentDetails$duration %>% tibble(id=tb$id, duration=.)
}
### getting the video IDs
my_yt_search(search_term = "salmon")->res
## Converting from JSON then selecting all the video ids
# fromJSON(content(res,as="text") )$items$id$videoId
my_yt_duration(fromJSON(content(res,as="text") )$items$id$videoId) -> tib.id.duration
# A tibble: 20 x 2
id duration
<chr> <chr>
1 -x2E7T3-r7k PT4M14S
2 b0ahREpQqsM PT3M35S
3 ROz8898B3dU PT14M17S
4 jD9VJ92xyzA PT5M42S
5 ACfeJuZuyxY PT3M1S
6 bSOd8r4wjec PT6M29S
7 522BBAsijU0 PT10M51S
8 1P55j9ub4es PT14M59S
9 da8JtU1YAyc PT3M4S
10 4MpYuaJsvRw PT8M27S
11 _NbbtnXkL-k PT2M53S
12 3q1JN_3s3gw PT6M17S
13 7A-4-S_k_rk PT9M37S
14 txKUTx5fNbg PT10M2S
15 TSSPDwAQLXs PT3M11S
16 NOHEZSVzpT8 PT7M51S
17 4rTMdQzsm6U PT17M24S
18 V9eeg8d9XEg PT10M35S
19 K4TWAvZPURg PT3M3S
20 rR9wq5uN_q8 PT4M53S

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

R & xml2: Locate elements by specific text value, store all children values in data.frame

I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00

Resources