How to edit nested/child DICOM tags for sequences using pyDicom? - dicom

I'm trying to set all the TIME (VR = TM) tags contained within the DICOM dataset to a placeholder value, using pyDicom.
I can remove the value of all the TIME (VR = TM) tags that are within the root of the DICOM metadata:
TIME_TAGS = [
(0x10, 0x32), # Patient's Birth Time
(0x40, 0x245), # Performed Procedure Step Start Time
(0x8, 0x13), # Instance Creation Time
(0x8, 0x30), # Study Time
(0x8, 0x31), # Series Time
(0x8, 0x32), # Acquisition Time
(0x8, 0x33), # Content Time
]
TIME_VAL_REPLACEMENT = '120000'
def _clear_times(dir_name: str) -> None:
'''
Set all DICOM standard (i.e. non-vendor) time tags
to a non-unique value (defined in _presets.py)
dir_name:
full path of the directory to process
'''
for dcm_file in os.listdir(dir_name):
dcm_file = os.path.join(dir_name, dcm_file)
# _presets defines what time tags to change
for time_str in TIME_TAGS:
dcmfile = pydicom.dcmread(dcm_file)
if dcmfile.get(time_str, None) and dcmfile.get(time_str,
None).VR == 'TM':
logging.debug("Removing time (%s)", time_str)
new_data = pydicom.dataelem.DataElement(
time_str, 'TM', TIME_VAL_REPLACEMENT)
dcmfile[time_str] = new_data
dcmfile.save_as(dcm_file)
else:
logging.debug("%s not set", time_str)
However, this misses nested/child tags for sequences.
What's the best way, using pyDicom, to remove all relevant nested/child tags too?

Using the walk method with a callback is probably the easiest way to do this. You could see the source code for remove_private_tags() as an example, but your function would check for VR of 'TM' and act on those data elements.

Related

Is there a way to restart a for loop at its last index in R?

I am in the process of downloading and processing a substantial amount of air quality data from an API through the jsonlite library in R. Because of the amount of data, I thought it prudent to write a script that would automate the entire process. These data encompass all 50 States throughout the US, as well as four different air pollutants. The timeline for this is January 1st, 2015 through December 31st, 2019. The state, air pollutant code, begin date, and end date are the four parameters that are submitted at each query iteration to the API.
Since I know this will take quite a bit of time to download and process all the data, I am wondering if there is a way that I can run the script for a while, stop it when I need to do something else, and then restart the script at its last iteration, or rather, at its last index in the vectors containing the values passed into the API URL. I did look up other similar questions but could not find something that accurately fit my situation.
Because this script is querying a web API, I thought that the best approach in this case would be to automate the API calls and data processing through nested for loops. For the "for loop" portion, it would begin on the first state, first pollutant parameter, and then iterate through each beginning date and end date (the API only accepts timelines of up to one year). Then it would go to the next pollutant and iterate through each beginning date and ending date, so on and so forth. Is there a way I could keep track of the iterations and pass that back into the for loops once I restart the script? I am struggling to come up with the logic for this.
Also, I am still familiarizing myself with R after doing much of my data processing in Python and SQL, so please bear with me if my code is not very efficient or more complicated than what is necessary. I did not include the actual data processing portion of the code, only the vectors and the for loops for iterating through those vectors.
#set working directory
setwd('Path_to_directory_here')
#import jsonlite library
library(jsonlite)
#create state FIPS list
fips_list <- c("01","02","04","05",
"06","08","09","10",
"12","13","15","16",
"17","18","19","20",
"21","22","23","24",
"25","26","27","28",
"29","30","31","32",
"33","34","35","36",
"37","38","39","40",
"41","42","44","45",
"46","47","48","49",
"50","51","53","54",
"55","56")
#create list of state names corresponding with state FIPS codes
#this was created to specify which state the data is from in the file name when writing to a .CSV file
fips_names <- c('AL','AK','AZ','AR',
'CA','CO','CT','DE',
'FL','GA','HI','ID',
'IL','IN','IA','KS',
'KY','LA','ME','MD',
'MA','MI','MN','MS',
'MO','MT','NE','NV',
'NH','NJ','NM','NY',
'NC','ND','OH','OK',
'OR','PA','RI','SC',
'SD','TN','TX','UT',
'VT','VA','WA','WV',
'WI','WY')
#create a key/value pair for FIPS codes and state names
fips_states <- setNames(as.list(fips_names), fips_list)
fips_key <- names(fips_states)
fips_val <- fips_states[fips_list]
#same procedure as FIPS codes and state names
param_list <- c('88101','88501', '42602','44201')
param_names <- c('PM25', 'PM25_LC', 'NO2', 'O3') #specifies which pollutant was pulled in the file name
params <- setNames(as.list(param_names), param_list)
param_key <- names(params)
param_val <- params[param_list]
#same as above
begin_yr <- c('20150101','20160101','20170101','20180101','20190101')
end_yr <- c('20151231','20161231','20171231','20181231','20191231')
yr_list <- setNames(as.list(end_yr), begin_yr)
key <- names(yr_list)
val <- yr_list[begin_yr]
#keep track of files processed
file_tracker <- 0
for (x in 1:length(fips_states)) {
for (y in 1:length(params)) {
for (z in 1:length(yr_list)) {
file_tracker = file_tracker + 1
tracker_msg <- sprintf('Reading in State: %s, parameter: %s, timeframe: %s', fips_val[x], param_val[y], key[z])
print(tracker_msg)
#call to API
url <- sprintf("https://aqs.epa.gov/data/api/sampleData/byState?email=MY_EMAIL&key=MY_KEY&param=%s&bdate=%s&edate=%s&state=%s",
param_key[y],key[z],val[z],fips_key[x])
data <- fromJSON(txt = url)
#rest of the data formatting and processing here
}
}
}
I can include more code if necessary. Thanks for any help provided.

How to input multiple departure dates and times in gmapsdistance R package?

I am using the R package, gmapsdistance, to extract Google distance matrix.
My codes are:
gdm <- gmapsdistance(origin = origin_vector,
destination = destination_vector,
combinations = "pairwise",
mode = "driving",
shape = "long",
dep_date = date_vector,
dep_time = time_vector)
The Google API key was also set. The code works until the parameters for dep_date and dep_time. I believe it is because the parameters do not accept vectors. Two errors appear:
1: XML declaration allowed only at the start of the document.
2: Extra content at the end of the document.
In my dataset, each row is a data point with a unique origin, destination, departure date, and departure time. I require the Google distance/time for each row and there are thousands of them. There is no need for me to compare across rows. How may I do this?
Appreciate any help!

Calling getSibling() in R to extract single nodes from XML file causes crash

I am attempting to extract one node at a time from a very large (~620 MB) XML file using an R script. Each main node that I want to access corresponds to a different drug, and all of the nodes are parallel to each other. My aim is to process the entire file, one node at a time, since trying to read the entire file into memory does not work with the XML parser in R.
I have significantly truncated my large XML file into a much smaller example file that contains only 4 nodes; the beginning of this XML file looks like:
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
<cas-number>138068-37-8</cas-number>
<unii>Y43GF64R34</unii>
<state>liquid</state>
<groups>
<group>approved</group>
</groups>
Having reviewed the available options, and wanting to use the R script that I have already written that extracts desired fields from the XML file (it works for small XML files, but fails for the large file), it seems that using the getSibling() function in the XML library in R is my best choice. The following example code (from http://svitsrv25.epfl.ch/R-doc/library/XML/html/addSibling.html ) works to extract the single node in this example file:
f = system.file("exampleData", "job.xml", package = "XML")
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- getSibling(x)
# I’m still not sure how to “walk” to the next sibling after the above process completes, since this example file only contains one node, and there is no simple way to increment a counter using the above code
That example job.xml file begins as follows:
<!-- Initial Comment -->
<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location">
<gjob:Jobs>
<gjob:Job>
<gjob:Project ID="3"/>
<gjob:Application>GBackup</gjob:Application>
<gjob:Category>Development</gjob:Category>
<gjob:Update>
<gjob:Status>Open</gjob:Status>
<gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified>
<gjob:Salary>USD 0.00</gjob:Salary>
</gjob:Update>
<gjob:Developers>
<gjob:Developer>
</gjob:Developer>
</gjob:Developers>
However, if I substitute my own XML file (small version of the full file; I have checked that it is legitimate XML format, as my R script correctly runs to process it), the following code crashes R:
f = "MyTruncatedExampleFile.xml" -> this line causes R to crash
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- getSibling(x)
Can anyone suggest why my own small XML file would cause a crash, but the example job.xml file runs correctly? Thanks in advance for your insights.
Apparently it seems to be due to the undeclared namespace prefix in the truncated XML causing the crash: xmlns="http://www.drugbank.ca". If you remove this, the method does not crash R. Do note: undeclared namespaces is valid in XML documents. So, this issue should be raised with XML package authors. Also, since <drug> does not have a sibling in truncated XML, xmlChildren() is used below in place of getSibling().
# CRASHES
f = "DrugBank.xml"
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- xmlChildren(x)[[1]]
DesiredOutput
# NO CRASHES
f = "DrugBank_oth.xml" # REMOVED UNDEFINED NAMESPACE PREFIX
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- xmlChildren(x)[[1]]
DesiredOutput
As a workaround without modifying original XML, consider getNodeSet where you define a prefix for the special namespace and XPath to the children level of the root wih /*/*. The index [[1]] is used to take the first instead of all nodes. Here, web is used as prefix but it can be anything.
# NO CRASHES
f = "DrugBank.xml"
doc = xmlParse(f)
nmsp = c(web="http://www.drugbank.ca") # DEFINE NAMESPACE PREFIX
DesiredOutput <- getNodeSet(doc, "/*/*", nmsp)[[1]]
DesiredOutput
# <drug type="biotech" created="2005-06-13" updated="2016-08-17">
# <drugbank-id primary="true">DB00001</drugbank-id>
# <drugbank-id>BTD00024</drugbank-id>
# <drugbank-id>BIOD00024</drugbank-id>
# <name>Lepirudin</name>
# <description>Lepirudin is identical to natural hirudin except for
# substitution of leucine for isoleucine at the N-terminal
# end of the molecule and the absence of a sulfate group on
# the tyrosine at position 63. It is produced via yeast
# cells. Bayer ceased the production of lepirudin (Refludan)
# effective May 31, 2012.</description>
# <cas-number>138068-37-8</cas-number>
# <unii>Y43GF64R34</unii>
# <state>liquid</state>
# <groups>
# <group>approved</group>
# </groups>
# </drug>

Read and parse an irregular and mixed ASCII file into R

I'm still very new to R and I apologize if I'm not using the proper terminology. I'm interested in pulling a large amount of Unemployment Insurance Trust Fund data from the Treasury Direct online report query system (http://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm) and I've successfully pulled the information using readLines.
ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')
which gives me the chart as a string vector.
Is there a way to then parse the lines and turn it into a data frame so I can at least put it excel and easily get the important information out of it? I'm sure there is another way to do this as well but the reports will always vary in what accounting code sections are included and how many individual transactions are included so I'm not even sure where to begin with that.
The items I need are the date, the share/par (dollar transaction amount), the transaction code, and the transaction description. The totals would be useful but are by no means necessary.
When you look at it using Excel it looks like
This will help you parse the information:
ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')
# Select lines with /
z = grepl(pattern = "/",x = ESAA_OCT15)
d = trimws(ESAA_OCT15[z])
dates = substr(d,0,10)
sharesPar = substr(d,11,41)
What this does is first select all lines that contain a / character. This will even return the column titles. These are stored in d.
If you examine d:
[1] "Effective Date Shares/Par Description Code Memo Number Code Account Number"
[2] "10/01/2015 2,313,000.0000 12-10 FUTA RECEIPTS 3305617 ESAA"
[3] "10/01/2015 3,663,000.0000 12-10 FUTA RECEIPTS 3305618 ESAA"
[4] "10/02/2015 4,314,000.0000 12-10 FUTA RECEIPTS 3305640 ESAA"
[5] "10/05/2015 3,512,000.0000 12-10 FUTA RECEIPTS 3305662 ESAA"
The information is aligned neatly. This means that the data of each column ends at a precise position. To parse this you can use substr with start and stop as shown in my script.
Of course, I did not complete all parses, I'll let you finish the rest. Once each column is parsed, create a data.frame(dates, sharesPar, ...)
It's a fixed-width format, so it should be treated as such:
library(dplyr)
library(readr)
readLines("http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt") %>%
grep("^\ +[[:digit:]]+/[[:digit:]]+", ., value=TRUE) %>% # grab only the lines with data
textConnection() %>%
read.fwf(widths=c(19, 26, 27, 15, 10, 27), skip=7) %>% # read the columns
mutate_all(trimws) %>% # clean them up
type_convert() %>% # you still need to convert the date even with this type conversion
setNames(c("effective_date", "shares_per", # add decent colnames
"trans_descr_code", "memo_num", "location_code", "acct_no"))

Creating SpatialLinesDataFrame from SpatialLines object and basic df

Using leaflet, I'm trying to plot some lines and set their color based on a 'speed' variable. My data start at an encoded polyline level (i.e. a series of lat/long points, encoded as an alphanumeric string) with a single speed value for each EPL.
I'm able to decode the polylines to get lat/long series of (thanks to Max, here) and I'm able to create segments from those series of points and format them as a SpatialLines object (thanks to Kyle Walker, here).
My problem: I can plot the lines properly using leaflet, but I can't join the SpatialLines object to the base data to create a SpatialLinesDataFrame, and so I can't code the line color based on the speed var. I suspect the issue is that the IDs I'm assigning SL segments aren't matching to those present in the base df.
The objects I've tried to join, with SpatialLinesDataFrame():
"sl_object", a SpatialLines object with ~140 observations, one for each segment; I'm using Kyle's code, linked above, with one key change - instead of creating an arbitrary iterative ID value for each segment, I'm pulling the associated ID from my base data. (Or at least I'm trying to.) So, I've replaced:
id <- paste0("line", as.character(p))
with
lguy <- data.frame(paths[[p]][1])
id <- unique(lguy[,1])
"speed_object", a df with ~140 observations of a single speed var and row.names set to the same id var that I thought I created in the SL object above. (The number of observations will never exceed but may be smaller than the number of segments in the SL object.)
My joining code:
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object)
And the result:
row.names of data and Lines IDs do not match
Thanks, all. I'm posting this in part because I've seen some similar questions - including some referring specifically to changing the ID output of Kyle's great tool - and haven't been able to find a good answer.
EDIT: Including data samples.
From sl_obj, a single segment:
print(sl_obj)
Slot "ID":
[1] "4763655"
[[151]]
An object of class "Lines"
Slot "Lines":
[[1]]
An object of class "Line"
Slot "coords":
lon lat
1955 -74.05228 40.60397
1956 -74.05021 40.60465
1957 -74.04182 40.60737
1958 -74.03997 40.60795
1959 -74.03919 40.60821
And the corresponding record from speed_obj:
row.names speed
... ...
4763657 44.74
4763655 34.8 # this one matches the ID above
4616250 57.79
... ...
To get rid of this error message, either make the row.names of data and Lines IDs match by preparing sl_object and/or speed_object, or, in case you are certain that they should be matched in the order they appear, use
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object, match.ID = FALSE)
This is documented in ?SpatialLinesDataFrame.
All right, I figured it out. The error wasn't liking the fact that my speed_obj wasn't the same length as my sl_obj, as mentioned here. ("data =
object of class data.frame; the number of rows in data should equal the number of Lines elements in sl)
Resolution: used a quick loop to pull out all of the unique lines IDs, then performed a left join against that list of uniques to create an exhaustive speed_obj (with NAs, which seem to be OK).
ids <- data.frame()
for (i in (1:length(sl_obj))) {
id <- data.frame(sl_obj#lines[[i]]#ID)
ids <- rbind(ids, id)
}
colnames(ids)[1] <- "linkId"
speed_full <- join(ids, speed_obj)
speed_full_short <- data.frame(speed_obj[,c(-1)])
row.names(speed_full_short) <- speed_full$linkId
splndf <- SpatialLinesDataFrame(sl_obj, data = speed_full_short, match.ID = T)
Works fine now!
I may have deciphered the issue.
When I am pulling in my spatial lines data and I check the class it reads as
"Spatial Lines Data Frame" even though I know it's a simple linear shapefile, I'm using readOGR to bring the data in and I believe this is where the conversion is occurring. With that in mind the speed assignment is relatively easy.
sl_object$speed <- speed_object[ match( sl_object$ID , row.names( speed_object ) ) , "speed" ]
This should do the trick, as I'm willing to bet your class(sl_object) is "Spatial Lines Data Frame".
EDIT: I had received the same error as OP, driving me to check class()
I am under the impression that the error that was populated for you is because you were trying to coerce a data frame into a data frame and R wasn't a fan of that.

Resources