Adding files that rely on pipeline outputs - dvc

In my workflow, I do the following:
Acquire raw data (e.g. a video containing people)
Transform it (e.g. automatically extract all crops with faces)
Manually label them (e.g. identify the person in each crop). The labels are stored in json files along with the crops.
Train a model on these data.
How should I track this pipeline with DVC?
My concerns:
If stage 2 is changed (e.g. crops are extracted with a different size), the manual data should be invalidated (and so should the final model).
The 3rd step is manual and therefore not precisely reproducible. But I do need its input to be reproducible.
Stage 4 has an element of randomness, so it's not precisely reproducible either.

Stage 3 is manual so you can't really codify it or automate it, nor guarantee its reproducibility (due to possible human error). But there's a way to get you as close as possible:
You could replace it with a helper script that just checks whether all the labels are annotated. If so, output a text file with content "green", otherwise "red" (for example) and error out.
Stage 4 should depend on both the inputs from stages 2 and 3, so it will only run if BOTH the face crops changed AND if they are thoroughly annotated.
Internally, it first checks the semaphore file (from 3) and dies on red. On green, it trains the model :)
The DAG looks like this:
+-----------+
| 1-acquire |
+-----------+
*
*
*
+---------+
| 2-xform |
+---------+
you ** **
--> ** **
* **
+---------+ *
| 3-check | **
+---------+ **
** **
** **
* *
+---------+
| 4-train |
+---------+
re randomness: while not ideal, non-determinism technically only affects intermediate stages of the pipeline, because it causes everything after that to always run. In this case, since it's in the last stage, it won't affect DVC's job.

Related

What is the structure of the Presentation-Data-Value in P-Data-TF?

I find these for Presentation Data Value:
0 | 0 | 0 | 0 | 0 | 0 | last-or-not | command-or-dataset | **Some Message**
But I couldn't find Some Message part. I suppose this part include C-Find, C-Get etc. How can I know this structure?
Where do you have this from? In fact it is a bit different.
Your example should read
0 | 0 | 0 | 0 | 0 | 0 | last-or-not | command-or-data*set* | **Some Message**
So the "command-or-dataset" flag indicates whether the following bytes are encoding a command (as defined in PS3.7) or a dataset as defined in PS3.3 or 3.4 respectively).
E.g. for DICOM Queries, there is a C-FIND command defined in PS3.7, Chapter 9.1.2.1. In C-FIND, the query criteria are part of the command ("Identifier") in table 9.1-2. How the identifier is formed and all its semantics is subject of the Query/Retrieve Service Class as defined in PS3.4, C.4.1.
For transferring objects, there is a C-STORE command, also defined in PS3.7 (chapter 9.1.1.1). The Data-Set is also a part of the C-STORE command, and its contents depend on the type of data (SOP Class). This is referred to as an Information Object Definition (IOD) and defined in PS3.3. The protocol for Strorage is also defined in PS3.4 (Annex B)
However, the length limitation of the PDV will only allow to have the whole object encoded in a single PDV and needs to be split. For the following PDVs, no command set will be present but only a fragment of the dataset. In this case, the "command-or-dataset" bit must be set to 0.
I hope I could make it a bit clear. It is a bit difficult in the beginning of learning DICOM to know all the terms and the interrelationships.
Encoding
Logically command- and dataset are encoded in the same way. The data dictionary (Part 6) is a complete list of all possible attributes and the major difference between command- and data set attributes is that command attributes are having the group number 0 while data set attributes have "any even number but 0".
For each attribute, the data dictionary gives you the Value Representation (VR) which needs to be considered for encoding the value. E.g. "PN" for patient name "UI" for Unique Identifier and so forth. The VRs are defined in PS3.5, Chapter 6.2.
The encoding of attributes is then
group | element | (VR) | length (always even) | value
How this is transformed to the binary level depends on the Transfer Syntax (TS) that was agreed for the service during association negotiation. For this reason "VR" is enclosed in brackets above - it depends whether it is an implicit or explicit TS if this must/must not be present.
There are some more things to consider (endianess, sequence encoding) when encoding code sets or datasets in binary form. Basically everything about it is described in various chapters in PS3.5

AwesomeWM tag with static layout

StackOverflow is denoted as a place for AwesomeWM community support.
I would like to have a dedicated Tag in my AwesomeWM config where only three particular application will be running all the time. I managed to create new tag using sample config, and I managed to filer the applications using awful.rules.rules and place them into the tag.
I am experiencing troubles in understanding how AwesomeWM layout engine really works. I would like to achieve the following: three static columns of fixed widths, each application is located at its own column, when focus changes then no rearrangement happens, when any application is not running, then its reserved place is remain empty.
___________________
| | | |
| | | |
| A | B | C |
| | | |
| | | |
___________________
How do I specify layout in such case? Should I write my own one? Can I use flexible layout and specify position for client? What is the recommended correct way to achieve my goal?
I am experiencing troubles in understanding how AwesomeWM layout engine really works
A layout is a table with two entries:
name is a string containing, well, the name of the layout
arrange is a function that is called to arrange the visible clients
So you really only need to write an arrange function that arranges clients in the way you want. The argument to this function is the result of awful.layout.parameters, but you really need to care about
.clients is a list of clients that should be arranged.
.workarea is the available space for the clients.
.geometries is where your layout writes back the assigned geometries of clients
I would recommend to read some of the existing layouts to see how they work. For example, the max layout is as simple as:
function(p)
for _, c in pairs(p.clients) do
p.geometries[c] = {
x = p.workarea.x,
y = p.workarea.y,
width = p.workarea.width,
height = p.workarea.height
}
end
end
Should I write my own one? Can I use flexible layout and specify position for client?
Well, the above is the write-own-layout approach. Alternatively, you could also make your clients floating and assign them a geometry via awful.rules. Just have properties = { floating = true, geometry = { x = 42, y = 42, width = 42, height = 42 } }. However, with this you could e.g. accidentally move one of your clients.
What is the recommended correct way to achieve my goal?
Pick one. there is no "just one correct answer".

Downloading the entire Bitcoin transaction chain with R

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....
There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

How to setup futures instruments in FinancialInstrument to lookup data from CSIdata

Background
I am trying to setup my trade analysis environment. I am running some rule based strategies on futures on different brokers and trying to aggregate trades from different brokers in one place. I am using blotter package as my main tool for analysis.
Idea is to use blotter and PerformanceAnalytics for analysis of live performance of various strategies I am running.
Problem at hand
My source of future EOD data is CSIData. All the EOD OHLC prices for these futures are stored in CSV format in following directory structure. For each future there is seperate directory and each contract of the future has one csv file with OHLC price series.
|
+---AD
| AD_201203.TXT
| AD_201206.TXT
| AD_201209.TXT
| AD_201212.TXT
| AD_201303.TXT
| AD_201306.TXT
| AD_201309.TXT
| AD_201312.TXT
| AD_201403.TXT
| AD_201406.TXT
| AD_54.TXT
...
+---BO2
| BO2195012.TXT
| BO2201201.TXT
| BO2201203.TXT
| BO2201205.TXT
| BO2201207.TXT
| BO2201208.TXT
| BO2201209.TXT
| BO2201210.TXT
| BO2201212.TXT
| BO2201301.TXT
...
I have managed to define root contracts for all the futures (e.g. in above case AD, BO2 etc) I will be using in FinancialInstrument with CSIData symbols as primary identifiers.
I am now struggling on how to define all the actual individual future contracts (e.g. AD_201203, AD_201206 etc) and setup their lookup using setSymbolLookup.FI.
Any pointers on how to do that?
To setup individual future contracts, I looked into ?future_series and ?build_series_symbols, however, the suffixes they support seem to be only of Future Month code format. So I have a feeling I am left with setting up each individual future contract manually. e.g.
build_series_symbols(data.frame(primary_id=c('ES','NQ'), month_cycle=c('H,M,U,Z'), yearlist = c(10,11)))
[1] "ESH0" "ESM0" "ESU0" "ESZ0" "NQH0" "NQM0" "NQU0" "NQZ0" "ESH1" "ESM1" "ESU1" "ESZ1" "NQH1" "NQM1" "NQU1" "NQZ1"
I have no clue where to start digging for my second part of my question i.e. setting price lookup for these futures from CSI.
PS: If this is not right forum for this kind of question, I am happy to get it moved to right section or even ask on totally different forum altogether.
PPS: Can someone with higher reputation tag this question with FinancialInstrument and CSIdata? Thanks!
The first part just works.
R> currency("USD")
[1] "USD"
R> future("AD", "USD", 100000)
[1] "AD"
Warning message:
In future("AD", "USD", 1e+05) :
underlying_id should only be NULL for cash-settled futures
R> future_series("AD_201206", expires="2012-06-18")
[1] "AD_201206"
R> getInstrument("AD_201206")
primary_id :"AD_201206"
currency :"USD"
multiplier :1e+05
tick_size : NULL
identifiers: list()
type :"future_series" "future"
root_id :"AD"
suffix_id :"201206"
expires :"2012-06-18"
Regarding the second part, I've never used setSymbolLookup.FI. I'd either use setSymbolLookup directly, or set a src instrument attribute if I were going to go that route.
However, I'd probably make a getSymbols method, maybe getSymbols.mycsv, that knows how to find your data if you give it a dir argument. Then, I'd just setDefaults on your getSymbols method (assuming that's how most of your data are stored).
I save data with saveSymbols.days(), and use getSymbols.FI daily. I think it wouldn't be much effort to tweak getSymbols.FI to read csv files instead of RData files. So, I suggest looking at that code.
Then, you can just
setDefaults("getSymbols", src="mycsv")
setDefaults("getSymbols.mycsv", dir="path/to/dir")
Or, if you prefer
setSymbolLookup(AD_201206=list(src="mycsv", dir="/path/to/dir"))
or (essentially the same thing)
instrument_attr("AD_201206", "src", list(src="mycsv", dir="/path/to/dir")

Sustain in Pure Data for libpd with xcode

I'm working on a patch that plays samples from a piano, which works in xcode to build an piano app for ipad. I'm trying to add an adsr to create sustain, but I can't seem to get it working. Could someone point me in the right direction? Thanks!
Patch:
https://docs.google.com/file/d/0B4-qHDgzbDB3VUlwM09FSEowZWM/edit
The ADSR is just an evelope which you are using to multiply the sound output with. However it is meant to be on a temporal axis together with the trigger of the sound. When I look at your patch I notice another thing: Why are you reloading the samples into the arrays every time you trigger them? The arrays should be filled on startup of the app, like this:
[loadbang]
|
[read -resize c1.wav c1Array(
|
[soundfiler]
Later, when you actually just want to play back, you do
[r c1]
|
[t b]
|
[tabplay~ c1Array]
|
[throw~]
and at one central point in your patch you can have
[catch~]
|
[dac~]
(add the main voulme there). Notice there are no connections between the three parts!

Resources