Exporting embeddings per epoch in Keras - vector

I am trying to get access to the output of the embedding layer (the n-dimensional vectors) in Keras on a per epoch basis. There doesn't seem to be a specific callback for this. I 've tried the Tensorboard callbacks since it provides an option for logging the embeddings on each epoch but when I find the log files, I can't read them. They are probably files that can be accessed only by Tensorboard for visualization purposes. I need the embedding vectors to be saved in a format I can use later on outside keras, like a TSV file. Is there a way I could do this?
Thanks a lot!

OK, so I figured out how to do this, with much needed help from Nazmul Hasan on how to format the name to be updated with each epoch. Essentially I created a custom callback:
import io
encoder = info.features['text'].encoder
class CustomCallback(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
out_v = io.open('vecs_{}.tsv'.format(epoch), 'w', encoding='utf-8')
vec = model.layers[0].get_weights()[0] # skip 0, it's padding.
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()

Related

How to save gl_speech_op to an object in R

How do you save gl_speech_op output to an object within the R?
I successfully ran GoogleLanguageR to convert an audio file to text within the Google Cloud Platform. I can see the output but I don't know how to save the output to an object within R Studio.
Sample code is below. I am using R Notebook.
library(googleLanguageR)
library(tidyverse)
###let's get Craig Watkins
gl_auth("D:/Admin/Documents/Google API JSON Authenticate/My Project two test-db5d6330925e.json")
watkins <- gl_speech("gs://testtwoibm/craig watkins 2018_05_07_14_08_08.flac",
encoding = c("FLAC"), sampleRateHertz = 44100, languageCode = "en-US",
maxAlternatives = 1L, asynch = TRUE)
## Send to gl_speech_op() for status or finished result
gl_speech_op(watkins)
RStudio notebook output showing converted speech to text.
The easiest way in R to save any output of an operation to an object in R is assigning it via the assignment operator <-
In your case, you would only assign it to an object like this:
transcript <- gl_speech_op(watkins)
One small reminder: This will also work if the asynchronous API request hasn't finished transcribing yet. However, your object will not contain any information. In your case it will be any list of 2 with two NULL elements. If finished, the object will contain both the transcript and the timings.
I understand you want the output as a text. If this is the case, then you can use capture.output:
new_obj = capture.output(gl_speech_op(watkins))
new_obj

Can I cache data loading in R?

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?
Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.
I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

Read ADCP data in R

I have ADCP measured data for a river and I am wondering if it is possible to read the ADCP file in R. I found a package called "oce" but I couldn't read the ADCP file.
The function I found in oce package is as follows:
read.oce
read.adp
I have uploaded the sample file here https://www.dropbox.com/sh/owian354auah6h3/379D5spA2X.
If anyone could help me how to read this kind of ADCP, I would highly appreciate.
Thanks.
Since it's just a text file, the simplest thing to do is to do something like
my_header <- readlines(myfile,n=7)
followed by
my_data <- read.table(myfile,skip=7,...)
(You'll need a few more parameters, probably, in those calls).
That way the metadata is separated from the array of data, which should simplify subsequent processing operations.
I'm very late to the party here, but hopefully, this will be useful for others.
data<-read.adp("my.prf", from = 1, to = 10, by=1, tz="UTC")
the arguments from = and to = correspond to the record number in the .sen file.

How to Import SQLite data (gathered by an Android device) into either Octave or MatLab?

I have some data gathered by an Android phone and it is stored in SQLite format in an SQLite file. I would like to play around with this data (analysing it) using either MatLab or Octave. The SQLite data is stored as a file.
I was wondering what commands you would use to import this data into MatLab? To say, put it into a vector or matrix. Do I need any special toolboxes or packages like the Database Package to access the SQL format?
There is the mksqlite tool.
I've used it personally, had some issues of getting the correct version for my version of matlab. But after that, no problems. You can even query the database file directly to reduce the amount of data you import into matlab.
Although mksqlite looks nice it is not available for Octave, and may not be suitable as a long-term solution. Exporting the tables to CSV-files is an option, but the importing (into Octave) can be quite slow for larger data sets because of the string-parsing involved.
As an alternative, I ended up writing a small Python script to convert my SQLite table into a MAT file, which is fast to load into either Matlab or Octave. MAT files are platform-neutral binary files, and the method works both for columns with numbers and strings.
import sqlite3
import scipy.io
conn = sqlite3.connect('my_data.db')
csr = conn.cursor()
res = csr.execute('SELECT * FROM MY_TABLE')
db_parms = list(map(lambda x: x[0], res.description))
# Remove those variables in db_parms you do not want to export
X = {}
for prm in db_parms:
csr.execute('SELECT "%s" FROM MY_TABLE' % (prm))
v = csr.fetchall()
# v is now a list of 1-tuples
X[prm] = list(*zip(*v))
scipy.io.savemat('my_data.mat', X)

Resources