Read in an RDS file from memory using aws.s3 - r

I am using the cloudy r aws.s3 code to pull down an RDS file from s3. I have my own custom R runtime in a lambda. aws.s3 has a handy method called s3readRDS("s3://pathtoyourfile"). This works well, but has a limitation in that it saves the RDS file to disk, then must read it back in using readRDS. This is fine for smaller files, but for larger files, there is a no-go as we have limited disk storage.
Right now, I'm kind of stuck with these largish data files and the ability to pull them out into a database is just not feasible at the moment due to the target group, so I'm trying to minimize our cost and maximize throughput and this is the last nail in that coffin.
According to the documentation:
"Some users may find the raw vector response format of \code{get_object} unfamiliar. The object will also carry attributes, including \dQuote{content-type}, which may be useful for deciding how to subsequently process the vector. Two common strategies are as follows. For text content types, running \code{\link[base]{charToRaw}} may be the most useful first step to make the response human-readable. Alternatively, converting the raw vector into a connection using \code{\link[base]{rawConnection}} may also be useful, as that can often then be passed to parsing functions just like a file connection would be. "
Based on this(and an example using load()) the below code looks like it should work but it does not.
foo <- readRDS(rawConnection(get_object("s3://whatever/foo.rds")))
Error in readRDS(rawConnection(get_object("s3://whatever/foo.rds", :
unknown input format
I can't seem the datastream right for readRDS or unserialize to make sense of it. I know the file is correct as using the save to disk/load from disk works fine. But I want to know how to make "foo" into an unserialized object without the save/load.

Related

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

Creating a NetIncomingMessage with Lidgren from saved package in a binary file

I am using Lidgren networking library to create a real time multiplayer game.
What I am trying to do is, save all incoming packages (including all bytes) to a peer in a binary file. Later when I need to debug some weird behavior of networking, I can load this file and have it load all (or rebuild) the packages that it saved, sequentially. This way, I can find how the weird behavior occurred exactly.
My question is, how do I recreate this package when I load it from the file?
It is a NetIncomingMessage that I need to recreate, I assume, and so far I thought of either creating it anew, or sending an NetOutgoingMessage to self, so it hopefully has the same effect I want to achieve, if the first approach fails.
The way I solved this is by creating an interface (wrapper object) of the NetIncomingMessage which contains a data byte array among other data members, and than have a thread to fill a list of these objects based on the saved incoming time, which is requested and removed (dequeued) from another thread.
See https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem

How to transfer (to connect) data between Digital Micrograph and R

I'm a new DM's user and I need to transfer data (pixels bright) between Digital Micrograph and R, for processing and modelling an image.
Specifically, I would need to extract the bright pixels from an original image, send it to R for processing, and return to DM for to represent the new image.
I would like to know if it is possible and how to do it from an script in DM.
A lot of thanks. Regards.
There is very little direct connection between DM (scripting) and the outside world, so the best solution is quite likely the following (DM-centric) route:
A script is started in DM, which does:
all the UI needed
extract the intensities etc.
save all required data in a suitable format on disc at specific path. (Raw data/text-data/...)
call an external application ( anything you can call from a command prompt, including .bat files) and waits until that command has finished
Have all your R code been written in a way that it can be called from a command prompt, potentially with command prompt parameters (i.e. a configuration file):
read data from specific path
process as required (without an UI, so do it 'silently')
save results on disc on specific path
close application
At this point, the script in DM continues, reading in the results (and potentially doing some clean-up of files on disc.)
So, in essence, the important thing is that your R-code can work as a "stand-alone" black-box executable fully controlled by command-line parameters.
The command you will need to launch an external application can be found in the help-documentation under "Utility Functions" and is LaunchExternalProcess. It has been introduced with GMS 2.3.1.
You might also want try using the commands ScrapCopy() and ScrapPasteNew()to copy an image (or image subarea) into the clipboard, but I am not sure how the data is handled there exactly.

Is there any size/row limit in .txt file?

The question may looks duplicate. But i am not getting the answer which i am looking.
The problem is, in unix, one of the 4GL binary is fetching data from the table using cursor and writing the data in .txt file.
The table contains around 50 Million records.
The binary took lot of time and not completing. the .txt file is also 0 byte.
I want to know the possibilities why the records are not written in the .txt file.
Note: There is enough disk space available.
Also, for 30 Million records, i can get the data in the .txt file as i expected.
The information you provide is insufficient to tell for sure why the file is not written.
In UNIX, a text file is just like any another file - a collection of bytes. No specific limit (or structure) is enforced on "row size" or "row count," although obviously, some programs might have certain limits on maximum supported line sizes and such (depending on their implementation).
When a program starts writing data to a file (i.e. once the internal buffer is flushed for the first time) the file will no longer be zero size, so clearly your binqary is doing something else all that time (unless it wipes out the file as part of the cleanup).
Try running your executable via strace to see the file I/O activity - that would give some clues as to what is going on.
Try closing the writer if you are using one to write to the file. It achieves the dual purpose of closing the resource along with flushing the remaining contents of the buffer.
CPU calculated output needs to be flushed if you are using any mechanism of buffered writer. I have encountered such situations a few times and in almost all cases, the issue was that of flushing the output.
In java specifically, usually the best practice of writing data involves buffers. So when the buffer limit is reached, it gets written to the file but doesn't get written to the file when the end of buffer has not been reached yet. This happens when program closes without flushing the buffered writer.
So, in your case, if the processing time that it takes is reasonable and still the output is not on the file, it may mean that the output has been calculated and put on the RAM but could not be written to the file (which represents the disk) due to the output not being flushed.
You can also consider the answers to this question.

Comparing bz2 files in unix

I manage a number of databases on unix servers, and do daily backups of these databases using mysqldump. Since (some of) these databases are very large (20+Gb), I usually zip the backup .sql files using bzip2, to get compressed bz2 files.
As part of the backup process, I check that the size of the new backup file is greater than or equal to the size of the previous backup file - we are adding data to these databases on a daily basis, but very rarely remove data from these databases.
The check on the backup file size is a check on the quality of the backup - given that our databases primarily only grow in size, if the new backup is smaller than the old backup, it means either a) something has been removed from the database (in which case, I should check out what...) or b) something went wrong with the backup (in which case, I should check out why...).
However, if I compare the sizes of the bz2 files - for example, using comparison (using test) of stat %s, even though the database has increased in size, the bz2 files may have shrunk - presumably because of more efficient compression.
So - how can I compare the size of the backup files?
One option is to decompress the previous backup file from .bz2 to .sql, and compare the sizes of these .sql files. However, given that these are quite large files (20+Gb), the compression/decompression can take a while...
Another option is to keep the previous backup file as .sql, and again do the comparison of the .sql files. This is my preferred option, but requires some care to make sure we don't end up with lots of .sql files lying around - because that would eat up our hard drive pretty quickly.
Alternatively, someone in the SO community might have a better or brighter idea...?
It's possible to split the input files into parts (100MB chunks for example) and compare them separately. As size might actually also stay the same even with different input, you should generally not use it for looking for differences - instead use something like cmp to see if the files differ.
It's also possible to just cat the bz2 files of the individual parts together and get a perfectly valid multi-stream bz2 file, which can be uncompressed again in whole without any problems. You might want to look into pbzip, which is a parallel implementation of bzip and uses exactly this mechanic for parallel bzip into a multi-stream bz2 file to speed up the process on smp/multi core systems.
As to why I would suggest splitting the files into parts: Depending on your mysql setup, it might be possible that some of your parts never change, and data might actually mostly get appended at the end - if you can make sure of this, you would only have to compare small parts of the whole dump, which would speed up the process.
Still, be aware, that the whole data could change without anything added or removed, as mysql might resort data in memory (OPTIMIZE command for example could result in this)
Another way of splitting the data is possible if you use InnoDB - in that case you can just tell mysql (using my.cnf) to use one file per table, so you could a) bzip those files individually and only compare the tables which might actually have changed (in case you have static data in some of the tables) and/or b) save the last modified date of a table file, and compare that beforehand (again, this is only really useful in case you have tables with only static data)

Resources