AWS Elastic MapReduce Streaming. Use data from nested folders as input

AWS Elastic MapReduce Streaming. Use data from nested folders as input - hadoop-streaming

I have data located in structure s3n://bucket/{date}/{file}.gz with > 100 folders. How to setup streaming job and use all of them as input?
Specifying s3n://bucket/ didn't help since nodes are folders.

Specify s3n://bucket/*/ as input and it should work fine.

Related

Load multiple files through SQL Loader

I have a requirement to load multiple files received from two source systems into 1 table using SQL Loader.
To make this possible, I want to understand the following --
Pros and Cons of integrating multiple files like this ? - Need this to compare the merging option at source or via SQL Loader
Any other way of interfacing the data from .CSV files except SQL Loader in Oracle for multiple files ? -- I don't think so but still need expert's confirmation
What are the things I need to mindful about? -
Ex- file format and the header sequence should be same for all the files.
Thanks in Advance.

How to serve multiple apps simultaneously with Panel/PyViz?

I've been exploring this problem of how to write a suite of small utilities but serve them up together. It's like writing multiple little .py files that each gives us a panel app, but I wanted to serve them up via a single Docker container with a single entrypoint.
Voila gives us the ability to serve up multiple notebooks by taking advantage of its jupyter extension; is something analogous possible with panel? For example, I'm wondering whether I could do panel serve . [--options] to serve up all .py files in a directory?

h/t Philipp Rudiger, the lead developer of Panel, who gave pointed me to this answer:
Use panel serve src1.py nb1.ipynb ... to serve multiple apps simultaneously.
You might want to provide your own index page since the default bokeh one isn't too pretty.

Is there a way to do a offline directory changes through ldif file?

I have a ldif file containing multiple modify operations. I would like to apply them offline, as ldif-import can import data in a off-line manner.
I've looked at ldif-import, but I understand it can only apply add operations, not modify operation. I tried with this tool, but every operation was rejected.
How can I achieve this ?

The import-ldif tool is meant to do fast import of data to the OpenDJ directory server, not to do offline processing.
OpenDJ has a tool called ldifmodify which allows to apply one or more operations (expressed in LDIF representation) against an existing LDIF file (representing the entries).
Please read the documentation of OpenDJ for the details and examples.
Regards,
Ludovic.

Perforce directory structure

In Perforce, I notice that my work-space is linked to a specific directory (location) in my local hard drive. Is it possible to change the location of this mapping for each file? For example if I have two scripts in two completely different directories locally -
C:/File1.pl & D:/File2.pl
And I want to map these 2 scripts under the same folder in perforce. Is this possible?

The root directory must be the same for all files in a single workspace.
However, you can define multiple workspaces, one which resides on your C:\ disk and one which resides on your D:\ disk.
Generally, a single workspace is used for a single project, and generally all files for a single project are located together in a single area of your workstation. I'm having trouble thinking of a scenario in which you'd want to have files be part of a single project, and yet stored in various places scattered around your workstation. Can you explain your scenario further?
There are techniques (the SUBST command, using Windows Junction Points, etc.) which can be used to create aliases for files on a different disk, but given what you've described, using multiple workspaces seems like the clearest approach to me.

How i Connect the data that's in my folder without writing the specific path like:"C:\\Users\\Dima\\Desktop\\NewData\\..."

I am writing a script that's Requires Data Which is in my computer folder.
But eventually this script will be used in another computer, by another person.
I can't tell him to change all the paths to the data in the script.
How i Connect the data that's in my folder without writing the specific path
Like:"C:\Users\Dima\Desktop\NewData\..."

The best way of making your code shareable depends upon your use case.
As Carl Witthoft pointed out, most code should be encapsulated in functions. These functions can then be packaged into packages and easily redistributed on other peoples's machines. Writing packages is easier than you think.
For one off analyses, scripts are appropriate. How you make them user-independent depends on who your users are. If your are sharing the script with colleagues, try to keep your data on a network drive, then the link to the data will be the same for everyone. If you are sharing your script with the world, then keep your data on the internet, and the link to the data will be a hyperlink, again, the same for everyone.
If you are sharing your script with a few people who don't have access to a common drive, and you can't put your data on the internet, then some directory manipulation is acceptable.
Change your working directory to the root of where your project files are.
setwd("c:/Users/Dima/My Project")
Then you can reference the location of the data using relative paths.
data_file <- "Data/My data file.csv"
my_data <- read.csv(data_file)
Assuming that you keep the directory structure within your project the same, then you only need to change the call to setwd on each machine.
Also note that the special location "~" refers to your user home directory. Try
normalizePath("~")
That way, if you keep your project in that location, you can avoid reference to "Dima" entirely.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

AWS Elastic MapReduce Streaming. Use data from nested folders as input - hadoop-streaming

I have data located in structure s3n://bucket/{date}/{file}.gz with > 100 folders. How to setup streaming job and use all of them as input? Specifying s3n://bucket/ didn't help since nodes are folders.

Specify s3n://bucket/*/ as input and it should work fine.

Related

Load multiple files through SQL Loader

How to serve multiple apps simultaneously with Panel/PyViz?

Is there a way to do a offline directory changes through ldif file?

Perforce directory structure

How i Connect the data that's in my folder without writing the specific path like:"C:\\Users\\Dima\\Desktop\\NewData\\..."

Categories

Resources