what is the best way to process multi files in storm - similarity

I am new apache storm, i want to use storm to get similarity of files. I want get cosine of all of file in folder "A" with all of file in folder "B". can you help me to show the way to get result.
Thanks so much.

I did not understand what you meant by 'cosine of all files', but in general,
you can think of each folder as a 'stream'. You can have spoutA that read-understand-format-emit the files in folderA and spoutB that does the same for folderB into two tuple streams (I am assuming there are some differences between the two folders like encoding, formatting etc.). Your processing bolt can then 'subscribe' to those streams. For e.g.,
bolt.fieldsGrouping(spoutA, streamname, new Fields("field_in_stream"));
bolt.fieldsGrouping(spoutB, streamname, new Fields("field_in_stream"));
If on the other hand, you meant something like two different instances of the same spout to read from different folders
Not a great idea, because the number of spout executors is now
tied to the #folders you have. Not scalable.
Load distribution will probably be pretty bad.
If you still want to do it, you can
use the task-index of a spout to have different spout executors with
slightly different behavior (different meaning reading from different folders)
Like this, maybe
public class MySpout extends BaseRichSpout {
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
System.out.println("Spout Index = " + context.getThisTaskId());
}
}

Related

Cleanly-maintained concat steps -- deleting temp files?

Using grunt-contrib-concat, I have the following entry:
application: {
files: {
'purgatory/js/appbase.js': [
'src/js/**/*.js', '!src/js/**/tests/**', '!src/js/**/workers/**'
],
'purgatory/js/appworkers.js': [
'src/js/workers/shared.js', 'src/js/workers/*.js'
],
'purgatory/js/application.js': [
'purgatory/js/appbase.js', 'purgatory/js/appworkers.js'
]
}
}
The explanation is this:
"purgatory" is what I call the staging area. In theory, temp files can live here and not necessarily make it to production. So, what happens in the "application" task is that I construct a temp file called "appbase" that contains most of my logic except my web workers and tests. I make the workers phase separate because the order is important. Then I assemble the two temp files into the final application file.
It works.
But my current process eventually just grabs ALL of purgatory/js, because until today, none of my files were actually temp, they were all final targets. I would like to continue doing this instead of granularizing the copy phase or running a clean on the temp files first.
I can't help feel that I'm missing an opportunity right within grunt-contrib-concat.
Something that goes,
"grab everything except workers and tests, but then grab a particular workers file, and then grab the rest of the workers files".
I feel like if I understood the destination map better, I could do it all in one shot instead of bothering with the two temp files. The end goal is to only ever send "application.js" to purgatory. meaning there are no other temp files to clean up or otherwise deal with. Any suggestions?
Well, the answer was staring me straight in the face: you just have to think of the list of paths as a progressively built filter. So, you can exclude all of "workers" earlier in the list, and then include the specific ordered files afterwards. All in the same entry:
application: {
files: {
'purgatory/js/application.js': [
'src/js/**/*.js', '!src/js/**/tests/**', '!src/js/**/workers/**', 'src/js/workers/shared.js', 'src/js/workers/*.js'
]
}
of course, having done that, you could then just use the src/dest properties if that's what you're more comfortable with.

Can We use a small part out of Enyo's ilib package?

I have a small requirement to internationalize strings. Honestly the topic itself is so wide. but I only wish to use its ResourceBundle functionality where I only wish to include strings.json files for each language and use $L("some key") in my enyo app. Is it possible with minimum number of individual dependent javascript files ?
This is what I am talking about. Thanks in advance for your efforts.
The enyo-ilib package has 3 "sizes" of ilib in it. By default, you get the "standard" size which has a reasonable set of things that people might need in it like date formatting, number formatting, etc. What you want is the "core" size which includes only the resource bundle class and the string formatter plus all the classes they depend on like the locale class. In order to use the core size of ilib in your enyo app, you would do the following in your package.js file:
enyo.depends({
"$lib/enyo-ilib/core-package.js",
<other libraries>,
...
});
Then, set up a resources directory right beside where your index.html is:
index.html
resources/
de/
strings.json -- strings for German
fr/
strings.json -- strings for French
etc.
Then you should be able to use $L without the memory footprint of the standard size of ilib.

is it an issue of moving/renaming a folder of asp.net?

I would like to rename a folder with asp.net:
string oldFolderTitlePath = ServerPhyscialPath + oldFolderTitle + "/";
string newFolderTiltePath = ServerPhyscialPath + newFolderTille+ "/";
DirectoryInfo diPath = new DirectoryInfo(oldFolderTitlePath);
if(diPath.Exists)
{
///Now move(Rename) folder on the server
Directory.Move(oldFolderTitlePath, newFolderTiltePath);
}
I wonder that if the old folder contains number of files and the size is more than 1GB. Will it take a lot of time to rename a folder on asp.net?
Thanks in advance.
Generally, no it should not take a lot of time. You're basically changing the name of the directory not actually moving its contents on the disk.
That said, I'd be very careful with doing what you're doing. I'm always wary of IO operations from ASP.NET -- the reason: Many users could potentially be executing this code at the same time. That could lead to all sorts of problems. You need to make sure this operation is thread safe (perhaps by locking a static variable).
http://msdn.microsoft.com/en-us/library/c5kehkcz%28v=vs.71%29.aspx
http://msdn.microsoft.com/en-us/library/system.io.directory.move.aspx

Are there solutions for streamlining the update of legacy code in multiple places?

I'm working in some old code which was originally designed for handling two different kinds of files. I was recently tasked with adding a new kind of file to this code. Most of my problems were solved by filling out an extensive XML file with a new entry that handled everything from what lists were named to how the file is written in plural lower case. But this ended up being insufficient, as there were maybe 50 different places in 24 different code files where I had to update hardcoded switch-statements that only branched for the original two file types.
Unfortunately there is no consistency in this; there are methods which operate half from the XML file, and half off of hardcode. Some of the files which look like they would operate off of the XML file don't, and some that I would expect that I'd need to update the hardcode don't need it. So the only way to find the majority of these is to run through testing the whole system when only part of it is operational, finding that one step to fix (when I'm lucky that error logging actually tells me what is going on), and then running the whole thing again. This wastes time testing the parts of the code which are already confirmed to work, time better spent testing the new parts I have to add on top of it all.
It's a hassle and a half, and to my luck I can expect that I will have to add yet another new kind of file in the near future.
Are there any solutions out there which can aid in this kind of endeavour? Something which I can input some parameters of current features, document what points in a whole code project actually need to be updated, and run something nice the next time I need to add a new feature to the code. It needn't even be fully automated, something that'll help me navigate straight to the specific points in everything and maybe even record what kind of parameters need to be loaded.
Doubt it matters specifically, but the code is comprised of ASP.NET pages, some ASP.NET controls, hundreds of C# code files, and a handful of additional XML files. It's all currently in a couple big Visual Studio 2008 projects.
Not exactly what you are describing, but if you can introduce a seam into the code and lay down some interfaces you can break out and mock, a suite of unit/integration tests would go a long way to helping you modify old code you may not fully understand well.
I completely agree with the comment about using Michael Feathers' book to learn how to wedge new tests into legacy code. I'd also strongly recommend Refactoring, by Martin Fowler. What it sounds like you need to do for your code is to implement the "Replace conditionals with polymorphism" refactoring.
I imagine your code today looks somewhat like this:
if (filetype == 23)
{
type23parser.parse(file);
}
else if (filetype == 69)
{
filestore = type69reader.read(file);
File newfile = convertFSto23(filestore);
type23parser.parse(newfile);
}
What you want to do is to abstract away all the "if (type == foo)" kinds of logic into strategy patterns that are created in a factory.
class FileRules : pReader(NULL), pParser(NULL)
{
private:
FileReaderRules *pReader;
FileParserRules *pParser;
public:
void read(File* inFile) {pReader->read(inFile);};
void parse(File* inFile) {pParser->parse(inFile);};
};
class FileRulesFactory
{
FileRules* GetRules(int inputFiletype, int parserType)
{
switch (inputFiletype)
{
case 23:
pReader = new ASCIIReader;
break;
case 69:
pReader = new EBCDICReader;
break;
}
switch (parserType)
... etc...
then your main line of code looks like this:
FileRules* rules = FileRulesFactory.GetRules(filetype, parsertype);
rules.read(file);
rules.parse(file);
Pull off this refactoring, and adding a new set of file types, parsers, readers, etc., becomes as simple as writing one exclusive to your new type.
Of course, go read the book. I vastly oversimplified it here, and probably got stuff wrong, but you should get the general idea of how to approach it from this. I can also recommend another book, "Head First Design Patterns", which has a great section on the Factory patterns (if you like those "Head First" kinds of books.)

System::IO::Directory::GetFiles in c++

I am having trouble in storing the files in a string array from a directory in c++, using System::IO::Directory::GetFiles in c++
Also would like to know if we could copy an entire folder to a new destination/ in c++ like given in http://www.codeproject.com/KB/files/xdirectorycopy.aspx for c#
You can store the file names from a directory in a managed array like this:
System::String ^path = "c:\\";
cli::array<System::String ^>^ a = System::IO::Directory::GetFiles(path);
Console::WriteLine(a[0]);
Console::ReadKey();
As for how would you copy an entire folder... Simply recurse from a given root directory creating each directory and copying the files to the new location. If you are asking for code for this, then please say so, but at least try to figure it out for yourself first (i.e. show me what you have so far).
Check out the file listing program in Boost::FileSystem: http://www.boost.org/doc/libs/1_41_0/libs/filesystem/example/simple_ls.cpp. They iterate over all files, printing the paths, but it's trivial to store them instead.
Assuming you're on Win32, you're looking for the FindFirstFile and FindNextFile APIs.
C/C++ does not define a standard way to do this, though Boost::Filesystem provides a method if you need cross platform support.

Resources