I'm using hadoop streaming to process a huge file.
Say I have a file, each line is a number, I want to split this file into 2 files, one containing odd
numbers, the other even.
Using hadoop, I might specify 2 reducers for this job, cause when the numbers go from mapper to
reducer, I thought the number goes to which reducer is determined by number % 2, right?
But I was told otherwise, it's not simply number % 2 but hash(number) % 2 which makes a number
go to which reducer, is that true?
If so, How could I make it? Can I specify a Partitioner or something to make it right?
How about doing the split in your mapper?
For example, each mapper does
if int(number) % 2 == 0:
# Output "EVEN", number
else:
# Output "ODD", number
Then reduce over two keys: EVEN and ODD, writing them to the appropriate file.
Related
I just hope to learn how to make a simple statistical summary of the random numbers fra row 1 to 5 in R. (as shown in picture).
And then assign these rows to a single variable.
enter image description here
Hope you can help!
When you type something like 3 on a single line and ask R to "run" it, it doesn't store that anywhere -- it just evaluates it, meaning that it tries to make sense out of whatever you've typed (such as 3, or 2+1, or sqrt(9), all of which would return the same value) and then it more or less evaporates. You can think of your lines 1 through 5 as behaving like you've used a handheld scientific calculator; once you type something like 300 / 100 into such a calculator, it just shows you a 3, and then after you have executed another computation, that 3 is more or less permanently gone.
To do something with your data, you need to do one of two things: either store it into your environment somehow, or to "pipe" your data directly into a useful function.
In your question, you used this script:
1
3
2
7
6
summary()
I don't think it's possible to repair this strategy in the way that you're hoping -- and if it is possible, it's not quite the "right" approach. By typing the numbers on individual lines, you've structured them so that they'll evaluate individually and then evaporate. In order to run the summary() function on those numbers, you will need to bind them together inside a single vector somehow, then feed that vector into summary(). The "store it" approach would be
my_vector <- c(1, 3, 7, 2, 6)
summary(my_vector)
The importance isn't actually the parentheses; it's the function c(), which stands for concatenate, and instructs R to treat those 5 numbers as a collective object (i.e. a vector). We then pass that single object into my_vector().
To use the "piping" approach and avoid having to store something in the environment, you can do this instead (requires R 4.1.0+):
c(1, 3, 7, 2, 6) |> summary()
Note again that the use of c() is required, because we need to bind the five numbers together first. If you have an older version of R, you can get a slightly different pipe operator from the magrittr library instead that will work the same way. The point is that this "binding" part of the process is an essential part that can't be skipped.
Now, the crux of your question: presumably, your data doesn't really look like the example you used. Most likely, it's in some separate .csv file or something like that; if not, hopefully it is easy to get it into that format. Assuming this is true, this means that R will actually be able to do the heavy lifting for you in terms of formatting your data.
As a very simple example, let's say I have a plain text file, my_example.txt, whose contents are
1
3
7
2
6
In this case, I can ask R to parse this file for me. Assuming you're using RStudio, the simplest way to do this is to use the File -> Import Dataset part of the GUI. There are various options dealing with things such as headers, separators, and so forth, but I can't say much meaningful about what you'd need to do there without seeing your actual dataset.
When I import that file, I notice that it does two things in my R console:
my_example <- read.table(...)
View(my_example)
The first line stores an object (called a "data frame" in this case) in my environment; the second shows a nice view of how it's rendered. To get the summary I wanted, I just need to extract the vector of numbers I want, which I see from the view is called V1, which I can do with summary(my_example$V1).
This example is probably not helpful for your actual data set, because there are so many variations on the theme here, but the theme itself is important: point R at a file, as it to render an object, then work with that object. That's the approach I'd recommend instead of typing data as lines within an R script, as it's much faster and less error-prone.
Hopefully this will get you pointed in the right direction in terms of getting your data into R and working with it.
I am so very very new to R. Like had to look up how to open a file in R new. Diving in the deep end. Anyway
I have a bunch of .csv files with results that I need to analyse. Really, I would like to set up some kind of automation so I can just say "go" (a function?)
Basically I have results in one file that are -particle.csv and another that are -ROI.csv. They have the same names so I know which ones match up (e.g. brain1 section1 -particle.csv and brain1 section1 -ROI.csv). I need to do some maths using these two datasets - Divide column 2 rows 2-x in -particle.csv (the row number might change but is there a way of saying row "2-No more content"?) by column 1, 5, 10, etc. row 2 in -ROI.csv (the column number will always stay the same but if it helps they are all called Area1, Area2, Area3,... the number of Area columns can vary but surely there's a way I can say "every column that begins with Area"? Also the area count and the row count will always match up)
Okay, I'm fine to do that manually for each set up results but I have over 300 brains to analyse! Is there anyway I can set it up as a process that I can apply this these and future results that will be in the same format?
Sorry if this is a huge ask!
I am concatenating 1000s of nc-files (outputs from simulations) to allow me to handle them more easily in Matlab. To do this I use ncrcat. The files have different sizes, and the time variable is not unique between files. The concatenate works well and allows me to read the data into Matlab much quicker than individually reading the files. However, I want to be able to identify the original nc-file from which each data point originates. Is it possible to, say, add the source filename as an extra variable so I can trace back the data?
Easiest way: Online indexing
Before we start, I would use an integer index rather than the filename to identify each run, as it is a lot easier to handle, both for writing and then for handling in the matlab programme. Rather than a simple monotonically increasing index, the identifier can have relevance for your run (or you can even write several separate indices if necessary (e.g. you might have a number for the resolution, the date, the model version etc).
So, the obvious way to do this that I can think of would be that each simulation writes an index to the file to identify itself. i.e. the first model run would write a variable
myrun=1
the second
myrun=2
and so on... then when you cat the files the data can be uniquely identified very easily using this index.
Note that if your spatial dimensions are not unique and the number of time steps also changes from run to run from what you write, your index will need to be a function of all the non-unique dimensions, e.g. myrun(x,y,t). If any of your dimensions are unique across all files then that dimension is redundant in the index and can be omitted.
Of course, the only issue with this solution is it means running the simulations again :-D and you might be talking about an expensive model to run or someone else's runs you can't repeat. If rerunning is out of the question you will need to try to add an index offline...
Offline indexing (easy if grids are same, more complex otherwise)
IF your space dimensions were the same across all files, then this is still an easy task as you can add an index offline very easily across all the time steps in each file using nco:
ncap2 -s 'myrun[$time]=array(X,0,$time)' infile.nc outfile.nc
or if you are happy to overwrite the original file (be careful!)
ncap2 -O -s 'myrun[$time]=array(X,0,$time)'
where X is the run number. This will add a variable, with a new variable myrun which is a function of time and then puts X at each step. When you merge you can see which data slice was from which specific run.
By the way, the second zero is the increment, as this is set to zero the number X will be written for all timesteps in a given file (otherwise if it were 1, the index would increase by one each timestep - this could be useful in some cases. For example, you might use two indices, one with increment of zero to identify the run, and the second with an increment of unity to easily tell you which step of the Xth run the data slice belongs to).
If your files are for different domains too, then you might want to put them on a common grid before you do that... I think for that
cdo enlarge
might be of help, see this post : https://code.mpimet.mpg.de/boards/2/topics/1459
I agree that an index will be simpler than a filename. I would just add to the above answer that the command to add a unique index X with a time dimension to each input file can be simplified to
ncap2 -s 'myrun[$time]=X' in.nc out.nc
I need to count number of lines in each block and count number of blocks in order to read it properly afterwards. Can anybody suggest a sample piece of code in Fortran?
My input file goes like this:
# Section 1 at 50% (Name of the block and its number and value)
1 2 3 (Three numbers in line with random number of lines)
...
1 2 3
# Section 2 at 100% (And then again Name of the block)
1 2 3...
and so on.
The code goes below. It works fine with 1 set of data, but when it meets " # " again it just stops providing data only about one section. Can not jump to another section:
integer IS, NOSEC, count
double precision SPAN
character(LEN=100):: NAME, NAME2, AT
real whatever
101 read (10,*,iostat=ios) NAME, NAME2, IS, AT, SPAN
if (ios/=0) go to 200
write(6,*) IS, SPAN
count = 0
102 read(10,*,iostat=ios) whatever
if (ios/=0) go to 101
count = count + 1
write(6,*) whatever
go to 102
200 write(6,*) 'Section span =', SPAN
So the first loop (101) suppose to read parameters of the Block and second (102) counts the number of lines in block with 'ncount' as the only parameter which is needed. However, when after 102 it suppose to jump back to 101 to start a new block, it just goes to 200 instead (printing results of the operation), which means it couldn't read the data about second block.
Let's say your file contains two valid types of lines:
Block headers which begin with '#, and
Data lines which begin with a digit 0 through 9
Let's add further conditions:
Leading whitespace is ignored,
Lines which don't match the first two patterns are considered comments and are ignored
Comment lines do not terminate a block; blocks are only terminated when a new block is found or the end of the file is reached,
Data lines must follow a block header (the first non-comment line in a file must be a block header),
Blocks may be empty, and
Files may contain no blocks
You want to know the number of blocks and how many data lines are in each block but you don't know how many blocks there might be. A simple dynamic data structure will help with record-keeping. The number of blocks may be counted with just an integer, but a singly-linked list with nodes containing a block ID, a data line count, and a pointer to the next node will gracefully handle an arbitrarily large blob of data. Create a head node with ID = 0, a data line count of 0, and the pointer nullify()'d.
The Fortran Wiki has a pile of references on singly-linked lists: http://fortranwiki.org/fortran/show/Linked+list
Since the parsing is simple (e.g. no backtracking), you can process each line as it is read. Iterate over the lines in the file, use adjustl() to dispose of leading whitespace, then check the first two characters: if they are '#, increment your block counter by one and add a new node to the list and set its ID to the value of the block counter (1), and process the next line.
Aside: I have a simple character function called munch() which is just trim(adjustl()). Great for stripping whitespace off both ends of a string. It doesn't quite act like Perl's chop() or chomp() and Fortran's trim() is more of an rtrim() so munch() was the next best name.
If the line doesn't match a block header, check if the first character is a digit; index('0123456789', line(1:1)) is greater than zero if the the first character of line is a digit, otherwise it returns 0. Increment the data line count in the head node of the linked list and go on to process the next line.
Note that if the block count is zero, this is an error condition; write out a friendly "Data line seen before block header" error message with the last line read and (ideally) the line number in the file. It takes a little more effort but it's worth it from the user's standpoint, especially if you're the main user.
Otherwise if the line isn't a block header or a data line, process the next line.
Eventually you'll hit the end of the file and you'll be left with the block counter and a linked list that has at least one node. Depending on how you want to use this data later, you can dynamically allocate an array of integers the length of the block counter, then transfer the data line count from the linked list to the array. Then you can deallocate the linked list and get direct access to the data line count for any block because the block index matches the array index.
I use a similar technique for reading arbitrarily long lists of data. The singly-linked list is extremely simple to code and it avoids the irritation of having to reallocate and expand a dynamic array. But once the amount of data is known, I carve out a dynamic array the exact size I need and copy the data from the linked list so I can have fast access to the data instead of needing to walk the list all the time.
Since Fortran doesn't have a standard library worth mentioning, I also use a variant of this technique with an insertion sort to simultaneously read and sort data.
So sorry, no code but enough to get you started. Defining your file format is key; once you do that, the parser almost writes itself. It also makes you think about exceptional conditions: data before block header, how you want to treat whitespace and unrecognized input, etc. Having this clearly written down is incredibly helpful if you're planning on sharing data; the Fortran world is littered with poorly-documented custom data file formats. Please don't add to the wreckage...
Finally, if you're really ambitious/insane, you could write this as a recursive routine and make your functional programming friends' heads explode. :)
I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.