Applying the collections.counter() on UTF-8 - collections

I have a list consisting of non-English text in utf-8 format.
Therefore, when I attempt to print a single word, it gives me this:
u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628'
Therefore, in order to print it as the original word, I have to loop it and it will output correctly, as the original word.
I want to find the 5 most frequent words.
When storing the words into the collections.counter() function, they enter as the unicode format.
How do I access the word inside the counter() in order to print the top 5 most frequent words.
I have done the following code: (txt being my text file)
words = [w for w in txt.split()]
will print out
[u'\ufeff\u0643\u0627\u0646', u'\u064a\u0627', u'\u0645\u0627',
...u'\u0643\u0627\u0646', u'\u0641\u064a', u'\u0642\u062f\u064a\u0645']
I therefore loop it to get the desired output (I don't know why)
>>> for w in words:
print w,
will print out
كان يا ما كان
I use the counter() function to find the most frequent words
>>> count = collections.Counter(words)
>>> print count.most_common(5)
will print out
>>> [(u'\u0627\u0644\u0633\u0644\u062d\u0641\u0627\u0629', 5),
(u'\u0627\u0644\u0645\u063a\u0631\u0648\u0631', 3),
(u'\u0627\u0644\u0623\u0631\u0646\u0628', 2), (u'\u060c', 2),
(u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628', 2)]
I want to access each word and loop it to print it out WITH it's frequency.

With your first example, you can just print the word directly to get the original (I can't read Arabic, so this may be wrong):
>>> print u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628'
والأرنب
If you are doing this through the interpreter and you do not explicitly use print, you will still see the unicode representation:
>>> u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628'
u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628'
Therefore, you can just call print to see the actual word:
>>> l
[(u'\u0627\u0644\u0633\u0644\u062d\u0641\u0627\u0629', 5),
(u'\u0627\u0644\u0645\u063a\u0631\u0648\u0631', 3),
(u'\u0627\u0644\u0623\u0631\u0646\u0628', 2), (u'\u060c', 2),
(u'\u0648\u0627\u0644\u0623\u0631\u0646\u0628', 2)]
>>> for el in l:
print el[0], el[1]
السلحفاة 5
المغرور 3
الأرنب 2
، 2
والأرنب 2

Related

Formating in r: print(sprintf("%03d / %03d")

I'm a beginner in R and I'm confused by formatting part of this code:
print(sprintf("%03d / %03d", j, n)).
I familiarized myself with ?sprintf and r documentation. I understand each sign but I don't understand how to read it all together especially with '/'. What does this formatting mean?
The "/" is just a literal forward slash that should appear in the output string.
To break the code down, it means "Write the integer j with leading zeros if necessary, so that it is at least 3 characters long (%03d), then write the literal string " / ", then write the integer n with leading zeros if necessary so that it is at least 3 characters long (%03d)"
For example:
sprintf("%03d / %03d", 4, 2)
#> [1] "004 / 002"
In other words, the forward slash could be any text:
sprintf("%03d banana %03d", 4, 2)
#> [1] "004 banana 002"

Python input prompt using for loop

i have created a script that use user inputs to greet them by hello but it prints the alphabets not the complete name in one sentence. How?
val = input()
for i in val:
print("Hello", i)
but this prints
Hello p
Hello r
Hello i
Hello n
Hello c
Hello e
The function input() takes in an input from the command line. For example, if you input prince into the command line, now the variable val has a value of "prince".
With the for loop, you are using the for-each notation. Strings are also a type of iterator--in fact, strings are simply arrays of characters. Think of it like a regular list, but instead of having a list like [1, 2, 3, 4], you have the list ['p', 'r', 'i', 'n', 'c', 'e']. So each iteration of the for loop only prints the character it is currently iterating on.
You could simplify your code by avoiding the for loop and only using the code print("Hello", val).
However, if you only want to practice with for loops, you can use the code below. Try to understand how and why you can simplify it!
val = input() //stores the user input into val
name = "" //creates an empty string called name
for s in val: //iterates through each character in val
name += s //adds that character to name
//when the for loop ends, the user input is stored in name
print("Hello", name) //prints "Hello" and the name

Python 3.4 help - using slicing to replace characters in a string

Say I have a string.
"poop"
I want to change "poop" to "peep".
In fact, I also want all of the o's in poop to change to e's for any word I put in.
Here's my attempt to do the above.
def getword():
x = (input("Please enter a word."))
return x
def main():
y = getword()
for i in range (len(y)):
if y[i] == "o":
y = y[:i] + "e"
print (y)
main()
As you can see, when you run it, it doesn't amount to what I want. Here is my expected output.
Enter a word.
>>> brother
brether
Something like this. I need to do it using slicing. I just don't know how.
Please keep your answer simple, since I'm somewhat new to Python. Thanks!
This uses slicing (but keep in mind that slicing is not the best way to do it):
def f(s):
for x in range(len(s)):
if s[x] == 'o':
s = s[:x]+'e'+s[x+1:]
return s
Strings in python are non-mutable, which means that you can't just swap out letters in a string, you would need to create a whole new string and concatenate letters on one-by-one
def getword():
x = (input("Please enter a word."))
return x
def main():
y = getword()
output = ''
for i in range(len(y)):
if y[i] == "o":
output = output + 'e'
else:
output = output + y[i]
print(output)
main()
I'll help you this once, but you should know that stack overflow is not a homework help site. You should be figuring these things out on your own to get the full educational experience.
EDIT
Using slicing, I suppose you could do:
def getword():
x = (input("Please enter a word."))
return x
def main():
y = getword()
output = '' # String variable to hold the output string. Starts empty
slice_start = 0 # Keeps track of what we have already added to the output. Starts at 0
for i in range(len(y) - 1): # Scan through all but the last character
if y[i] == "o": # If character is 'o'
output = output + y[slice_start:i] + 'e' # then add all the previous characters to the output string, and an e character to replace the o
slice_start = i + 1 # Increment the index to start the slice at to be the letter immediately after the 'o'
output = output + y[slice_start:-1] # Add the rest of the characters to output string from the last occurrence of an 'o' to the end of the string
if y[-1] == 'o': # We still haven't checked the last character, so check if its an 'o'
output = output + 'e' # If it is, add an 'e' instead to output
else:
output = output + y[-1] # Otherwise just add the character as-is
print(output)
main()
Comments should explain what is going on. I'm not sure if this is the most efficient or best way to do it (which really shouldn't matter, since slicing is a terribly inefficient way to do this anyways), just the first thing I hacked together that uses slicing.
EDIT Yeah... Ourous's solution is much more elegant
Can slicing even be used in this situation??
The only probable solution I think would work, as MirekE stated, is y.replace("o","e").

Why doesn't Rebol new-line? treat the newline keyword and the newline character similarly?

I would think that the following Rebol 3 code:
x: [newline 1 2]
y: [
1 2]
print x
print new-line? x
print y
print new-line? y
should output:
<empty line>
1 2
true
<empty line>
1 2
true
but what is output is:
<empty line>
1 2
false
1 2
true
Both blocks, when reduced, should yield a newline character followed by '1' and '2' and so, IMO, should print identically. Less clear is whether new-line? on the two blocks should also give the same result since the newline keyword should be equivalent to the literal newline for this kind of test.
The flag which is checked by new-line? and set by new-line is used only by LOAD and MOLD. For all other semantic purposes in the program, it might as well not be there.
Therefore your x and y are completely different. Note that:
x: [newline 1 2]
y: [
1 2]
3 = length? x
2 = length? y
It's a quirk of Rebol that it singles out this one bit of whitespace information to stow in a hidden place. But arguably the choice to break a line represents something that is often significant in source, that if you reflect it back out into text you'd like to preserve more than the rest of the whitespace.
Let's start with NEWLINE: newline is a word bound to a char! value:
>> ? newline
NEWLINE is a char of value: #"^/"
That's Rebol's escape sequence for Unicode codepoint U+000A, commonly used as line feed ("LF") control code.
So your first example code [newline 1 2] has nothing to do with the NEW-LINE function. It simply describes a block containing three values: newline (a word!), 2 (an integer!), and 3 (another integer!). If you REDUCE the block from your first example, you'll get another block of three values: char!, integer!, and integer!:
>> reduce [newline 1 2]
== [#"^/" 1 2]
Now PRINT does not only REDUCE, it does REFORM (first REDUCE, then FORM) a block argument. FORM of a block converts the elements to a string representation and then joins those with spaces in between:
>> form [1 2 3]
== "1 2 3"
Putting those pieces together we finally know how to arrive at the output you are seeing for your first example:
>> basis: [newline 1 2 3]
== [newline 1 2 3]
>> step1: reduce basis
== [#"^/" 1 2 3]
>> step2: form step1
== "^/1 2 3"
>> print step2
1 2 3
So the question remains, why does the second example not print identically?
That's because FORM (used by PRINT, as described above) does not respect the NEW-LINE flag when converting from a block to a string.
This flag is "meta-data", not unlike e.g. the index position of an element within a block. So just as you don't have elements at index positions 8 and 6 just because you write a block like [8 6], you don't set the new-line flag for a position just because you happen to put an element there which is a character that on some systems represents a line break: [1 newline 2].
And this finally brings us to the last part of the puzzle: NEW-LINE? does not check if a given string represents a line break. It checks if a block (at its current position) has the new-line flag set.

Pythonic way to iterate over a collections.Counter() instance in descending order?

In Python 2.7, I want to iterate over a collections.Counter instance in descending count order.
>>> import collections
>>> c = collections.Counter()
>>> c['a'] = 1
>>> c['b'] = 999
>>> c
Counter({'b': 999, 'a': 1})
>>> for x in c:
print x
a
b
In the example above, it appears that the elements are iterated in the order they were added to the Counter instance.
I'd like to iterate over the list from highest to lowest. I see that the string representation of Counter does this, just wondering if there's a recommended way to do it.
You can iterate over c.most_common() to get the items in the desired order. See also the documentation of Counter.most_common().
Example:
>>> c = collections.Counter(a=1, b=999)
>>> c.most_common()
[('b', 999), ('a', 1)]
Here is the example to iterate the Counter in Python collections:
>>>def counterIterator():
... import collections
... counter = collections.Counter()
... counter.update(('u1','u1'))
... counter.update(('u2','u2'))
... counter.update(('u2','u1'))
... for ele in counter:
... print(ele,counter[ele])
>>>counterIterator()
u1 3
u2 3
Your problem was solved for just returning descending order but here is how to do it generically. In case someone else comes here from Google here is how I had to solve it. Basically what you have above returns the keys for the dictionary inside collections.Counter(). To get the values you just need to pass the key back to the dictionary like so:
for x in c:
key = x
value = c[key]
I had a more specific problem where I had word counts and wanted to filter out the low frequency ones. The trick here is to make a copy of the collections.Counter() or you will get "RuntimeError: dictionary changed size during iteration" when you try to remove them from the dictionary.
for word in words.copy():
# remove small instance words
if words[word] <= 3:
del words[word]

Resources