CPython: Why does lower (string method) create new memory locations?

CPython: Why does lower (string method) create new memory locations? - cpython

What's going on here?:
>>> a, b, c = ("TEST", "test", "TEST".lower())
>>> map(id, [a,b,c])
[140341845003072, 140341845003216, 140341845003264]
>>> map(str, [a,b,c])
['TEST', 'test', 'test']
>>> map(type, [a,b,c])
[<type 'str'>, <type 'str'>, <type 'str'>]
Shouldn't "TEST" and "TEST".lower() or "test" and "test".lower() share the same memory location?
EDIT: I get that there's a new copy, but I thought when two strings are the same, they share the same memory space, i.e.:
>>> a = "test"
>>> b = "test"
>>> map(id, (a,b))
[140341845003216, 140341845003216]
>>> a is b
True
On Python 2.7.3, I get:
>>> a = "test"
>>> a is a.lower()
False

The docs are clear. For string.lower():
Return a copy of s, but with upper case letters converted to lower case.

If you want identical strings to be identical objects, intern them.
By default None, True, False are like that; As well as constants in the source including strings, even across modules.

That is not guaranteed to be always the case (that equal strings are always the same object). But as others have pointed out, it depends on the Python implementation (e.g. for Dietrich with CPython 3.2.3, it is the same object).
The code of CPython 2.7 is quite simple: https://github.com/albertz/CPython/blob/master/Objects/stringobject.c#L1984

Related

Performance of looping over array of dicts in Julia

I am doing loops over arrays/vector of dicts. My Julia performance is slower than Python.
Run time is 25% longer than Python.
The toy program below represents the structure of what I am doing.
const poems = [
Dict(
:text => "Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore—",
:author => "Edgar Allen Poe"),
Dict(
:text => "Because I could not stop for Death – He kindly stopped for me – The Carriage held but just Ourselves – And Immortality.",
:author => "Emily Dickinson"),
# etc... 10,000 more
]
const remove_count = [
Dict(:words => "the", :count => 0),
Dict(:words => "and", :count => 0),
# etc... 100 more
]
function loop_poems(poems, remove_count)
for p in poems
for r in remove_count
if occursin(r[:words], p[:text])
r[:count] += 1
end
end
end
end
How do I optimize? I have read through Performance Tips in Julia website:
First, I declare constants.
Second, I assume since I pass arguments remove count and poems into the function, I don't need to declare as global.
Third, the meat of the processing (the loops) are in a function.
Fourth... I don't know how to declare types in an array of dicts (specifically for performance). How to do it? Would it help performance?

The issue here seems to be, what we call "type instability". Julia code is fast, when Julia can figure out the correct types at runtime, but is slower when the types are not known. To figure out, if there is any kind of type instability, you can use the #code_warntype macro in your REPL:
julia> #code_warntype loop_poems(poems, remove_count)
StackOverflow does not show the colors of the output, but you should look out for red parts, that indicate that Julia cannot narrow down the types enough. Yellow parts also indicate places, where the type is not known exactly, but these parts are often intentionally, so we have to worry less about them.
In my case (Julia v1.8.5) the following lines have some red color
│ %17 = Base.getindex(r, :words)::Any
│ %18 = Base.getindex(p, :text)::String
│ %19 = Main.occursin(%17, %18)::Bool
└── goto #5 if not %19
4 ─ %21 = Base.getindex(r, :count)::Any
│ %22 = (%21 + 1)::Any
the suffixes ::Any indicate that Julia could only infer the type Any here, which could be any type.
We also see that this happens in the cases of Base.getindex(r, :words) and Base.getindex(r, :count) - these are just the de-sugared expressions r[:words] and r[:count].
So why is that the case? If we look at the type of remove_count with
julia> typeof(remove_count)
Vector{Dict{Symbol, Any}}
We see that that the key type of the dictionary can only be a Symbol - but the value type can be any kind of type. We can get a very moderate speed up by constructing remove_count so that we narrow down the value type to a union:
const remove_count = Dict{Symbol, Union{String, Int}[
Dict(:words => "the", :count => 0),
Dict(:words => "and", :count => 0),
# etc... 100 more
]
Running #code_warntype again, shows, that we still have some red entries, but this time at least they are of type Union{String, Int} - but the speed up is still disappointing.
As others have pointed out, it might be better, if you find a different data structure so that your code is type stable.
There are multiple ways to do that. Probably the easiest is to use a vector of NamedTuple:
const remove_count = [
(word="the", count=0),
(word="and", count=0),
# etc... 100 more
)
so that
typeof(remove_count)
Vector{NamedTuple{(:word, :count), Tuple{String, Int64}}}
and your function then becomes
function loop_poems(poems, remove_count)
for p in poems
for i in eachindex(remove_count)
word = remove_count[i].word
count = remove_count[i].count
if occursin(word, text)
# NamedTuple is immutable, so wee need to create a new one
remove_count[i] = (word=word, count=count + 1)
end
end
end
end
If we use #code_warntype again, the red parts have disappeared.
There are few other easy improvements:
Use the #inbounds macro when looping over arrays: https://docs.julialang.org/en/v1/devdocs/boundscheck/#Eliding-bounds-checks
Move p[:text] into the outer loop
and your function then becomes:
function loop_poems(poems, remove_count)
#inbounds for p in poems
text = p[:text]
for i in eachindex(remove_count)
word = remove_count[i].word
count = remove_count[i].count
if occursin(word, text)
# NamedTuple is immutable, so wee need to create a new one
remove_count[i] = (word=word, count=count + 1)
end
end
end
end
It also might make sense to also convert poems into a vector of NamedTuple.
Ultimately, if you still need more performance, you might better look at your domain and at more complex string algorithms:
Can your words contain any white spaces? If not, maybe split the poems into tokens.
If you have a lot of words, your words might share a lot of prefixes - in such a case a trie might be of help: https://en.wikipedia.org/wiki/Trie

You want function loop_poems(poems, remove_count), not function loop_poems(poem, remove_count). Your code as written is accessing poems as a global variable.

How to pass FsCheck Test Correctly

let list p = if List.contains " " p || List.contains null p then false else true
I have such a function to check if the list is well formatted or not. The list shouldn't have an empty string and nulls. I don't get what I am missing since Check.Verbose list returns falsifiable output.
How should I approach the problem?

I think you don't quite understand FsCheck yet. When you do Check.Verbose someFunction, FsCheck generates a bunch of random input for your function, and fails if the function ever returns false. The idea is that the function you pass to Check.Verbose should be a property that will always be true no matter what the input is. For example, if you reverse a list twice then it should return the original list no matter what the original list was. This property is usually expressed as follows:
let revTwiceIsSameList (lst : int list) =
List.rev (List.rev lst) = lst
Check.Verbose revTwiceIsSameList // This will pass
Your function, on the other hand, is a good, useful function that checks whether a list is well-formed in your data model... but it's not a property in the sense that FsCheck uses the term (that is, a function that should always return true no matter what the input is). To make an FsCheck-style property, you want to write a function that looks generally like:
let verifyMyFunc (input : string list) =
if (input is well-formed) then // TODO: Figure out how to check that
myFunc input = true
else
myFunc input = false
Check.Verbose verifyMyFunc
(Note that I've named your function myFunc instead of list, because as a general rule, you should never name a function list. The name list is a data type (e.g., string list or int list), and if you name a function list, you'll just confuse yourself later on when the same name has two different meanings.)
Now, the problem here is: how do you write the "input is well-formed" part of my verifyMyFunc example? You can't just use your function to check it, because that would be testing your function against itself, which is not a useful test. (The test would essentially become "myFunc input = myFunc input", which would always return true even if your function had a bug in it — unless your function returned random input, of course). So you'd have to write another function to check if the input is well-formed, and here the problem is that the function you've written is the best, most correct way to check for well-formed input. If you wrote another function to check, it would boil down to not (List.contains "" || List.contains null) in the end, and again, you'd be essentially checking your function against itself.
In this specific case, I don't think FsCheck is the right tool for the job, because your function is so simple. Is this a homework assignment, where your instructor is requiring you to use FsCheck? Or are you trying to learn FsCheck on your own, and using this exercise to teach yourself FsCheck? If it's the former, then I'd suggest pointing your instructor to this question and see what he says about my answer. If it's the latter, then I'd suggest finding some slightly more complicated function to use to learn FsCheck. A useful function here would be one where you can find some property that should always be true, like in the List.rev example (reversing a list twice should restore the original list, so that's a useful property to test with). Or if you're having trouble finding an always-true property, at least find a function that you can implement in at least two different ways, so that you can use FsCheck to check that both implementations return the same result for any given input.

Adding to #rmunn's excellent answer:
if you wanted to test myFunc (yes I also renamed your list function) you could do it by creating some fixed cases that you already know the answer to, like:
let myFunc p = if List.contains " " p || List.contains null p then false else true
let tests =
testList "myFunc" [
testCase "empty list" <| fun()-> "empty" |> Expect.isTrue (myFunc [ ])
testCase "nonempty list" <| fun()-> "hi" |> Expect.isTrue (myFunc [ "hi" ])
testCase "null case" <| fun()-> "null" |> Expect.isFalse (myFunc [ null ])
testCase "empty string" <| fun()-> "\"\"" |> Expect.isFalse (myFunc [ "" ])
]
Tests.runTests config tests
Here I am using a testing library called Expecto.
If you run this you would see one of the tests fails:
Failed! myFunc/empty string:
"". Actual value was true but had expected it to be false.
because your original function has a bug; it checks for space " " instead of empty string "".
After you fix it all tests pass:
4 tests run in 00:00:00.0105346 for myFunc – 4 passed, 0 ignored, 0
failed, 0 errored. Success!
At this point you checked only 4 simple and obvious cases with zero or one element each. Many times functions fail when fed more complex data. The problem is how many more test cases can you add? The possibilities are literally infinite!
FsCheck
This is where FsCheck can help you. With FsCheck you can check for properties (or rules) that should always be true. It takes a little bit of creativity to think of good ones to test for and granted, sometimes it is not easy.
In your case we can test for concatenation. The rule would be like this:
If two lists are concatenated the result of MyFunc applied to the concatenation should be true if both lists are well formed and false if any of them is malformed.
You can express that as a function this way:
let myFuncConcatenation l1 l2 = myFunc (l1 # l2) = (myFunc l1 && myFunc l2)
l1 # l2 is the concatenation of both lists.
Now if you call FsCheck:
FsCheck.Verbose myFuncConcatenation
It tries a 100 different combinations trying to make it fail but in the end it gives you the Ok:
0:
["X"]
["^"; ""]
1:
["C"; ""; "M"]
[]
2:
[""; ""; ""]
[""; null; ""; ""]
3:
...
Ok, passed 100 tests.
This does not necessarily mean your function is correct, there still could be a failing combination that FsCheck did not try or it could be wrong in a different way. But it is a pretty good indication that it is correct in terms of the concatenation property.
Testing for the concatenation property with FsCheck actually allowed us to call myFunc 300 times with different values and prove that it did not crash or returned an unexpected value.
FsCheck does not replace case by case testing, it complements it:
Notice that if you had run FsCheck.Verbose myFuncConcatenation over the original function, which had a bug, it would still pass. The reason is the bug was independent of the concatenation property. This means that you should always have the case by case testing where you check the most important cases and you can complement that with FsCheck to test other situations.
Here are other properties you can check, these test the two false conditions independently:
let myFuncHasNulls l = if List.contains null l then myFunc l = false else true
let myFuncHasEmpty l = if List.contains "" l then myFunc l = false else true
Check.Quick myFuncHasNulls
Check.Quick myFuncHasEmpty
// Ok, passed 100 tests.
// Ok, passed 100 tests.

Indentation Error in Python 3.6.1 def

I had been having issues with python 3.7 for quite some time about very pointless indentations, so I decided to get back to 3.6, specifically repl.it Python 3.6.1, and as I mentioned, the errors are for no good reason whatsoever as far as I can tell, the code is as written below:
from random import randint
import functools
printf = functools.partial(print, end=" ")
defNuc = ['C','A','T','G']
def opNuc():
def create():
nuc = [0]
nucop = [0]
length = randint(11,16)
print (length - 1)
for i in range(1,length):
part = randint(1,4)
for a in range(1,4)
if part == a:
nuc = defNuc[a]
nucOp = defNuc[-a]
if i != length - 1:
printf(nuc[i],i,"-")
else:
print(nuc[i],i)
for i in range (1,length):
if i != length - 1:
printf(nucOp[i],"-")
else:
print(nucop[i])
The error is at line 9, at
def create():
and as for the reason of error, it just says
expected an indented block
Edit:
This was completely my stupidity, don't take the post seriously, will be deleted in 10 minutes.

You never finished the definition of opNuc, so the parser is expecting an indented line to continue the body of that function. Either add a pass statement to provide a trivial body:
def opNuc():
pass
or indent the definition of create if that is supposed to be local to the body of opNuc (unlikely, but possible):
def opNuc():
def create():
...

The problem is that your first function, opNuc, was never finished. I have made this simple mistake many times myself and is very easy to miss. It's easy to fix though, just type pass inside of the opNuc function and it should be fine. Hope I helped!

Find a specific tuple by key in an Erlang list (eJabberd HTTP Header)

I am just getting started with eJabberd and am writing a custom module with HTTP access.
I have the request going through, but am now trying to retrieve a custom header and that's where I'm having problems.
I've used the Request record to get the request_headers list and can see that it contains all of the headers I need (although the one I'm after is a binary string on both the key and value for some reason...) as follows:
[
{ 'Content-Length', <<"100">> },
{ <<"X-Custom-Header">>, <<"CustomValue">> },
{ 'Host', <<"127.0.0.1:5280">> },
{ 'Content-Type', <<"application/json">> },
{ 'User-Agent', <<"Fiddler">> }
]
This is also my first foray into functional programming, so from procedural perspective, I would loop through the list and check if the key is the one that I'm looking for and return the value.
To this end, I've created a function as:
find_header(HeaderKey, Headers) ->
lists:foreach(
fun(H) ->
if
H = {HeaderKey, Value} -> H;
true -> false
end
end,
Headers).
With this I get the error:
illegal guard expression
I'm not even sure I'm going about this the right way so am looking for some advice as to how to handle this sort of scenario in Erlang (and possibly in functional languages in general).
Thanks in advance for any help and advice!
PhilHalf

The List that you have mentioned is called a "Property list", which is an ordinary list containing entries in the form of either tuples, whose first elements are keys used for lookup and insertion or atoms, which work as shorthand for tuples {Atom, true}.
To get a value of key, you may do the following:
proplists:get_value(Key,List).
for Example to get the Content Length:
7> List=[{'Content-Length',<<"100">>},
{<<"X-Custom-Header">>,<<"CustomValue">>},
{'Host',<<"127.0.0.1:5280">>},
{'Content-Type',<<"application/json">>},
{'User-Agent',<<"Fiddler">>}].
7> proplists:get_value('Content-Type',List).
<<"application/json">>

You can use the function lists:keyfind/3:
> {_, Value} = lists:keyfind('Content-Length', 1, Headers).
{'Content-Length',<<"100">>}
> Value.
<<"100">>
The 1 in the second argument tells the function what tuple element to compare. If, for example, you wanted to know what key corresponds to a value you already know, you'd use 2 instead:
> {Key, _} = lists:keyfind(<<"100">>, 2, Headers).
{'Content-Length',<<"100">>}
> Key.
'Content-Length'
As for how to implement this in Erlang, you'd write a recursive function.
Imagine that you're looking at the first element of the list, trying to figure out if this is the entry you're looking for. There are three possibilities:
The list is empty, so there is nothing to compare.
The first entry matches. Return it and ignore the rest of the list.
The first entry doesn't match. Therefore, the result of looking for this key in this list is the same as the result of looking for it in the remaining elements: we recurse.
find_header(_HeaderKey, []) ->
not_found;
find_header(HeaderKey, [{HeaderKey, Value} | _Rest]) ->
{ok, Value};
find_header(HeaderKey, [{_Key, _Value} | Rest]) ->
find_header(HeaderKey, Rest).
Hope this helps.

pyparsing for querying a database of chemical elements

I would like to parse a query for a database of chemical elements.
The database is stored in a xml file. Parsing that file produces a nested dictionary that is stored in a singleton object that inherit from collections.OrderedDict.
Asking for an element will give me an ordered dictionary of its corresponding properties
(i.e. ELEMENTS['C'] --> {'name':'carbon','neutron' : 0,'proton':6, ...}).
Conversely, asking for a propery will give me an ordered dictionary of its values for all the elements (i.e. ELEMENTS['proton'] --> {'H' : 1, 'He' : 2} ...).
A typical query could be:
mass > 10 or (nucleon < 20 and atomic_radius < 5)
where each 'subquery' (i.e. mass > 10) will return the set of elements that matches it.
Then, the query will be converted and transformed internally to a string that will be evaluated further to produce a set of the indexes of the elements that matched it. In that context the operators and/or are not boolean operator but rather ensemble operator that acts upon python sets.
I recently sent a post for building such a query. Thanks to the useful answers I got, I think that I did more or less the job (I hope on a nice way !) but I still have some questions related to pyparsing.
Here is my code:
import numpy
from pyparsing import *
# This import a singleton object storing the datase dictionary as
# described earlier
from ElementsDatabase import ELEMENTS
and_operator = oneOf(['and','&'], caseless=True)
or_operator = oneOf(['or' ,'|'], caseless=True)
# ELEMENTS.properties is a property getter that returns the list of
# registered properties in the database
props = oneOf(ELEMENTS.properties, caseless=True)
# A property keyword can be quoted or not.
props = Suppress('"') + props + Suppress('"') | props
# When parsed, it must be replaced by the following expression that
# will be eval later.
props.setParseAction(lambda t : "numpy.array(ELEMENTS['%s'].values())" % t[0].lower())
quote = QuotedString('"')
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
float_ = Regex(r'[+-]?(\d+(\.\d*)?)?([eE][+-]?\d+)?').setParseAction(lambda t:float(t[0]))
comparison_operator = oneOf(['==','!=','>','>=','<', '<='])
comparison_expr = props + comparison_operator + (quote | float_ | integer)
comparison_expr.setParseAction(lambda t : "set(numpy.where(%s)%s%s)" % tuple(t))
grammar = Combine(operatorPrecedence(comparison_expr, [(and_operator, 2, opAssoc.LEFT) (or_operator, 2, opAssoc.LEFT)]))
# A test query
res = grammar.parseString('"mass " > 30 or (nucleon == 1)',parseAll=True)
print eval(' '.join(res._asStringList()))
My question are the following:
1 using 'transformString' instead of 'parseString' never triggers any
exception even when the string to be parsed does not match the grammar.
However, it is exactly the functionnality I need. Is there is a way to do so ?
2 I would like to reintroduce white spaces between my tokens in order
that my eval does not fail. The only way I found to do so it the one
implemented above. Would you see a better way using pyparsing ?
sorry for the long post but I wanted to introduce in deeper details its context. BTW, if you find this approach bad, do not hesitate to tell it me!
thank you very much for your help.
Eric

do not worry about my concern, I found a work around. I used the SimpleBool.py example shipped with pyparsing (thanks for the hint Paul).
Basically, I used the following approach:
1 for each subquery (i.e. mass > 10), using the setParseAction method,
I joined a function that returns the set of eleements that matched
the subquery
2 then, I joined the following functions for each logical operator (and,
or and not):
def not_operator(token):
_, s = token[0]
# ELEMENTS is the singleton described in my original post
return set(ELEMENTS.keys()).difference(s)
def and_operator(token):
s1, _, s2 = token[0]
return (s1 and s2)
def or_operator(token):
s1, _, s2 = token[0]
return (s1 or s2)
# Thanks for Paul for the hint.
grammar = operatorPrecedence(comparison_expr,
[(not_token, 1,opAssoc.RIGHT,not_operator),
(and_token, 2, opAssoc.LEFT,and_operator),
(or_token, 2, opAssoc.LEFT,or_operator)])
Please not that these operators acts upon python sets rather than
on booleans.
And that does the job.
I hope that this approach will help anyone of you.
Eric

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

CPython: Why does lower (string method) create new memory locations? - cpython

The docs are clear. For string.lower(): Return a copy of s, but with upper case letters converted to lower case.

If you want identical strings to be identical objects, intern them. By default None, True, False are like that; As well as constants in the source including strings, even across modules.

Related

Performance of looping over array of dicts in Julia

How to pass FsCheck Test Correctly

Indentation Error in Python 3.6.1 def

Find a specific tuple by key in an Erlang list (eJabberd HTTP Header)

pyparsing for querying a database of chemical elements

Categories

Resources