Hacker News new | past | comments | ask | show | jobs | submit login
Why Do Python Lists Multiply Oddly? Exploring the CPython Source Code (codeconfessions.substack.com)
51 points by thunderbong 6 months ago | hide | past | favorite | 99 comments



I like the deep dive into the source, but big picture, I think if you understand the general order of evaluation which applies to most languages, this isn't all that unintuitive. The arguments going into the outer [ ] get evaluated before the outer [ ], meaning the inner [] gets evaluated first. Once that happens the inner list is created, and there's no code left that even visually could create more inner lists.

The more unintuitive example of this to me is in function defaults, because those feel more like part of the function signature--i.e. I'd expect them to be assigned at function call time. But apparently, function defaults get called once when the function is defined.

    >>> def grow(items=[0]):
    ...    items.append(items[-1] + 1)
    ...    return items
    ... 
    >>> grow()
    [0, 1]
    >>> grow()
    [0, 1, 2]
    >>> grow()
    [0, 1, 2, 3]


You can whitewash this away by discussing order of evaluation, but it still is a great example of how Python continually mixes up concepts of expressions and statements and immutable and mutable data. In real life, if I ask someone to go get me four empty buckets, I expect them to come back with four distinct empty buckets. The intuitive interpretation of [[]]*4 is asking just that. Instead Python creates a bucket and then gives you a piece of paper that lists its address four times. Why does one want that, by default?

In more sane languages, the expression [] means give me an empty list value, nothing more and nothing less. So then whatever the syntax for [[]]*4 is in such sane languages, it gives you four empty list values. If you want something more imperative, you can get it, but it most cases, you don't want or even need such things.

Python makes no distinction, at a level able to be visualized by the user, between immutable and mutable data. There are a lot of bizarre situations in Python that will bite you like this. And in most cases, it's strange that such behaviors were chosen as the deault.


> In more sane languages, the expression [] means give me an empty list value, nothing more and nothing less. So then whatever the syntax for [[]]*4 is in such sane languages, it gives you four empty list values.

I don't buy this. Either the syntax to give you four empty list values is some kind of comprehension, which Python has and does exactly what you like, or it must have lazy evaluation somehow. Otherwise, how can the language know you don't want to copy the existing reference but create a new empty list? If [[]]*4 were to create new list on the fly, then `[print('hello')]*4` would have to print hello 4 times, i.e., the print statement is lazily executed, which does not fit into Python at all.


My question is: what's a real-world example where this [...]*n syntax buys you something that can't be done simply otherwise and that is worth the complexity tradeoff of people often using it subtly wrong?


> that is worth the complexity tradeoff of people often using it subtly wrong?

I don't think that makes the complexity tradeoff worth it. This is the kind of tradeoff that makes the language implementation, reference and semantics more and more complex. Unfortunately Python has a lot of such complexities already. We don't need more of it. Such complexity hurts the creation of completing implementations.

IMHO programming language specifications and implementations should be simple and consistent so that competing implementations can be developed easily.

> people often using it subtly wrong?

I think that... you know... before writing serious software, these people need to roll up their sleeves and just learn the damn language. Just learn what references are and how they work in Python. It's not that hard. It is most definitely easier than adding a sphagetti monster to the compiler make it bend to the will of programmers who can't be bothered to read the tutorial. I mean this gotcha about references is taught in the 4th chapter of the tutorial. Can't developers even bother to RTFM these days before developing serious software with a programming language?


> these people need to roll up their sleeves and just learn the damn language

You really have no choice but to do that. But the critique here is that some languages make this hard. And some languages, like Python, appear deceptively simple and consistent, when they are anything but. And as I pointed out, these decisions were not really required or designed to solve certain problems. They just kind of came about in Python's development, one historically and intentionally ignorant of pre-existing languages and their good ideas.

When everybody just says "learn the language" or "there's list comprehensions" or "there's for loops", then why does the [...]*n syntax exist? What problems is it solving that require the confusion that it generates?


> What problems is it solving that require the confusion that it generates?

I think I answered that already. It keeps the language spec consistent and simpler.

Imagine the complexity you have to add to the language spec to say that when we write [] we deal with the reference to this list except in the multiplication syntax a = [[]] * 5 where the inner [] is not a reference to the list but the list value! Such special case will make the language both inconsistent and harder to understand for experienced programmers.


I'm asking what makes introducing the multiplication syntax worth it in the first place.


Ah! Misunderstood your original comment. My apologies! Yes, I am with you when you question the usefulness of the multiplication syntax.

I prefer simplicity and consistency in a programming language grammar, syntax and semantics as I advocated above. So yes I'd be happy to lose that multiplication syntax. It is not worth it.


It's syntax sugar. I use it all the time, but of course it could always be respected by, i don't know, a for loop. It also makes writing duck typed code easier, as X3 will work for strings, bytes and lists.

Overall it's a very useful feature and I don't think it's that surprising. It's pretty obvious [x]4 will reuse the same reference. Just like manually adding "x" four times will.


This feels like a legitimate critique to me. I'm developing a programming language myself and I'm not going to include a syntax like this--multiplying an array by a number feels like a type error to me.


Totally agreed @ just not implementing this part.

I think the motivation was being able to do scalar multiplication with a vector, but it gets ugly when operating on anything but numbers imo


I wouldn't recommend just throwing the idea away, it just needs to be approached with care. You can do it in a type safe way.

Research Rank Polymorphism and Array Languages:

https://en.m.wikipedia.org/wiki/Array_programming

https://arxiv.org/abs/1907.00509


I mean, is there anything there that couldn't just be a library function? Not everything has to be an operator.


Philosophically, I don't know what "must" or "has" to be an operator.

I certainly don't want to tell you how to design your language.

I'm just trying to say, don't rule the pattern out because it's perceived as a type error; you may have other reasons that are good, but my citations provide a basis of how to make the typing work (if you were inclined).

If you have other reasons, even if it's just aesthetic, that's fine.


It's useful for math when doing scalar multiplication but imo it would make more sense to use a tuple instead of list


Languages with immutable data structures are relatively rare. With its high level focus python could have been one of them, but went for different trade offs.

> In more sane languages, the expression [] means give me an empty list value, nothing more and nothing less

What is a “list value”? It’s either a pointer to some other space where the lists values are stored, or it’s the values stored themselves. If you say it’s the second (I can’t think of any language that does), then `a = […]; b = a` is an expensive copy by value assignment. Is that worth that trade off to get less surprising results in the articles case? I’m not saying the answer is “obviously no”, but it definitely isn’t “obviously yes”.


> If you say it’s the second (I can’t think of any language that does), then `a = […]; b = a` is an expensive copy by value assignment

In Swift collections are value types, so b = a conceptually is a copy. It’s not expensive, though, because Swift does copy-on-write.

You pay for that by a combination of having more expensive collection updates (they, at least conceptually, have to check whether to make that copy) and a more complex compiler (it tries to avoid having to actually do those checks at runtime)

  var a: [Int]=[]
  var b = [a,a,a,a]
  b[0].append(1)
  print(b)
will produce

  [[1],[],[],[]]


Just teaching myself Swift this week. Thanks, this is helpful!


Matlab is one language that chooses the second. Every assignment a=b where b is a matrix creates a distinct copy of the matrix.

The interpreter is slightly more clever and instead implements this as a copy-on-writr mechanism, but that's just an implementation detail .


I really wonder what "more sane" languages you mean. Because a single [] does give you an empty list value in Python: one list, and not four different empty lists. Why would it give your four? That'd need to somehow bring in the surrounding context and in "sane" languages, (closed) expressions generally don't change the meaning because the context around them is changing.

Actually, if you really want to have re-evaluting (sub)expressions, then Python does give you an option: comprehensions.


F#:

    List.init 4 (fun _ -> [])

    [for _ in 1..4 -> []]
Or any sane language that uses immutable values by default like Elixir, Erlang, Clojure, Racket, Scheme, OCaml, etc.

> Actually, if you really want to have re-evaluting (sub)expressions, then Python does give you an option: comprehensions.

That's the complaint, that [[]]*4 is not shorthand for list comprehensions and that it's something entirely different. This is common in Python where two things you'd expect to be the same are indeed different in subtle ways.

List comprehensions in Python are closer to sane, but they still have unexpected behavior because of the way scoping works in Python.


> That's the complaint, that [[]]4 is not shorthand for list comprehensions

There are lots of reasons you would want items in a list repeated rather than having, essentially, the result of separate executions of the same constructor -- [] is just shorthand for list() -- so its good that [x()]4 is not shorthand for [x() for _ in range(4)]. Especially since that would mean it would have to be special-case sytax for list construction, and not a list operator that you could use on existing lists. The mildly annyoing edge case you need to learn about isn’t worth wiping out the whole broadly-useful language feature, or creating a great special-case inconsistency which would be more cognitive load to deal with, to eliminate.


Why are you comparing functions/closures with values? You can do the same in python:

  [list() for _ in range(4)]


> Why are you comparing functions/closures with values?

I'm not. The fun _ -> [] is needed because the argument is the element's index. That's just what List.init happens to expect, a function that takes an index and initializes the element, but the index is not needed here.


(I can’t respond to your other comment)

Ah I see, you’re saying that the construction of a list in F# mandates a lambda, thereby reducing one possible point of confusion. I could agree with that. But then it comes at a cost of comfort: [0]*4 has less boilerplate than the F# version. So it comes down to trade-offs again.


Are you saying f# special-cases something that looks like a closure, but isn’t in this one context? Because that, to me, seems like a much more grievous issue than the one at hand.


No. F# lambdas are closures. I was just saying that I'm not using the concept of closures to explain here. It's just a side effect of using the List.init function as a demonstration.


>That's the complaint, that [[]]*4 is not shorthand for list comprehensions and that it's something entirely different.

Well, of course. It's a shorthand for duplicating the list elements four times. In this case, the list contains a single reference to an empty list, do the result contains four references to an empty list.


And it's literally impossible for [[]]*4 to be a shorthand for a list comprehension, because nothing in Python can be a shorthand for a list comprehension. Instead, it is but a method call.


I mean, there is no reason Python couldn't special case something that looked like an operator call into shorthand for a comprehension instead of shorthand for a method call, other than the fact that it would make the parser more complex and make it harder to understand the language.

It would be bad for this case, though. As confusing as the particular case of:

  [[]] * 4
might be to people who haven't learned how it works, list multiplication doesn't just apply to single element lists, and is useful and usefully distinct from what people are suggesting the behavior should be for the operation based on this one case.


Well, Python is not Raku, so while it could be possible to make "[expr1, expr2, ..., exprN] * exprM" to mean "[*(expr1), *(expr2), ..., *(exprN) for _ in range(exprM)]", it would never happen.

Besides, that still doesn't help with the case of

    x = function_that_returns_a_list()
    x = x * 4
where there is no literal list expression.


> I expect them to come back with four distinct empty buckets. The intuitive interpretation of [[]]*4 is asking just that.

I think the small example is misguiding the intuition here.

If I write `my_list = f(expensive operation); long_list = my_list * 4`, I don't expect python to recreate the whole thing four times. I justly assume `f` will be evaluated once, and then copied.


That's a different statement and expected, but in a language like Python, it isn't always clear whether my_list is a value or a reference. And because of that, it leads to unexpected behavior when an intuitively expected value is actually a reference.


I think you're confused about the difference between [x]*4 and [x for _ in range (4)]. It's not, at any point, a difference between something being a value or reference.

The first code is equivalent to:

list_multiply([x], 4)

(the function is actually called list.__mul__, but that's an unimportant detail). The latter is a syntax sugar for:

out = [] for _ in range(4): out.append(x)

(actually a bit more complex, but close enough).

When you understand it, the behavior is obvious in both cases.


> That's a different statement and expected, but in a language like Python, it isn't always clear whether my_list is a value or a reference

It is always clear, and it is 100% always a reference, with a set of cases where that reference is to a singleton or immutable object, and a subset of those cases where, in particular Python implementations, that singleton may effectively be implemented as a special case value.

If you treat it as a reference, you will never, ever be wrong.


And again, that decision is the more complex one. Inverting that such that everything is immutable by default and mutable or a reference by choice is the more regular, consistent choice that removes a fairly big swath of common programming problems.

The critique here is that it's a badly designed language in that it's design decisions often get in the way of writing software that works as you expect. We're not discussing about whether you can learn such a system. The discussion is a lament on why it had to be this way.


> And again, that decision is the more complex one. Inverting that such that everything is immutable by default and mutable or a reference by choice is the more regular, consistent choice that removes a fairly big swath of common programming problems.

Maybe. But that's largely irrelevant since we are discussing the impact of a syntax choice regarding an operator on an explicitly-mutable container type, and specifically the behavior of it used when the container also contains a member of the same type.

And the alleged surprising thing is not the mutability of anything, but thar after the operation, all the members of the other container are the same instance of the inner mutable container type.

Python being designed ground up with immutability by default and explicitly opt-in mutability would not directly have any bearing on this issue, as long as it still had the capacity to express immutable lists and the capacity to apply something like the list-reptition operation to them. Except it would probably make all of the involved examples more verbose.


Yup, same as Java if you only use boxed primitives


> in a language like Python, it isn't always clear whether my_list is a value or a reference.

I can't think of anything that's a value? At least since Java, the idea of "everything is an object" seems pretty dominant in most languages.

There are a few immutable types, like ints, strings and tuples, but you still refer to them by reference.


As a pure implementation detail, IIRC, I think small integers and maybe True/False/None are values in CPython but logically its all references.


Lists in Python are always references.

There’s a separate type called tuple specifically designed to represent lists-as-values.

If you do [()]*4 then you get exactly what you want.


> There’s a separate type called tuple specifically designed to represent lists-as-values.

tuples are also references, but to an immutable object.

> If you do [()]*4 then you get exactly what you want.

I think what is actually wanted here is four different mutable lists rather than the same mutable list four times, something like what you would get from:

  [[] for _ in range(4)]


Agreed. Python is aimed at developers, not language enthusiasts, and you should be able to intuit what something is doing without having to memorize dozens of obscure rules and edge cases. The less surprises a language has, the better the design.


> In real life, if I ask someone to go get me four empty buckets, I expect them to come back with four distinct empty buckets.

That's not what you're asking for, though. You're asking for one bucket, and then you're putting it in another bucket, and then you're duplicating the contents of the second bucket four times.

This isn't whitewashing, this is just what those expressions mean.

It's not like it's hard to get four buckets in Python, BTW:

    [[] for _ in range(4)]


> immutable and mutable data

I don't get how this is an example of python mixing up mutable/immutable data, and I'm unaware of a situation where python does. Conceptually I'm not even sure what you mean by a language 'mixing up mutable and immutable data' unless it's literally just a buggy/non hardened implementation.

A = (1, ); A[0] = 2 # this raises an exception


Languages with immutable data structures are relatively rare. With its high level focus python could have been one of them, but went for different trade offs.

> In more sane languages, the expression [] means give me an empty list value, nothing more and nothing less

What is an “empty list value”?


Python has immutable data structures, lists just aren't one of them. (Tuples are. So are, unlike in some languages, strings.)


In that case what would you expect the following to return?

    a = []
    b = [a]*4
    a.append(1)
    print(b)
I mean it is possible to change Python and make [...]*n a shorthand for "repeat this statement n times" but that would be odd on its own.

Lists pass by reference, tuples pass by value, if you want to understand Python you're just going to have to get used to that distinction. There are other options, but most are worse.


> Lists pass by reference, tuples pass by value, if you want to understand Python you're just going to have to get used to that distinction.

Is that true? I haven't looked at the implementation so I can't say it's impossible, but it would be extremely surprising to me if tuples passed by value, because tuples longer than the size operated upon my individual processor instructions would become expensive to pass around.

From a correctness perspective the difference between pass-by-reference and pass-by-value (pass-by-copy) aren't usually important, but for performance it can be a very large difference.


I can't tell what the underlying implementation is, but either way you can't modify them without creating a new tuple so they behave as if they use value-semantics not reference-semantics. (this is what makes tuples safe as defaults in a method, and lists not)

If you compare their ids it does look like python makes multiple instances and passes them around by reference, but without a way to modify them in place it's hard to tell the difference, also Python tries to avoid making unnecessary copies of the same thing if possible.

You can try to test it a bit with something like the following:

    t = (1,2)
    def f(x):
        return x is t

   f(t), f((1,2)), (1,2) is (1,2) # -> True, False, True
but the results might be implementation dependent.


> I can't tell what the underlying implementation is, but either way you can't modify them without creating a new tuple so they behave as if they use value-semantics not reference-semantics. (this is what makes tuples safe as defaults in a method, and lists not)

Ahh, you were talking about the semantics, not the implementation. Understood.

Another way to think of this might be in terms of mutability, i.e. tuples are immutable. Pass-by-value versus pass-by-reference for immutable objects doesn't make much difference in the behavior of objects in a language like Python which doesn't have explicit pointers, but it's a big difference for performance.

> If you compare their ids it does look like python makes multiple instances and passes them around by reference

I didn't think of answering my own question this way. Nice!

> but the results might be implementation dependent.

It's possible since Python isn't standardized, but I strongly doubt you'll find a mature implementation that passes potentially-large, variable-sized objects like tuples around by value, for two reasons:

1. Once you get above the size operated on by processor instructions (64 bits on most modern desktop/laptop processors) pass by value requires a memcopy, which is orders of magnitude slower. Tuples can grow past the L1 or L2 cache size, which corresponds to another two orders of magnitude in performnance loss. Comparing this to passing around a 64-bit pointer which is about as fast as passing around any other value.

2. Virtual machines usually implement their own stack. This is easiest to do as an array of same-sized values. Implementing a performant stack with variable-sized elements is a nightmare I wouldn't wish on my worst enemy.


> , if you want to understand Python you're just going to have to get used to that distinction

My argument is that that distinction isn't needed, by default, in a high-level language, and that plenty of people get it wrong, understandably. It is simply complexity that is not needed.

These simple examples can be toyed with, but sooner or later one is going to want to solve real problems, and unintuitive, default behavior doesn't help with that. Python's mode of operation tends to favor implicit over explicit. And the syntax [...]*n is implicit.


The distinction is needed, I think, unless you want to go the purely-functional route. I love that route, but just as many people will complain about purely functional languages being unintuitive.

If you want mutable values in your language, you need to expect values to mutate. So what should list multiplication do?

- There is the current implementation, where a change to the first entry of [f(x)] * 4 will be reflected in all entries of the list (because they're all references to the same value). - It could only work on list literals. So [f(x)] * 4 would call f four times and collect the values in a list. Then, a = [f(x)]; a * 4 would not work, which would be surprising. - It could work differently on list literals and lists. So [f(x)] * 4 would call f four times collecting the results, but a = [f(x)]; a * 4 would just return a list of four references to the one result of f(x). - It could copy the value in the list. How would you do that if the list contained values like file handlers or network connections that can't be copied?

Within the context of Python (an imperative language with mutable values), the current implementation is the most sensible one. If you work in such a language, you must always exercise some care when working with mutable values. Learn your tool!

Python could have been designed with different decisions. But who knows if we would even be talking about it today in that case?


> If you want mutable values in your language, you need to expect values to mutate. So what should list multiplication do?

bitmc can correct me if I'm misinterpreting them, but I feel like what they're saying is that multiplying a list by a number doesn't make sense, so it should throw a type error. That feels reasonable to me (and it's the choice I'm making in the interpreter I'm writing).


It's really not that complex, but it does have to be learned.

Everything in Python is an object with an id. Assignment and parameter passing is by object id.

There are some mutable objects ("objects", lists, dicts) and some immutable objects (numbers, strings, tuples).

The []*n syntax is treated as assignment, hence the id is duplicated. This isn't an issue with immutable objects since there's no way to change the underlying object - you can only change the id of each object in the list. Obviously there are some gotchas like with mutable objects here where the succinct syntax can be confusing. But I think the simplicity of everything being an id has deep intuitive value and is one reason Python has been so successful, even if many users never really think much about ids.


Well, those things are called "default argument values", not "default argument expressions", you know. It's a value that's reused for every function call, not an expression that re-evaluated at every function call.


Sure there is a reason it is that way. But it doesn't mean it's not incredibly unintuitive and error-prone.


Yes, the reason is that it's a tradition to have it this way. Python IMHO has decent enough assignment semantics that using "default argument expressions" model would've been fine, unlike e.g. in C++.


> Well, those things are called "default argument values", not "default argument expressions", you know.

No, I didn't know that, and I'd guess that the vast majority of Python developers don't, because people don't generally learn Python by reading the docs front to back and finding out the official names of everything.

> It's a value that's reused for every function call, not an expression that re-evaluated at every function call.

No one is confused about that. I'm aware that's how it is, but is that how it should be? I.e. if we could choose for this to be reused for every function call or re-evaluated every function call, which results in more intuitive behavior?


I think calling it a default argument object would be much more clear. "Value" can still refer to an abstract, immutable concept, even the python docs use it like that at some places.


if you compare

> a = [0]

and

> b = [0]

these two expressions produces different 'value' objects for `[0]`, and modifying `a.append(a[-1] + 1)` does not change b.

so it's unintuitive that using `[0]` as a default value in a function argument can cause that argument to retain the same value across multiple function calls, when programmers would assume the argument to a function is not the same across multiple invocations.

The issue really here is the usage of an array object, which is not a value object (immutable), but a reference object. It would not have been a problem if the default argument was a string, or a number.

This is a reason not to use arrays in default arguments.


Very technically speaking, in "def f(a=[0]): ..." or in "a = [[0]] * 4" there is only one textual occurrence of "[0]", and the function definition is literally evaluated to define a function (you can try def "f(a=sys.stdout.write('Hi!')): return" in REPL to see when the default value gets constructed).

On the other hand, if we abstract from the technical minutiae, yes, it would have been entirely possible for Python to treat default arguments in the function definitions as expressions, by using machinery similar to what gets used for comprehensions: after all, as you point out, function's arguments (and local variables) do get re-assigned on every function call.


I've never understood peoples confusion about function defaults.

They are part of function definition, they aren't in the function body. You should expect them to be "executed" when the function is defined.


Or they could have been part of the function call, and run when the function is called without that argument.


You want Python to be how much slower now?


This isn't about what I want, just noting that the other way around could sound just as intuitive. Anyway, it's awfully slow already, how much can it hurt ;-)


This is a terrible excuse. Quickly doing the wrong thing isn't optimization.


The language spec isn't wrong, you just don't like it.


The language spec isn't right you just like it.

Sure, there are a lot of subjective aesthetics that go into the spec, but in this case, there are objective reasons for not liking this. It's a well-known footgun that causes bugs. And it's almost never what you want, so you end up doing something like this:

    def f(xs = None):
        # Are these two lines actually faster than the
        # interpreter creating defaults at call time?
        if xs is None:
            xs = []

        ...
Do you have any reasons at all for defending this decision?


>

   # Are these two lines actually faster than the
   # interpreter creating defaults at call time?
You're proposing that the interpreter add a check for every default parameter in every function signature; that it should optionally fire off arbitrary code for each and every one. And when you consider that high-performance Python involves writing C extensions, your proposal would be to move that check out of the compiled code and into the slow interpreted space is, yes, a major performance hit.


> You're proposing that the interpreter add a check for every default parameter in every function signature

No, that's not what I'm proposing. Why would it check anything? Just evaluate the given default expression at call time. If you don't want the overhead of an expression, don't put a default.

You can also do defaults like:

    LIST_OF_X = []

    def foo(xs = LIST_OF_X):
        ...
...if you want the other behavior. This does add a variable lookup (oh no!).


Your list there is mutable. Isn't that what you meant to solve with this?


Ugh. You're just ignoring my entire post except the one part where you (wrongly) think you can correct me?

The code I posted is showing how you can explicitly get the mutable behavior if you want it when the default expressions are evaluated at call time.

That is to say, if you evaluate default expressions at call time, you can get either behavior by being explicit with minimal loss in performance:

    def foo(xs = []):
        xs.append(1)
        print(xs) # always prints [1]

    DEFAULT = []
    def bar(xs = DEFAULT):
        xs.append(1)
        print(xs) # prints [1], [1,1], [1,1,1], etc.
To reiterate, the above code is what would happen if default expressions were evaluated at function call time.

You're clearly not as knowledgeable as you think you are on this and you're just cherry picking things you don't understand to feel smart.


Yes, the default function argument values is another surprising behavior.

In this case, the default arguments get compiled as part of the function code and are part of the function's environment. As a result every time that function is called it is always having the same environment, i.e. the default argument values are not reinterpreted on every invocation. So the mutations to the environment are visible across function calls.


> In this case, the default arguments get compiled as part of the function code and are part of the function's environment.

Yeah, I understand that; I'm saying that was a choice they made, and I think having the defaults be evaluated at each call time would have been a better choice.


The * operator could be defined to use a deepcopy operation, but isn’t, presumably for efficiency reasons. This has little to do with order of evaluation.

Alternatively, the issue is that Python does not distinguish between value and reference types.


How is that counter-intuitive? Or at least what’s your general intuition then? If it was [x] with x being a complex object, would you expect it to be deep copied? Shallow copied? Copied via IRepeatable? What’s your intuition on a process of repeating a value that isn’t reference-copy? What should a[3] = a[2] = a[1] = a[0] or an equivalent chunk do, assuming trivial __setitem__()? Why that should be different from a * 4?

I’m not even a python guy, if you think I’m biased. These questions are natural for a programmer and a programmer has to ask themselves about what they expect, why, is it consistent, is it safe to assume.

Making [[]]*4 to create three new lists implies that you add a non-obvious ephemeral re-evaluation point in your grammar. Should [[f()]]*4 call f() three more times?

These may be added for convenience. But you have to make sure that syntactic context makes it dead clear. E.g. in a for-i iterator part, loop conditions, function argument init, etc. Locations which are known to be executed more than once. (Python specifically doesn’t even do that for arguments. Every def f(x=[]) shares the same list as a default for x.)


Indeed at first glance it looks odd, but the alternative is that multiplying a sequence always does a deep copy. Actually what is "strange" is the behaviour in combination with the syntax of a nested list. The behaviour itself is fine. Replace it with

class X: pass

y = [X()] * 5

and suddenly I think the behaviour looks fairly reasonable.


I think anybody that ever did LeetCode/HackerRank in Python had that bug at least once when making an empty 2-D DP table or a return matrix.

My bigger question is how does a professor of machine learning does not know about one of the oldest Python footguns to exist:

https://stackoverflow.com/questions/12791501/why-does-this-c...

Very nice article otherwise, I never actually bothered looking into why it behaved this way(at this point most people initialize using list comprehension, so this bug is IMO very rare in production), very well explained.


> My bigger question is how does a professor of machine learning does not know about one of the oldest Python footguns to exist

Because there's a lot of them. And as soon as you touch a sane language that fits your intuitive model of how things should work, going back to Python means that you'll start doing intuitive things that Python will happily and silently transform into unintuitive operations.


Yep, common enough that everyone learns it.

I started using loops and it motivated me to start writing list comprehensions. One of the foot guns that will really motivate you to understand Python as a tool more than anything.


The author has a number of other good articles going into the depths of CPython here: https://codeconfessions.substack.com/t/cpython-internals.

One of them goes into how many lines of C it takes to execute "a + b" in Python (great explanation, but you have to go to the comments for the answer!). I remember one of my professors opening up gdb and printing every C line to execute "print("hello, world")", and the author's writings give a better understanding of what some of those lines are.

I'd be very interested in a follow-up post going in-depth on the ins and outs of list comprehension.


Thank you (author here).

I have not looked into list comprehensions. I think that and how generators work internally might be interesting to write about. I will try to cover them.


It is a good explanation of what is happening. However, I think that many of not most Python linters will warn you about this. It is perfectly valid, and something you might actually want, so it should not throw an error.


TL;DR is that multiplying a list performs a shallow copy of the list's elements. Great article. The only thing it's missing is a snippet that shows what you should do instead:

  [[] for _ in range(N)]
Since generator-like expressions evaluates the inner expression (i.e. the []) on each iteration, the elements of the outer list are guaranteed to be unique references.


I only use the [x] * n form of array initialization for primitive types (int,float,bool,string). So

  #good
  a = [True] * 8 # bool is primitive
  b = [[True] * 8 for i in range(8)] # range creates new lists each time

  #bad
  a = [{}] * 8 # dict is not a primitive
  b = [[True] * 8] * 8 # list is not a primitive


This is pretty subtle. I wish more languages explicitly distinguished references and values like C++ and Rust.


In Python, everything is a reference, so you don't need special syntax to mark them.

There just happen to be some types where you cannot change the referenced value in any way. These behave like value types in other languages.

(If you believe that they should be marked explicitly, do you also believe the same thing about Rust? I.e. should the Rust compiler throw an error if you never actually change a reference type, as you "should" be using a value type in that case?)


Everything falls in place when you accept that in python all names (including list indexes) are just pointers to an object, and objects aren't copied unless you explicitly copy them. Everything else follows from that.

If you had done [1] * 4 you'd ended up with a list of four references to the same object (1) as well. It's just that 1 isn't mutable, so nobody cares.


F# does this nicely with the mutable keyword and the <- assignment operator.


Using * on a sequence wouldn't get past code review with me, it's an obscure feature with surprising side-effects and it doesn't explain your thinking.

Only context I've ever found it useful in, is if you're unit testing something that accepts a string of max length M, then

    bad_str = "A" * (M + 1)
Is quite nice and also, easy to interpret.

But if you're calling multiply on a list[list[Any]] lol nah mate, what the hell are you playing at, make a for loop and add comments.


I feel like far too much attention is given to `[[]] * n`. In the grand scheme of things, no serious python programmer is using multiply on a sequence outside of the string construction convenience.

It's also remarkebly easy to diagnose once you see the unexpected behaviour, so anyone asking for help is going to be instantly told not to do it and use a comprehension instead.

I'm sure there are plenty of other pitfalls in other languages, though I concede this one is especially unintuitive to a new person.


> Using * on a sequence wouldn't get past code review

On a mutable sequence. Or anything mutable. [] being a sequence is not the issue here.

If you work with immutable stuff, * is really not an issue.


It's not great though. I do use it to repeat strings but it should have been a method and not weirdly overloading *. This type of misuse of operator overloading is why people from a C++ background think operator overloading is evil. In python it's mostly ok, due to a more restrained culture, but this is a good example of when it's bad.


It's not weirdly overloaded.

'a'*3 == 'a'+'a'+'a'


There's always the `zip(*[iter(x)]*2)` to split an iterable into even/odds.


In other realms, SSA and immutable data makes things faster (cache friendly) and more predictable without impure side-effects cascading in unanticipated ways. In general, references are half pointers and quarter gotos in terms of papercut yourself factor.

I think Erlang got this one right.

Perhaps folks replace their containers and fundamental types like strs, lists, and dicts with immutable ones using an appropriate library.


The root problem here is how references and mutability combine to cause spooky action at a distance, and the non-obviousness of how they combine.

The higher-level a programming language is, the more likely it is to hide the use of references, and also the more likely it is to hide when a new object is created, which would break shared mutable state.

In C or Go any reference is taken explicitly and objects are also created explicitly, so it's very easy to see at a glance what can be mutated from outside the immediate scope or when a defensive copy is made.

Java and C# use lots of implicit references but still require keyword new for nearly all object instantiations. It's easy to look at the expression Collections.nCopies(4, new ArrayList<>()) and understand that the same list will be referenced 4 times.

At the highest extreme you have languages like Haskell where immutability makes details about when objects are copied or referenced irrelevant. Surprising behaviour is outright impossible now.

Python appears to be in a tarpit where object instantiation and referencing is obscured just enough that it can take you by surprise when it silently results in shared mutable state.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: