Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really don't like the way this was presented, even if I may agree with the idea underneath. It's not about classes at all. It's about better design, but even the examples are strange.

Just from the JSON example:

- Why do I need a class for streaming JSON - Python's got a perfectly good `yield` for returning tokens in such situations.

- Why would I ever design the JSON library to be extendable at the tokenizer level? If you need serialiser / deserialiser, why not just provide a map of types to callbacks / callback as a parameter? Do you really want to extend JSON format itself?

- The 2GB JSON example is just weird. If you care about such use cases, you a) most likely have a limit on data size at webserver level, b) use proper formats for handling that size of data (I really doubt there's no better data representation once you get to GB sizes).

I see his point of view, but he's arguing for one single "hammer" solution, rather than arguing against the monolithic design. His story seems to present some weird back-story really: "I needed to make my data easier to handle, so I started automatically serialising objects into JSON, then they became huge so I have to start streaming them otherwise just parsing of them takes way too long".



> - Why do I need a class for streaming JSON - Python's got a perfectly good `yield` for returning tokens in such situations.

See the msgpack-cli example at the bottom. Say you have a function that returns a generator for tokens in Python. You would need another function that builds objects out of them. How do you customize how objects are being built? A class makes that simpler because each of the methods are extension points you can override.

But yeah, a token stream would be much appreciated.

> - Why would I ever design the JSON library to be extendable at the tokenizer level?

For instance if you want to skip past objects. That's what we're doing for instance for unwanted input data. That implicitly also prevents hash collision DOS attacks because we never build hash tables for things that we don't want. It also gets rid of the suspension of execution when a garbage collector runs and cleans up a nested structure. I can make you tons of examples where the "build a tree, reject a tree" approach destroys an evented Python server.

> - The 2GB JSON example is just weird. If you care about such use cases, you a) most likely have a limit on data size at webserver level, b) use proper formats for handling that size of data (I really doubt there's no better data representation once you get to GB sizes).

Try using a streaming JSON API like Twitter's firehose. Most people just give up and use XML because there is SAX or if they go with JSON they newline delimit it because most libraries are horrible or useless for parsing streamed JSON.


> You would need another function that builds objects out of them.

I.e. you don't bundle your serialiser with your json parser. I think that's a good idea. How do you customise? most likely a callback that builds an object or returns the json fragment unmodified if it can't. It can be streamed too in order to accumulate / rebuild fragments of the tree.

Whether object is simpler here (builder / factory style), or another generator, that's a matter of taste mostly.

> For instance if you want to skip past objects.

You can't skip past objects at a tokenizer level, unless you implement logic of skipping whole structure. But that's what the parser does. Why don't you just skip the objects based on the streaming API? You don't have to construct them first - it's up to your deserialiser implementation if you want to ignore parts of the structure.

> Try using a streaming JSON API like Twitter's firehose.

From the documentation (unless I'm reading the wrong format description):

"""The body of a streaming API response consists of a series of newline-delimited messages, where "newline" is considered to be \r\n (in hex, 0x0D 0x0A) and "message" is a JSON encoded data structure or a blank line."""

That's not a huge JSON document. They use the newline delimiter as you described later. I don't think that's because "most libraries are horrible or useless for parsing streamed JSON". Why would you ever want to stream JSON which is an infinite and never complete object? What's wrong about splitting by newline? By making newlines the message separators, you'll never have to worry that your stream becomes broken due to some parsing error: bad message? ignore it skip to newline, continue with next one.


I think

> A class makes that simpler because each of the methods are extension points you can override.

is a strong argument in favour of classes. They're more extensible even if the author doesn't consider it. However - as soon as your class has an implementation like

  def to_json(str)
    JSONParser.parse(str) # JSONParser is not streamed
  end
then you're in trouble. Unless your language supports dynamic lookup of constants, class names feel very much like a global variable that's a pain to change. In Ruby, as of 1.9.3, lookup is lexical so you can't simply define a class-local value for the JSONParser constant.

I don't know the story in other languages - I assume Java has it as you see a great emphasis on dependency injection. If dynamic lookup of constants was present I think classes would be more unintentionally extensible however they were written - as it is, you have to be as careful writing classes for extension as for functional code where you have to manually provide extension points.


In Perl, classes/packages are dynamically scoped. So you can safely localise any monkey-patching. For eg:

  use 5.016;
  use warnings;
  
  package JSONParser {
      sub parse {
          my ($self, $str) = @_;
          "JSONParser::parse $str";
      }
  }
  
  package Foo {
      sub new { my $class = shift; bless {}, $class }
  
      sub to_json {
          my ($self, $str) = @_;
          JSONParser->parse($str);
      }
  }
  
  my $foo = Foo->new;
  
  say $foo->to_json("foo");
  
  {
      # OK... I want to amend that JSONParser->parser behaviour
      # but just in this scope!

      no warnings 'redefine';
      local *JSONParser::parse = sub {
          my ($self, $str) = @_;
          "No Longer JSONParser::parser!  $str";
      };
  
      say $foo->to_json("bar");
  }
  
  say $foo->to_json("baz");
This outputs...

  JSONParser::parse foo
  No Longer JSONParser::parser!  bar
  JSONParser::parse baz


In Java, you often see a class JsonClass with a toJson method like you presented and then a StreamingJsonClass with an overriden toJson that streams - this can cause problems with types (because the most obvious return type for toJson when it's written is JsonObject, but you really want an Iterator<JsonObject>), but Ruby/Python have the same problem (only hidden).

It's quite idiomatic Java, though, to remove all static calls from methods, even if you have to use anonymous classes (i.e. extensible method dependencies) to do so, for exactly the reasons you outline.


> - Why do I need a class for streaming JSON - Python's got a perfectly good `yield` for returning tokens in such situations.

I use msgpack in Python (twisted, tornado) precisely because it can consume byte buffers which are not token-aligned.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: