Hacker News new | past | comments | ask | show | jobs | submit login

I find that to be the issue. You are considering it just RAW text when it is actually formatted text that has been parsable for years with common unix command line tools. It not being in the format you consider a structured object does not mean it's not a object or even parsable. If you are using ad hoc regex I suspect you are not using all the tools available to you.

I feel like Kernighan and Pike do a much better job of explaining than I could ever.

https://www.amazon.com/Unix-Programming-Environment-Prentice...




> You are considering it just RAW text when it is actually formatted text that has been parsable for years with common unix command line tools.

Parsing command output with sed/awk/etc (ie. "common unix command line tools") is absolutely an ad hoc parser.

Let me give you an example that I recently ran into.

I have a tool that parses the output of "readelf -sW", which dumps symbols and their sizes. The output normally looks like this:

     885: 000000000043f0a0   249 FUNC    WEAK   DEFAULT   13 _ZNSt6vectorIPN3re23DFA5StateESaIS3_EE19_M_emplace_back_auxIJRKS3_EEEvDpOT_
     886: 000000000041c380    64 FUNC    GLOBAL DEFAULT   13 _dwarf_get_return_address_reg
     887: 0000000000424e60   122 FUNC    GLOBAL DEFAULT   13 _dwarf_decode_s_leb128
     888: 000000000043dca0   157 FUNC    GLOBAL DEFAULT   13 _ZN3re23DFA10StateSaverC2EPS0_PNS0_5StateE
So I wrote a regex to parse this. Seems pretty straightforward, right?

But then I noticed a bug where some symbols were not showing up. And it turns out those symbols look like this:

     5898: 00000000001a4d80 0x801058 OBJECT  GLOBAL DEFAULT   33 _ZN8tcmalloc6Static9pageheap_E
Notice the difference? Because it's a large symbol, readelf decided to print it starting with "0x" and in hex instead of decimal. I had to update my regex to accommodate this.

That is what makes a parser "ad hoc". You write a parser based on the examples you have seen, but other examples might break your parser. Parsing text robustly is non-trivial.

Worse, it is an unnecessary cognitive burden. Readelf already had this data in a structured format, why does it have to go to text and back? Why do I have to spend mental cycles figuring out which "common unix command-line tools" (and what options) can parse it back into a data structure?


The Unix answer is why are you using a regex on tab-separated values? Wrong tool for the job.

Of course the problem with Unix is that there are a thousand different semi-structured text formats, edge cases, and tools that must be mastered before you can make any sense of it all. Any time you point out the pain a Unix fan can just respond by pointing out your ignorance.


> The Unix answer is why are you using a regex on tab-separated values?

They aren't tab-separated. There are no tabs in readelf output.

Also if you assume they are space-separated, note that the first few lines look like this:

    Symbol table '.dynsym' contains 164 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
> respond by pointing out your ignorance.

Indeed.


I've actually become a pretty big fan of line terminated json (all string linefeeds, etc are escaped). Each line is a separate JSON object... In this way, it's very easy to parse, handle your use case, and even pipe-through as JSON some more.

In this case, you can have objects as text, and still get plain text streaming with the ability to use just about any language you like for processing.


First off, readelf shouldn't switch between hex and base10. Secondly, that's DSV, so you shouldn't have written a regex for it. You should have either cut, or awk, both tools SPECIFICALLY DESIGNED to do what you want.


What's DSV?


DSV: Delimiter Separated values. readelf uses a delimiter matching the regex /\w+/. In AWK, this is $FS by default, so AWK will parse this by default. Or you can pipe

  tr -s [:blank:] 
To cut, which will give you the row you want.


I would like to see how it will look in PowerShell. $(readelf).ShowMeTheSecondStringFromTheEnd or $(readelf).PrintTheLastColumnInHeX?


A comparable PowerShell cmdlet would give you one object per line with properties corresponding to the columns. And no, those properties usually have sensible names, instead of "LastColumn".

Of course, for wrapping the native command you'd still have to do text parsing if you want objects. This was more as a comparison of the different worlds here not so much as "if I ran readelf in PowerShell it would get magically different output".


I happen to find your example of why to use a object a bit hilarious.

You are right, readelf has the object in a structure -- because that is what elf is...

           typedef struct {
               uint32_t      st_name;
               unsigned char st_info;
               unsigned char st_other;
               uint16_t      st_shndx;
               Elf64_Addr    st_value;
               uint64_t      st_size;
           } Elf64_Sym;

If you wanted the object why did you need readelf in the first place? Why not just read the elf format directly and bypass readelf all together? That seems to be what you are advocating by having readelf passing a object instead of what it does today.


You're asking me why I use a tool instead of parsing a binary format manually? Does that really need explanation?

If that is your attitude, why use any command-line tools ever? Why use "ls" when you can call readdir()? Why use "ps" when you can parse /proc?

You just pointed me to Kernighan and Pike a second ago. I didn't expect I would need to justify why piping standard tools together is better than programming everything manually.


I never said anything about not liking command line tools. In fact I love them and think they do a awesome job!

In any case you just proved my point. You think its insane to parse binary data while scripting and I do too. That is why I think the passing binary objects is insane on the shell.

Now if you were talking about text base objects (not binary ones) then that is an entirely different story and I feel that is what we do today. In your example you have rows which could be called objects, and members which would be separated out in columns. To argue a different text base format is better than another is not something I am interested in doing -- mostly because there are a million different ways one could format the output. If you were to do "objects" I think they would have to be in binary to get any of the benefits one could perceived.

To be honest I feel the output you posted is a bug in readlef. I would expect all data from that column to be in the same base.

I will level with you I can see some benefits of having binary passed between command line programs but I think the harm it would do would outweigh the benefit.

But if you you really wanted to do that you could. There is nothing stopping command line utility makers from outputting a binary or any other formats of text. You don't need shell to make that happen.

What I think everybody is asking for is for command line developers to standardize their output to something parsable -- which I feel that most command line utilities already do that. They give you many different ways to format the data as it is. Some do this better than others, and I think that would hold true even if somebody forced all programs to only produce binary, or json text format when pipped.


This isn't about binary vs text, it is about structured vs. unstructured.

The legacy of UNIX is flat text. Yes it may be expressing some underlying structure, but you can't access that structure unless you write a parser. Writing that parser is error-prone and unnecessary cognitive burden.

PowerShell makes it so the structure of objects is automatically propagated between processes. This is undeniably an improvement.

I'm not saying PowerShell is perfect. From what I understand the objects are some kind of COM or .NET thing, which seems unnecessary to me. JSON or some other structured format would suffice. What matters is that it has structure.

I still don't think you appreciate how fragile your ad hoc parsers are. When things go wrong, you say you "feel" readelf has a bug. What if they disagree with you and they "feel" it is correct? There is no document that says what readelf output promises to do. You're writing parsers based on your expectations, but no one ever promised to meet your expectations. But if the data was in JSON, then there would be a promise that the data follows the JSON spec.


> From what I understand the objects are some kind of COM or .NET thing, which seems unnecessary to me. JSON or some other structured format would suffice.

They are .NET objects, which, in some cases wrap COM or WMI objects. The nice thing about them isn't just properties, though. You can also have methods. E.g. the service objects you get from Get-Service have a Start() and Stop() method; Process objects returned from Get-Process allow you to interact with that process. Basically wherever a .NET class already existed to encapsulate the information, that was used which gets you a lot more functionality than just the data contained in properties.


If the data was in JSON it would promise a that it followed the JSON spec -- but its not, it follows its defined spec, which in the case of readelf is apparently undefined.

Other programs that expect to be machine parsable define in great detail the output. In your initial post I replied to you mentioned ps. In the case of ps it has many options to help you get the data you want without using standard parsing tools. That is because its output was expected to be consumed by both humans and possibility other programs.

Now take readelf on the other hand. It clearly talks about in its man page about being more readable. Its author cares about how it will look on a terminal and even goes through the effort to implement -W which makes it nice to view on larger terminals. It even shows in print_vma, where somebody wen tout of their way to print hex if the number was larger than 99999. If the author really cared about the ability to be parsed they would have added a OUTPUT FORMAT CONTROL section that would provide you the contract you are looking for. Just saying if the data was in JSON does not solve your problem. Why? Because the author of readelf did not spend time to define its output properly in the man page it is not likely he/she would have implemented a json output type when piped little alone take the time to provide the object structures in the man page.

You say it's not about binary vs text but I don't think that can be said. There are lots of things to consider.

* Speed of encoding and decoding. * Memory consumption issues with larger objects needing to be fully decoded before being able to be used or processed. * Binary data would need to be encoded and would likely result in much more overhead.

Its not clear to me that a binary option would not be better than a text one. Pipes today are not just used for simple scripts and system management.

There are lots of things that concern me, maybe it is just the implementation details.

* Not all command line programs are written with the expectation to be parsed. How do we handle that? Force the programmer to make all output parsable regardless if they ever intended on the program being used in some script? * Would a program put the same thing to stdout even if it was flowing to a terminal? Are terminals not for humans? * Would structure be enforced? One of the awesome things about stdin/stdout is that you can send ANY data you want.

That all said I would love it if programs who intended on their output to be parsed offered a JSON output. I am not against structured output. I am against forcing programmers to shoehorn their output into some format that may not be the best for their program. I think a well designed and documented command line tool that expects to be parsed by other programs will go out of its way to ensure the format is documented and adhered to when operating.


It does follow a standard. It's DSV. Unix tools are really good at handling that. Awk and cut specifically.


[flagged]


The dollar sign is charming.

There's a few points here:

1) Not all data is text. In fact, very little of the data people see/work with day-to-day is raw text. It's silly to transform a PNG image into text to be able to pipe it around. (Or to pipe around its filename instead and have a dozen tools all having to open and re-parse it each time.)

2) There's nothing on PowerShell preventing you from serializing a piece of data to text if you want to. The key is: you don't have to.

3) Systems that depend on 50,000 CLI tools all having their own ad-hoc text parsers are cemented, mummified, cannot change. You can't change the output format of ps (to use an example in this thread) without breaking an unknown number of CLI tools. Even if you come up with a great way to improve the output, doesn't matter, you've still broken everything. This is less (but not none!) of an issue with PowerShell. I like computers to evolve to become better over time, and text-based CLIs are a huge anchor preventing that.


Unfortunately PowerShell relies on everything running on .NET (well, I think COM works, too); the idea of a shell that can expose live objects is useful, but PowerShell's platform limitations in reality doing that make it a far from ideal implementation of that concept. Something built on a protocol that is platform agnostic would be better.


Live objects don’t usually expose any protocols at all. They only expose an ABI. Do you know any platform-agnostic OO ABI, besides what’s in .NET?

If you’ll wrap your objects into some platform-agnostic protocol like JSON, you gonna waste enormous amount of CPU time parsing/formatting those streams at the object’s boundaries.


You can run streams of many millions of JSON objects pretty much as fast as the IO can feed it... most of the time, in situations like this, you're constrained by IO speed, not CPU... assuming you are working with a stream that has flow control.

I tend to dump out data structures to line terminated JSON, and it works out really well for streams, can even gz almost transparently. Parse/stringify has never been the bottleneck... it's usually memory (to hold all the objects being processed, unless you block/pushback on the stream), or IO (the feed/source of said stream can't keep up).


Even if printing and parsing is computationally cheap, memory allocation is less so.

If you expose JSON, each serialize/deserialize will produce another instances of objects, with the same data.

The architecture of PowerShell implies commands in the pipeline can process the same instances, without duplicating them.

Another good thing about passing raw objects instead of JSON — live objects can contain stuff expensive or impossible to serialize. Like an OS handle to an open file. Sure, with JSON you can pass file names instead, but this means commands in your pipeline need to open/close those files. Not only this is slower (opening a file requires kernel call, which in turn does various security checks for user’s group membership and file system’s inherited permissions), but can even cause sharing violations errors when two commands in the pipeline try accessing the same file.


And just think how much faster it would be if there were no serialization and I/O involved at all...


And my method can work over distributed systems with network streams... There are advantages to streams of text.


What advantages?

PowerShell works over networks just fine, because standardized ISO/IEC 17963:2013 protocol a.k.a. WS-Management.


> Even if you come up with a great way to improve the output, doesn't matter, you've still broken everything.

You phrase this as if changing things for the sake of changing was a good thing. It is not.

Well, perhaps it is good for the software vendor, but from the customer's point of view, having to re-learn how to do the same stuff over and over every other year is a PITA.


This is why I have trouble getting along with Linux users.

Without change, there's no improvement.


It's often the customers who are complaining that your current output is not suitable.


First off, you can pipe binary data around. Most tools just expect text.

Secondly, if you used DSV to parse PS, like you should, adding a new column to the end won't break anything. A fancier parser won't even break if you add to the middle, but that's usually not worth the effort to write.


> What the M$ community fails to see is that text streams can be consumed by __everyone__.

Text can be poorly parsed by everyone, yes. I especially love it when the default tools settings mean two different computers will give different text results, because the installation defaults changed at some point. What's not to love about trying to properly escape and unescape strings via the command line, while simultaneously keeping in mind your own shell's escaping, and having scripts which are neither backwards nor forwards compatible? And this is to say nothing of different distros.

It's as if the text was built for humans instead of my tools half the time, or something. I usually try to centralize my parsing of text in one place so when it invariably breaks, I don't have to rewrite my entire shell script.

Some basic structuring - I hesitate to call it a brittle object model, when most of the time I'm dealing with something more akin to C structs than invariant and abstraction laden Smalltalk or somesuch, or Java's sprawl of factories and inheritance - makes things a bit easier. New fields can be added without breaking my regular expressions. I don't need to worry about escapes to handle special characters. I can trivially dump to text for display (by simply running the command as this is the default behavior of the shell), or to feed into text consuming tools.


It's even more than that: text formatting allows the use of generic filtering and text processing tools. Whereas if you are using objects, you will tend to use less on-the-fly command line composition, and write or reuse more tools dedicated to one particular object or the other. In the end I'm not sure keeping the structuring as the default use case yields an usage so much improved in the real world, because if you work on a broad number of types you will tend to need to know more specialised tools, instead of generic ones, and have less higher-order composition tools at your disposal.

Now of course you can always format to text from your structured object, but this does not matter. What matters is what is convenient in the proposed UI, and what is mainly practiced in the real world because of such convenience and the amplification loop it creates between tools authors and users.

Also some objects are originally handled in text form, and their structured decomposition is extremely complex compared to an informal description and the naturally refinable heuristic which comes with it. For example you can grep into a source code in any language (with more or less efficiency depending on the language, but at least if there is a unique symbol you will find it), whereas trying to get and handle it in a structured way basically means you would need half of a (modular) compiler, and a huge manual to describe the structure, and possibly a non trivial amount of code to actually do the lookup.

The PowerShell approach is not all-bad, though, and obviously there are some usages where it superior to text based shells. But for a day to day usage as a general purpose shell, and a programmer shell, I'll stick to the Unix approach.


Structured data allows the use of generic filtering and structured data processing tools. The basic requirement is reflection, objects being able to tell you about their structural composition.

If code was stored in a structured representation you could still search for a string object containing a symbol name in the structured representation. You can match a structure pattern just like you match a regex to text.

Typical shells can be thought of as REPL for a programming language that makes the file system easily accessible and uses said file system or pipes to pass around data between functions/commands, sometimes as human-readable text. Most programming languages don't encourage passing data around as strings.


Nothing prevents me from changing those objects into a text stream. In fact, it's infinitely easier than turning a text stream into objects.


I feel you missed my point. It's not just a stream of raw data. It's a stream of formatted text... There is no magic or hand waving involved.


I guess that's why HTTP2 is now binary, because text is awesome.

https://http2.github.io/faq/#why-is-http2-binary




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: