More

zackelan · on May 23, 2018

The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.

I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.

mmt · on May 24, 2018

It's frustrated me for the better part of a decade that the misconception persists that "big data" begins after 2U. It's as if we're all still living during the dot-com boom and the only way to scale is buying more "pizza boxes".

Single-server setups larger than 2U but (usually) smaller than 1 rack can give tremendous bang for the buck, no matter if your "bang" is peak throughput or total storage. (And, no, I don't mean spending inordinate amounts on brand-name "SAN" gear).

There's even another category of servers, arguably non-commodity, since one can pay a 2x price premium (but only for the server itself, not the storage), that can quadruple the CPU and RAM capacity, if not I/O throughput of the cheaper version.

I think the ignorance of what hardware capabilities are actually out there ended up driving well-intentioned (usually software) engineers to choose distributed systems solutions, with all their ensuing complexity.

Today, part of the driver is how few underlying hardware choices one has from "cloud" providers and how anemic the I/O performance is.

It's sad, really, since SSDs have so greatly reduced the penalty for data not fitting in RAM (while still being local). The penalty for being at the end of an ethernet, however, can be far greater than that of a spinning disk.

zackelan · on May 24, 2018

That's a good point, I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

As a software engineer who builds their own desktops (and has for the last 10 years) but mostly works with AWS instances at $dayjob, are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment? Short of going full homelab, tripling my power bill, and heating my apartment up to 30C, I mean...

mmt · on May 24, 2018

> I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

That's probably better, since it'll scale a bit better with technological improvements. The problem is, it doesn't have quite the clever sound to it, especially with the numbers and dollars.

Now, the other main problem is that, though the cost of a workstation is fairly well-bounded, the cost of that medium-data server can actually vary quite widely, depending on what you need to do with that data (or, I suppose, how long you might want to retain data you don't happen to be doing anything to right at that moment).

I suppose that's part of my point, that there's a mis-perception that, because a single server (including its attached storage) can be so expensive, to the tune of many tens of thousands of (US) dollars, that somehow makes it "big" and undesireable, despite its potentially close-to-linear price-to-performance curve compared to those small 1U/2U servers. Never mind doing any reasoned analysis of whether going farther up the single-server capacity/performance axis, where the price curve gets steeper is worth it compared to the cost and complexity of a distributed solution.

> are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment?

Sadly, no great tutorials or blogs that I know of. However, I'd recommend taking a look at SuperMicro's complete-server products, primarily because, for most of them, you can find credible barebones pricing with a web search. I expect you already know how to account for other components (primarily of concern for the mobos that take only exotic CPUs).

As I alluded in another comment, you might also look into SAS expanders (conveniently also well integrated into some, but far from all, SuperMicro chassis backplanes) and RAID/HBA cards for the direct-attached (but still external) storage.

daemin · on May 24, 2018

Actually it is sometimes faster to fetch data over a network loading from SSD than it is to read from a local spinning disk. Source: current work.

mmt · on May 24, 2018

Well do notice I did say the penalty "can be" not "always is" far greater.

That's primarily because I'm aware of the variability that random access injects into spinning disk performance and that 10GE is now common enough that it takes more than just a single (sequentially accessed) spinning disk to saturate a server's NIC.

Plus, if you're talking about a (single) local spinning disk, I'd argue that's a trivial/degenerate case, especially if compared to a more expensive SSD. Does my assertion stand up better if I it had "of comparable cost" tacked on? Otherwise, the choice doesn't make much sense, since a local SSD is the obvious choice.

My overall point is that, though one particular workload may make a certain technology/configuration appear superior to another [1], in the general case, or, perhaps most importantly, in the high performance case, to have an eye on the bottlenecks, especially the ones that carry a high incremental cost of increasing their capacity.

It may be that people think the network, even 10GE now, is too cheap to be one of those bottlenecks, arguably a form of fallacy [2] number 7, but that ignores the question of aggregate (e.g. inter-switch) traffic. 40G and 100G ports can get pricey, and, at 4x and 10x of a single server port, they're far from solving fallacy number 3 at the network layer.

The other tendency I see is for people not to realize just how expensive a "server" is, by which I mean the minimum cost, before any CPUs or memory or storage. It's about $1k. The fancy, modern, distributed system designed on 40 "inexpensive" servers is already spending $40k just on chasses, motherboards, and PSUs. If the system didn't really need all 80 CPU sockets and all those DIMM sockets, it was money down the drain. What's worse, since the servers had to be "cheap", they were cargo-cult sized at 2U with low-end backplanes, severely limiting existing I/O performance. Then, to expand I/O performance, more of the same servers [3] are added, not because CPU or memory is needed, but because disk slots are needed and another $4k is spent to add capacity for 2-4 disks.

[1] This has been done on purpose for "competitive" benchmarks since forever

[2] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

[3] Consistency in hardware is generally something I like, for supportability, except it's essentially impossible anyway, given the speed of computer product changes/refreshes, which means I think it's also foolish not to re-evaluate when it's capacity-adding time after 6-9 month.

daemin · on May 25, 2018

Actually my example is far simpler and less interesting. Having a console devkit read un-ordered file data from a local disk ends up being slower than reading the same data from a developer's machine from an SSD over a plain gigabit network connection. Simply has to do with the random access patterns and seek latency of the spinning disk versus the great random access capabilities of an SSD. Note this is quite unoptimised reading of the data.

mmt · on May 26, 2018

Yes, that is, indeed, a degenerate case, as I suspected.

Is it safe to say that such situations are often found with embedded or otherwise specialized hardware?

anitil · on May 24, 2018

I am on a completely different end of the spectrum (embedded devices) - How would I go about learning the capabilities of modern servers?

mmt · on May 24, 2018

Good question.

At a theoretical level, as a sysadmin, I learned the theoretical capabities by reading datasheets for CPUs, motherboards (historically, also chipsets, bridge boards, and the like, but those are much less irrelevant), and storage products (HBAs/RAID cards, SAS expander chips, HDDs, SSDs). Make sure you're always aware of the actual payload bandwidth (net of overhead), actual units (base 2 or base 10) and duplex considerations (e.g. SATA).

From a more practical level, I look at various vendors' actual products, since it doesn't matter (for example) if a CPU series can support 8 sockets if the only mobos out there are 2- and 4-socket.

I also look at whatever benchmarks were out there to determine if claimed performance numbers are credible. This is where sometimes even enthusiast-targeted benchmark sites can be helpful, since there's often a close-enough (if not identical) desktop version of a server CPU out there to extrapolate from. Even SAS/SATA RAID cards get some attention, not in a configuration worthy of even "medium" data, but enough for validating marketing specs.

zackelan · on April 18, 2018

attrs also has a feature that dataclasses don't currently [0]: an easy way to use __slots__ [1].

It cuts down on the per-instance memory overhead, for cases where you're creating a ton of these objects. It can be useful even when not memory-constrained, because it will throw AttributeError, rather than succeeding silently, if you make a typo when assigning to an object attribute.

0: https://www.python.org/dev/peps/pep-0557/#support-for-automa...

1: http://www.attrs.org/en/stable/examples.html#slots

kstrauser · on April 18, 2018

Does that still matter now that PEP 412 (https://www.python.org/dev/peps/pep-0412/) is implemented in Python 3.3 and newer?

xapata · on April 18, 2018

No, I wouldn't bother with __slots__ in 3.7, especially with the newly optimized dict.

markrwilliams · on April 19, 2018

PEP 412 makes __dict__s more memory efficient than they were before, but not more efficient than no __dict__, which is the point of __slots__. The following program demonstrates the difference. Note that it lowers the available address space to 1GB so that memory exhaustion occurs sooner, and thus only works on UNIX-like systems that provide the resource module.

  import resource
  import sys

  class WithoutSlots:
      def __init__(self, a, b):
          self.a = a
          self.b = b

  class WithSlots:
      __slots__ = ('a', 'b')
      def __init__(self, a, b):
          self.a = a
          self.b = b

  resource.setrlimit(resource.RLIMIT_AS, (1024 ** 3, 1024 ** 3))
  cls = WithSlots if sys.argv[1:] == ['slots'] else WithoutSlots
  count, instances = 0, []
  while True:
      try:
          instances.append(cls(1, 2))
      except MemoryError:
          break
  count = len(instances)
  del instances
  print(cls, count)

Here are numbers from my laptop:

  $ python3.6 /tmp/slots.py 
  <class '__main__.WithoutSlots'> 5830382
  $ python3.6 /tmp/slots.py slots
  <class '__main__.WithSlots'> 16081964

That's almost 3x more instances with __slots__! This isn't the case with PyPy, though, thanks to a more efficient representation of objects:

https://morepypy.blogspot.com/2010/11/efficiently-implementi...

xapata · on April 19, 2018

That's a silly example. If you're making billions of integers, use NumPy. If it's just one pass, use a generator. If you're making lots of objects with more interesting attributes, the attribute storage will overwhelm the difference the instance dicts make.

My point was not that __slots__ does nothing, but that there are more important things to worry about.

marvy · on April 19, 2018

Suppose I want to run algorithms on large arrays of 2D points while maximizing readability. I want to store the x and y coordinates using Python integers so I don't have to worry about overflow errors, but I expect that most of the time the numbers will be small and this is "just in case".

I claim that in this case, __slots__ is exactly the right thing to worry about.

xapata · on April 19, 2018

It's hard for me to imagine that situation coming up, but yes, __slots__ does indeed have a purpose.

BTW, have you considered using the complex type to handle that for you? It's 2d and ints should be safe in float representation. If it overflows it'll crash nicely.

marvy · on April 20, 2018

Good one. But let's say I want something mutable, so complex won't do.

kstrauser · on April 19, 2018

That's an interesting example, and thanks for demonstrating it with a modern version! I definitely wouldn't have expected that result.

zackelan · on April 7, 2018

I've been very happy with Wallabag as a Pocket / Instapaper replacement. Self-hostable, with an option to pay 9 EUR/year for a hosted version.

zackelan · on April 7, 2018

Just discovered that earlier this week, very happy with it so far. Thanks for your work!

zackelan · on April 2, 2018

Note that it's not either/or - you can dispatch work from an event loop to a thread pool (or a process pool) with loop.run_in_executor [0], while loop.call_soon_threadsafe [1] can be used by worker threads to add callbacks to the event loop.

This means that the "frontend" of a service can be asyncio, allowing it to support features like WebSockets that are non-trivial to support without aiohttp or a similiar asyncio-native HTTP server [2], while the "backend" of the service can be multi-threaded or multi-process for CPU-bound work.

0: https://docs.python.org/3/library/asyncio-eventloop.html#exe...

1: https://docs.python.org/3/library/asyncio-eventloop.html#asy...

2: Flask-SocketIO, for example, requires that you use eventlet or gevent, which are the "legacy" ways of doing asynchronous IO: https://flask-socketio.readthedocs.io/en/latest/

zackelan · on March 28, 2018

If you want a middle ground between hand-written shell scripts and full-blown Kubernetes, we use Hashicorp's Nomad[0] on top of CoreOS at $dayjob and are quite happy with it.

Similar use case - self-hosted VMs, for low-traffic, internal tools, and no need for autoscaling.

I can't speak to how well it integrates with Gitlab's Auto DevOps, but Nomad integrates very well with Terraform[1] and I'd be surprised if there wasn't a way to plug Terraform into Gitlab's process.

0: https://www.nomadproject.io/

1: https://www.terraform.io/

zackelan · on Feb 3, 2018

The key difference between "classic" RDS and Aurora is that classic RDS really only automated the control plane. That is, RDS spins up an EC2 instance (or two, for multi-AZ) on your behalf, attaches an EBS volume of the appropriate specs, installs Postgres, sets up security and backups and replication etc.

Under classic RDS, when your application makes a SQL connection (the data plane) it's talking to a more or less stock Postgres instance, the same as you would have if you ran it locally.

Aurora, on the other hand, is involved in both the control plane and data plane. Your SQL connection is to a Postgres instance that's been forked/modified to work within Aurora.

zackelan · on Dec 25, 2017

> what do you do about people moving to a community for its desirable character but killing that character in the process?

Here's what I think is the central (and flawed) assumption in this line of reasoning - people move to an area because of its "character". And that "character" is an intangible, immeasurable quality, but it is somehow diminished if more people move to the area.

I grew up in Seattle. Both of my grandparents, when I was a kid, lived in Seattle's Fremont neighborhood. I live in Fremont today. From one perspective, the Fremont of my childhood is completely changed. On the other hand, it's still Fremont, with the Center of the Universe sign and the statue of Lenin and many other things I remember from childhood. Does it have the same "character"? Does it have a newer, different, but just as good, "character"?

Those are impossible questions and it boils down to a Ship of Theseus style argument. Either way, I can't bring myself to assert that the housing supply of Fremont should be artificially constrained by zoning policies, in order to preserve my ideal of what Fremont "should be" or "used to be".

sliverstorm · on Dec 25, 2017

I admittedly come to this from a different angle. Many people move to my town for nature & recreation. Every new house & every new infill is less nature, less trails, less recreation. So paradoxically, by moving here, are we killing what we moved here for? Not a Ship of Theseus.

In our case redevelopment for density actually helps preserve that character. But I still feel like I can understand the Bay Area home owners.

Decade · on Dec 25, 2017

We should collectively redefine what we are trying to preserve. You recognize that increasing density allows more people to live there without encroaching on the wilderness. As a bonus, increasing density also makes walkable neighborhoods more viable, so more people can live without cars.

But many voters believe what we should preserve is the single-family home, built environment that some developer created long ago. Then the number of people per unit of land is restricted: homes near economic activity become playthings of the rich, and any new home that is affordable is taking away wildlife habitat and farmland.

In short, their stance is understandable, but it is sociopathic.

https://www.sfhac.org

sliverstorm · on Dec 25, 2017

But many voters believe what we should preserve is the single-family home, built environment

Do you believe this is an honest characterization of their core goal? Is the opposition's number one goal simply to oppose multifamily property? Like, "God ordained that no two families should live in a single structure"? Or is it about property valuation changes, or building height, or street parking, or land use, or decreasing number of (semi-permanent) owners and increasing number of (temporary) renters, or...?

Decade · on Dec 26, 2017

> But many voters believe what we should preserve is the single-family home, built environment

> Do you believe this is an honest characterization of their core goal?

Yes. I can quote Rothstein about racist motivations[0] or Marohn about short-sighted financial recklessness,[1] but I believe more people have nostalgia than malice. Even if they deploy structural racism and racist rhetoric.

Most people become set in their ways very quickly, and have difficulty imagining what is good other than what they thought was good when they were young. By now, you cannot find a native-born American who grew up in a time before cars became supreme. Most Americans don’t even remember a time before the Suburban Experiment.[2]

So, yes, people will bring up building heights, and respecting the neighborhood, and traffic, and parking über alles, but I think the main motivation is that they can’t imagine someone else can have a good life that is a benefit to the community other than the life that they think is a good life.

[0] https://smile.amazon.com/Color-Law-Forgotten-Government-Segr...

[1] https://www.strongtowns.org

[2] https://www.strongtowns.org/curbside-chat-1/2015/12/14/ameri...

dillondoyle · on Dec 25, 2017

Replacing low density with high density shouldn't have to touch outdoor/public spaces. I live in Denver and people from around my state complain about this all the time. Perfectly possible to keep (or even expand) recreation spaces if we allow more density. Might be better to argue about increased population use of same public recreation resources (crowded trails) but that's the same selfish complaints as this whole thread - it comes down to why some feel because they have it already that they should be able to exclude others from having it in the future.

sliverstorm · on Dec 25, 2017

Replacing low density with high density shouldn't have to touch outdoor/public spaces [...] Perfectly possible to keep (or even expand) recreation spaces if we allow more density.

Right, I'm pretty sure I specifically acknowledged this in my previous comment.

why some feel because they have it already that they should be able to exclude others from having it in the future

If some hypothetical resource has a determinate carrying capacity and any greater usage degrades the resource for everyone, it's not unreasonable to exclude people. See the fixed number of backcountry permits Yosemite issues. Some things simply cannot be had by everyone. Given this, how do you decide who gets, and who does not get?

We really only have three systems that I know of- 1) lottery, ala Yosemite permits 2) free market, ala Bay Area low density housing, also ski condos 3) precedent/i-was-there-first, ala people who already own a house there get to stay there as it becomes desirable, also prescriptive easements of public trails on private land

zackelan · on Dec 15, 2017

At that point you're not writing JSON though. Once you start bolting on non-standard bells and whistles, why not recognize that JSON was never meant to be used for config files, and switch to something that was, like TOML [0]?

0: https://npf.io/2014/08/intro-to-toml/

zackelan · on Dec 13, 2017

I'm a very happy fish user but this is one of my pain points as well. If you want to define a function, the syntax is light-years ahead of bash, including named arguments (and closures!), but it took a fair bit of googling and eyebrow-wrinkling before I could figure out the way just to set a default argument for that function.

What I ended up with was (for a shortcut for generating a password on the command line):

    function pw --argument length
        test -z $length; and set length 16
        python3.6 -c "import secrets; print(secrets.token_urlsafe($length))"
    end