The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.
I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.
It's frustrated me for the better part of a decade that the misconception persists that "big data" begins after 2U. It's as if we're all still living during the dot-com boom and the only way to scale is buying more "pizza boxes".
Single-server setups larger than 2U but (usually) smaller than 1 rack can give tremendous bang for the buck, no matter if your "bang" is peak throughput or total storage. (And, no, I don't mean spending inordinate amounts on brand-name "SAN" gear).
There's even another category of servers, arguably non-commodity, since one can pay a 2x price premium (but only for the server itself, not the storage), that can quadruple the CPU and RAM capacity, if not I/O throughput of the cheaper version.
I think the ignorance of what hardware capabilities are actually out there ended up driving well-intentioned (usually software) engineers to choose distributed systems solutions, with all their ensuing complexity.
Today, part of the driver is how few underlying hardware choices one has from "cloud" providers and how anemic the I/O performance is.
It's sad, really, since SSDs have so greatly reduced the penalty for data not fitting in RAM (while still being local). The penalty for being at the end of an ethernet, however, can be far greater than that of a spinning disk.
That's a good point, I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.
As a software engineer who builds their own desktops (and has for the last 10 years) but mostly works with AWS instances at $dayjob, are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment? Short of going full homelab, tripling my power bill, and heating my apartment up to 30C, I mean...
> I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.
That's probably better, since it'll scale a bit better with technological improvements. The problem is, it doesn't have quite the clever sound to it, especially with the numbers and dollars.
Now, the other main problem is that, though the cost of a workstation is fairly well-bounded, the cost of that medium-data server can actually vary quite widely, depending on what you need to do with that data (or, I suppose, how long you might want to retain data you don't happen to be doing anything to right at that moment).
I suppose that's part of my point, that there's a mis-perception that, because a single server (including its attached storage) can be so expensive, to the tune of many tens of thousands of (US) dollars, that somehow makes it "big" and undesireable, despite its potentially close-to-linear price-to-performance curve compared to those small 1U/2U servers. Never mind doing any reasoned analysis of whether going farther up the single-server capacity/performance axis, where the price curve gets steeper is worth it compared to the cost and complexity of a distributed solution.
> are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment?
Sadly, no great tutorials or blogs that I know of. However, I'd recommend taking a look at SuperMicro's complete-server products, primarily because, for most of them, you can find credible barebones pricing with a web search. I expect you already know how to account for other components (primarily of concern for the mobos that take only exotic CPUs).
As I alluded in another comment, you might also look into SAS expanders (conveniently also well integrated into some, but far from all, SuperMicro chassis backplanes) and RAID/HBA cards for the direct-attached (but still external) storage.
Well do notice I did say the penalty "can be" not "always is" far greater.
That's primarily because I'm aware of the variability that random access injects into spinning disk performance and that 10GE is now common enough that it takes more than just a single (sequentially accessed) spinning disk to saturate a server's NIC.
Plus, if you're talking about a (single) local spinning disk, I'd argue that's a trivial/degenerate case, especially if compared to a more expensive SSD. Does my assertion stand up better if I it had "of comparable cost" tacked on? Otherwise, the choice doesn't make much sense, since a local SSD is the obvious choice.
My overall point is that, though one particular workload may make a certain technology/configuration appear superior to another [1], in the general case, or, perhaps most importantly, in the high performance case, to have an eye on the bottlenecks, especially the ones that carry a high incremental cost of increasing their capacity.
It may be that people think the network, even 10GE now, is too cheap to be one of those bottlenecks, arguably a form of fallacy [2] number 7, but that ignores the question of aggregate (e.g. inter-switch) traffic. 40G and 100G ports can get pricey, and, at 4x and 10x of a single server port, they're far from solving fallacy number 3 at the network layer.
The other tendency I see is for people not to realize just how expensive a "server" is, by which I mean the minimum cost, before any CPUs or memory or storage. It's about $1k. The fancy, modern, distributed system designed on 40 "inexpensive" servers is already spending $40k just on chasses, motherboards, and PSUs. If the system didn't really need all 80 CPU sockets and all those DIMM sockets, it was money down the drain. What's worse, since the servers had to be "cheap", they were cargo-cult sized at 2U with low-end backplanes, severely limiting existing I/O performance. Then, to expand I/O performance, more of the same servers [3] are added, not because CPU or memory is needed, but because disk slots are needed and another $4k is spent to add capacity for 2-4 disks.
[1] This has been done on purpose for "competitive" benchmarks since forever
[3] Consistency in hardware is generally something I like, for supportability, except it's essentially impossible anyway, given the speed of computer product changes/refreshes, which means I think it's also foolish not to re-evaluate when it's capacity-adding time after 6-9 month.
Actually my example is far simpler and less interesting.
Having a console devkit read un-ordered file data from a local disk ends up being slower than reading the same data from a developer's machine from an SSD over a plain gigabit network connection.
Simply has to do with the random access patterns and seek latency of the spinning disk versus the great random access capabilities of an SSD.
Note this is quite unoptimised reading of the data.
At a theoretical level, as a sysadmin, I learned the theoretical capabities by reading datasheets for CPUs, motherboards (historically, also chipsets, bridge boards, and the like, but those are much less irrelevant), and storage products (HBAs/RAID cards, SAS expander chips, HDDs, SSDs). Make sure you're always aware of the actual payload bandwidth (net of overhead), actual units (base 2 or base 10) and duplex considerations (e.g. SATA).
From a more practical level, I look at various vendors' actual products, since it doesn't matter (for example) if a CPU series can support 8 sockets if the only mobos out there are 2- and 4-socket.
I also look at whatever benchmarks were out there to determine if claimed performance numbers are credible. This is where sometimes even enthusiast-targeted benchmark sites can be helpful, since there's often a close-enough (if not identical) desktop version of a server CPU out there to extrapolate from. Even SAS/SATA RAID cards get some attention, not in a configuration worthy of even "medium" data, but enough for validating marketing specs.
attrs also has a feature that dataclasses don't currently [0]: an easy way to use __slots__ [1].
It cuts down on the per-instance memory overhead, for cases where you're creating a ton of these objects. It can be useful even when not memory-constrained, because it will throw AttributeError, rather than succeeding silently, if you make a typo when assigning to an object attribute.
PEP 412 makes __dict__s more memory efficient than they were before, but not more efficient than no __dict__, which is the point of __slots__. The following program demonstrates the difference. Note that it lowers the available address space to 1GB so that memory exhaustion occurs sooner, and thus only works on UNIX-like systems that provide the resource module.
import resource
import sys
class WithoutSlots:
def __init__(self, a, b):
self.a = a
self.b = b
class WithSlots:
__slots__ = ('a', 'b')
def __init__(self, a, b):
self.a = a
self.b = b
resource.setrlimit(resource.RLIMIT_AS, (1024 ** 3, 1024 ** 3))
cls = WithSlots if sys.argv[1:] == ['slots'] else WithoutSlots
count, instances = 0, []
while True:
try:
instances.append(cls(1, 2))
except MemoryError:
break
count = len(instances)
del instances
print(cls, count)
That's a silly example. If you're making billions of integers, use NumPy. If it's just one pass, use a generator. If you're making lots of objects with more interesting attributes, the attribute storage will overwhelm the difference the instance dicts make.
My point was not that __slots__ does nothing, but that there are more important things to worry about.
Suppose I want to run algorithms on large arrays of 2D points while maximizing readability. I want to store the x and y coordinates using Python integers so I don't have to worry about overflow errors, but I expect that most of the time the numbers will be small and this is "just in case".
I claim that in this case, __slots__ is exactly the right thing to worry about.
It's hard for me to imagine that situation coming up, but yes, __slots__ does indeed have a purpose.
BTW, have you considered using the complex type to handle that for you? It's 2d and ints should be safe in float representation. If it overflows it'll crash nicely.
Note that it's not either/or - you can dispatch work from an event loop to a thread pool (or a process pool) with loop.run_in_executor [0], while loop.call_soon_threadsafe [1] can be used by worker threads to add callbacks to the event loop.
This means that the "frontend" of a service can be asyncio, allowing it to support features like WebSockets that are non-trivial to support without aiohttp or a similiar asyncio-native HTTP server [2], while the "backend" of the service can be multi-threaded or multi-process for CPU-bound work.
If you want a middle ground between hand-written shell scripts and full-blown Kubernetes, we use Hashicorp's Nomad[0] on top of CoreOS at $dayjob and are quite happy with it.
Similar use case - self-hosted VMs, for low-traffic, internal tools, and no need for autoscaling.
I can't speak to how well it integrates with Gitlab's Auto DevOps, but Nomad integrates very well with Terraform[1] and I'd be surprised if there wasn't a way to plug Terraform into Gitlab's process.
The key difference between "classic" RDS and Aurora is that classic RDS really only automated the control plane. That is, RDS spins up an EC2 instance (or two, for multi-AZ) on your behalf, attaches an EBS volume of the appropriate specs, installs Postgres, sets up security and backups and replication etc.
Under classic RDS, when your application makes a SQL connection (the data plane) it's talking to a more or less stock Postgres instance, the same as you would have if you ran it locally.
Aurora, on the other hand, is involved in both the control plane and data plane. Your SQL connection is to a Postgres instance that's been forked/modified to work within Aurora.
> what do you do about people moving to a community for its desirable character but killing that character in the process?
Here's what I think is the central (and flawed) assumption in this line of reasoning - people move to an area because of its "character". And that "character" is an intangible, immeasurable quality, but it is somehow diminished if more people move to the area.
I grew up in Seattle. Both of my grandparents, when I was a kid, lived in Seattle's Fremont neighborhood. I live in Fremont today. From one perspective, the Fremont of my childhood is completely changed. On the other hand, it's still Fremont, with the Center of the Universe sign and the statue of Lenin and many other things I remember from childhood. Does it have the same "character"? Does it have a newer, different, but just as good, "character"?
Those are impossible questions and it boils down to a Ship of Theseus style argument. Either way, I can't bring myself to assert that the housing supply of Fremont should be artificially constrained by zoning policies, in order to preserve my ideal of what Fremont "should be" or "used to be".
I admittedly come to this from a different angle. Many people move to my town for nature & recreation. Every new house & every new infill is less nature, less trails, less recreation. So paradoxically, by moving here, are we killing what we moved here for? Not a Ship of Theseus.
In our case redevelopment for density actually helps preserve that character. But I still feel like I can understand the Bay Area home owners.
We should collectively redefine what we are trying to preserve. You recognize that increasing density allows more people to live there without encroaching on the wilderness. As a bonus, increasing density also makes walkable neighborhoods more viable, so more people can live without cars.
But many voters believe what we should preserve is the single-family home, built environment that some developer created long ago. Then the number of people per unit of land is restricted: homes near economic activity become playthings of the rich, and any new home that is affordable is taking away wildlife habitat and farmland.
In short, their stance is understandable, but it is sociopathic.
But many voters believe what we should preserve is the single-family home, built environment
Do you believe this is an honest characterization of their core goal? Is the opposition's number one goal simply to oppose multifamily property? Like, "God ordained that no two families should live in a single structure"? Or is it about property valuation changes, or building height, or street parking, or land use, or decreasing number of (semi-permanent) owners and increasing number of (temporary) renters, or...?
> But many voters believe what we should preserve is the single-family home, built environment
> Do you believe this is an honest characterization of their core goal?
Yes. I can quote Rothstein about racist motivations[0] or Marohn about short-sighted financial recklessness,[1] but I believe more people have nostalgia than malice. Even if they deploy structural racism and racist rhetoric.
Most people become set in their ways very quickly, and have difficulty imagining what is good other than what they thought was good when they were young. By now, you cannot find a native-born American who grew up in a time before cars became supreme. Most Americans don’t even remember a time before the Suburban Experiment.[2]
So, yes, people will bring up building heights, and respecting the neighborhood, and traffic, and parking über alles, but I think the main motivation is that they can’t imagine someone else can have a good life that is a benefit to the community other than the life that they think is a good life.
Replacing low density with high density shouldn't have to touch outdoor/public spaces. I live in Denver and people from around my state complain about this all the time. Perfectly possible to keep (or even expand) recreation spaces if we allow more density. Might be better to argue about increased population use of same public recreation resources (crowded trails) but that's the same selfish complaints as this whole thread - it comes down to why some feel because they have it already that they should be able to exclude others from having it in the future.
Replacing low density with high density shouldn't have to touch outdoor/public spaces [...] Perfectly possible to keep (or even expand) recreation spaces if we allow more density.
Right, I'm pretty sure I specifically acknowledged this in my previous comment.
why some feel because they have it already that they should be able to exclude others from having it in the future
If some hypothetical resource has a determinate carrying capacity and any greater usage degrades the resource for everyone, it's not unreasonable to exclude people. See the fixed number of backcountry permits Yosemite issues. Some things simply cannot be had by everyone. Given this, how do you decide who gets, and who does not get?
We really only have three systems that I know of- 1) lottery, ala Yosemite permits 2) free market, ala Bay Area low density housing, also ski condos 3) precedent/i-was-there-first, ala people who already own a house there get to stay there as it becomes desirable, also prescriptive easements of public trails on private land
At that point you're not writing JSON though. Once you start bolting on non-standard bells and whistles, why not recognize that JSON was never meant to be used for config files, and switch to something that was, like TOML [0]?
I'm a very happy fish user but this is one of my pain points as well. If you want to define a function, the syntax is light-years ahead of bash, including named arguments (and closures!), but it took a fair bit of googling and eyebrow-wrinkling before I could figure out the way just to set a default argument for that function.
What I ended up with was (for a shortcut for generating a password on the command line):
function pw --argument length
test -z $length; and set length 16
python3.6 -c "import secrets; print(secrets.token_urlsafe($length))"
end
I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.