We Don’t Use Docker | Hacker News

April 1, 2021

Along with this, vertical scaling is severely underrated. You can do a lot and possibly everything ever for your company with vertical scaling. It would apply to 99% of the companies or even more.

Edit: Since people are confused, here is how StackOverflow handles of all of its web operations. If SO can run with this, so can your 0.33 req/minute app which is mostly doomed for failure. I am only half joking.

StackOverflow architecture, current load (it will surprise you): https://stackexchange.com/performance

Everytime you go to SO, it hits one of these 9 web servers and all data on SO sits on those 2 massive SQL servers. That’s pretty amazing.

I want to be clear though, Horizontal scaling has a place in companies that has a team of corporate lawyers. Big. And in many many other scenarios for ETL and backend microservices.

Several years ago, I was chatting with another engineer from a close competitor. He told me about how they’d set up a system to run hundreds of data processing jobs a day over a dozen machines, using docker, load balancing, a bunch of AWS stuff. I knew these jobs very well, they were basically identical for any company in the space.

He then mentioned that he’d noticed that somehow my employer had been processing thousands of jobs, much faster than his, and asked how many machines we were using.

I didn’t have the heart to tell him we were running everything manually on my two-year-old macbook air.

F- me. I love this. This is a really important message.

It’s like Jonathon Blow asks: why does Photoshop take longer to load today than it did in the 90s despite the (insane) advances in hardware?

I believe it’s due to a bunch of things, but over complicating the entire process is one of the big issues. If people (developers/engineers) would only sit back and realise just how much computing power they have available to them, and then realised that if they kept things simple and efficient, they could build blazing fast solutions over night.

I cringe thinking about the wasted opportunities out there.

Jonathan Blow’s Preventing the Collapse of Civilization [0] is an excellent talk, on how too much abstractions destroy knowledge.

This is a problem with the software industry today. we have forgotten how to do the simple stuff, that works and is robust.

[0]: https://www.youtube.com/watch?v=ZSRHeXYDLko

Compare photoshop between the 90s and now. 10x on features. Size of photos grown exponentially as well.
I see the point you’re trying to make, however the increase in features (and their complexity) plus the size of the average graphic that a high-end professional has maybe grown by 300-500% since the 90s. In fact I’ll tell you what: I’ll give you a growth of 10,000% in file sizes and feature complexity since the since the 90s…

… computational power has grown ~%259,900 since the 90s.

The point being made is this: Photoshop does one job and has one focus should), yet it has gotten slower at doing that one job and not faster. Optimising the code AND introducing incredibly hardware to the consumer market should see Photoshop loading in milliseconds, in my opinion.

>The point being made is this: Photoshop does one job and has one focus should), yet it has gotten slower at doing that one job and not faster.

Has it though? Without measurements this is just idle talk.

And I’ve used Photoshop in the 90s and I use it today ocassionally. I remember having 90s sized web pictures (say, 1024×768) and waiting for a filter to be applied for tens of seconds – which I get instantly today with 24MP and more…

And if we’re into idle talk I’ve always found Photoshop faster in large images and projects than competitors, including “lightweight” ones.

It’s hella more optimized than them.

In any case, some e.g. image filter application (that takes, e.g. 20 seconds vs 1 minute with them) just calls some optimized C++ code (perhaps with some asm thrown in) that does just that.

The rest of the “bloat” (in the UI, feature count, etc) has absolutely no bearing as to whether a filter or an operation (like blend, crop, etc) runs fast or not. At worst, it makes going around in the UI to select the operations you want slower.

And in many cases the code implementing a basic operation, filter, etc, hasn’t even been changed since 2000 or so (if anything, it was optimized further, taken to use the GPU, etc).

I recall my dad requiring overnight sessions to have Photoshop render a particular filter on his Pentium 166MHz. That could easily take upwards of an hour, and a factor 10 more for the final edit. He’d be working on one photo for a week.
To me it feels as though the last decade and a half computational power has not grown vertically. Instead Intel and AMD have grown computational power horizontally (i.e adding more cores). I’m looking at the difference the M1 has had on compute performance as a sign X86 strayed.
It has also grown substantially vertically: single-core speeds keep going up (about 10x from a decade and a half ago), even as core count increases. (and M1 is not substantially faster than the top x86 cores, the remarkable thing is how power efficient it is at those speeds).
Clock speed != single-threaded performance. Clock speeds plateaued a long time ago, single threaded performance is still improving exponentially (by being able to execute multiple instructions in an instruction stream in parallel, as well as execute the same instructions in less clock cycles), though the exponent approximately halved around 2004 (if the trend had continued we would be at about a 100-500x improvement by now).

https://github.com/karlrupp/microprocessor-trend-data

Hard to say it’s still “exponential”…what do you think the current constant doubling period is now?

Here’s the single thread raw data from that repo. If you take into account clock speed increase (which, as you agree, have plateaued) we’re looking at maybe a 2x increase in instructions per clock for conventional int (not vectorized) workloads.

Is there even another 2x IPC increase possible? At any time scale?

https://github.com/karlrupp/microprocessor-trend-data/blob/m…

And somehow Photopea is fast, free and in browser and suffices for 85% of whatever people do in Adobe Photoshop.
Those people for whom Photopea is fast and suffices didn’t need Photoshop in the first place.
Opening PSDs is a big reason. A lot of designers will send PSDs and usually you also want them to check layers, extract backgrounds etc.
Back in the day I was using Intel Celeron 333 MHz and then AMD Duron 800 MHz.

I did not know how to use Winamp playlists because Winamp has been “an instant” app for me, I just clicked on a song and it played within miliseconds. That was my flow of using Winamp for years. This did not change between the Celeron and Duron, the thing was instant on both Celeron and Duron.

Then Winamp 3 came out and I had to use playlists because a song once clicked took good second or two to start playing. Winamp 5 from 2018 still starts slower than my beloved 2.73* did 20 years ago. On Celeron 333 and 5400 RPM HDD with 256 MB of RAM. I think even the good old Winamp 2.x is not as fast as it was on Windows 98/XP.

Something went wrong.

* not sure if it was 2.73, but I think so

Note: I realise Winamp 3 was crappy as hell, but still…

This is why I did choose and stick to coolplayer at the time (before I converted to Linux) : no install, so light, so fast, and it had everything I needed. I loved it when I could find such an elegant a versatile app. I don’t need to get “more” every 6-12 months.

I learned that every time you gain something, you also loose something without realizing it, because you take it for granted.

Winamp 2.73 is still my default player and works Win10 Pro 64x build $LATEST. I will never change. It is 2MB pure gold.
This is sad to read 🙁

I remember having a 486 my self and everything being so damn snappy. Sigh.

Are you me? Perhaps a little bit later on and a different set of signifiers (foobar2000/xp/celeron 1.7) but the same idea. Things were so much snappier back then than on my previously-SOTA MBPR 2019. Sigh.
I was at a graphic design tradeshow back in the mid 90’s and there was a guy there demonstrating this Photoshop alternative called Live Picture or Live Photos or something like that. And he had a somewhat large, at the time, print image on the screen, probably 16mb or so, and was zooming in and out and resizing the window and it was redrawing almost instantly.

This was AMAZING.

Photoshop at the time would take many many seconds to zoom in/out.

One person in the group asked, “Yeah, but how much memory is in that machine?”

The guy hemmed and hawed a bit and finally said “It’s got a bit, but not a lot, just 24mb.”

“Yeah, well that explains it.” Nobody had 24mb of RAM at that time. Our “big” machine had 16mb.

Live Picture was the first to use what I think are called image proxies, where you can have an arbitrarily large image and only work with the screen-image sized image. Once you have applied all the changes and click save, it will then grind through the full size image if needed.

A feature that Photoshop has since added, but it appeared in Live Picture first.

Yup, that must have been it. I think Adobe implemented that as a feature into Photoshop within a year of us seeing that other software.
This might be a long shot, but did the demo include zooming in on a picture of a skull to show a message written on one of the teeth? If so, I’ve been trying to find a video of it for years.
No, it was a photo of a building. The “demo” was the guy right there playing with the software, no video.
Perhaps, but size of photos should affect load times when starting the app (and in my opinion, so shouldn’t most feature either, but that depends on your architecture I suppose).
Yeah Jonathan Blow isn’t exactly a luminary in computer science. I once read him going on a meltdown over linux ports because “programming is hard”. This is the kind of minds Apple enable, i.e. “why isn’t this easy?”
The entire point of computers is to make things easy.

The Apple question “why isn’t this easy” is missing from 95% of UX in modern software. Stop acting as if software devs are the only users of software. And even then: computers should do work and not produce work or mental overhead.

> The Apple question “why isn’t this easy” is missing from 95% of UX in modern software.

I switched to a MacBook at work a few months ago, and it’s been an epic of frustration and Googling: e.g.,

1. I set the keyboard layout to UK PC, but it kept switching back to default. (I was accidentally mashing a keyboard shortcut designed to do just that. I’ve since disabled it.)

2. Clicking on a link or other web-page element will, occasionally and apparently randomly, take me back a page or two in my history rather than opening the link or activating the element. (At least in Chrome: I’ve not used another browser on macOS yet.)

3. Command-tabbing to a minimised application does not, as one would naively expect, automatically unminimise it. Instead I’m left staring at an apparently unchanged screen.

4. If I open Finder in a given folder, there is, as best can tell, no easy way to navigate to that folder’s parent.

Now arguably #1 was the fault of my own ignorance (though some kind of obvious feedback as to what was happening and why would have been nice), and #2 may be down to Google rather than Apple.

But #3 and #4 are plain bad design, bordering on user-hostile.

So far I’m not seeing that Apple’s reputation for superior UX is justified, at least not in laptops.

Window switching in OS X is so unintuitive it drives me MAD when you are remoting into a Mac.

Finder is just god awful and does all it can to obscure your actual filesystem location but the go to folder (cmd+g?) can get you where you need to go.

The Apple reputation was absolutely justified back in the OS9 era, and the early iPhone as well. However both OS X and iOS7 and beyond were huge steps backwards in usability.

At this point I think Apple still deserves their reputation for superior UX, however that’s a result of how epically bad Google, MS, and Facebook are at UX, not Apple doing a great job like they used to.

for 4. …as best can tell, no easy way to navigate to that folder’s parent

You can add a button to the toolbar in Finder (customize it from options) that when dropped down will show the complete path to the current folder as a list. You can use that to move up the tree.

Are you implying Apple software and devices are “easy”?

I think that’s marketing, and not borne out of real experience.

Try setting up a new Apple device, there is very little in IT that is as frustrating and confusing.

>Try setting up a new Apple device, there is very little in IT that is as frustrating and confusing.

It’s a 5 minute process, and there’s very little to it that it’s not optimized for ease. Any champ can do it, and millions do.

Not sure what you’re on about.

>I think that’s marketing, and not borne out of real experience.

Yes, all those people are delluded.

Clearly you’ve never tried it, because it’s certainly not 5 minutes, it’s optimized for selling you bullshit Apple services and it’s buggy as hell, with no feedback to the user why everything is broken and why you’re having to re-authenticate five times in a row.

And good luck if you’re setting up a family account for several devices with different iOS versions. You’re gonna really need it.

>Clearly you’ve never tried it, because it’s certainly not 5 minutes, it’s optimized for selling you bullshit Apple services and it’s buggy as hell

I’ve tried it tons of times, have over 20 iOS/macOS devices over the years, and for some pervese reason, on macOS/OS X I like to install every major update on a clean disk too (and then re-import my data. It’s an old Windows 95/XP-era reflex), so I do it at least once every year for my main driver (plus different new iOS devices).

What part of this sounds difficult?

https://www.youtube.com/watch?v=70u0x8Kf6j4

And the whole “optimized for selling you bullshit Apple services” is a couple of screens you can skip with one click — and you might want to legitimately use too.

Honestly, literally millions of people do this every year, and for most of them, it’s like 10 minutes, plus the time waiting for the iCloud restore. Even my dad was able to set up his new iPad, and he’s as techophobic as it gets.
Technophobic people are exactly the target audience that has a huge tolerance for broken software built on piles of abusive bullshit.
My father has very little patience for broken things, software or otherwise, so I’m really not sure what you’re talking about.
Watching the video that coldtea posted, no, it is not. Ubuntu and most of its derivatives have very easy installers, and take a fraction of the time. The video didn’t even include the time it would take to read and understand all the terms and conditions!
I don’t know what you mean by this- care to elaborate? I have several fully working computers running Ubuntu derivatives without having to do anything after the install.
I currently have two laptops, one a Lenovo from work with Kubuntu and the other cheap Asus with KDE Neon. Both required no additional work to be fully working after install.
>This is the kind of minds Apple enable, i.e. “why isn’t this easy?”

So the kind of minds we want?

“Why isn’t this easy?” should be the bread and butter question of a programmer…

Except for problems that are hard. I wholeheartedly disagree, Blow showed many times he has not the right mindset.
> This is the kind of minds Apple enable, i.e. “why isn’t this easy?”

I dunno, personally that’s why I’ve used Apple products for the past decade, and I think it’s also maybe part of why they have a 2T market cap, and are the most liquid publicly traded stock in the world?

So making simplistic products that treat users as dumb is profitable, yeah, I agree with that.
Believe it or not, lots of people have more important things to do with their computers than dick with them to make shit work.

Most people don’t really know how their car works, and shouldn’t have to. Same goes here.

Where’s your evidence for that? The argument that it “treats users as dumb” and that doing so is “profitable” is oft trotted out, but I never see any substantiation for it. Plenty of companies do that. What’s so special about Apple, then? I mean, it’s gotta be something.

You gotta be careful about these arguments. They often have a slippery slope to a superiority complex (of “leet” NIX users over the unwashed “proles”) hiding deep within.

Needlessly complex or powerful user interfaces aren’t necessarily good. They were quite commonplace before Apple. Apple understood the value of minimalism, of cutting away interaction noise until there’s nothing left to subtract. Aesthetically speaking, this is approach has a long, storied history with respect to mechanical design. It’s successful because it works.

What Apple understood really acutely and mastered is human interface design. They perfected making human centric interfaces that look and feel like fluid prosthetic extensions of one’s body, rather than computational interface where power is achieved by anchoring ones tasks around the machine for maximal efficiency. Briefly, they understood intuition. Are you arguing that intuition is somehow worse than mastering arcana, simple because you’ve done the latter?

Now, I’m not going to say that one is better than the other. I love my command line vim workflow dearly, and you’ll have to pry my keyboard out of my cold dead hands. But there’s definitely the idea of “right tool for the right job” that you might be sweeping by here. Remember, simplicity is as much a function of cherished *NIX tools you probably know and love. It’s where they derive their power. Be careful of surface level dismissals (visual interfaces versus textual) that come from tasting it in a different flavor. You might miss the forest for the trees!

It’s easy to take a stab at someone online, behind a keyboard, but I’d suggest you show us all your work and we’ll judge your future opinions based on it.
By no metric I am comparatively as successful as the guy, but I am still able to disagree on his point that Linux is held back by its tools. The fact that he did not want or had the time to learn Linux’s tooling don’t mean anything in particular except that he’s either very busy or very lazy. Any interview I read with him he’s just ranting and crying over this or that “too complex” matter. If he does not want to deal with the complexity of modern computers, he should design and build board games.

As an example, read how the one-man-band of Execution Unit manages to, in house, write and test his game on three operating systems. https://www.executionunit.com/blog/2019/01/02/how-i-support-…

It’s not a matter of “this is too hard”, Blow just does not want to do it, let’s be honest.

I think there are two things that cause this.

One is that there’s a minimum performance that people will tolerate. Beyond that you get quickly diminishing user satisfaction returns when trying to optimize. The difference between 30 seconds and 10 seconds in app startup time isn’t going to make anyone choose or not choose Photoshop. People who use PS a lot probably keep it open all day and everyone else doesn’t care enough about the 20 seconds.

The second problem is that complexity scales super-linearly with respect to feature grown because each feature interacts with every other feature. This means that the difficulty of optimizing startup times gets harder as the application grows in complexity. No single engineer or team of engineers could fix the problem at this point, it would have to be a mandate from up high, which would be a silly mandate since the returns would likely be very small.

I’ll give it a read. Thank you.

(I like anything that echos well in this chamber of mine… just kidding 😉

The problem is that if everything is simple enough. How to set your goals this year? Complication creates lots of jobs and waste. This makes us not starved but others somewhere in the world or in the future when resource all gone.
I see where you’re coming from with this, but this is getting into the realm of social economics and thus politics.

To solve the problem you’re describing we need to be better at protecting all members of society not in (high paying) jobs, such as universal basic income and, I don’t know, actually caring about one another.

But I do see your point, and it’s an interesting one to raise.

> If people (developers/engineers) would only sit back…*

It is a matter of incentives. At many companies, developers are rewarded for shipping, and not for quality, efficiency, supportability, documentation, etc.* This is generally expected of a technology framework still in the profitability growth stage; once we reach a more income-oriented stage, those other factors will enter incentives to protect the income.

Maybe. I’m not convinced.

I think you can build something complex quickly and well at the same time. I built opskit.io in three weeks. It’s about 90% automated.

> I think you can build something complex quickly and well at the same time.

One definitely can. It’s a real crapshoot whether the average developer can. For what it’s worth, I consider myself a below-average developer. There is no way I could grind l33tcode for months and even land a FAANG interview. I can’t code a red-black tree to save my life unless I had a textbook in front of me. Code I build takes an enormous amount of time to deliver, and So. Much. Searching. You get the picture. I’m a reasonably good sysadmin/consultant/sales-engineer; all other roles I put in ludicrous amounts of effort into, to become relevant. Good happenstance I enjoy the challenge.

For the time being however, there is such enormous demand for any talent that I always find myself in situations where my below average skills are treated as a scarcity. Like near-100 headcount testing organization in a tech-oriented business with an explicit leadership mandate to automate with developer-written integration code from that organization…and two developers, both with even worse skills than mine. When a developer balks at writing a single regular expression to insert a single character into the front of an input string, that’s nearly the definition of turning one wrench for ten years; while I’m slow and very-not-brilliant, I’m a smart enough bear to look up how to do it on different OS’ or languages and implement it within the hour.

This is not unusual in our industry. That’s why FizzBuzz exists. That’s just to clear the bar of someone who knows the difference between a hash and a linked list.

To clear the bar of “something complex quickly and well at the same time” though, I’ve found it insufficient to clear only the technical hurdle and obtain consistent results. The developer has to care about all the stakeholders. Being able to put themselves into the shoes of the future developers maintaining the codebase, future operators who manage first line support, future managers who seek summarized information about the state and history of the platform, future users who apply business applications to the platform, future support engineers feeding support results back into developers, and so on. That expansive, empathetic orientation to balance trade-offs and nuances is either incentivized internally, or staffed at great expense externally with lots of project coordination (though really, you simply kick the can upstairs to Someone With Taste Who Cares).

I’d sure as hell like to know alternatives that are repeatable, consistently-performing, and sustainable though. Closest I can think of is long-term apprenticeship-style career progression, with a re-dedicated emphasis upon staffing out highly-compensated technical writers, because I strongly suspect as an industry we’re missing well-written story communication to tame the complexity monster; but that’s a rant for another thread.

“But developer productivity!”

Most orgs (and most devs) feel developer productivity should come first. They’re not willing to (and in a lot of cases, not able to) optimize the apps they write. When things get hard (usually about 2 years in) devs just move on to the next job.

Reminded me of Fabrice Bellard’s Pi digits record[1]

The previous Pi computation record of about 2577 billion decimal digits was published by Daisuke Takahashi on August 17th 2009. The main computation lasted 29 hours and used 640 nodes of a T2K Open Supercomputer (Appro Xtreme-X3 Server). Each node contains 4 Opteron Quad Core CPUs at 2.3 GHz, giving a peak processing power of 94.2 Tflops (trillion floating point operations per second).

My computation used a single Core i7 Quad Core CPU at 2.93 GHz giving a peak processing power of 46.9 Gflops. So the supercomputer is about 2000 times faster than my computer. However, my computation lasted 116 days, which is 96 times slower than the supercomputer for about the same number of digits. So my computation is roughly 20 times more efficient.

[1]: https://bellard.org/pi/pi2700e9/faq.html

I once joked with a colleague that my sqlite3 install is faster than his Hadoop cluster for running a report across a multi-gig file.

We benchmarked it, i was much, much faster.

Technically though once that multi-gig file becomes many hundreds of gigs, my computer would loose by a huge margin.

I recently did some data processing on a single (albeit beefy node) someone had been using a cluster for. I composed and ran ETL in a day what took them weeks in their infrastructure (they were actually still in the process of fixing it).
No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

I have tried this, I literally wanted to create a simple web app that is powered by the cheapest solution possible, but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard! In the end I now launch redshift for a couple days, process all the data, then pipe it to Postgres running on a lightsail vps via dblink. Haven’t found a better solution.

My rule of thumb is that a single processor core can handle about 100MB/s, if using the right software (and using the software right). For simple tasks, this kan be 200+ MB/s, if there is a lot of random access (both against memory and against storage), one can assume about 10k-100k IOPS per core.

For a 32 core processor, that means that it can process a data set of 100G in the order of 30 seconds. For some types of tasks, it can be slower, and if the processing is either light or something that lets you leverage specialized hardware (such as a GPU), it can be much faster. But if you start to take hours to process a dataset of this size (and you are not doing some kind of heavy math), you may want to look at your software stack before starting to scale out. Not only to save on hardware resources, but also because it may require less of your time to optimize a single node than to manage a cluster.

> “using the right software (and using the software right)”

This is a great phrase that I’m going to use more.

This is a great rule of thumb which helps build a kind of intuition around performance I always try to have my engineers contextualizing. The “lazy and good” way (which has worked I’d say at least 9/10 times in my career when I run into these problems) is to find a way to reduce data cardinality ahead of intense computation. It’s 100% for the reason you describe in your last sentence — it doesn’t just save on hardware resources, but it potentially precludes any timespace complexity bottlenecks from becoming your pain point.
>No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

Yes, you can. Without indexes to slow you down (you can create them afterwards), it isn’t even much different than any other DB, if not faster.

>Even Postgres by itself was very hard!

Probably depends on your setup. I’ve worked with multi-TB sized Postgres single databases (heck, we had 100GB in a single table without partitions). Then again the machine had TB sized RAM.

> but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard!

The PostgreSQL database for a CMS project I work on weighs about 250GB (all assets are binary in the database), and we have no problem at all serving a boatload of requests (with the replicated database and the serving CMS running on each live server, with 8GB of RAM).

To me, it smells like you’ve lacked some indices or ran on a rpi?

It sounds like the op is trying to provision and load 150GB in a reasonably fast manner. Once loaded, presumably any of the usual suspects will be fast enough. It’s the up front loading costs which are the problem.

Anyway, I’m curious what kind of data the op is trying to process.

I am trying to load and serve the Microsoft academic graph to produce author profile pages for all academic authors! Microsoft and google already do this but IMO they leave a lot to be desired.

But this means there are a hundred million entities, publishing 3x number of papers and a bunch of metadata associated. On redshift I can get all of this loaded in minutes and takes like 100G but Postgres loads are pathetic comparatively.

And I have no intention of spending more than 30 bucks a month! So hard problem for sure! Suggestions welcome!

There are settings in Postgres that allow for bulk loading.

By default you get a commit after each INSERT which slows things down by a lot.

How many rows are we talking about? In the end once I started using dblink to load via redshift after some preprocessing the loads were reasonable, and indexing too. But I’m looking at full data refreshes every two weeks and a tight budget (30 bucks a month) so am constrained on solutions. Suggestions welcome!
Try DuckDB! I’ve been getting 20x SQLite performance on one thread, and it usually scales linearly with threads!
Maybe I’m misunderstanding, but this seems very strange. Are you suggesting that Postgres can’t handle a 150GB database with acceptable performance?
I’m trying to run a Postgres instance on a basic vps instance with a single vcpu and 8gb of ram! And I’ll need to erase and reload all 150 GB every two weeks..
Had a similar problem recently. Ended up creating a custom system using a file-based index (append to files named by the first 5 char of the SHA1 of the key) Took 10 hours to parse my Terabyte. Uploaded it to Azure Blob storage, now I can query my 10B rows in 50ms for ~10^-7$. It’s hard to evolve, but 10x faster and cheaper than other solutions.
My original plan was to do a similar S3 idea, but I forgot about it’s charge per 1000 gets and puts and had a 700 dollar bill I had to bargain with them to waive! Does azures model not have that expense?
Curious if you tried this on an EC2 instance in AWS? The IOPS for EBS volume are notoriously low, and possibly why a lot of self-hosted DB instances feel very slow vs similarly priced AWS services. Personal anecdote, but moving to a a dedicated server from EC2 increased the max throughput by a factor of 80 for us.
Did you try to compare that to EC2 instances with ephemeral nvme drives? I’m seeing hdfs throughput of up to several GB/node using such instances.
You can use locally attached SSD instances. Then you’re responsible for its reliability so not getting all the ‘cloud’ benefits. Used them for provisioning own CI cluster with raid-0 btrfs running PostgreSQL. Only backed up the provisioning and CI scripts.
Got burned there for sure! Speed is one thing but the cost is outrageous for io heavy apps! Anyways I moved to lightsail which doesn’t have io costs paradoxically so while io is slow at least the cost is predictable!
You can skip Hadoop and go from SQLite to something like S3 + Presto that scales to extremely high volumes with low latency and better than linear financial scaling.
Does hundreds of gigs introduce a general performance hit or could it still be further optimized using some smart indexing strategy?
Unless it’s accidentally quadratic. Then all the RAM in the world isn’t going to help you.
And in 2021, almost everything* does.

*Obviously not really. But very very many things do, even doing useful jobs in production, as long as you have high enough specs.

I’ve had similar experiences. Sometimes we’ll have a dataset with tens of thousands records and it will give rise to the belief that it’s a problem that requires a highly scalable solution because “tens of thousands” is more than a human can hold in their head. In reality, if the records are just a few columns of data, the whole set can be serialized to a single file and consumed in one gulp into a single object in memory on commodity hardware no sweat. Then process it with a for loop. Very few enterprises actually have big big data.
My solution started out as a 10-line Python script where I would manually clean the data we received, then process it. CEO: “Will this scale?”

Me: “No, absolutely not, at some point we’ll need to hire someone who knows what they’re doing.”

As time passed and we got more data, I significantly improved the data cleaning portions so that most of it was automated, and the parts that weren’t automated would be brought up as suggestions I could quickly handle. I learned the very basics of performance and why `eval` is bad, set up my script so I didn’t need to hard-code the number of files to process each day, started storing data on a network drive and then eventually a db…

I still don’t know what I’m doing, but by the time I left it took maybe 5 minutes of manual data cleaning to handle thousands of jobs a day, and then the remainder could be done on a single machine.

Said enterprise WISH they had big data. Or maybe it’s fear, as in, ‘what IF we get big data?’
I’m aware of a couple of companies who behave like that – “well, we could increase our user based by an order of magnitude at any point here, so better spring for the order of magnitude more expensive database, just in case we need it.”

Feels like toxic optimism.

It’s not just about scaling databases, some people are simply unable to assess reasonable limits on any system. A few years ago certain Scandinavian publisher decided to replace their standard industry tools by a single “Digital Experience Platform” that was expected to do everything. After a couple of years they understood it’s a stupid idea and gave up. Then later someone in the management thought that since they already spent some millions of euros they should continue anyway. This behemoth is so slow and buggy the end users work at 1/4th speed but everyone is afraid to say anything as the ones who did have been fired. The current PM is sending weekly success messages. It’s hilarious. And all because someone once had a fantasy of having one huge system that does everything.
I’ve noticed business people have a different idea of what ‘big data’ means to tech guys. The business guys think it means a lot of data like the records of a million people which is a lot of data but not the tech guy definition which tends to be data too large to process on a single machine.

Those come out at something like 1GB and 10TB which are obviously rather different.

Unfortunately, this kind of behavior will be rewarded by the job market, because he’s now got a bunch more tech buzzwords on his resume than you. Call it the Resume Industrial Complex: engineers build systems with as many bells and whistles as possible, because they want to learn all the hot new tech stacks so they can show off their extensive “skills” to potential employers.
I wonder what percentage of data center power is wasted running totally unnecessary trendy abstractions…
My favorite part of conducting design interviews is when a candidate has pulled some complex distributed system out of their ass, and I ask them what the actual throughput/memory usage looks like.
On that day, most probably nothing with regards to this task.

Then, later, probably someone would check out the scripts from a shared repo. Then, read an outdated README, try it out, swear a bit, check for correctness with someone dependent on the results, and finally learn how to do the task.

There is a lot of business processes that can tolerate days or weeks of delay in case of such a tragic (and hopefully improbable) event. The trick is to know which of them can’t.

> There is a lot of business processes that can tolerate days or weeks of delay in case of such a tragic (and hopefully improbable) event. The trick is to know which of them can’t.

This is really true BUT that kind of problems are OK – nobody cares – until somebody starts caring and then all of a sudden it is urgent (exactly because they were undetected for weeks/months due to their periodicity)

I meant it’s fine to take calculated risks.

E. g. we have less than 1% chance per year that a given person leaves us on bad terms or suffers a bad accident or illness. In case it really happens, it will cost us X in delays and extra work. To lower the probability of this risk to Y% would cost us Z (money, delay, etc).

If you do this math, you can tell if it’s a good idea to optimize here, or if you have more pressing issues.

In my experience, this sort of one-man jobs gets automated or at least well described and checked for fear of mistakes and/or employee fraud rather than “downtime”.

Mistakes are “downtime” as well in a way. Or maybe better, downtime is a mistake, cause errors and lead to problems.
Another guy install Postgres on his machine, runs a git clone, connects to the VPN and initiates the jobs?
I wasn’t even using a db at the time, it was 100% pandas. We did eventually set up more infrastructure, when I left the data was loaded into the company’s SQL Server db, then pulled into pandas, then uploaded back into a different table.
It’s true – at that point, if I had disappeared without providing any transition help, the company would have been in trouble for a few days. But that goes for any employee – we were only 7 people at the time!

Eventually I built out some more infrastructure to run the jobs automatically on a dedicated machine, but last I checked everything still runs on one instance.

All of my routine tasks are documented in the operations manual. I’d be missed but the work would still get done.
SO is always impressive – love that their redis servers with 256GB RAM peak at 2% CPU load 🙂

SO is also my go-to argument when some smart “architect” proposes redundant Kubernetes cluster instances for some company-local project. People seem to have lost the feeling what is needed to serve a couple of thousand concurrent users (For company internal usages which I specialize in, you hardly will get more users). Everyone thinks they are Google or Netflix. Meanwhile, SO runs on 1-2 Racks with an amount of server that would not even justify kubernetes or even docker.

SO really isn’t a great example, they have considerations most companies don’t – Windows and SQL Server licensing. When shit like that is involved, scale out rarely seems like a better choice.

It’s not only about the amount of users, it’s also a matter of availability. Even the most stupid low-use barely-does-anything internal apps at my company get deployed either to two machines or a Nomad cluster for redundancy ( across two DCs). Design for failure and all that. Failure is unlikely, but it’s trivial to setup at least active-passive redundancy just in case, it will make failures much easier.

> SO is also my go-to argument when some smart “architect” proposes redundant Kubernetes cluster instances for some company-local project.

Technically you don’t need Kubernetes, yes. But: There are advantages that Kubernetes gives you even for a small shop:

– assuming you have a decent shared storage, it’s a matter of about 30 minutes to replace a completely failed machine – plug the server in, install a bare-bones Ubuntu, kubeadm join, done. If you use Puppet and netboot install, you can go even faster (Source: been there, done that). And the best thing: assuming well written health checks users won’t even notice you just had a node fail as k8s will take care of rescheduling.

– no need to wrangle with systemd unit files (or, worse, classic init.d scripts) for your application. For most scenarios you will either find Docker-embedded healthchecks somewhere or you can easily write your own so that Kubernetes can automatically

– no “hidden undocumented state” like wonky manual customizations somewhere in /etc that can mess up disaster recovery / horizontal scale, as everything relevant is included in either the Kubernetes spec or the Docker images. Side effect: this also massively reduces the ops load during upgrades, as all there is on a typical k8s node should be the base OS and Docker (or, in newest k8s versions, not even that anymore)

– it’s easy to set up new development instances in a CI/CD environment

– generally, it’s easier to get stuff done in corporate environments: just spin up a container on your cluster and that’s it, no wrestling with finance and three levels of sign-off to get approval for a VM or, worse, bare metal.

I won’t deny that there are issues though, especially if you’re selfhosting:

– you will end up with issues with basic network tasks very quickly during setup, MetalLB is a nightmare, but smooth once you do have set it up. Most stuff is made with the assumption of every machine being in a fully Internet-reachable cluster (coughs in certbot), once you diverge from that (e.g. because of corp requiring you have to have dedicated “load balancer” nodes that only serve to direct traffic from outside to inside and “application” nodes not be directly internet-reachable) you’re on your own.

– most likely you’ll end up with one or two sandwich layers of load balancing (k8s ingress for one, and if you have it an external LB/WAF), which makes stuff like XFF headers … interesting to say the least

– same if you’re running anything with UDP, e.g. RTMP streaming

– the various networking layers are extremely hard to debug as most of k8s networking (no matter the overlay you use) is a boatload of iptables black magic. Even if you have a decade of experience…

Your arguments are true, but you did not consider the complexity that you have now introduced in a small shop operation. You will need kubernetes knowledge and experienced engineers on that matter. I would argue that the SO setup with 9 webservers, 2×2 DB servers and 2 redis servers could easily be administered with 20 year old knowledge about networks and linux/windows itself.

And I also argue, lack of experience of fiddling with redundant kubernetes is a more likely source of downtime than hardware failure and keeping things simple.

> You will need kubernetes knowledge and experienced engineers on that matter.

For a small shop you’ll need one person knowing that stuff, or you bring in an external consultant for setting up and maintaining the cluster, or you move to some cloud provider (k8s is basically a commodity that everyone and their dog offers, not just the big 3!) so you don’t have to worry about that at all.

And a cluster for basic stuff is not even that expensive if you do want to run your own. Three worker machines and one (or, if you want HA, two) NAS systems… half a rack and you’re set.

The benefit you have is your engineers will waste a lot less time setting up, maintaining and tearing down development and QA environments.

As for the SO setup: the day-to-day maintenance of them should be fairly simple – but AFAIK they had to do a lot of development effort to get the cluster to that efficiency, including writing their own “tag DB”.

You will always need 2-3 experts, because in case of an incident, your 1 engineer might be on sick/holiday leave.

Well but lets walk one step back looking at SO, they are a windows shop (.NET, MS SQL Server) so I doubt k8s would be found in their setup.

Ah yes, I’ll make my critical infrastructure totally dependent on some outside consultant who may or may not be around when I really need him. That sounds like a great strategy. /s
SO is a great counter example to many over complicated setups, but they have a few important details going for them.

> Everytime you go to SO, it hits one of these 9 web servers

This isn’t strictly true. Most SO traffic is logged out, most doesn’t require strictly consistent data, most can be cached at the CDN. This means most page views should never reach their servers.

This is obviously a great design! Caching at the CDN is brilliant. But there are a lot of services that can’t be built like this.

CDN caches static assets. The request still goes to SO servers. Search goes to one of their massive Elastic Search servers.

I’m not saying we should all use SO’s architecture, I am trying to shed light on what’s possible.

YMMV obviously.

Are you an SO dev? I had thought I read about the use of CDNs and/or Varnish or something like that for rendered pages for logged out users? I don’t want to correct you on your own architecture if you are!
No, not a dev at SO. I am guessing what would be rather a standard use of CDN (hosting static assets, caching them geographically).

What you’re saying is probably right.

We went all-in on vertical scaling with our product. We went so far we decided on SQLite because we were never going to plan to have a separate database server (or any separate host for that matter). 6 years later that assumption has still held very strong and yielded incredible benefits.

The slowest production environment we run in today is still barely touched by our application during the heaviest parts of the day. We use libraries and tools capable of pushing millions of requests per second, but we typically only demand tens to hundreds throughout the day.

Admitting your scale fits on a single host means you can leverage benefits that virtually no one else is even paying attention to anymore. These benefits can put entire sectors of our industry out of business if more developers were to focus on them.

Do you have any more details on your application? Sounds like your architecture choice worked out really well. I’m curious to hear more about it.
Our technology choices for the backend are incredibly straightforward. The tricky bits are principally .NET Core and SQLite. One new technology we really like is Blazor, because their server-side mode of operation fits perfectly with our “everything on 1 server” grain, and obviates the need for additional front-end dependencies or APIs.
Our backup strategy is to periodically snapshot the entire host volume via relevant hypervisor tools. We have negotiated RPOs with all of our customers that allow for a small amount of data loss intraday (I.e. w/ 15 minute snapshot intervals, we might lose up to 15 minutes of live business state). There are other mitigating business processes we have put into place which bridge enough of this gap for it to be tolerable for all of our customers.

In the industry we work in, as long as your RTO/RPO is superior to the system of record you interface with, you are never the sore thumb sticking out of the tech pile.

In our 6-7 years of operating in this manner, we still have not had to restore a single environment from snapshot. We have tested it several times though.

You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.

>> You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.

Yep, this is something we rely on whenever we perform risky upgrades or migrations. Just snapshot the entire thing and restore it if something goes wrong, and it’s both fast and virtually risk-free.

I’m not the OP but I’m the author of an open source tool called Litestream[1] that does streaming replication of SQLite databases to AWS S3. I’ve found it to be a good, cheap way of keeping your data safe.

[1]: https://litestream.io/

I am definitely interested in a streaming backup solution. Right now, our application state is scattered across many independent SQLite databases and files.

We would probably have to look at a rewrite under a unified database schema to leverage something like this (at least for the business state we care about). Streaming replication implies serialization of total business state in my head, and this has some implications for performance.

Also, for us, backup to the cloud is a complete non-starter. We would have to have our customers set up a second machine within the same network (not necessarily same building) to receive these backups due to the sensitive nature of the data.

What I really want to do is keep all the same services & schemas we have today, but build another layer on top so that we can have business services directly aware of replication concerns. For instance, I might want to block on some targeted replication activity rather than let it complete asynchronously. Then, instead of a primary/backup, we can just have 4-5 application nodes operating as a cluster with some sort of scheme copying important entities between nodes as required. We already moved to GUIDs for a lot of identity due to configuration import/export problems, so that problem is solved already. There are very few areas of our application that actually require consensus (if we had multiple participants in the same environment), so this is a compelling path to explore.

You can stream back ups of multiple database files with Litestream. Right now you have to explicitly name them in the Litestream configuration file but in the future it will support using a glob or file pattern to pick up multiple files automatically.

As for cloud backup, that’s just one replica type. It’s usually the most common so I just state that. Litestream also supports file-based backups so you could do a streaming backup to an NFS mount instead. There’s an HTTP replica type coming in v0.4.0 that’s mainly for live read replication (e.g. distribute your query load out to multiple servers) but it could also be used as a backup method.

As for synchronous replication, that’s something that’s on the roadmap but I don’t have an exact timeline. It’ll probably be v0.5.0. The idea is that you can wait to confirm that data is replicated before returning a confirmation to the client.

We have a Slack[1] as well as a bunch of docs on the site[2] and an active GitHub project page. I do office hours[3] every Friday too if you want to chat over zoom.

[1]: https://join.slack.com/t/litestream/shared_invite/zt-n0j4s3c…

[2]: https://litestream.io/

[3]: https://calendly.com/benbjohnson/litestream

I really like what I am seeing so far. What is the rundown on how synchronous replication would be realized? Feels like I would have to add something to my application for this to work, unless we are talking about modified versions of SQLite or some other process hooking approach.
Litestream maintains a WAL position so it would need to expose the current local WAL position & the highest replicated WAL position via some kind of shared memory—probably just a file similar to SQLite’s “-shm” file. The application can check the current position when a transaction starts and then it can block until the transaction has been replicated. That’s the basic idea from a high level.
Does your application run on your own servers, your customers’ servers, or some of each? I gather from your comments that you deploy your application into multiple production environments, presumably one per customer.
Both. We run a QA instance for every customer in our infrastructure, and then 2-3 additional environments per customer in their infrastructure.
Vertical scaling maybe works forever for the 99% of companies that are CRUD apps running a basic website. As soon as you add any kind of 2D or 3D processing like image, video, etc. you pretty much have to have horizontal scaling at some point.

The sad truth is that your company probably won’t be successful (statistically). You pretty never have to consider horizontal scaling until you have a few hundred thousand DAU.

You don’t need to scale your application vertically even with media processing, you just need to distribute that chunk of the work, which is a lot easier (no state).
> Like someone else said, distributing work across multiple machines is a form of horizontal scaling.

Sure, but it is the easy kind, when it comes to images or videos. Lambda, for example, can handle a huge amount of image processing for pennies per month and there is none of the additional machine baggage that comes with traditional horizontal scaling.

It really depends. Handling streaming video service that does any kind of reprocessing of the data would probably be better off with horizontal scaling.
I imagine its still super simple to have one core app that handles most of the logic and then a job queue system that runs these high load jobs on worker machines.

Much simpler than having everything split.

Sure but its a massive amount simpler than some massively distributed microservice app where every component runs on multiple servers.

Most of these vertical scaling examples given actually do use multiple servers but the core is one very powerful server.

Definitely. There is certainly a place for Horizontal scaling. Just wanted to highlight how underrated vertical scaling is and a good engineer would evaluate these scaling options with prudence and perspicacity, not cult behavior so often observed in software engineering circles.
I think somehow this is related to how business minded people think too. I went to a course where people learn to pitch their ideas to get funds but the basics of business simply did exist much among the technical people.

One simple example (which I suspect most of the business do) is that you do all work either manually yourself or on your laptop while advise them as a resource-rich service. Only when you truly can not handle the demand then you may ‘scale up’ and turn your business into ‘real’ business. And there are plenty of tricks like this (as legally as possible).

> Everytime you go to SO, it hits one of these 9 web servers and all data on SO sits on those 2 massive SQL servers. That’s pretty amazing.

I don’t find it amazing at all. Functionality-wise, StackOverflow is a very simple Web application. Moreover, SO’s range of 300-500 requests per second is not a mind-blowing load. Even in 2014, a powerful enough single physical server (running a Java application) was able to handle 1M requests per second[1]. A bit later, in 2017, similar performance has been demonstrated on a single AWS EC2 instance, using Python (and a blazingly-fast HTTP-focused micro-framework Japronto), which is typically not considered a high-performance option for Web applications[2].

[1] https://www.techempower.com/blog/2014/03/04/one-million-http…

[2] https://www.freecodecamp.org/news/million-requests-per-secon…

The amaziness is that the leadership allows it to be simple.

This is such a great competitive advantage.

Compare this to a leadership that thinks you absolutely must use Akamai for your 50 req/secs webserver. You end up with tons of complexity for no reason.

Fair enough. Though not too surprising still, considering the original leadership of the company, one of whom (Joel Spolsky) is still on the board of directors. Having said that, the board’s 5:4 VC-to-non-VC ratio looks pretty scary to me. But this is a different story …
SO is a bit more complicated than returning a single character in a response. You can achieve high throughput with just about anything these days if you aren’t doing any “work” on the server. 300-500 reqs/second is impressive for a web site/application with real-world traffic.
Thing is 99% of companies could run like SO if their software would be like SO.

But if you are confronted with a very large 15+ year old monolith that requires multiple big instance machines to even handle medium load. Then you’re not going to get this easily fixed.

It’s very possible that you come to the conclusion that it is too complex to refactor for better vertical scaling. When your demand increases, then you simply buy another machine every now and then and spin up another instance of your monolith.

> if you are confronted with a very large 15+ year old monolith that requires multiple big instance machines to even handle medium load. Then you’re not going to get this easily fixed

Last 15+ year old monolith I touched needed multiple machines to run because it was constrained by the database due to an insane homegrown ORM and poorly managed database schemas (and this is a common theme, I find.)

Tuning the SQL, rejigging things like session management, etc., would have made it go a lot quicker on a lot fewer machines but management were insistent that it had to be redone as node microservices under k8s.

I totally agree with your main point and SO is kind of the perfect example. At the same time it is kind of the worst example because for one, to the best of my knowledge, their architecture is pretty much an outlier, and for another it is what it is for non-technical historical reasons.

As far as I remember they started that way because they were on a Microsoft stack and Microsofts licensing policies were (are?) pretty much prohibitive for scaling out. It is an interesting question if they would design their system the same way if they’d the opportunity to start from scratch.

Most people responding here are nitpicking on whether SO’s architecture is the one. I wasn’t trying to imply that at all.

I wanted to drive a point and SO is a good enough example to show that a massive company of the size of SO can run, so can your tiny app.

Don’t scale prematurely. A lot can be done by reasonable vertical scaling.

For one $120k/year Kubernetes infra engineer, you could pay for entire rack of beefy servers.

Obviously YMMV. Discussion about SO and licensing details are distracting.

Yes but Stackoverflow is a now mostly a graveyard of old closed questions, easily cached, I am only half joking. Most startup ideas today are a lot more interactive, so a SO model with two DBs would probably not serve them well. Horizontal scaling is not only for ETL and I am uncertain in why you say that it needs many lawyers.
Genuine question, how is 9 web servers vertical scaling? And also, peak CPU usage of 12% means this is about 10x oversized for what is needed. Isn’t it much better to only scale up when actually needed, mostly in terms of cost?
because they play in the major leagues where most teams have hundreds or thousands of servers, while they have those nine.

Yes, there is some horizontal scaling, but the sheer amount of vertical scaling here is still mind blowing.

I’ve run more servers in what were basically hobby projects compared to SO.

I think I agree, but what do you mean exactly? Just keep getting beefier servers as opposed to serverless junk?
Not the OP, but yes, getting more powerful machines to run your program is what “vertical scaling” means (as opposed to running multiple copies of your program on similar-sized machines aka “horizontal scaling” ).
Stack-overflow’s use case has the benefit of being able to sit behind a Content Delivery Network (CDN) with a massive amount of infrastructure at the edge offloading much of the computational and database demands. This reduces the requirements of their systems dramatically. Given their experience in the segment, its plausible to expect they understand how to optimize their user-experience to balance out the hardware demands and costs as well.
A ‘single’ big box with multiple terabytes of RAM can probably outperform many ‘horizontally scaled’ solutions. It all depend on the workload, but I feel that sometimes it’s more about being ’hip’ than being practical.

https://yourdatafitsinram.net/

Might apply to 99 % of the companies but I doubt it applies to 99 % of the companies that HN readers work for.
Their database query to page request ratio is about 20:1. Seems like this should be lower.
Stack Overflow have unique constraints ( Microsoft licensing) which make vertical scaling a cheaper option, and IMHO that’s rarely the case.
people keep fapping to this, but stackoverflow is served read only to most users, and probably heavy cached.
Use bigger machines instead of more machines.

There’s always a limit on how big you can go, and a smaller limit on how big you should go, but eitherway it’s pretty big. I wouldn’t go past dual Intel Xeon, because 4P gets crazy expensive; I haven’t been involved in systems work on Epyc, 1P might be a sensible limit, but maybe 2P makes sense for some uses.

If you have a single machine with 64 cores running 256 threads of your daemon that’s considered vertical scaling? Odd definition
If multiple cores/threads shouldn’t be considered vertical scaling, what should?

Overclocking a single-core processor from 2.8Ghz to 4.2Ghz can only take you so far after all…

Get a more powerful single machine (in contrast to multiple machines). However I wonder if multisockets Xeons count as vertical or horizontal. I never understood how programmable those machines are..
Wow, thanks. This is the last place I expected to see C# mentioned. Very interesting!
It was mentioned in the article as being hard to build/test/deploy, but I disagree. Everything can be done in a few clicks using VS or Rider.
It might apply to 99% who have specific requirements, but the vast majority of internet companies need more. Deployments, N+1 redundancy, HA etc… are all valuable, even if some resources are going to waste.
> Deployments, N+1 redundancy, HA etc

None of those things are mutually exclusive with vertical scaling?

Having two identical servers for redundancy is doesn’t mean you are scaling horizontally (assuming each can handle the load individually, nothing better than discovering that that assumption was incorrect in an outage).

The author is not lying. I’ve been learning docker and it is certainly nice to pull and docker-compose up for local dev, but there is a lot to learn when you factor in orchestrators. And when I mean learn, I mean actually learning the nuts and bolts of a Dockerfile and not copy/pasting shit from the internet. While it’s all helpful, its certainly not needed for the project we have at work, nor the “microservices” that our lead thinks we need even though they aren’t even microservices. All we’ve done is split our application into seperate mini applications that don’t talk to eachother at all, essentially 4 web services. Great! Sigh.

So why am I learning all this instead of just programming? Programming is what I really want to do all day, just writing code and thinking through software architecture. Because the industry tells me if I don’t learn this stuff I will be decommissioned. Fun stuff.

You’re learning this stuff because you’re an engineer not a computer scientist. Deploying to prod is the goal, not writing code.

>seperate mini applications that don’t talk to eachother at all, essentially 4 web services.

I mean, that sounds pretty good. If you can do that why couple them? “Microservices” is just SOA without any debate over how small a service can be.

Whether decoupling into microservices or making them part of a monolith is a good idea or not, to me, had a lot more to do with the teams and company organization behind these services than the product itself.

So I’d say without knowing how big, and how many teams are dealing with this we cannot say one approach is better than the other.

I’ve been in companies where several hundreds of engineers where working on a “core” monolith and it was a real pain, almost impossible to track where problems where coming from, some times deploys blocked for weeks due to some team having problems with their tests or just spending months to agree how to do something.

I’ve also been at companies where being about 10 devs the “architect” went totally crazy into microservices, microfraneworks, cqrs, event sourcing, etc etc… And it was a total mess where each developer had to deal with 10 services and keep it’s dependencies up to date and coordinate deploys of 3 things at the same time and nobody knew where and why things were failing.

So, as always, right tool for the job.

What I’ve seen works the best, is to adapt the services you run to the teams structure you have.

> You’re learning this stuff because you’re an engineer not a computer scientist. Deploying to prod is the goal, not writing code.

Writing code is not the goal of a computer scientist.

Because there is a bunch of shared code behavior and data between the services and his idea was “just copy and paste”. We devised a library that gets shared as a dependency, but managing that dependency is PITA and not everything can go into the shared library still. Which means we still copy/paste some code, which is okay with this bozo somehow. I can stomach splitting things into microservices that don’t actually talk to each other, though I don’t see the need day one, but why not run it as a monorepo and split it out in the CD pipeline so I don’t have to open four PRs for one fucking task?

Because the goof couldn’t figure it out and just gave up and said here you go. This is before I knew the guy was completely useless. I had to do a whole presentation and get the favor of every single developer just for him to succumb to FACTS about moving to a monorepo that then splits the code out to his beloved CV fuel on build. Trust me, there is a lot more wrong here if you are getting any of this. As for team size we are small. Six warm bodies, three actually moving.

Personally I say go monolith (for most small/medium projects) but design it with an eye towards splitting it into services down the line, that is unless something is smacking you in the face to go all in on micro. For monolithic that means a logical separation today, that can easily be split out in a single sprint tomorrow. That means less pain and more features today and a path towards some dudes wet dream in the future.

Yeah I’m salty.

> I mean, that sounds pretty good. If you can do that why couple them?

Because what is the point in separating each one into their own deployed service and having to deal with network issues when they could just be services inside a monolith?

OP just said they don’t talk at all. They sound completely decoupled already. You could put them in the same app and share routes if you want and just separate by package or compilation unit if you really want to. The hard part is already done though.

They could be managed and deployed by completely different teams with no overhead now. It really depends what we’re tuning for.

Scaling is another factor that comes to mind, also resilience.

If one part in the monolith goes haywire so will the entire application. If you can decouple and split the software into 4 applications with their own concerns, at least you have 3 applications left running (assuming they are decoupled).

If app 1 wants 5000% more CPU than app 2, maybe you can have different instance types running and save costs/resources.

A good reason no doubt and if that comes to bear sure. But at least code in a monorepo and split to microservices in your pipeline. Cake and eat.
Read my reply further up, they have a lot in common. We had to build a shared library that goes in as a dependency and even with that we have to still copy/paste code between services. Because a data structure change in one must often be reflected in another HELLO multiple PRs for one task.
Developing with docker is not necessary a micro service, it’s just a way of packaging, distributing and deploying your application in a clean way. And docker is not a virtual machine, there’s not much overhead, you don’t need kubernetes if it’s just a simple app, but you can just take advantage of managed service like ECS, you get auto scaling right away, and you don’t have manage your node and deal with the stupid thing like systemD
> you get auto scaling right away

Every time I see someone say something like this, or better yet use the word “magic”, what I hear is that they don’t understand how or why their system does the things it does. Nothing is free; nothing is magic; nothing “just works”. You understand it or you don’t, and nothing has contributed more to the plague of developers thinking they understand it when they don’t than cloud-based containers (and the absurd associated costs!).

> Every time I see someone say something like this, or better yet use the word “magic”, what I hear is that they don’t understand how or why their system does the things it does.

There is nothing magic about configuring a deployment to autoscale. You set resource limits, you configure your deployment to scale up if an upper limit is reached and you didn’t maxed out, and you configure your deployment to scale down if a lower limit is reached and you didn’t min out. What do you find hard to understand?

> Nothing is free; nothing is magic; nothing “just works”. You understand it or you don’t,

You’re the only one claiming it’s magic.

The rest of the world doesn’t seem to have a problem after having read a couple of pages into the tutorial on how to configure autoscale on whatever service/cloud provider.

> and nothing has contributed more to the plague of developers thinking they understand it when they don’t than cloud-based containers (and the absurd associated costs!).

You’re the only one posting a baseless assertion that other developers somehow don’t understand autoscaling, as if everyone around you struggles with it.

Having separate services does add additional overhead and maintenance, but it does provide the benefit of 1) Reducing the blast radius if issues occur, allowing for a degraded service instead of being totally down, and 2) Better scaling/ optimizations. For example, one service could need to support more TPS, or need more memory, CPU, etc.
if they don’t talk to each other, then you can run multiple instances of them without paying the price of intercommunications latency. It’s a dream for scaling. In this example their teamlead is right.
Looks good to who? When i see all this kind of froth on a CV, that’s a red flag to me, or at best, effectively empty space, where another candidate might be telling me about experience and skills that are actually valuable.
Isn’t that why the DevOps role was invented, so us programmers don’t have to fanny around with docker.
Devops is a practice not a role. Essentially it’s bringing good software engineering practice to operations and eliminating the silos. In practice, though, this does usually mean programmers fannying around with docker. Some organisations just rebranded the ops team to the devops team, but that’s kinda missing the point.
I’m actually perplexed why the people in our ops team have the title DevOps Engineer, when they don’t do any dev work, just handle all aws related stuff.

I asked one of them this question, but couldn’t get any answer that satisfied my curiousity.

EDIT: For what it’s worth I don’t view myself as a software developer or operations or systems administrator or frontend or backend or fullstack. I like to think of myself as a Problem Solver. It just so happens that I’m currently paid to solve software engineering problems.

I want to know everything from Dockerfiles to bash scripting to assembly to functional programming to DDD, etc.

I have asked this question myself, and after failing to get an appropriate answer, I started using “DevOps Administrator” instead of “Engineer” in my email signature. It feels more appropriate since I definitely do not write production code.
In my experience, developers then still have to implement a lot of behaviour to deal with docker’s limitations around networking. (in my case it was trying to connect multiple BEAMs in a docker network)
For me, docker is just a superpower that let’s me build larger, more complex applications. That’s why it’s worth learning. It raises the ceiling of what I’m able to create. Creating a site with multiple services that integrate with each other is complicated. You could do it from scratch, but it would take so long that you would likely never be able to manage the actual complexity you care about–the complexity you want to architect and program. You’d be too busy programming solutions to the problems that docker already solves.
> You could do it from scratch, but it would take so long that you would likely never be able to manage the actual complexity you care about

Except plenty of people have and continue to manage? Do we think that _most_ people are using Docker now?

This is how we do it:

We use baremetal servers. We use systemd, including the systemd sandboxing capabilities. We have deploy scripts (almost all bash, and yes, they are largely idempotent). A host could be deployed with any number of: cockroachdb, openresty, monetdb, rabbitmq or postgresql. They’re all running vector (log ingestion). And they can be running N apps. Each host also get public keys for encrypting on-site, credentials for uploading backups to Google Cloud, systemd-timers for tasks like running lynis and rkhunter. Everything is connected via wireguard (over a physical private network).

Our apps (mostly go and elixir) are built on a self-hosted gitlab and we deploy like normal. Each project can pick what it wants, e.g. auto deploy dev branch to dev, auto deploy master to stage, manually-triggered to prod.

We run separate tool servers for our internal stuff, gitlab, elastic search, kibana, prometheus, vault and grafana.

We have some clustering/mesh networking which we “discover” with an internal DNS. But everything is pretty static/fixed. We can survive downtime because either something is HA and stateless (e.g. we keep N monetdb “reporting” servers in sync from the OLTP->OLAP process), or stateful and part of a cluster (e.g., cockroachdb).

Here’s the kicker: we have 0 devops. We don’t spend a huge deal of time on this. We used to have devops and K8 and everything was worse. It was _much_ more expensive, everything took _a lot_ longer to get done, it all ran slower, it was much buggier, and it was less secure (because we aren’t sharing a host).

I feel like we could build almost anything this way. I don’t see what complexity we’d run into that Docker would help with. I think we’d run into management issues…applying patches to hundreds of servers, reviewing hardening and access logs, etc. But that’s infrastructure complexity, not application complexity.

We use docker in apps that require a complex setup for end-to-end testing.

What you have sounds great, and sounds about the same as running containers with orchestration, just via SystemD cgroups instead of container cgroups, and bash and humans instead of a real orchestrator. You lose a lot of features ( idempotent and self-contained ( so zero risk of dependency clashing), cluster-wide auto-healing, load distribution and autoscaling, centralised management) and ease of use of existing solutions ( want to run a CockroachDB cluster? There’s probably a k8s operator for that; want automatic DNS? Consul from Nomad, k8s already do that), but you understand it. There could be very serious downsides around maintenance of your bash scripts, and onboarding new people.
Personally, I don’t want to think about all this complexity. I just build simple web applications. I want a box in my basement that has everything production has – the whole thing, all the web services running all the time.

It ought to be possible in my mind. After all, there is really only one user: me. I just want to be able to automatically get the latest code that reservations team or order management system team or what have you deploys to production.

Why do I need to connect to VPN just to work on a stupid web application? None of the code we write has any special sauce in it and we can have fake data…

I love docker as a home user as well. Sometimes when I’m bored I’ll build an IM bot. Docker makes hosting it super simple. I just start from the ruby docker image, copy my script in and then dump it on gitlab which will build and host the container which lets me pull it from my server.

I can then copy a simple systemd config to start it on boot and restart it if it fails. This is all much simpler than managing multiple versions of ruby installed locally. Not to mention the security benefits of keeping it in a container. Perhaps this is all so easy and convenient for me because I learned docker for pro use and now its just mentally free for casual use.

Yeah I have seen this but for some reason it just wasn’t working for me. I think that maybe the docker service just wasn’t being started at boot but once I set it up with systemd it all just works now.
I find it painfully reliable. I forget about it and then weeks later I find the image when scrounging about trying to work out where my RAM went.
I started using Docker heavily about 3 years into my current project and that was the right time. There’s real overhead to getting things to work on compared to a local dev environment, but there are major benefits, especially once you actually do need to run the same thing on multiple machines. It’s a pretty classic example of something with a high constant factor but better scaling properties.
Unfortunately I also had to waste my time learning containers when all I wanted was Heroku without the insane prices for addon services. For delivering just apps I think Cloud Native Buildpacks solve that: https://buildpacks.io/
Buildpacks predate Docker. They’re standardized build scripts for popular app runtimes. Heroku created the concept and applied it for their PaaS years before Docker was a thing. They’re not a competitor to Docker containers. They are perhaps a competitor to Dockerfiles.
They detect the kind of app and then basically do everything required to build and run that app from pushing a repository. So it’ll detect that it’s a NodeJS application because there’s a package.json file in the root and essentially: install node modules, run the build script and then do npm run start to run it.

All you need to know how to do is: supply a build + start script and push your repository and now it’s built and running in the Cloud to scale from 0 to unlimited.

It’s basically Heroku but buildpacks are a standardised way to do it on any Cloud.

Render.com is like Heroku but cheaper and handles static sites as well as distributed applications (nodes can talk to each other).
Have you tried Dokku? Very similar to Heroku. I run it on a DO server without any issues.
Used them years ago when we moved from AWS to bare-metal (losing Elastic Beanstalk that we were using in AWS in the process)

I seem to recall a few minor issues here and there, but I’d totally use them again.

> … the “microservices” that our lead thinks we need even though they aren’t even microservices. All we’ve done is split our application into seperate mini applications that don’t talk to eachother at all,

Replacing in-app API/libary calls with RPC is an micro-service anti-pattern (1). If the services don’t need to communicate you have probably made the correct split – or you could think backwards – why would these services that are independent and don’t talk to each need to be merged into a monolith ?

1) There are of course exceptions, there could be a good idea to separate things for performance, eg. create worker processes. Which works if they are like pure functions.

As my reply far up this chain says, the services have a ton in common. So much so that the idea was “just copy and paste” code between. Still sound great? I forced his hand on a shared library that houses lots of the code all services need, but not everything can (or should) go in there.

Multiple PRs just to complete one task sometimes and we’re talking a small 4-6 hour task. If the services were truly independent this wouldn’t be needed and the approach wouldn’t be a poor developer experience and infrastructure headache. But a change in one often requires a change in another. We don’t have a monorepo, because he was too much of a doofus to figure it out and gave up instead of asking for help.

80% of the problems are because the guy doesn’t know what he’s doing, doesn’t know architecture and just said we’re doing this. Which means I have to go in and fix his poorly built Dockerfiles that are an expose on what not to do. We’re now adding API Gateway. Was this explained as to why we need it? No. Did I ask yes? But I simply get “I’ve already explained it” Cool. I explained to you why running an update on packages in your build instead of installing from a lock file is some of the dumbest shit I’ve ever seen yet I just had to go in and clean up your dockerfile again. You want untested packages going into stage / prod? My lead is your guy. I’m sure in a few weeks he’ll come to me with “But API Gateway doesn’t work and I don’t have time can you fix it”.

Fuck this dude. I just want to write clean code and not fuck with his mistakes. Did I mention we run completely different Dockerfiles between environments (local vs stage/prod). Like, not even the same O.S (ubuntu vs alpine), web server (apache vs nginx) etc… Getting the picture of whats its like to deal with his mistakes day in and day out and slowly fix them.

It’s super simple and it works for you – yeay! great! If it will continue to work for you – double yeay!

However, as soon as I read, I saw a lot of red flags:

– Do you really want to copy from a development computer to your production? No staging at all? (“go test” doesn’t mean that you have 0 bugs)

– Are you really sure that everything works exactly the same on different versions of GoLang? (Hey, a new guy in your company just installed unreleased Go 1.17, build on his notebook and pushed to production)?

– That VM with systemd died at 1am. No customers for you until 7am (when you wake up)

BTW. I am not saying that you should do Docker or CICD. The thing which I am saying that when you cut out from your process too much, you are increasing risks. (As an example, you didn’t remove unit tests part. Based on “Anything that doesn’t directly serve that goal is a complication” you probably should have. However, you decided that it would be way too much risk)

Exactly my reasons: I don‘t use Docker because it‘s great that i can (could?) scale the universe.

In my case, I simply use Docker because it is so easy to set up n HTTP services listening on port 80/442 on the same server and then put a reverse proxy (traefik) with Let‘s Encrypt in front of it. I don‘t need to worry about port conflicts. I don‘t need to worry about running Nginx and Apache on the same host. I don‘t need to worry about multiple versions of Go/PHP/Dotnet/[insert lang here].

Still, I can‘t scale (single machine) but I don‘t need to. I don‘t have a failover, because I don‘t need one. But I have so much simpler management of the dozen services I run on the host. And that‘s worth it IMHO.

I think it‘s always about the right tool for the job. And I think if the OP does work with an automated script and scp, there‘s nothing wrong with that. Because that also adds reproducability to the pipeline and that‘s just such an important point. As long as nobody ssh‘s into prod and modifies some file by hand.

100%. For the startup we’re starting/working on now, we’re running microk8s on bare metal (dedicated servers).

What you describe is a big reason for it. Once you have k8s set up, services can be very easily deployed with auto TLS, and basic auth or oauth2 is really simple as well.

So we’re big believers in vertical scaling, but still use k8s (microk8s or k3s) for these kinds of benefits. An additional benefit of this is that it makes scaling/transitioning to a bigger/managed k8s easy down the road.

It might sound like overkill, but it takes about 10 minutes to set up microk8s on a ubuntu lts server. It comes with nginx ingress. Install cert manager (another 5 mins) and you don’t even need traefik these days. All configs are kept in Git.

> it takes about 10 minutes

After you’ve spent weeks / months reading blogs, installing dependencies, tweaking parameters, playing around deployments and figuring out new names for the same old stuff, debugging that extra whitespace in the yaml file only to figure out oww, you could use helm charts for deployments which is eerily similar to how jsp looked like 15 years ago and everything that was wrong with templating deployments than scripting it.

Then it takes about 4 mins.

And now you get to do brown bag sessions with your baffled team members! Yay!

But only till kubernetes evicts your pods for being too resource “hungry”. Gotta keep prod up folks. Better grab a coffee and kubectl (“kube-cuttle” is it?) into prod to check why the pods have restarted 127 times.

All these “it takes 10 minutes” should really be “it takes 10 minutes plus several hours/days/weeks actually learning how to run and maintain all this stuff”.
Precisely. Some weeks reading, then 10 minutes doing (well some hours or a day doing, more likely).

And then half a day downtime every now and then maybe? Until one knows enough about the new stack

I would even go to a lower level. I use Docker swarm and Docker compose. It is reliable, simple and very effective.
> Do you really want to copy from a development computer to your production? …

> Are you really sure that everything works exactly the same on different versions of GoLang? …

He mentions he does have a build server which runs a 10 line shell script to download code and build the binary.

Builds happen on that server, and I assume it handles deploying the compiled binary (and systemd script?) to the target as well.

The build server would also have a “blessed” golang version. New guy code that uses new not-yet-blessed features would not compile.

> That VM with systemd died at 1am…

Your docker host died. All your containers die along with it. Docker alone cannot solve this category of issues anyway.

The infrastructure layer is expanding, to meet the needs of the expanding infrastructure layer
And that would be much easier to implement and understand.

We seem to be obsessed with producing the most complex machines we can instead of creating simple, understandable machines.

Now you have 3 problems.

It’s not solving “your system went down”, it’s adding more layers of system that can go down.

Except now your system can handle the inevitable server going down without taking the entire site offline. It DOES solve the problem of a single host failure causing an outage. Yes, there are other types of outages you can have, but it certainly does reduce the occurrence of outages significantly.

Are you really trying to suggest that people can’t use Kubernetes to increase their reliability?

Yeah, I guess I am. It’s adding whole layers of complexity and configuration to the system. I understand that those layers of complexity and configuration are designed to make the system more resilient, but it depends on everyone getting everything right all the time. The “screw-up surface” is huge.
Ever seen a large system that has it’s own job server and scripts for orchestration/deployment? Application code that checks the status of it’s peers and runtime env to determine what should run? All glued together with decades old perl and bash with no documentation.

I’ll take “more configuration in yaml” over that.

that’s not a 1:1 comparison though.

Leave your nice clean K8s deployment paradise to cruft up for decades, and will it be any better? I doubt it – there’ll be old Dockerfiles and weird bits of yaml that shouldn’t work but do, and upgrading a version of anything will break random things.

So yes, I think I would prefer the decades of crufty perl and bash to decades of crufty outdated yaml. At least the bash scripts have a hope of doing what they say they do, and are likely to still execute as intended.

Hum… No, Kubernetes is not a HA solution.

One can certainly create an HA cluster over some infrastructure set up by kubernetes, just as well as one can take a bunch of physical servers, set them up by hand, and create an HA cluster with them. K8s isn’t adding anything to the availability.

> Docker alone cannot solve this category of issues anyway.

Docker does comes with an orchestrator out of the box, it’s called Docker Swarm. You may not use it, but it’s there and it’s up to you to use it or not. It’s extremely simple to setup, a single command on the manager and another one on the worker. It support healthcheck, replication, etc…. all super simples to setup too.

Sure doing all theses will takes, what 30 minutes? Instead of the 5 he took for his deployment, but it does solve that issue, natively, out of the box.

Oh and my docker image always have the “blessed” [insert environment here] version, thus everyone always use it while testing locally. If you need to update it, anyone can do it easily, without any knowledge of the build server environment, nor any special access to it.

– Staging: THe world ran well with people pushing PHP files onto live environments and it will continue to run well long after Docker gets replaced with something else.

– Versioning: It’s pretty easy to ensure you have same versions on the same platform.

– Systemd: None of this means he does not have Pagerduty or similar setup.

Why do I say all of this? Because i ran really good businesses with similar architecture to his back in the day. Sure, I run docker now, but sometimes we tend to overcomplicate things.

If you have one app, and one server. There is No good reason to run a layer of Docker on it. None.

Elixir, Ruby, PHP, Node — if your business has a monolith and can run on one server, guaranteed there is less to worry about when you remove Docker.

> THe world ran well with people pushing PHP files onto live environments

No, it didn’t. The world didn’t fall apart, but it absolutely burned out lots of people who had to deal with this irresponsible way of doing things.

The way we do things is much, much better now. It is more complex, but that can be worth it. Don’t romanticize a past that was absolute hell for a lot of people.

Source: inherited and maintained many of these dumpster fires.

> Source: inherited and maintained many of these dumpster fires.

And you imagine whoever inherits what’s become of your current projects in 10 years is going to be happy? Shrug

No, but they will have some combination of declarative infrastructure, build scripts with error messages, and Docker images as a starting point.

I still maintain some of my 10-year-old code, by the way. Once I got it to build and deploy with modern tools, it has been much, much easier to keep it updated with patches and the latest server OS.

>Elixir, Ruby, PHP, Node — if your business has a monolith and can run on one server, guaranteed there is less to worry about when you remove Docker.

For Ruby at least you run into the problem of keeping all the development environments the same. It’s not insurmountable by any means but it’s a constant nagging annoyance. Especially so once we start talking about working on multiple projects that may be using different versions of Ruby Postgres etc. Being able to do a docker-compose up and have the exact same environment as production is huge.

Docker has nothing to do with any of that though.

There artifact is a single statically linked binary/executable, not a docker container. They can build that binary once and pass it along a deployment pipeline i.e dev, test, prod changing config parameters for each environment but the exact same code running.

Systemd just like the various container runtimes supports auto restarts + logging. You can have the same alerting tools hang off your logs etc etc.

The fact they are not using Docker does not mean they can’t have a proper build/deployment pipeline. The fact they are dealing with a single static executable makes building a pipeline and robust deployment far simpler as they have far fewer moving pieces.

If the author was deploying python, javascript, ruby apps where you don’t get a static executable artifact or fat jar with all dependencies bundled then Docker would make sense.

I’ve been struggling with this for years now. Every time I ask the question “why do I need Docker when Go produces static binary executables?” I get some reasonable answers, but nothing along the lines of “you can’t do X without Docker”.

I totally grok the need when your deployable is a few hundred script files for a very specific runtime and a large set of very exact dependencies. But that’s not the situation with Go.

And now //embed too. Adding all my templates, static files, everything, to the binary. Ship one file. Awesome

Yeah, there isn’t really anything that you just can’t do at all without Docker. The question is whether it’s a net positive or negative on your architecture as a whole. Eg, I deploy some Go apps using docker (basically build the executable and then build a docker image that just contains that executable). Looking at just the single application, it’s pure overhead to add Docker vs just deploying the binary executable somewhere and running it. But in the overall context of my setup, since I’m running other apps as well that are written in other languages/frameworks and have different characteristics, it’s a huge net positive for me to have a single uniform interface for deploying and running them. Everything gets pushed to the same container registry the same way, is versioned the same way, can handle service discovery the same way (by listening on the docker port), can do canary deploys the same way, etc. I can use docker-compose to bring up a dev environment with multiple services with a single command. I can deploy the container to Cloud Run or ECS or an equivalent if that makes more sense than running servers.
I’ve just being doing this process for my product now. I have a bunch of deployment scripts that control the test, build and deploy of the app.

I can’t run them on a Docker container (the final hurdle was that Docker can’t run systemd). So my choice was to either add Docker to my production servers, or drop Docker and use Vagrant instead for localhost VM for dev.

Again, I couldn’t see what Docker was adding to the mix that was of value – it would be an additional layer of configuration, complexity and failure on the production servers. It wouldn’t save anything if the server went down, it would complicate attempts to restart the app if that crashed, and it gives us… what?

Again, I get it for Rails or Django builds (and similar) where the environment is complex and dependencies have to be managed carefully and can conflict horribly. But I just don’t have that problem. And it’s a real incentive to stick to the one language and not introduce any more dependencies 😉

In my opinion whether those are red flags depends entirely on the context. How many people work on the project, code base size and age, etc.

I feel these days projects are often starting out with way too much complexity and shiny tools. It should not be about cutting out things, but instead about adding things at the point they really add value.

No of the points you list have anything to do with docker.

1. No I don’t, I still would potentially not use Docker in many cases (but I would use a CI, which might or might not run in a docker image, but it’s not the same as deploying docker images).

2. Depends on the language I’m using, for some languages I would be afraid of accidental incompatibilities. For others I’m not worried and would be fine if roughly the same OS is used in CI and production.

3. Can happen with docker too, on the other hand VM or auto non-VM restarts exists independent of Docker. I’m not sure why you mention systemd here, it has very reasonable “auto restart if not alive” features.

Through then in the end I’m increasingly drifting more to use images, but I really don’t want to use docker in production. But then I can do what people normally expect from “using docker” without docker, e.g. by using podman or other less “root” heavy ways to run VM’s with appropriate tooling for reliability (which yes can be systemd+rootless podman in some cases).

I don’t think any of these need docker though?

You can copy binaries to/from staging just fine.

They have a CI job that builds and pushes code. So it does not matter what the new guy did on his laptop.

I am not sure if you are going for “monitoring” or “redundancy” in your dead VM example, but docker by itself cannot provide either of those. You need some solution either way.

I use docker daily because our app is big and has tons of system dependencies, but I have no other choice. We need fast version rollback/update and regular approach of installing deb packages will not work well. I dream of nixOS and sylabs, but my org is not going to switch to radical new technology. But if someone can set up the system so they don’t need docker – more power to them, I can only be envious.

It is not NixOS vs Docker, it is NixOS vs Docker/Ubuntu.

Most of the day-to-day OS problems don’t come from Docker, they come from base linux distribution. And Ubuntu was released in Oct 2004, and it is in big part Debian, which was released in 1993.

There is a big advantage when you run the same OS on developers’ desktops and on server. It is also great when all sorts of weird third-party packages already know and support your OS.

This is the advantage of Docker I suppose — it does not “get in your way” and lets you use the same things you used to do before it.

> I am not sure if you are going for “monitoring” or “redundancy” in your dead VM example, but docker by itself cannot provide either of those. You need some solution either way.

You are the second one that say this here… that explain why Docker Swarm lose traction versus Kubernetes, that’s a makerting issue.

I just setup a Docker Swarm cluster recently, in 5 minutes it was redundant. I didn’t add any other software, just the basic docker swarm command that shipped with it.

I actually didn’t needed the redundancy part at all, but I wanted a second server that could ping the first one and send me an alert. I was going to simply put it on a machine, with systemd like him, but it was just as easy to run it on the machine using Docker. Hell I could run it on both even more easily than doing that twice using systemd… I don’t even know how to use systemd now that I mention it….

Source