What's wrong with 2006 programming?

wmf · on Oct 5, 2010

Just to amplify his point, if you want your program to take page faults as PHK suggests, it has to be multithreaded. If you choose event-driven concurrency you can't afford to take page faults in mmap() or read(). When you make the threads vs. events decision you're implicitly making a bunch of related decisions about I/O and scheduling as well; a hybrid approach (like using events and mmap) won't work well.

antirez · on Oct 5, 2010

Exactly: one of the two must be threaded, the swapping subsystem or the side serving clients. Since Redis serves clients in an event driven fashion our VM I/O part is threaded. But it was much simpler to design a threaded VM compared to a fully threaded Redis, and anyway there were other good reasons for implementing VM at application level.

ntoshev · on Oct 5, 2010

Be sure to check PHK's comment (#5)

madvise + mincore syscalls could be used to get the kernel to preload the pages asynchronously.

wmf · on Oct 5, 2010

Not really; because mincore() does not generate events you'd have to poll to find out when the page-in has finished, which seems pretty wasteful. If I was doing disk I/O from an event-driven program I would rather use the Flash approach of using a small number of worker threads to perform blocking I/O. PHK says "I consider this a deficiency in POSIX standards, not in the concept of VM.", but you have to go to production with the kernel you have, not the kernel you wish you had.

ntoshev · on Oct 6, 2010

I agree you have to make it work with the real and not ideal kernel :) Another thing PHK misses is that single-threaded approach solves the database concurrency problems for free, and people start to use it more - e.g. Stonebraker in his VoltDB.

Polling mincore wouldn't be that bad as it would only happen between the commands and at least the core of the algorithm would be simple: before executing a command, check if it's data are in memory with mincore, if not ask to load them with madvise, and go to the next command.

What does Flash use worker threads for?

wmf · on Oct 6, 2010

Flash used worker threads for any operation that could potentially block, like disk reads.

http://www.usenix.org/event/usenix99/full_papers/pai/pai_htm...

rbranson · on Oct 5, 2010

This is exactly how Varnish is setup. It's threads basically only exist for I/O reasons. It takes some serious fu to effectively mix event-driven and threaded programming.

nostrademons · on Oct 5, 2010

You can use events + mmap, you just need to factor the paging latency into your design. Normally, this might mean mmapping a chunk of data at startup, touching it all so that it's resident in RAM, and then beginning to serve queries, keeping an eye on your total resident set so that it never pages out.

jasonwatkinspdx · on Oct 5, 2010

The entire point of redis's VM layer is to serve data sets larger than ram when the request distribution is such not everything need be memory resident (I believe this is sometimes referred to as the 1:10 problem in the redis community).

While ram is far cheaper than it once was, there still are substantial savings in reducing your resident set requirements from TBs to just hundreds of GBs.

rbranson · on Oct 5, 2010

What would be the point of being able to spill to disk if you've got to keep everything in RAM? Simple serialization?

nostrademons · on Oct 5, 2010

The point is zero-copy on load, not being able to spill to disk. Most high-performance, scalable servers I've seen ignore virtual memory entirely and kill (+ restart) the process if it exceeds the physical memory available on the machine. Yes, that means they use pre-1960s technology; sometimes, the price of performance is ignoring the programming conveniences we've come up with in the last 50 years.

rbranson · on Oct 5, 2010

Most applications need to perform some computations with data that is read from the backing store. Even simple sorts and searches will vastly outweigh the cost of an extra memcpy. Honestly, an HTTP cache is sort of a perfect case for the way in which Varnish was implemented. There is very little actual processing, if any, that needs to be done with the data that's read from disk. It just needs to be read from the backing store and shuttled over the socket as fast as possible, with very little friction. An extra memcpy or two matter in the Varnish scenario.

shin_lao · on Oct 6, 2010

Actually the hybrid approach works extremely well, it's just more delicate to architecture.

The idea is to process in parallel everything you can and have the services ("processors") communicate asynchronously by events.

_3u10 · on Oct 5, 2010

Stop thinking of RAM/disk/etc as storage systems and start thinking of them as retrieval systems. Then stop thinking of your data costs as $/GB (storage systems) and start thinking of your data costs as $/(IO/sec/GB) (retrieval systems).

I know everyone these days seems to think that removing a structured query language parser from a database makes every other problem go away, but realistically RDBMS vendors spend millions of dollars trying to fix this exact problem. It's called cache invalidation and it's a hard problem to solve in a general way.

SSDs are just a midpoint in the performance trade-off game.

The OS is the worst at this, DBs are somewhat better, but realistically if you want serious performance out of your application you need to make those choices for yourself, and use all strategies where appropriate RAM for records you need instantly (memcache,redis,mongodb) SSDs for the stuff you can't afford to keep on an SSD And hard drives for stuff you can't afford to keep on an SSD.

What you need to think about is the value of your data in dollars per IO/sec per DB ($/(IO/sec/GB), if the amortized value of that data exceeds the amortized cost of the retrieval system then buy it. Focus on increasing the value of your data, not reducing the costs of it's retrieval as that will drop by 1/2 every 18 months anyway. Alternatively, change your business model so you are going short on IO/sec/GB, (eg. pre-sell storage so that when you need to buy it you can do so cheaply)

What I'm trying to say is that the value of a picture is worth more to Flickr than it is to Facebook, thus Flickr will have an easier time building it's retrieval systems than Facebook because of the costs involved. That's why Facebook had to write their own filesystem for retrieving pictures.

I'd bet that any commercial DB will blow rings around redis/mongo/etc if you had your persistent store as a RAM disk and used hard drives for the transaction log. The cost of a SQL Server license is negligible if you're going to buy a server with $200,000 worth of RAM in it. If your data is valuable enough you could just keep everything in SRAM (L1/L2 cache) and buy processors just for the cache.

jasonwatkinspdx · on Oct 5, 2010

This has been well understood since antiquity (in the CS world at least). Read Jim Gray's "The 5 minute rule" and the more recent papers that cite it. Most likely the access frequency of your objects is not such to demand them residing in L2.

Ultimately there's no need to use a commercial database either, as there are compelling open source alternatives, though if your needs are very specific, a commercial database may be your best tool.

mkramlich · on Oct 6, 2010

> This has been well understood since antiquity (in the CS world at least).

Yes I believe it was first Cicero that pointed this out. Or perhaps even Aristotle. ;)

jacquesm · on Oct 6, 2010

> SSDs for the stuff you can't afford to keep on an SSD

I think you meant 'in RAM' there.

scott_s · on Oct 5, 2010

Again, the kernel will use a simple LRU algorithm, where the granularity is the page.

I don't think it's accurate to describe any performance critical part of the Linux kernel as "simple." For an overview of the page replacement policy, see http://kerneltrap.org/node/7608. I wondered if CLOCK-Pro [1, 2] had made it into the kernel yet, but it looks like it has not.

This author makes compelling arguments for implementing application level paging. But the nice thing about doing systems work is we never have to rely on arguments alone to evaluate something - show me numbers.

[1] http://linux-mm.org/ClockProApproximation

[2] http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.p...

bch · on Oct 5, 2010

There was no mention of Linux in the article. This system also runs on the various *BSDs, Solaris, MacOS...

scott_s · on Oct 5, 2010

Then it's worth noting that CLOCK-Pro is used in NetBSD.

My point here is that a lot of people have spent a lot of time working on the page replacement problem. I am very open to the idea that an application can beat the page replacement policy in the underlying kernel, for a variety of reasons. But: numbers. Always evaluate. The implementation of these algorithms in practice is always more subtle than our high level understanding, so we need to do real performance comparisons to know if we actually improved anything.

(I'm getting my information from the algorithm's author's page: http://www.ece.eng.wayne.edu/~sjiang/ Jiang graduated from William and Mary while I was a young grad student there.)

dasil003 · on Oct 5, 2010

I'm a bit torn on this. I realize that you need to benchmark to really prove anything, and that optimization intuitions are often wrong, but on the other hand, you really need a wide variety of world benchmarks to approach any form of real world "proof". In the early implementation stages, I think clear thinking such as provided in this article is probably more important than spending 50 times as long setting up an array of benchmarks. Hopefully Salvatore has done some micro-benchmarks during development to guide his thinking, but even if not, it's hard to refute the thinking in this article.

If you think about the problem space of Redis vs Varnish, it's intuitively obvious that Varnish deals with a wide variety of general data without many opportunities to optimize beyond general purpose algorithms such as an OS provides. Whereas Redis has specific data types often with small footprints, and very careful attention paid to the details of optimization for memory and disk usage.

scott_s · on Oct 5, 2010

I think you're overstating the time it takes to come up with benchmarks to evaluate performance optimizations. Even micro-benchmarks tailored to showcase your performance under ideal circumstances are a start. The longer you go without doing any performance comparisons, the longer you go without knowing if your work was worth it.

I'm not trying to deride his work - it's a neat project, and I will probably read through his earlier entries more. I'm down with all of the reasons provided, but I recognize that as humans, we tend to believe in things we understand. Hence, we need to evaluate.

dasil003 · on Oct 5, 2010

Perhaps I am, but why are we assuming he hasn't done any benchmarks? It seems quite likely that his opinion is informed by actual results. Does he need to draw up graphs and spend more time on a blog post to be taken seriously?

scott_s · on Oct 5, 2010

Well, yes. I take him "seriously," but I'm not yet convinced his techniques outperform the kernel. That's how systems work is done. If you want to convince people that your way is better, then you need data to back it up.

dasil003 · on Oct 5, 2010

You can't take any of his arguments at face value? Like the blocking argument?

scott_s · on Oct 5, 2010

Remember that I started this discussion off by saying "This author makes compelling arguments for implementing application level paging." I'm down with his arguments - people have actually made them before, and I thought they were good arguments then, too. But I can't say "Yes, I agree, what you have done improves performance" until I actually see performance comparisons.

But that's only the first level point. The second level point is: are the optimizations worth it? That is, if you only improve performance by less than 1%, then it's probably not worth the hassle. These are the sorts of things that experiments can tell you.

If it sounds like I'm being pedantic: well, yes, I am. I do systems research. This is the same standard I hold myself and my peers to. If someone asked me to review a systems paper that claimed to improve something, but had no results, I'd reject it. I recognize this is a blog post and not an academic paper, but my standard for "do I accept that this is a better approach" has not changed. And I have seen plenty of blog posts with experiments.

jules · on Oct 5, 2010

No, performance of complex systems is really really hard to predict.

dasil003 · on Oct 7, 2010

Yes, but again, why do we assume Salvatore is just pulling this stuff out of his ass?

bch · on Oct 5, 2010

Agreed. "Problems" without measurements aren't worth much. My initial reaction was to your assumption about Linux, but we agree that assumptions in general aren't worth much.

nwmcsween · on Oct 5, 2010

How is LRU not simple? The page replacement policy in Linux is simple it's just makes use of available information - The algorithm does not make the system. Clock-pro is not in Linux and I doubt it ever will be. The goal in creating manageable code is passing the information that you know to the next person that doesn't by using comments and even Linux doesn't do this at all so hack-ability of any non trivial parts is near impossible without a few months of poking.

bediger · on Oct 5, 2010

I blame Benjamin Zorn in this paper: http://www.cs.colorado.edu/department/publications/reports/d...

for the whole "programmers shouldn't manage memory" myth. Clearly, if you know what you're doing, you can do better than the OS and/or malloc() does. If you don't know what you're doing, you have bigger problems than writing your own allocator will quickly solve.

gjm11 · on Oct 5, 2010

It seems a bit strange to blame Zorn for spreading the myth that "programmers shouldn't manage memory" on the basis of that paper, when at about the same time he was writing a bunch of other papers extolling the virtues of customized memory allocators: http://portal.acm.org/citation.cfm?id=172674, for instance.

bediger · on Oct 5, 2010

Did you read the "CustoMalloc" paper? It's more of the same. Just to illustrate, the paper you referenced has the title "CustoMalloc: efficient synthesized memory allocator".

That is, "CustoMalloc" takes a look at memory usage patterns of a particular program, then generates a semi-customized allocator for that program.

gjm11 · on Oct 5, 2010

Yes, of course I read it. Yes, the allocators are synthesized; so what? The point is that "synthesize a semi-customized allocator for each program" is a very different thing from "just use what the system provides you with; you won't be able to do better".

acqq · on Oct 5, 2010

As far as I see the Zorn's paper is about mallocs vs garbage collector (moreover, specific mallocs and a specific GC), whereas the main article here is about just allocating from VM versus managing your own cache with your own file read / write.

And as others already noted, "'just' allocating from VM" can be better (less copying) once you can organize everything right and know how to manage the threads. And the main argument of the main article is "if you can't access it with different threads, VM access can block you everything" (that's the two clients complaint). That much both are right, as long nobody tries to make some too general statements like "just use always X."

kenjackson · on Oct 5, 2010

But the real point isn't "can you do a better job", but is it worth it to a better job?

If I can use an OTS memory allocator and get within 2% of my hand-crafted hand-tuned allocator, I think I'd need to be extremely perf sensitive to care.

stephen · on Oct 5, 2010

I'm not an expert, but the application's memory usage being opaque to the kernel sounds vaguely like the problems Azul had scaling its Java runtime to large (>2GB) heaps.

Azul recently open sourced some of their kernel patches:

http://www.managedruntime.org/faq

But that is about all I know.

jamii · on Oct 5, 2010

In a vaguely similar vein, the recent Mirage paper showed impressive GC improvements by running ocaml on xen without an operating system in between. It allowed them to use a much simpler allocation algorithm.

http://anil.recoil.org/papers/2010-hotcloud-lamp.pdf

moron4hire · on Oct 5, 2010

Something like paging virtual memory for Redis is going to be a core feature. I can completely understand why they would want to keep core features "in house". It's not "not implemented here" syndrome if the entire point of your company is to implement things like this.

davidw · on Oct 5, 2010

The code is open source under a BSD style license, so it's not really about benefits for the company, which in this case is VMWare. They make their money elsewhere and pay antirez to hack on Redis, which is a pretty good deal for users of Redis.

wingo · on Oct 5, 2010

Fascinating article. I wonder how SSDs affect his conclusions, though. If you really care, don't you have an SSD? Especially for the recently mentioned 1:10 problem (10G of data, 1G of memory).

poet · on Oct 5, 2010

Weird to see PHK being referred to as "the Varnish guy". :P

antirez · on Oct 6, 2010

Fixed in the article, sorry I was not aware that the "Varnish guy" was a well known programmer.

jacquesm · on Oct 6, 2010

It's like referring to Donald Knuth as 'the TeX guy'.

antirez · on Oct 6, 2010

With all the respect, I wish Poul-Henning Kamp the biggest of the fame, but it is hard to compare him with a world wide myth as Donald Knuth.

Argorak · on Oct 6, 2010

Especially as everyone should know that he is "the bikeshed guy" :).

gjm11 · on Oct 6, 2010

What colour should the varnish on the bikeshed be?