For me, Redis was the first software written in C which I could easily customize with additional features (as I have low C knowledge). It was written beautifully. I've been learning Rust, I certainly find learning Rust easier than C and I like the fact that I can dive into software written in Rust without worrying about GC. You've done the best of both worlds for me by writing Redis in Rust. With that said, I'm still having an easier time reading Redis in C over your code as you are lacking comments and well named function/variable names. I admire your work nonetheless.
That's a fair criticism, and well taken, but keep in mind I'm learning the language as I go, and most of the commits are just rewriting things because they were suboptimal, not idiomatic, or hard to read. At this stage, I would consider adding comments wasteful.
I also have no intention of making this project live as long or have as many users as Redis does.
If you want good examples of Rust code, you should probably look into other repositories. Today I was checking out how Rust's HashMap works and I found it quite readable:
I don't know Rust, and the only thing opaque to me in that line is the `'a` syntax. I suppose it is related to memory safety.
In plain English:
`search_entry_hashed` is a parametric function. It works in a type safe way on any kind of K and V. It takes as arguments a reference to a mutable hash table, a hash function and a key, and returns an `Entry` object whose precise type depends on the type of the parameters.
You could hardly express that more succinctly. You can't understand it by skimming it, but it is a complex definition.
Your guess about 'a is correct. 'a is here used to signify that the return value can only be safely used while the value table exists and is not being concurrently modified. This implies that the value is not copied on return, but only a reference into the table is returned.
I'm surprised that RawTable isn't parametrized on the type of hash function. Something like this (possibly wrong way to express it, but you get the idea)
"'lifetime" gives no useful semantic information there. All 'somethings are lifetimes; it would be like naming a type parameter "Type," or a variable "variable." If you don't have a specific shard context that would give the name meaning, `'a` is as good as any. I do agree that something like Key or Value would probably be better than K or V.
Yes, in general, it is a good idea to give more descriptive names but in the context of a hashtable, K and V are good enough. I seriously doubt that there is anyone reading the source code of a hashtable and cannot figure out in milliseconds what K and V mean (how fast is the human brain?:).
Depends if you know Rust. The language can do things with types and memory that (say) something like Python or Ruby don't do, which means Rust needs extra syntax to mark those things. Most of the <> or ' are to do with types or memory safety.
Before anyone starts using this as a Redis replacement on Windows as the readme suggests, take a look at the TODO file. Notable missing features include:
- maxmemory key eviction
- hash values
- ~2/3 of the set operators
- multi/exec
- lua scripting
This is an interesting and potentially useful effort, but a replacement for Redis it is not.
By the way, if you are looking for a production-quality Windows port of Redis, there is a fork available at https://github.com/MSOpenTech/redis. We (Microsoft) provide it in production as Azure's cache service today, and are committed to continuing to work on it.
Are you part of the Azure cache service? I was an intern the the Edge Caching and Storage team and joining back fulltime next month. If things are the same way they were it would be worth it exploring using Redis for our cache and I'd like to talk details.
I am not, but I work with them closely. If you'd like to talk to the team involved, send me (shankun_at_microsoft_com) your email and I'll connect you up!
Also notice that the main developer only has two months of experience with Rust, so it is probably not as stable and well tested as Redis. As stated in the README, the main goal is to learn Rust.
Since this seems to be just a learning project, note as well that there exist Rust bindings to Redis itself, from Armin Ronacher (though I'm not sure if they've yet been updated to work on 1.0): https://github.com/mitsuhiko/redis-rs
The phrase i like most is "data structure server". It's basically a giant heap that you can fill with data structures - strings, lists, sets, sorted sets, maps, bitsets, and this wacky HyperLogLog thing:
The data structures are all addressed by string keys.
Redis can persist this heap to disk, and load it again, so you get a measure of durability, but the typical use case is for data you can afford to lose - caches, metrics, some kinds of events, etc.
Redis's key non-functional strengths are speed and robustness. Operations people love it because you stick it in production and it just quietly keeps on working without needing attention or setting your CPU on fire.
To my mind, any project should have PostgreSQL as its first data store. But it should probably have Redis as its second, when it finally has some data that needs to be changed or accessed so fast that PostgreSQL can't keep up.
From their official github page, Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, HyperLogLogs, Bitmaps.
It simply means that the key-value store is directly loaded into the memory (RAM) and is available for fast access, but the data is retained (persistent) even after the application is closed.
It is usually used as cache store, queuing messages to communicate with different processes locally or distributed.
To add to that, Redis is a TCP server so you can speak to it from multiple processes on multiple machines (and Redis will easily support huge loads)
Its data structure cover a good part of what you'd need with generic data structures, which makes Redis an easy way to do the logic of, say, List intersection of friends common between multiple people, sorted set of goods ranked by their amount, all of this shared with other processes.
Redis also offers pubsub capabilities in two forms:
- A standard PUB/SUB couple which does what you think it does
- Blocking pop on a list for a client, and a push for another client, which will "wake up" the first one with the value.
It's a memory cache to take some load off your DB. The good thing about Redis is the useful data structures it supports: lists, sets, hashes, bitmaps and somewhat more specific sorted sets.
I was just thinking, that Rust is a great candidate for big data processing tools. So much more than Java (which is annoyingly used a lot there). Something like Spark and HDFS should be implemented in Rust.
The more cores you have and the more RAM, the bigger advantage GC has. The thing with having lots of RAM is that it's very hard to take advantage of it with on-stack data (which can, at most, use about 1-2% of the total RAM available -- do the math) and thread-local heaps. Once you use thread-local heaps/arenas, you need to shard your data. Any cross-shard access would mean locking, which doesn't scale very well. That's exactly where GCs shine: they let you have scalable, concurrent access to data with arbitrary lifespan. That's why Java is used for those kind of applications -- it performs and scales much better than Rust can hope to do on large machines.
You are right, though, that if the processing is extremely "batchy" and all data dies at the same time, then it doesn't make a difference.
> That's why Java is used for those kind of applications
I'm not convinced that's the reason why Java is used for it. There are native alternatives like HPCC which claim to perform better.
As was noted, concurrent access to shared data is not something very common in such distributed computation scenario. Well designed processing will avoid it, and thus will avoid need for locking as well.
> There are native alternatives like HPCC which claim to perform better.
The goal with performance is almost never to get the maximum possible performance but the best peformance/effort ratio for your needs. This is true even for performance-critical applications. As there are diminishing returns, every increase in performance costs more than the previous. Very, very few applications are willing to invest an extra 30-50% effort to go from 98% (of maximum performance) to 100%.
As to concurrent access -- see my other comment (depends on use-case).
Of course, but concurrent access to shared data in Rust (let alone when that data has an arbitrary lifetime) carries more overhead (and is much trickier) than in GCed environments. As a general rule, GCs decrease overhead for concurrent memory access.
Hazard pointers are by no means fast, but a single hazard pointer being set per operation on a shared data structure barely registers compared to the cost of the rest of a nontrivial concurrent access most of the time. Scoped threads don't require bumping a reference count on access, either. Concurrent immutable access in Rust requires literally no overhead at all. In the long run, throughput might drop if you have to take a lock on the entire data structure to perform something like a resizing operation, of course, but for many applications this can be completely avoided (and for those that can't, Java isn't much better).
I generally agree with you on GC but, assuming you haven't, I think you would benefit from actually trying to write some of these algorithms in a manually memory managed system. It's really not as hard as you consistently make it out to be in a language like Rust with very precise control over lifetime (which can be exploited just as well for the hazard pointer itself as it can for anything else).
1. The use-case I have in mind is an in-memory database because a. that is a significant part of what I do and am most familiar with and b. because in the age of large RAM, its most effective use is embedding large parts of the database within the application process (whether as an actual product, or replicating some part of a database's functionality ad-hoc). While doing so in an environment without a GC is certainly possible, it is both much harder and/or (depending on the effort spent) less scalable with increasing numbers of cores. I have come to that conclusion after spending many years trying doing that in C++.
2. I have no doubt that Rust -- and I take great joy in this fact -- makes a whole lot of things much easier than C++. But there is also no question that Rust is harder to program than most GCed languages. The question from my perspective, then, is not "why not do it in Rust?" but "why not in Java?". There is a good answer to that question: Rust is much more suitable than Java (in terms of performance and RAM footprint) when running in constrained environments. But unless you're targeting a constrained environment, why work even a little harder at all?
More generally, I believe that the most efficient form of memory management in unconstrained environments, in terms of both throughput and worst-case latency -- and I've had this discussion on HN with pcwalton -- is an environment that allows three types of memory: stack, scoped arenas (including a permanent scope), and a GCed heap. Real-time Java (RTSJ) provides exactly that. Combining arenas with Rust's beautiful scoped threads is a great idea. I believe that reference counting has no place in such unconstrained environments.
In my experience, concurrent access to shared data is relatively rare in a well-designed data-parallel platform. Data are exchanged rather than shared, at which point ownership of memory transfers from one worker to another. Rust is great at understanding ownership, and at managing memory in this sort of situation.
Again, that depends on how batchy things are. If all the datais available in advance -- you're right. If it's streaming -in, it's not so simple any more.
I'm sorry, this hasn't been my experience, or the experience of anyone I know writing stream processing systems. Having written one each over a GC and not (Naiad, and the timely dataflow in Rust linked above), dev time on the former was disproportionately spent fighting GC perf issues.
The folks I know doing stream processing over the JVM have garbage collection (and avoiding it) as perf issue #1. For example, you can read about Flink's approach to avoiding GC at
Thanks, I'll read your references, but there are two general points to make:
1. Memory with very clear lifespan is obviously easier to manage without a GC.
2. The fact that GCs are a performance issue, does not mean that using a non-GC environment would yield better performance. It is often easier to avoid GC in GCed environments than to handle concurrent access to data in non-GC environments.
GC definitely makes some types of systems easier to develop and reason about, but it also tends to make people be a bit sloppy with respect to memory management. For systems where managing data is one of their reasons for existence, it creates a negative effect. GC'd runtimes also bloat memory consumption to support GC, as well as perform additional "hidden" operations in support of GC (e.g. card marking), even if you manage to not trigger GCs. In addition, GC works well when the generational hypothesis holds. For data intensive applications with mutation/churn, that doesn't tend to be the case.
I disagree that avoiding GC is easier; you're swimming upstream the entire time. Managing memory manually in GC environments is annoying because the platform wasn't designed for that.
As Frank said, you want to avoid sharing mutable data irrespective of GC or not - that won't scale well with larger core counts. Instead, you want to delineate ownership so you have single writer and many readers, and share the data that way.
The key is that most of the effort is not where most of the data is. It's the small percentage of data (say, database index) that requires the biggest development effort, and you want the best mechanism to assist not where most bytes are stored, but where the most effort is invested.
> For systems where managing data is one of their reasons for existence, it creates a negative effect.
I think that people who author such systems/libraries know when to rely on the GC and when not to.
> GC'd runtimes also bloat memory consumption to support GC
When you write, say, an in-memory OLTP database (that's the use-case I know best), you keep the index -- that's where concurrency is managed -- on the heap and the user data off-heap. User data is the bulk of the RAM, but you only pay the GC footprint-overhead for the index.
> you're swimming upstream the entire time.
Going off-heap for large, known-lifespan data sets in Java has been widespread -- idiomatic, almost -- for at least five years now, if not ten. There are so many libraries that do that, that it's very, very simple. Again -- that's just where most data lies, not where the most effort is required. You end up with a very small percentage of your code that is not quite "beginner's Java" -- that's not swimming against the current; that's hitting a few strokes for one tiny section of the journey. It's managing the transactions and update to the indexes that requires 95% of the code, and where you end up thanking your lucky stars for a good GC.
I haven't measured exactly, but the entire off-heap code in our database is under 3%, even though over 85-95% of the data is stored off-heap.
> Instead, you want to delineate ownership so you have single writer and many readers, and share the data that way.
Even single writer/many readers still require a concurrent access mechanism. A GC tremendously helps with this pattern.
Besides, the best way to manage data -- and that's based on how caching mechanisms work at every level -- is single-writer/multiple-readers approximately, namely the single-writer may change identity at any time, but you ensure that it happens rarely. Once you try to make ownership permanent, system-wide locking is required. It's that flexibility that helps with amortizing the costs and making sure you're never hit with the worst case at every transaction[1]. There's some literature touching on that required flexibility, but here's my own take: http://highscalability.com/blog/2012/8/20/the-performance-of... (that flexibility is required to keep the B-tree evenly distributed across nodes -- they can be across the network or even NUMA sockets).
[1]: Changing the owner is the amortization pressure-release valve (all amortized algorithms work on that idea of slowly increasing pressure, that is then released seldom but regularly). It replaces sharding's rebalancing, which usually requires a stop-the-world lock (or most of the world).
>The key is that most of the effort is not where most of the data is. It's the small percentage of data (say, database index) that requires the biggest development effort, and you want the best mechanism to assist not where most bytes are stored, but where the most effort is invested.
There's also substantial amount of code around dealing with i/o efficiently (and not churning garbage), both net and disk (if there's any persistence involved in the product).
>I think that people who author such systems/libraries know when to rely on the GC and when not to.
Perhaps there's more awareness now, but there are plenty of JVM based systems where moving off-heap (and/or changing code to ease GC workload) came much later in the dev cycle.
>Going off-heap for large, known-lifespan data sets in Java has been widespread -- idiomatic, almost -- for at least five years now, if not ten. There are so many libraries that do that, that it's very, very simple. Again -- that's just where most data lies, not where the most effort is required. You end up with a very small percentage of your code that is not quite "beginner's Java" -- that's not swimming against the current; that's hitting a few strokes for one tiny section of the journey. It's managing the transactions and update to the indexes that requires 95% of the code, and where you end up thanking your lucky stars for a good GC.
IME, you end up not using a good bulk of JDK and 3rd party libs because they allocate frivolously. You end up having to look at internals of other code to see if they allocate or not since nobody in the java space cares about documenting that aspect (because allocation is mundane). If you then need to tune GC knobs, may the force be with you. On top of that, once you go off-heap, you're doing roughly the same style of block memory management that you'd do natively, except less efficiently (e.g. some Unsafe operations aren't JIT optimized as well, some operations aren't even intrinsics so you get JNI call overhead, you always get zero'd memory if you call Unsafe.allocate() even if you then overwrite everything, etc). You're probably going to now tell me that getting 90% of the performance is fine ...
I suggest you look at some non-managed OSS systems that deal with high-perf workloads. Your posts always make it sound like it's nigh near impossible to build anything of the sort without GC, which is ridiculous. If you want to see what a modern C++ rendition would look like that attempts to avoid sharing mutable data, have a look at https://github.com/cloudius-systems/seastar
> IME, you end up not using a good bulk of JDK and 3rd party libs because they allocate frivolously.
That's more of your concern because you care about worst-case (very) low-latency domains. Short-lived objects are not a big concern for those who just wish to avoid the large pauses (and can live with the very occasional query taking 40ms instead of the average 50us).
> with i/o efficiently (and not churning garbage), both net and disk
At least one of the most popular Java IO libraries don't create much garbage (Netty 4), and besides, that's not a major concern at all because short-lived etc...
> you're doing roughly the same style of block memory management that you'd do natively, except less efficiently
Only for very simple lifespans. The heap data guards the offheap data wrt concurrency. And so what? All of it still amounts to a tiny, almost negligible, portion of the code, and that portion turns out to be quite simple. I'd rather write a small simple package for stuff that would be taken care of automatically by a non-GC language than write lots of complex, bug-prone code in a non-GC language for stuff that's taken care of automatically in GC runtimes.
> you always get zero'd memory if you call Unsafe.allocate()
But you don't allocate offheap slabs often at all.
> Your posts always make it sound like it's nigh near impossible to build anything of the sort without GC
It's much more work, and requires placing and reasoning about various constraints that may prove problematic. An example later.
Thank you, I will. The example I like to give is that of VoltDB, a "NewSQL" database written in Java and C++ (C++ for the core) which uses internal sharding. Every thread is responsible for a slice of the data. But that database (designed or supervised by none other than Michael Stonebraker) requires locking for any join (even they couldn't have come up with a better way to do it). But their approach displays results that are far from groundbreaking (for modern, in-memory databases), and places a lot of burden on the user to carefully choose a sharding strategy. As another example, MemSQL don't shard, and use a lock-free index (they use concurrent skip-lists). They wrote the code in C++, so obviously that's possible, but put a lot of effort into just implementing the skip-lists (they use hazard pointers), and ended up with results that could have been just as good -- or better -- with a general-purpose GC. BTW, another thing they missed by picking C++, is that in order to make queries faster, they compile SQL to C++ at runtime, compile that C++ and load the resulting code. Java would have given them a much more efficient JIT.
> You're probably going to now tell me that getting 90% of the performance is fine ...
As a general case -- absolutely. In this case, Java gives you more like 96-97% performance for significantly less effort.
EDIT:
I've now read about seastar (and realized I know the people behind it). That approach is the absolute best way to get the worst-case lowest latency possible, if you're willing to pay a significant cost in design (how do you transactionally update data on two shards? -- you can't. Imagine a game object moving from one shard to another -- it will either show up twice or disappear momentarily). But if you're willing to relax the worst-case latency, you can gain significant gains in both throughput and ease of development (and still keep latency low enough for almost all domains). That's certainly how I'd design video-streaming or a network/phone switch -- not so much an online game, IoT sensor-fusion, or a group-chat app.
In many cases distributed computation is needed when there is BIG data, which won't fit on a laptop like in that example. I.e. distribution is needed not only because computation can't be done on one node (i.e. it would take too long), but because data can't possibly fit on any one node.
How big are the jobs being done? I remember reading a statistic[1] that is a couple of years old that 50% of the big data jobs were less than 10 GB. 90% was less than 100 GB and only something like 1% was bigger than a TB. You can fit multiple terabytes on a single machine. I wonder if anyone has new statistics about this.
A previous project I worked on was persisting more than 10 TB of data per week, data that had to be analyzed, the requirement being to extract meaningful user profiles from it. We also went back to analyze older data with new queries / algorithms quite a lot and playing with a petabyte is fun. It was a startup, it was dissolved at some point, so you've never heard of it.
I don't like the 90% argument. 90% of apps are just dumb frontends to a MySQL database, so you don't need Rust, or Spark. As you can see, that doesn't tell you anything meaningful about the other 10%.
And the point being made (and with which I agree) is there's probably fewer than a few hundred of such deployments in the world, and their problems are very likely not your problems.
I have good context on this from seeing the analytics warehouse for a very popular social network. Even with a multi-terabyte hot database, the analytics dataset fit on a single 2U server full of drives. I have not seen a deployment of Hadoop yet that makes sense. Usually I see things like a 2 or 3 TB log spread across 60 nodes and a job to count entries matching a regular expression (because Hadoop becomes the hammer hunting for nails), which can easily be done in a VM on a Mac in minutes.
That was the thrust of the original post, and I agree wholeheartedly.
If the data already lives on 60 nodes, then transferring all this data to a vm + performing work may be longer (and clog the pipes) than just doing the work where the data already is and aggregating results.
> because Hadoop becomes the hammer hunting for nails
Yes, I agree that it's not an uncommon situation. But regardless whether legitimately big deployments are common or not, I never really appreciated some negligence with performance in using Java for such things. Just because it scales, it doesn't mean that overhead should be ignored, because the cost of the overhead scales too.
So going back to my original point, having such kind of systems made with Rust would be really much better.
Many parallel algorithms become slower if you make them use too many cores, as they spend most of the time in communication, and very little in actual computations. Maybe the parallel systems would have been faster on a smaller number of cores than 128?
Also GNU utils http://aadrake.com/command-line-tools-can-be-235x-faster-tha...
The above experiment (which has an interesting github repo) is somewhat over (and real world unusable), but still is eye opening. Hadoop and Spark bring so much complexity that looking for simpler solutions is something worth considering.
GNU parallel is one of the most underrated projects in the whole GNU universe. Sure it doesn't work for everything, but for all cases where it does work, spinning up 50 ec2 spot instances and just pointing parallel at them is by far the quickest and easiest way to do distributed computing.
Could someone (or OP) elaborate on the value that re-implementing a whole software to a new language provide comparatively to just building an interface "bridging" both worlds?
To clarify, my metric for "value" is usefulness to other people. That is, without considering the (interesting) learning opportunity that it represent for the author.
For example, someone developed a Python interface to the Stanford Core-NLP library (written in Java). Would re-writing the Core NLP library to Python be useful to the community? How to figure what are people needs?
I am asking because while I think it would be ton of fun and allow me to learn a lot, I also value building useful software and re-writing a whole system sounds like an overkill except for a few very niche cases..
And if I am not mistaken you would need a team at least as large as the parent project to implement new features, fix bugs and keep pace with it. Looking forward to hear what HNers think!
Aside from project's reason, there's one very good reason to re-implement a bunch of stuff in Rust: testing whether it delivers. It's being pushed as a safer, better systems language. So, let's take many different C/C++ apps, rewrite them in Rust, and see what the results are across many metrics. By the end of it, we should know Rust's strengths and weaknesses in real-world coding.
Sorry if I wasn't clear, but I am looking for a more general answer! I would like to know in which case it is useful (to other people) and discuss it's value comparatively to writing interfaces to other languages.
I had no intention of making the end result useful, but I run into interesting problems.
First, I wanted to make it as pure Rust as possible. I tried to avoid UNIX specific code, and since there is no library with Windows support for asynchronous IO in Rust, I was pushed into spawning thread and blocking waiting for client data. I quickly noticed that the benchmark was way below Redis (around 60% of ops/sec with 50 clients). But then someone point out to me[1] that I was running tests in a machine with two cores, and this actually may be better for machines with multiple cores[2]. I have yet to try it out and benchmark the results.
So far Rust API was disappointing for network operations. For example, `TcpStream.read()`[3] and `TcpListener.incoming()`[4] do not have a timeout. Maybe because its development is driven for Servo and not for servers.
I have thought about doubling down on multithreading and instead of a global database lock as rsedis is using now, having one per key (or some other arbitrary partition), and having concurrent operations, which is hard to do safely in C. But I have not gotten there yet.
> set_timeout has been removed for now (as well as other
> timeout-related functions). It is likely that this may
> come back soon as a binding to setsockopt to the
> SO_RCVTIMEO and SO_SNDTIMEO options. This RFC does not
> currently proposed adding them just yet, however.
And on UDP:
> All timeout support is removed. This may come back in
> the form of setsockopt (as with TCP streams) or with
> a more general implementation of select.
I'm on shaky wifi and my phone, so I can't find a citation for this, but I also believe it was removed due to 1) Rust not having any stable representation of time and 2) needing to shim certain behaviors on some platforms, which we decided wouldn't happen in the first round of stable interfaces.
That said, the lack here certainly hurts, and we did manage to stabilize Duration, paving the way for timeouts to return.
Thanks for the clarification. I wanted to subscribe to rust's internals debates and proposals, but I was not sure how to find them. Should I be looking at https://github.com/rust-lang/rfcs or is there anywhere else?
1. We keep open issues on that repo to track ideas.
2. At some point, someone may decide to formally propose an idea. They may or may not post a "pre-RFC" to internals.rust-lang.org to get early feedback.
3. An actual RFC will get filed at that repo as a PR.
4. The relevant sub team will check it out and comment, and at some point, consensus will be reached either way.
5. The RFC will go to 'last call' for a week, making sure all voices have been heard.
6. Assuming nothing in last call blocks moving forward, the RFC will be accepted.
7. The RFC will be implemented.
So, in short, yes, subscribing to that repo will notify you of the relevant discussions.
IRC is used for quick casual conversation. The internals forum is used for discussion, "pre-RFC", and the like. The rfcs repo is use for actually discussing formal RFCs, as well as wishlist language and library bugs. The rust repo is for the actual implementation of rustc and the standard library itself.
Your comment is interesting to me because although i've only looked at Rust from a distance, its seemingly lack of features related to server side code ( threading , non blocking io and networking api) have made me wait before i actually start coding in that language.
It depends on what you're doing. See https://github.com/jonhoo/volley, which uses blocking IO and 1:1 threads, and we're still damn fast.
There are also libraries like mio that can give you no blocking IO. Rust's standard library is deliberately minimal, exploring the ecosystem a bit is helpful, though admittedly not as easy as it could be. Working on it!
Right. People should realize that there is no latency difference between blocking and non-blocking IO -- in fact there is a small difference in blocking IO's favor. The difference is in throughput when there is a very large number of concurrent requests, namely more than about 10K and certainly more than 2K. If you're serving under 2K concurrent requests, blocking IO is probably a better choice than nonblocking. Of course, the number of concurrent requests you need to serve depends on the time you spend serving each request; Little's Law and all that.
>Could someone (or OP) elaborate on the value that re-implementing a whole software to a new language provide comparatively to just building an interface "bridging" both worlds?
Making it safer and even catching bugs in the original implementation (both things Rust will help with)?
Making it integrate seamlesly with the new language's ecosystem? E.g. Lucene is Java, and someone could use that, but there are tons of ports of it in C, C++, Python etc, providing convenience to integrate it with projects in these languages.
>And if I am not mistaken you would need a team at least as large as the parent project to implement new features, fix bugs and keep pace with it.
Not necessarily. A project with 10 part time contributors could be matched with a project with 2-3 full time competent hackers for example, or even surpassed.
> Lucene is Java ... there are tons of ports of it in C, C++, Python, etc.
There used to be several ports, though most are dead and/or are several major versions behind. A new C++ or Rust port would be great, though unrealistic given the huge project side.
This project specifically is probably not useful for others just because it is written in another language (unless it were to succeed in fixing problems or improving security or performance.) Redis is a server application, not a library, so there's a clearly defined and interoperable protocol to bridge the Redis server to other languages already. Writing a Redis client for the other language would be more useful.
More generally, as coldtea mentions, making integration into the rest of the language's ecosystem is the primary benefit of rewriting in another language.
The value of such a port to others depends on how easy it is to integrate between the two languages, either via libraries or other methods. The harder it is to integrate the two (and the absence of automated translation tools) increases the value of the rewrite to others.
Your Core-NLP example is actually an interesting one, because that library has already been ported to other languages... It is available for the C#/F# ecosystem (http://sergey-tihon.github.io/Stanford.NLP.NET/).
In this case, it's a learning exercise. Sometimes, it's because there are other benefits, such as making it easier to distribute or easier to tie into specific parts of the algorithm than may be possible by calling a library that provides a high level interface.
From a learning perspective, reimplementation has key advantage versus other kinds of projects: the design is completely done, so you can focus exclusively on the mechanics of implementation.
I like this kind of project.
But the use case of redis it's a little bit exstream, I mean that the main feature of redis is the speed and the way how the memory consuption is handled.
If this requirements are not satisfied, it is only a very good way to learn Rust( as the author goal) and the redis internal.
So how do you re-implement something like Redis in another language?
Is it more of a translation job or do you start with splitting the concepts and try to implement it.
Or take the idea and go your own way with implementing it?
I'm try the same thing for similar reasons in Go, but I'm wondering if at some point a Go version would perform better than C. On a machine with a large number of cores, perhaps?
GitHub.com/sudhirj/restis
Also wondering if some rethinking is possible - would a HTTP interface a la DynamoDB be more useful? Can complexity and performance be increased by using a purely memory backend with no disk persistence? If there were pluggable back ends would a Postgres or DynamoDB back end be more useful for terabytes / petabytes of data? Is the beauty of Redis the API or the implementation?
> but I'm wondering if at some point a Go version would perform better than C.
The answer is "no" with a certain amount of probability. Redis isn't single threaded by lack of capability, but by design. Concurrency for multiple CPUs will actually slow down a lot of the stuff you see, as you will need to introduce locking mechanisms.
Also, garbage collection is highly tuned and customized in Redis to the use case of an in-memory-DB (in stark contrast to usual allocation patterns of an application), up to the point where it's almost impossible to replicate the performance in a garbage collected language.
I love Go and we're a 100% Go (and Angular) shop, but for an in-memory DB it wouldn't be a sane choice.
You can turn off disk persistence in Redis. My main usage of Redis is with disk persistence turned off (it handles a few hours worth of samples of data that we don't care if we lose - we care about having long term averages only of the data in question).
There should be minimal overhead from having the capability in Redis due to the way it implements disk snapshots (RDB snapshots are done by fork()'ing and relying on copy on write to let the main process keep on doing its thing while the child process writes the snapshot, so the main process doesn't need to care; other than that Redis offers logging/journalling of changes, but the cost of having that as an option is trivial if it's switched off).
Having pluggable backends for things like Postgres or DynamoDB seems a bit at odds with the purpose of Redis, which is exactly that you pay the (very low) cost of in-memory manipulation of simple data structures, though if a single Redis server could partition the key space between plugins, it might potentially be useful by letting you e.g. move keys between backends and still access them transparently to the client. E.g. for the samples I mentioned above, we roll up and transfer data to a CouchDB instance for archival now (doesn't matter that it's CouchDB really - we just need a persistent key-value store; Postgres or DynamoDB would also both have worked fine), but if I could archive them while still having them visible in a Redis-like server together with the in-memory keys, that'd make the client a tiny bit simpler.
For most Redis usage, I think paying the cost of connection setup and teardown and sending HTTP headers etc. would likely slow things down immensely. At least it would for my usage. Having a HTTP interface as an addition might be useful in some scenarios to enable new use cases, but as a replacement for the current Redis API would be a disaster.
If you want to explore alternative interfaces, I'd instead suggest going in the opposite direction, and experimenting with a UDP interface. In a typical data centre setting packet loss is low enough that while you'd need retry logic, it wouldn't necessarily get exercised much in normal situations.
(On the other hand, for the typical request/reply cycle it might very well not give any benefits vs. tcp in most scenarios where multiple request/replies are done over a single connection and thus amortising the connection setup cost - would be interesting to benchmark, though)
Spectacular! Could you add synchronous replication though? And coalescing queries (so that entire system processes queries in batches, say 300 times per second)?
What I'm personally really surprised about is that nobody's rewriting Redis as a unikernel to clear away all the OS context-switching/networking overhead from its basic operations.
Redis is already leaving plenty of performance on the table, e.g. by not having any concurrent shared memory data structures (the fastest concurrent hash tables achieve better throughput even on inserts than the fastest single-threaded ones). It does this in the name of implementation simplicity. People focused on implementation simplicity don't generally abandon the operating system.
People run Redis mostly in Linux-in-a-VM (with Redis being the only "user-serving" process) already, though, no? I would think Redis-as-the-entire-VM would be less to think about, operation-wise, at least if your cloud or data-center templates its VMs with something like EC2 AMIs. You would just launch a "Redis appliance" AMI and move on.
It's a feeling less of maintaining boxes, and more equivalent to paying a Redis-as-a-Service provider.
The "Why" is @seppo010's to answer (but having it run as is on all OSs is a big plus for one). As for writing it in Assembler, that makes little practical sense since Redis is written in (ANSI) C and it quite well optimized. In fact, if you profile Redis you'll see that very little time is actually spent by the code itself - OS, storage and network are the real bottlenecks usually.
Care to post details about this? Is this actually fast? Does it implement all features and guarantees of redis? Should anybody actually use this in production (maybe because it works on Windows)? Is it well tested?
Looks like a really cool effort but authors of open source projects often think people would read the code and figure out all, the truth is people usually look at what's in the readme and that's all the attention span most people are going to have. My 2c: improve your README.md.