> The application can move straight to supporting the new structure and entirely forget that the old one existed.
Not always. Changing the schema can break an application, in particular when the database supports multiple applications.
> It's simple data and some dangling records aren't going to hurt anyone.
It depends on the application. In healthcare people literally die due to some dangling inconsistent records.
The problem in anticipating and working around these inconsistencies is that the workaround is added ad-hoc in code, rather than through a model defining the data change. You need a model http://chronicdb.com/preparing_schema_changes
The fact that ALTER TABLE locks on most databases is not the primary reason changing a schema is hard. The primary reason is that a schema change breaks the application [1].
Migrations are programs, but your existing live application is also a program. Changing the schema breaks your live application to prepare for the new version of the application.
This is true for NoSQL too. If you change the format of your documents you applications might need to be modified to support both the old and the new format at the same time.
Moving databases between clusters is no easy feat. Can you share more details on the man-hours it took to oversee and verify the migration?
I see you mention several days for the data to be copied, but how about the time to put your solution together and verify it? Also, did recomputing checksums take less time between passes, proportional to the size of the remaining changes?
We didn't find much published material about the best way to do this,
You may want to have a look at our live migration solution: http://chronicdb.com
It goes beyond checksums: it guarantees the data move is atomic and data consistent.
It took a while. cant put a number cause a lot of it was done in spurts between other projects. altogether the project took about four months, but that included refactoring a lot of old code to play nice with mirrored writes and developing some framework level components. and waiting for hardware. and waiting for innodb conversion. and building setting up proper monitoring and alerting. future projects like that will take fraction of the time. Migration and verification itself took about 2 weeks. Yes, passes got proportionally quicker. first took almost a week, last few hours. we did not crank the speed to the max though as we wanted to keep the load low and we had time.
In CouchDB, if you want to change your schema you just save the new data.
Your code will have to handle the old and the new form, but that is always true. In CouchDB the data is just a JSON document, so handling the changed model is about as easy as it gets.
I'm not sure what your point is. As you say, this is hard, and no one else has a solution either.
But the nature of CouchDB can help here - views can be used to create backward compatible versions of new data, and reading a JSON document with extra fields won't break existing code.
It's true it's not magic, though, if that is what you are after.
Edit: I see now you are working on http://chronicdb.com/ which looks like it tackles this problem.
10 apps connected to the same database. Changing the model breaks 10 apps. And you may not have access to change the code to any of those apps.
"Database as an integration layer" is an antipattern that should always be avoided (except possibly in the case of reporting applications). CouchDB isn't unique there.
In healthcare, when either the database or app go down people literally die.
Yes, and? CouchDB is reliable, distributable, etc etc. If your point is that you should be careful when you upgrade, then yes, I agree.
Adding extra JSON fields can break code if you have matching entity types, say in Java, for example, and its trying to "cast" those JSON objects to that data type I am fairly certain it will throw an exception if the JSON object has a field thats not in the Java type
Views are the reflection of the light at the end of the tunnel, but solve less than half the problem to date. Views work on SELECT, but not on UPDATE/INSERT/DELETE.
Also, what makes you say that database as an integration layer should always be avoided?
Also, what makes you say that database as an integration layer should always be avoided?
15 years of writing software. Every time I've had to deal with a shared database it has caused problems, and every time I've built a system avoiding that antipattern it has worked much better.
Database-as-IPC is about the transport aspect, not the logical integration (which is the problem of back-compatibility).
Ian F Nelson's blog post is about a DB that is tightly-coupled to the app: “my” application is using NHibernate as an ORM layer, so until this invasion we have been able to perform database schema refactorings with relative impunity.
Conventionally, RDB schemas are designed in terms of the data itself, not a particular app. In the Enterprise (where RDB are most used), data typically outlives different applications for the same task over decades, often written in different languages. Typically, the data needs also to be accessed by several different applications.
But here I'm talking about logical back-compatibility (surviving version changes) - the blog makes good points about "caching and application-level security". Where those vary with the application, it makes sense to separate them from the DB. But like Database-as-IPC, they are closer to the transport layer than to logical structure.
That's what versioning is for, isn't it? You still can submit old data under the previous version, and old apps will works just fine as they won't see the new data.
New apps will have to deal with old versions somehow, or some back-end job will produce new data out of the old.
On 6ren's rigidity argument, relations and SQL are still not enough to guarantee you'll always be able to get the data you need; notably, when the data model changes. Backward compatibility is important and we might have they only solution that guarantees it.
> relations and SQL are still not enough to guarantee you'll always be able to get the data you need; notably, when the data model changes
What data model changes are you thinking of? Relations and SQL will cope with reorganizations/restructurings of the data, provided the information content is the same. But won't cope if the information is different (e.g. info is deleted or added).
> "Only ChronicDB can run new applications and unmodified old applications against a master database without breaking anything." [from your link]
I like your points about undoable DELETE (do modern RDB's really lack that?); but the above quote is striking because it was the stated motivation for relational databases in Codd's 1970 paper: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
[from abstract] Activities of users
at terminals and most application programs should remain
unaffected when the internal representation of data is changed
and even when some aspects of the external representation
are changed.
[from conclusion]
In Section 1 a relational model of data is proposed as a
basis for protecting users of formatted data systems from
the potentially disruptive changes in data representation
caused by growth in the data bank and changes in traffic.
The paper discusses derivability, and how some relations (such as those needed by old applications) can be derived from named relations using a high level language (now SQL). The idea was that you can change the database schema yet still provide the old relations (this mechanism is now called "views").
I'm not saying ChronicDB has no role, just that this one particular aspect, in isolation, doesn't seem new. But if it facilitates back-compatibility in contexts where RDB doesn't work well, it could be a big win. Back-compatibility, like integration, is always in demand.
EDIT looking at the "schema migration" example (http://chronicdb.com/system/files/images/agility_full_snap_g...), it looks like a join, with renaming, and a type change... is the new thing offered in this example that updates are bidirectional (especially for the type change)? Or is it that both old and new relations have the same name, "customers"?
You are right, this aspect is not new. More like half-baked.
I'm not sure what you mean by bidirectional. If this suggests having a separate copy for the old data and the new data then, in ChronicDB at least, no they are not bidirectional. Which is what one wants ideally: data consistency.
If you mean that updates on the old version are immediately reflected in the new version, then yes.
Basically, there's a single version of the truth, while old and new relations can have different names, for all DML operations.
Yes, the second one. So that from the old app's point of view, nothing has changed: it can create, read, update, delete as before. A single version of the truth etc.
SQL's "CREATE VIEW" does this already (note: not all views can support CRUD, e.g. if you changed an aggregate column, like COUNT, there is no clear meaning for the affect on the underlying tables. If you increased count, what rows should it add? If you decreased count, which ones should it delete? The problem with aggregates is that the mapping from the underlying tables to the view isn't 1:1, you can't run the mapping backwards to reconstruct the original).
1. How is SQL's solution "half-baked"?
2. What does ChronicDB add?
I mean, back-compatibility is valuable, but what is the problem you see with the existing solution?
1. Thank you for the link on views being updateable with some (natural) restrictions (http://news.ycombinator.com/item?id=3406952). Given this feature, SQL offers the primitives needed. Even if this feature is missing from various databases at the moment, it should be available eventually.
But updateable views are still not what you want! See below.
2. What ChronicDB adds in terms of backward compatibility is, effectively, automatic creation and management of these views for every schema change.
Rather than manually write compatibility views and rewrite/review an application to ensure it always uses these views, programmers instead access plain tables, not views. In fact ChronicDB creates no views at all!
You want to issue SQL queries that don't break. You don't want to manage views to do so. It's a subtle, yet major point. No database to date offers such automated versioning.
ChronicDB of course adds other features we wanted to have from our database (reversibility, data versioning, zero-downtime relocation) again in the spirit of saving the programmer/DBA from doing anything manually.
2. Yes, I agree, nice to have the "view" created automatically when you change schema.
Codd's idea was also that applications needn't be aware of the change (a view appears as a plain table)... but (in contrast to your approach) the views were managed by the DBA who was making the schema changes. So I think this feature was present from the very beginning, being the initial motivation for relational databases. [I looked up MySQL because that's the DB I have access to, but it's a general feature:
http://en.wikipedia.org/wiki/View_(database)#Read-only_vs._u... ]
I would expect automatic versioning to be available too, perhaps only in the older and more expensive RDBs. I've come across quite a few research papers on schema evolution in relational databases - some from big RDB vendors. I don't recall any on automated versioning, but I wasn't looking for that. So this is just my guess.
However, if you're not seeing databases with automated versioning, then (even if some highend RDBs do have it) there will be customers who also aren't seeing it (or it's not applicable e.g. it's too expensive). A market for you. And if you can make it truly painless (or even just less painful), that's really something.
I don't see any problem with your other features. The one feature of mapping for back-compatibility is of interest to me, which lead me to study Codd's original paper in great detail over a few weeks. That's why I've been hassling you on this issue.
If you mean why is it a problem at all, it's usually because of database size. On any large scale deployment (ie you have at least a million users) schema modifications will take hours. The only way to do reliable schema modifications is to have extra capacity and do it in stages. Also, your forward changes have to be backwards compatible. (AKA you're not allowed to both add and remove a column at the same time.)
The way to do it is to take some of your slaves out of the request pool and run the alter tables on them. You do this many times depending on your available capacity. (You probably can't just rip out half your slaves, you probably need to do at least 3 batches.) After you've altered all your slaves you can promote one to master and take the master offline to do its own alter. Then you push the code changes to production and add the old master back into the pool as a slave once it's done its alter.
In this scenario you need 3x the time the alter takes. So if the alter takes 6-7 hours (common in mysql if you have a large-ish table) it's going to take you at least 18 hours before you can push your code that depends on a database change.
Doing this manually at scale instead of an automated deployment process is extremely risky and will almost certainly be screwed up often.
This is one of the main reasons people are hoping schemaless databases work out in practice.
Agreed on backwards compatible, but for application semantics; not types of schema changes.
There's no reason to not be allowed to both add and remove a column at the same time, or to merge and split whole tables. In your example, these kinds of changes would not be possible.
There's also no reason to not be able to run old and new code at the same time, or to revert a schema change.
With ChronicDB we reduced schema changes to:
$ chd change -f upgrade_map mydb
Schemaless databases don't solve this, just as an instantaneous ALTER TABLE won't solve this.
No one said it'll work 100% of the time. But it most likely would have saved the OP a lot of pain. Just because something isn't a perfect solution, it doesn't mean it's not worth doing.
Denormalization is a performance consideration, and now even happens transparently at the storage layer in some RDBMs. So restructuring a 'live' database should be done for semantic reasons: because the workflow changes; because the app needs to change.
Not always. Changing the schema can break an application, in particular when the database supports multiple applications.
> It's simple data and some dangling records aren't going to hurt anyone.
It depends on the application. In healthcare people literally die due to some dangling inconsistent records.
The problem in anticipating and working around these inconsistencies is that the workaround is added ad-hoc in code, rather than through a model defining the data change. You need a model http://chronicdb.com/preparing_schema_changes